Tag
5 articles
Learn how to set up a benchmarking framework to evaluate AI coding agents like Claude Code and GPT-5.5, similar to industry benchmarks used in 2026.
Poolside AI introduces Laguna XS.2 and M.1, agentic coding models achieving 68.2% and 72.5% on SWE-bench Verified, demonstrating significant progress in automated software development.
A new study by METR reveals that nearly half of AI-generated code that passes industry benchmarks would be rejected by real developers due to quality and maintainability issues.
OpenAI announces it will no longer evaluate SWE-bench Verified due to contamination and data leakage issues. The organization recommends SWE-bench Pro as a replacement.
OpenAI plans to retire the SWE-bench Verified benchmark, citing flaws that undermine its validity as a coding performance measure. The move highlights concerns about memorization in AI model evaluations.