Tag

#SWE-bench

3 articles

Half of AI-written code that passes industry test would get rejected by real developers, new study finds

A new study by METR reveals that nearly half of AI-generated code that passes industry benchmarks would be rejected by real developers due to quality and maintainability issues.

Mar 1124

Why we no longer evaluate SWE-bench Verified

OpenAI announces it will no longer evaluate SWE-bench Verified due to contamination and data leakage issues. The organization recommends SWE-bench Pro as a replacement.

Feb 2372

OpenAI wants to retire the AI coding benchmark that everyone has been competing on

OpenAI plans to retire the SWE-bench Verified benchmark, citing flaws that undermine its validity as a coding performance measure. The move highlights concerns about memorization in AI model evaluations.

Feb 2332