The Problem with "My Model Can Code"
Everyone building AI tooling right now says their model "codes well." Vendor blog posts compare pass rates, leaderboards shuffle every few weeks, and the numbers keep climbing. But if you have spent any time actually using these tools in a real codebase, you know that a model which aces a benchmark can still produce subtly wrong code that passes your linter and fails in production.
That gap between benchmark score and real utility is not a minor issue. It is the central problem of evaluating language models on code. Understanding how these benchmarks are designed, what they actually measure, and where they break down is not just academic interest. It shapes which tools you pick, how much you trust AI-generated patches, and how you think about the economics of AI-assisted development.
This post unpacks the most widely cited coding benchmarks used in 2025 and 2026, explains the mechanics behind them, and gives you a way to reason about what any given score actually means for your work.
HumanEval: Where Benchmark-Driven Evaluation Started
HumanEval, released by OpenAI in 2021, is the grandfather of coding benchmarks. It consists of 164 Python programming problems, each with a function signature, a docstring, and a hidden test suite. The model generates a completion, and the test suite tells you whether it passed.
The metric is pass@k: given k samples from the model for a single problem, what is the probability that at least one of them passes all tests?
import numpy as np
def pass_at_k(n: int, c: int, k: int) -> float:
"""
n: total samples generated
c: number of correct samples
k: k in pass@k
Returns the probability that at least one of k samples is correct.
"""
if n - c < k:
return 1.0
return 1.0 - np.prod(
1.0 - k / np.arange(n - c + 1, n + 1)
)The elegance of pass@k is that it avoids the bias from just running a single sample. If you sample only once and report whether it passed, you are measuring a combination of model quality and sampling luck. The estimator above uses combinatorics to give you an unbiased estimate from multiple samples without actually needing to enumerate all k-subsets.
However, HumanEval has a well-documented contamination problem. These 164 problems have been scraped, discussed, and reproduced across the public internet for years. Any model trained on web data after 2022 has almost certainly seen them. Reported pass@1 numbers on HumanEval in 2025 are often in the 90s, which says very little about whether the model can actually write code you have never seen before.
A second problem is diversity. HumanEval is almost entirely simple algorithmic functions: reverse a string, find the median, check if a number is prime. Real engineering work is rarely any of those things.
SWE-bench: The Benchmark That Actually Scared Engineers
SWE-bench, introduced by researchers at Princeton in 2023 and updated through 2025, changed how the field thought about coding evaluation. Instead of synthetic problems, it pulls real GitHub issues and their corresponding pull request fixes from popular Python open-source repositories: Django, Flask, scikit-learn, NumPy, Requests, and others.
The task for the model is: given the repository state before the fix and the text of the GitHub issue, produce a patch that makes the existing test suite pass.
git checkout $BASE_COMMIT
git apply model_patch.diff
python -m pytest tests/ -k "$RELEVANT_TESTS" --tb=no -qThis is qualitatively different from HumanEval. The model must:
- Read and understand a multi-file Python codebase it has never seen in that exact state
- Identify the root cause from a natural-language issue description that may be ambiguous
- Produce a patch in unified diff format that applies cleanly
- Pass tests it did not write, which often cover edge cases the issue description never mentions
The early numbers were sobering. When SWE-bench was released, the best available models resolved around 2 to 4 percent of instances. The full benchmark has 2,294 instances; SWE-bench Verified is a human-filtered subset of 500 where annotators confirmed the issue is reproducible and the test actually catches the bug.
By mid-2025, Claude Sonnet 4 and similar frontier models were hitting 50 to 60 percent on SWE-bench Verified when given tool access. That sounds impressive, but it means a fully autonomous agent still fails on roughly half of real bugs from well-maintained Python libraries, and these are libraries the models have definitely been trained on.
What SWE-bench Measures That HumanEval Cannot
To be concrete about the difference, consider a real class of problem from SWE-bench. A Django issue might look like:
QuerySet.order_by()with aF()expression containing a string annotation raisesAttributeErrorwhen combined withselect_related().
Resolving this requires understanding the Django ORM internals at several layers: how order_by processes F objects, how select_related attaches extra fields, where the AttributeError originates in the query compilation chain, and whether fixing the root cause is safe given how the code is used elsewhere.
A model that can write def median(lst): return sorted(lst)[len(lst)//2] has demonstrated almost nothing about whether it can navigate that problem.
SWE-bench forces localisation (finding the right file and function), causal reasoning (understanding why the bug occurs, not just that it does), and side-effect awareness (not breaking unrelated functionality). These are the skills that actually matter in a senior engineering role.
One important nuance: SWE-bench scores depend heavily on the agent scaffold, not just the underlying model. Giving a model a Python REPL, the ability to run tests mid-generation, and a file-browsing tool produces dramatically different scores than querying the model in a single pass. When you see a headline SWE-bench number, always check whether it was achieved with tool use, what tools were permitted, and how many model calls were made per instance.
LiveCodeBench and the Contamination Arms Race
The contamination problem is not unique to HumanEval. Any static benchmark becomes stale once models have trained on it. LiveCodeBench, maintained by a team at MIT and CMU, addresses this by continuously pulling new problems from competitive programming platforms: LeetCode, CodeForces, and AtCoder, using only problems published after a specified cutoff date.
{
"problem_id": "lc_3305",
"platform": "leetcode",
"published_at": "2025-02-10", # after any plausible training cutoff
"difficulty": "medium",
"prompt": "...",
"test_cases": [...],
"solution_type": "function"
}By using problems that did not exist when the model was trained, LiveCodeBench gives a cleaner signal on generalisation. The tradeoff is that competitive programming problems skew toward algorithmic puzzles: dynamic programming, graph traversal, bit manipulation. They are more novel than HumanEval but still not representative of production engineering work.
A subtler issue is that platform difficulty ratings are not stable. A "hard" LeetCode problem from 2019 and a "hard" problem from 2025 are not the same thing, partly because the community has mapped out solution patterns so thoroughly that even older hard problems are well-represented in training data in aggregated form.
LiveCodeBench also tracks performance over time in a way that is useful for detecting when a model has been updated: if pass rates on a rolling window suddenly jump on post-cutoff problems, that is evidence the model was retrained on more recent data than claimed.
BigCodeBench: Multi-Step, Library-Heavy Tasks
BigCodeBench, released in 2024 and updated through 2025, takes a different angle again. Where HumanEval is algorithmic and SWE-bench is repository-level, BigCodeBench focuses on tasks that require correctly calling real Python libraries.
import pandas as pd
import matplotlib.pyplot as plt
def process_sales(df_orders: pd.DataFrame, df_products: pd.DataFrame) -> plt.Figure:
"""
Merge orders with products on product_id, convert order_date to
datetime, resample by month, and return a bar chart of monthly revenue.
Each row in df_orders has: order_id, product_id, quantity, order_date (str).
Each row in df_products has: product_id, unit_price.
"""
merged = df_orders.merge(df_products, on="product_id")
merged["order_date"] = pd.to_datetime(merged["order_date"])
merged["revenue"] = merged["quantity"] * merged["unit_price"]
monthly = (
merged.set_index("order_date")
.resample("ME")["revenue"]
.sum()
.reset_index()
)
fig, ax = plt.subplots()
ax.bar(monthly["order_date"].dt.strftime("%Y-%m"), monthly["revenue"])
ax.set_xlabel("Month")
ax.set_ylabel("Revenue")
return figThe challenge here is not algorithmic cleverness but API correctness: using the right pandas resample offset alias ("ME" in pandas 2.2+, not the deprecated "M"), understanding merge semantics, and producing output in exactly the shape the test expects. These tasks are closer to what a data engineer or ML practitioner actually does.
BigCodeBench evaluates two variants: the full task (where the model writes everything), and the instruction-following variant (where the model fills in a function body given an explicit specification). The instruction-following variant isolates code generation from task interpretation, which is useful for understanding where models fail.
The Gotcha: Benchmark Gaming and Annotation Artifacts
Here is something that is easy to miss until you read the papers carefully. Both HumanEval and many SWE-bench instances were originally human-authored but have been reproduced in model-generated datasets, blog posts, and documentation. The line between "in training data" and "not in training data" is genuinely blurry.
But the more damaging problem is annotation artifacts. Consider how SWE-bench instances are collected: a human researcher identifies a GitHub issue, confirms a test was added in the fixing PR, and includes the instance. This means every instance has a test added with the fix. Models that learn to pattern-match "issues that require adding a test in the diff" will do better on SWE-bench than models that fix bugs without adding tests, even if the non-test-adding fix is just as correct.
I noticed this when reviewing diffs generated by models on SWE-bench instances that had already been resolved in the real repo. The model-generated patches that scored highest often included a new test even when the original fix PR did not require one, because the benchmark's evaluation harness rewards running new test code that happens to pass.
A related artifact: some SWE-bench instances have underspecified test suites. The model can pass by doing something that technically satisfies the tests but misses the spirit of the fix. Quantifying how often this happens requires human review, which is expensive and subjective.
SWE-bench Verified addresses part of this by having annotators confirm instances are genuinely resolvable, but it does not fully fix the test-as-ground-truth problem.
Reasoning Traces and the CodeContests Dimension
The 2025 generation of reasoning models (o3, Claude 3.7 Sonnet with extended thinking, and similar) introduced a new evaluation dimension: can the model reason through code the way a human thinks before writing?
CodeContests, from DeepMind, evaluates competitive programming at the contest level. Problems require finding non-obvious algorithms, not just applying known ones. The dataset includes problems where brute-force approaches time out and the correct solution requires an insight about the problem structure.
def min_cut_stoer_wagner(N: int, roads: list[tuple[int, int, int]]) -> int:
adj = [[0] * N for _ in range(N)]
for u, v, w in roads:
adj[u][v] += w
adj[v][u] += w
best = float("inf")
merged = list(range(N))
for phase in range(N - 1):
key = [0] * N
in_A = [False] * N
prev = -1
for _ in range(len(merged)):
z = max((k for k in merged if not in_A[k]), key=lambda k: key[k])
in_A[z] = True
if prev != -1:
best = min(best, key[z])
for w in merged:
key[w] += adj[z][w]
prev = z
# Merge last two vertices added to A
if len(merged) > 1:
u, v = merged[-2], merged[-1]
for w in merged:
adj[u][w] += adj[v][w]
adj[w][u] += adj[w][v]
merged.remove(v)
return bestOn CodeContests, most models without chain-of-thought reasoning perform at or near zero on the hardest problems. With extended thinking, performance improves significantly, which provides concrete evidence that test-time compute matters for code as much as it does for maths.
This is where the benchmark landscape diverges most sharply. A model that is excellent on SWE-bench (multi-file repository understanding, grounded in real codebases) is not necessarily strong on CodeContests (pure algorithmic reasoning), and vice versa. Picking the right benchmark for your use case matters enormously.
How to Apply This When Evaluating AI Coding Tools
If you are choosing between AI coding assistants or evaluating a new model for your team, here is a practical framework grounded in what these benchmarks actually measure.
Match the benchmark to your task type. If your team writes data pipelines and internal tooling, BigCodeBench performance is more predictive than HumanEval. If you are building an AI agent to triage and fix production bugs, SWE-bench Verified is the right signal. If you are running competitive programming practice, LiveCodeBench matters.
Always check the agent scaffold. When a vendor reports SWE-bench numbers, find out if it was zero-shot (single prompt, no tools) or agentic (multiple calls, tool access). A 50% score with 30 model calls per instance and a Python REPL is not comparable to a 30% score with 3 calls and no tools. Neither is better intrinsically; they just represent different deployment modes.
Run your own internal evals before trusting published numbers. Take 20 to 30 representative tasks from your actual backlog. Use the same prompt format you will deploy. Measure pass rate and, separately, how often the generated code is correct in structure but broken in a subtle way that tests do not catch. The latter number will be worse than the former, sometimes significantly.
// Minimal internal eval harness for TypeScript projects
// Uses execFileNoThrow to avoid shell injection risks
import { execFileNoThrow } from "../utils/execFileNoThrow.js";
import * as fs from "fs";
interface EvalCase {
id: string;
prompt: string;
outputFile: string;
testBin: string;
testArgs: string[];
}
async function runEval(
cases: EvalCase[],
generateFn: (prompt: string) => Promise<string>
) {
const results = await Promise.all(
cases.map(async (c) => {
const generated = await generateFn(c.prompt);
fs.writeFileSync(c.outputFile, generated, "utf8");
const { status } = await execFileNoThrow(c.testBin, c.testArgs);
return { id: c.id, passed: status === 0 };
})
);
const passRate = results.filter((r) => r.passed).length / results.length;
console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
return results;
}Factor in latency and cost. A model that scores 60% on SWE-bench with an average of 45 seconds and 200k tokens per instance is not economically equivalent to one that scores 55% with 8 seconds and 20k tokens. For interactive developer tools, latency matters more than raw accuracy up to a point.
Watch for benchmark saturation. When multiple frontier models cluster above 85% on a benchmark, it has stopped being informative. This already happened with HumanEval, and it is happening with parts of SWE-bench Lite. When you see near-saturated benchmarks being cited, find the harder variant or the more recent version.
Where the Field Is Heading
The direction of travel in 2025 and 2026 is toward benchmarks that are harder to game and closer to real engineering.
SWE-bench Multimodal introduces issues where the relevant context is in a screenshot or a diagram, not just text. This closes the gap between benchmark conditions and how engineers actually work, where a Figma spec or a screenshot of a crash dialog is often the primary input.
AgentBench and WebArena evaluate multi-step agent behaviour in sandboxed environments: not just writing code, but using terminals, reading documentation, navigating error messages iteratively. The unit of evaluation shifts from "does this patch pass" to "did the agent accomplish the goal."
Cross-language evaluation is becoming more common. Most benchmarks are Python-centric, but real engineering teams use TypeScript, Go, Rust, and Java. Models that score well on Python-heavy benchmarks often show larger gaps on statically typed languages where the feedback loop is different.
The honest assessment: no current benchmark captures what it means to be a good software engineer over a multi-month project. Benchmarks measure a bounded, reproducible slice of that. They are useful for comparing models against each other on consistent conditions, but they should not be confused with a measure of engineering ability in full.
Key Takeaways
pass@kgives an unbiased estimate of model capability from multiple samples; single-sample pass rate is noisier and more luck-dependent.- HumanEval scores in the 90s are largely contaminated; they do not tell you much about how a model performs on code it has never seen.
- SWE-bench Verified is currently the strongest signal for repository-level reasoning, but scores depend heavily on the agent scaffold and number of model calls permitted.
- LiveCodeBench trades representativeness for freshness; it is the best contamination-resistant benchmark for algorithmic tasks.
- BigCodeBench is more relevant than HumanEval for engineers working with real libraries and APIs.
- CodeContests reveals whether a model can reason about algorithmic problems, which is a distinct skill from repository-level bug fixing.
- Any headline benchmark number without agent scaffold details, tool list, and call budget is incomplete and potentially misleading.
- Run your own evals on tasks representative of your actual work. Twenty real cases from your backlog will tell you more than any published leaderboard.