Coding Benchmarks Got Boring. ProgramBench Made Them Honest.

dehakuran.com · May 2026 · 4 min read

Agentic coding benchmarks have started to feel boring lately.

Not because the field is boring. The opposite, actually. The models are getting more capable, the tooling is getting more sophisticated, and the gap between "autocomplete" and "software engineer-ish system" keeps getting smaller.

But the way we measure them has not kept up.


The Leaderboard Problem

For a while, every new model launch came with the same kind of claim: better on this benchmark, higher on that leaderboard, strongest coding model yet, best agentic performance to date.

Duh. Everyone is the best of the best, somehow.

I was already skeptical of the inflated numbers. Many coding benchmarks started to feel less like a measure of real engineering ability and more like a measure of benchmark familiarity: solve a constrained issue, patch a known repo, pass a visible test suite, optimize for a narrow task definition.

Useful? Yes.

Enough? Not anymore.

"A low score can be more useful than another inflated leaderboard jump."

The Question Got Bigger

As models become more sophisticated, measuring their impact needs to evolve too. We should not only ask whether an agent can fix a bug, complete a function, or pass a curated test.

We should ask whether it can reason about a whole program as a system.

  • Can it infer architecture?
  • Can it reconstruct intent?
  • Can it turn observed behavior back into design?
  • Can it build something coherent enough to survive contact with reality?

That is why ./ProgramBench caught my attention.

ProgramBench Is Simple, Brutal, and Useful

ProgramBench, from researchers at Meta Superintelligence Labs, Stanford University, and Harvard University, asks a much more uncomfortable question:

Can language models rebuild programs from scratch?

The setup is simple, but brutal. Given only a compiled binary and its documentation, an agent must architect and implement a complete codebase that reproduces the original program's behavior.

No original source code. No neat issue description. No friendly scaffold. No repo full of hints.

Just the artifact, the docs, and the requirement to make the thing live again.

The benchmark contains 200 tasks, from compact command-line tools to widely used systems such as FFmpeg, SQLite, and the PHP interpreter. The agent must choose a language, design the architecture, write the source code, and produce a build script. Evaluation is behavioral: does the rebuilt program match the reference executable?

And the Models Hit the Floor

This is the funny part. The results are not another smooth leaderboard climb.

Across 9 evaluated language models, ProgramBench reports that none fully resolved any task. The best model passed 95% of tests on only 3% of tasks.

Somehow, that feels right.

Not because the models are bad. They are not. But because the benchmark is finally asking a question hard enough to expose the gap.

There is a big difference between editing code that already has shape and rebuilding the shape itself. Patching a bug inside an existing repository lets the agent inherit years of human architecture. Rebuilding from a binary asks the agent to recover that architecture from behavior.

That is closer to software engineering than most coding benchmarks are willing to be.

Why This Matters

Real engineering is rarely just "write the next line" or "fix this isolated test." It is understanding invisible constraints, reconstructing assumptions, making design decisions under uncertainty, and building something that works beyond the happy path.

The current generation of agents is impressive when the rails are visible. ProgramBench removes most of the rails. That is why the low scores are more interesting than another saturated benchmark.

A benchmark where everyone scores near the ceiling tells us less and less over time. A benchmark where frontier models hit the floor tells us where the frontier actually is.

Maybe the next phase of agentic coding evaluation will be less about who can solve the most toy tasks and more about who can recover, rebuild, and reason across entire systems.

That would make benchmarking interesting again.


Sources: ProgramBench official site and paper, "ProgramBench: Can Language Models Rebuild Programs From Scratch?" Views are my own and do not represent any commercial entity.

Frequently Asked Questions

What does ProgramBench measure?

ProgramBench measures whether coding agents can rebuild complete programs from only a compiled binary and documentation. It covers 200 tasks, from compact CLI tools to FFmpeg, SQLite, and the PHP interpreter.

How did current models perform on ProgramBench?

Current models performed poorly in the useful way: across 9 evaluated language models, none fully resolved any task. The best model passed 95% of tests on only 3% of tasks.

Why is ProgramBench harder than typical coding benchmarks?

ProgramBench removes the existing repository shape. Instead of patching code, the agent must infer architecture, choose a language, write source code, and produce a build script from behavior alone.

AI CodingDeveloper ToolsSoftware EngineeringAI Agents

Deha Kuran

AI Executive, Engineer, and Evangelist. Head of AI Business Operations at Philips.

Follow the thinking on LinkedIn →