Research

The best AI models for coding: nine models on one real build task

Everyone asks which model is best for coding. We ran nine of them, closed and open, through the same build task and the same rubric. GPT-5.5-Codex was the most consistent, but an open-weight model won one of the two tasks outright. Here is the full leaderboard and how to read it.

9 min read

The most common question we get from the coders using Perch is not about features. It is "which model should I run?" It is a fair question and a hard one, because almost nobody publishes a coding comparison that includes open-weight models on the same footing as the closed frontier, on a task that looks like real work rather than a puzzle.

So we ran one. We gave nine models the same build task, scored each against a fixed rubric, and ranked them. The task was not a toy. It was a single-file interactive dashboard built from a real accounts-payable dataset, the kind of data-dense, opinionated front end that separates a model that can write code from one that can ship something a person would actually use. Below is the full leaderboard, what surprised us, and how to read it if you are choosing a model to code with.

How we tested

Every model got the same prompt, the same context, and the same evaluation. We ran two variants of the task:

  • The build task. Produce a single self-contained interactive dashboard, a "Perch Intelligence Center," from a real accounts-payable run. The data included recoverable duplicate payments, a purchase-order conflict, a queue of unsupported payments with dollar exposure, and an entity graph stitched from payments, vendors, POs, and receipts. Getting this right meant handling real numbers correctly, laying out a dense information surface, and making it interactive, all in one file.
  • The concept task. A more design-led variant of the same surface, weighted toward visual quality, layout, and interaction rather than raw data handling.

Each output was scored from 0 to 100 against a fixed rubric covering functional completeness, fidelity to the underlying data, design and layout quality, interaction and responsiveness, and code quality. Same rubric, every model, every run.

The models were served the way you would actually run them: the closed frontier models natively, and the open-weight models through hosted inference. That is worth stating plainly, because how a model is served affects latency and sometimes quantization, and it is part of the real-world picture when you pick one to code with.

How to read this. This is a directional benchmark, not a universal ranking. It is one run per model per task, on two tasks, in a single domain: data-dense interactive front ends. It does not measure large multi-file refactors, backend systems, or algorithmic problem solving, and the design dimensions carry a degree of subjective judgment that any honest evaluator should admit. Read it as a grounded snapshot of how these models handle a real build, not as the last word on any of them.

The build task leaderboard

The core task: turn a real AP run into a working, interactive intelligence dashboard in a single file.

Closed weightsOpen weightsScore out of 100
Build task: a single-file interactive AP intelligence dashboard, scored 0 to 100 against a fixed rubric. Copper bars are closed-weight models, green bars are open-weight models.

GPT-5.5-Codex took the build cleanly. What stands out underneath it is how tight the next tier is: Opus 4.8 at 75 and the open-weight DeepSeek-V4-Flash at 73 are effectively a coin flip apart. An open-weight model landed within two points of a closed frontier model on a real, data-heavy build. That is the headline the raw ranking hides.

The rest of the open field shipped working output. Nemotron-Super-3-120B, GLM-5.1, and Qwen3-Coder-480B all produced usable dashboards, just with rougher layouts, thinner interactivity, or looser handling of the numbers. Kimi-K2.6 was the outlier at 30, which sets up the pattern in the next table.

The concept task leaderboard

The design-led variant, weighted toward visual quality and interaction.

Closed weightsOpen weightsScore out of 100
Concept task: the design-led variant of the same surface. An open-weight model, DeepSeek-V4-Flash, finishes first.

Here the order changes. DeepSeek-V4-Flash, an open-weight model, finished first, ahead of GPT-5.5-Codex and both closed frontier models. Kimi-K2.6, which scored 30 on the build, jumped to 83 on the design task. The closed models were strong and consistent, but on this task they did not win.

What surprised us

GPT-5.5-Codex is the safe default, not the runaway winner. Its 82 and 84 were the most consistent scores in the field, and if you want one model that will not embarrass you across varied work, it is the reasonable pick. But it topped only one of the two tasks, and by two points.

DeepSeek-V4-Flash is the story. An open-weight model matched the closed frontier on the build (73, two points behind Opus) and won the design task outright (86). If you had assumed open models were a compromise you accept to save money, this run says otherwise, at least for interactive front-end work.

Open models are more volatile. Kimi-K2.6 swung from 30 to 83 across the two tasks. That range is the real cost of open weights right now: the ceiling is competitive, but the floor is lower and more task-dependent than the closed models, which clustered tightly. If you run open models, you want a harness that lets you check the output before you trust it, not one that assumes the first result is good.

The gap that matters is not open versus closed. It is task versus task. The same model can lead one build and trail another. The useful takeaway is not "use model X for everything," it is "be able to switch."

What this means if you code with AI

The practical conclusion is not a single model name. It is that you should be choosing per task, and that most coding tools do not let you. If your assistant is welded to one vendor's model, you cannot run DeepSeek-V4-Flash on the task where it wins, or drop to a cheaper open model on the work that does not need the frontier, or compare two outputs side by side before you commit.

Perch is model-agnostic, and that is the only reason this comparison was possible to run at all. It is not a dedicated coding harness in the way Cursor or Claude Code are, and we would not pretend otherwise. But it can be used as one, and the thing it does that they do not is let you point the same workflow at GPT-5.5-Codex, Opus 4.8, or an open-weight model, with the cost of each run in view. For the coders already using it, that is the whole appeal: run the model you actually want, see what it produces, see what it cost.

This is the same principle that runs through the rest of our work. The reason a build like the one we tested is a fair measure is that the output is checkable: the numbers are either right or they are not, the interactions either work or they do not. That is the standard we hold models to across everything we build, laid out in how we evaluate models for verifiable work, and it is the same discipline that keeps agents reliable in production, covered in why most AI agents fail in production.

If you want to run open and closed models through one workflow and judge the output yourself, that is what the Perch CLI is for. If you want to see it on your own work, talk to us.