The Harness Problem Is Also A Training Problem

@AaronCQL

9 June 2026

In February, this post on "The Harness Problem" made headlines: the bottleneck in AI coding agents isn't the model, it's the harness. It argues that with the same LLM, swapping only the edit tool makes success rates swing wildly.

I've been building Pim, my own harness on top of Pi, so naturally I wanted to check it myself, but with one change: measure cost (in terms of input/output tokens and actual money spent), not just whether the task passed. After all, an agent also spends input tokens (your context budget), output tokens (your time), and dollars (your money!) — a tool can "pass" and still be horribly expensive.

So I built a micro benchmark, consisting of 12 fully-specified editing tasks, where scoring is an exact byte-match, to measure across 3 different edit tools:¹

replace: string replacement strategy that Claude and many other models are trained on
patch: OpenAI's V4A diff format that Codex models use
hashline: the proposed solution to the harness problem, where every line is tagged with a short hash so that models can reference stable anchors instead of just line numbers

The article is right that the harness dominates. But once you measure cost across different models and tasks, "which edit tool is best" stops having a clear answer. It actually dissolves into two other questions:

What was the model trained on?
What are you optimising for?

Spoiler: hashline is almost never the cheapest edit tool to use.

TLDR: What's The Cost?

Let's say we're optimising purely for cost (and let's face it, these APIs aren't getting cheaper), which we define as the combined dollar cost of the input and output tokens.² To ensure costs across different models are fairly represented, the benchmark uses a range of different models, from frontier ones like GPT 5.5 and Opus 4.8 to locally hosted models like Qwen 3.6 35B.

For models fluent in patch, it's the most efficient tool to use, but for models that are not, it's exorbitantly expensive. Across the full benchmark, hashline is never cheaper than replace, and ranges from 8% to 38% more expensive.

Replace: Simple Is Better

The most robust edit tool is replace: almost 100% pass rate with near-zero invalid tool calls across all 5 tested models. It is the only tool that never melts down.

Patch: Cost Of Being Untrained

For patch, it swings from best to worst depending purely on the model. Pim's tool description for patch is deliberately minimal: it tells the model the input is "a V4A patch", what the begin/end markers look like, and nothing else. There is no mention of the grammar (@@ hunk headers, +/- line prefixes, etc.) at all. The error messages are descriptive but they only appear after a failed call.

Thus, if a model produces a valid call on its first try, that knowledge almost certainly came from its weights, not from the prompt. In that case, first-call success is a good proxy for "is the model trained on this tool/format?".

GPT 5.5 and Opus 4.8 are both able to produce valid first calls every time. The rest, Gemini 3.5 Flash, DeepSeek V4 Flash, and Qwen 3.6 35B, all sit at 0% first call.

However, Gemini still passes 97% of the time. It just pays 13.7x the input tokens and around 6.6x the dollars to get there, retrying against the error messages until a patch finally applies. Qwen and DeepSeek can't recover at all: they stall at 50%-61%, burning 6x-22x the input tokens on the way down.

First-call success tracks training exposure (either the model has seen V4A or it hasn't). General capability tracks whether the model can recover, and how much that costs. Gemini is the clean dissociation: obviously capable (97% pass on patch), obviously not trained on V4A (0% first-call success on patch), recovers, but bleeds tokens doing it.

The most surprising data was Opus 4.8.³ It is not a Codex model, yet it nails patch first try consistently. There are no official sources (that I could find) mentioning whether Opus 4.8 was deliberately fine-tuned to V4A or if this was just incidental exposure (since V4A is publicly documented), only that it didn't come from the prompt.

Hashline: Narrow Use Case

Lastly, hashline is the one I most wanted to like, because its whole pitch is about how weaker models can stand to gain the most from it. However, for the weaker/smaller models like DeepSeek V4 Flash and Qwen 3.6 35B, the pass rates are a few percentage points short of the 100% they get from just using replace.

The one metric where hashline ever edges replace is output tokens, and even that only holds on some models. However, it trades slightly fewer output tokens for disproportionately more input tokens. Across all five models, the dollar cost of that trade still favours replace over hashline.

This tradeoff happens because every read call injects the hash anchors, and every edit changes the anchors again, even if the line numbers didn't. In an agentic loop, an agent must run a read ➝ edit ➝ read ➝ edit cycle to make multiple sequential edits to a file.

Where hashline's design genuinely pays off seems to be high edit density: amortising the anchor tax of a single read across many compact edits. On exactly that kind of task, it can be the cheapest tool. However, this is a very narrow use case, and the costs seem lower only for GPT 5.5 and Opus 4.8, with the rest still incurring higher dollar costs.

There Is No Best Edit Tool

To conclude, the "best" edit tool really depends on the model and what you are optimising for:

replace as the sensible default: especially on weaker or local models. It is the most robust, never spirals, and stays cheap.
patch if (and only if) the model was trained on it: on Codex models (and apparently on Claude), it is first-try clean and the lowest cost. On anything else, it ranges from expensive to downright catastrophic.
hashline when edit density is high enough to amortise the anchor tax: it can reduce output tokens, but usually at the cost of significantly more input tokens over a long agentic workflow.

All charts are computed from the raw per-trial results. ↩
Token prices for Qwen 3.6 35B uses OpenRouter's rate of $0.14/M input and $1/M output. ↩
Opus 4.7 also has 100% first-call success; 4.6 is where first-call success degrades toward 0%. ↩

The Harness Problem Is Also A Training Problem

➤TLDR: What's The Cost?

➤Replace: Simple Is Better

➤Patch: Cost Of Being Untrained

➤Hashline: Narrow Use Case

➤There Is No Best Edit Tool

➤Footnotes

TLDR: What's The Cost?

Replace: Simple Is Better

Patch: Cost Of Being Untrained

Hashline: Narrow Use Case

There Is No Best Edit Tool

Footnotes