The Harness Problem Is Also A Training Problem
In February, this post on "The Harness Problem" made headlines: the bottleneck in AI coding agents isn't the model, it's the harness. It argues that with the same LLM, swapping only the edit tool makes success rates swing wildly.
I've been building Pim, my own harness on top of Pi, so naturally I wanted to check it myself, but with one change: measure cost (in terms of input/output tokens and actual money spent), not just whether the task passed. After all, an agent also spends input tokens (your context budget), output tokens (your time), and dollars (your money!) — a tool can "pass" and still be horribly expensive.
So I built a micro benchmark, consisting of 12 fully-specified editing tasks, where scoring is an exact byte-match, to measure across 3 different edit tools:1
-
replace: string replacement strategy that Claude and many other models are trained on -
patch: OpenAI's V4A diff format that Codex models use -
hashline: the proposed solution to the harness problem, where every line is tagged with a short hash so that models can reference stable anchors instead of just line numbers
The article is right that the harness dominates. But once you measure cost across different models and tasks, "which edit tool is best" stops having a clear answer. It actually dissolves into two other questions:
- What was the model trained on?
- What are you optimising for?
Spoiler:
hashlineis almost never the cheapest edit tool to use.
TLDR: What's The Cost?
Let's say we're optimising purely for cost (and let's face it, these APIs aren't getting cheaper), which we define as the combined dollar cost of the input and output tokens.2 To ensure costs across different models are fairly represented, the benchmark uses a range of different models, from frontier ones like GPT 5.5 and Opus 4.8 to locally hosted models like Qwen 3.6 35B.
For models fluent in patch, it's the most efficient tool
to use, but for models that are not, it's exorbitantly expensive.
Across the full benchmark, hashline is never cheaper than
replace, and ranges from 8% to 38% more expensive.
Replace: Simple Is Better
The most robust edit tool is replace: almost 100% pass
rate with near-zero invalid tool calls across all 5 tested models. It
is the only tool that never melts down.
Patch: Cost Of Being Untrained
For patch, it swings from best to worst depending purely
on the model. Pim's tool description for patch is
deliberately minimal: it tells the model the input is "a V4A patch", what the begin/end
markers look like, and nothing else. There is no mention of the
grammar (@@ hunk headers, +/-
line prefixes, etc.) at all. The error messages are descriptive but
they only appear after a failed call.
Thus, if a model produces a valid call on its first try, that knowledge almost certainly came from its weights, not from the prompt. In that case, first-call success is a good proxy for "is the model trained on this tool/format?".
GPT 5.5 and Opus 4.8 are both able to produce valid first calls every time. The rest, Gemini 3.5 Flash, DeepSeek V4 Flash, and Qwen 3.6 35B, all sit at 0% first call.
However, Gemini still passes 97% of the time. It just pays 13.7x the input tokens and around 6.6x the dollars to get there, retrying against the error messages until a patch finally applies. Qwen and DeepSeek can't recover at all: they stall at 50%-61%, burning 6x-22x the input tokens on the way down.
First-call success tracks training exposure (either the model has seen
V4A or it hasn't). General capability tracks whether the model can
recover, and how much that costs. Gemini is the clean dissociation:
obviously capable (97% pass on patch), obviously not
trained on V4A (0% first-call success on patch),
recovers, but bleeds tokens doing it.
The most surprising data was Opus 4.8.3
It is not a Codex model, yet it nails patch first try
consistently. There are no official sources (that I could find)
mentioning whether Opus 4.8 was deliberately fine-tuned to V4A or if
this was just incidental exposure (since V4A is publicly documented),
only that it didn't come from the prompt.
Hashline: Narrow Use Case
Lastly, hashline is the one I most wanted to like,
because its whole pitch is about how weaker models can stand to gain
the most from it. However, for the weaker/smaller models like DeepSeek
V4 Flash and Qwen 3.6 35B, the pass rates are a few percentage points
short of the 100% they get from just using replace.
The one metric where hashline ever edges
replace is output tokens, and even that only holds on
some models. However, it trades slightly fewer output tokens for
disproportionately more input tokens. Across all five models, the
dollar cost of that trade still favours replace over
hashline.
This tradeoff happens because every read call injects the hash
anchors, and every edit changes the anchors again, even if the line
numbers didn't. In an agentic loop, an agent must run a
read ➝ edit ➝ read ➝ edit cycle to make multiple
sequential edits to a file.
Where hashline's design genuinely pays off seems to be high edit density: amortising the anchor tax of a single read across many compact edits. On exactly that kind of task, it can be the cheapest tool. However, this is a very narrow use case, and the costs seem lower only for GPT 5.5 and Opus 4.8, with the rest still incurring higher dollar costs.
There Is No Best Edit Tool
To conclude, the "best" edit tool really depends on the model and what you are optimising for:
-
replaceas the sensible default: especially on weaker or local models. It is the most robust, never spirals, and stays cheap. -
patchif (and only if) the model was trained on it: on Codex models (and apparently on Claude), it is first-try clean and the lowest cost. On anything else, it ranges from expensive to downright catastrophic. -
hashlinewhen edit density is high enough to amortise the anchor tax: it can reduce output tokens, but usually at the cost of significantly more input tokens over a long agentic workflow.
Footnotes
-
All charts are computed from the raw per-trial results. ↩
-
Token prices for Qwen 3.6 35B uses OpenRouter's rate of $0.14/M input and $1/M output. ↩
-
Opus 4.7 also has 100% first-call success; 4.6 is where first-call success degrades toward 0%. ↩