---
title: The Harness Problem Is Also A Training Problem
description: No, you really shouldn't default to hashline edits just because.
date: 9 June 2026
---

In February, this post on ["The Harness Problem"](https://blog.can.ac/2026/02/12/the-harness-problem/) made headlines: the bottleneck in AI coding agents isn't the model, it's the harness. It argues that with the same LLM, swapping only the edit tool makes success rates swing wildly.

I've been building [Pim](https://github.com/AaronCQL/pim-agent), my own harness on top of Pi, so naturally I wanted to check it myself, but with one change: measure *cost* (in terms of input/output tokens and actual money spent), not just whether the task passed. After all, an agent also spends input tokens (your context budget), output tokens (your time), and dollars (your money!) — a tool can "pass" and still be horribly expensive.

So I built a [micro benchmark](https://github.com/AaronCQL/pim-agent/tree/main/benchmarks/edit), consisting of 12 fully-specified editing tasks, where scoring is an exact byte-match, to measure across 3 different edit tools:[^data]

1. `replace`: string replacement strategy that Claude and many other models are trained on
2. `patch`: OpenAI's V4A diff format that Codex models use
3. `hashline`: the proposed solution to the harness problem, where every line is tagged with a short hash so that models can reference stable anchors instead of just line numbers

The article is right that the harness dominates. But once you measure *cost* across different models and tasks, "which edit tool is best" stops having a clear answer. It actually dissolves into two other questions:

1. What was the model trained on?
2. What are you optimising for?

> Spoiler: `hashline` is almost never the cheapest edit tool to use.

## TLDR: What's The Cost?

Let's say we're optimising purely for cost (and let's face it, these APIs aren't getting cheaper), which we define as the combined dollar cost of the input and output tokens.[^qwen] To ensure costs across different models are fairly represented, the benchmark uses a range of different models, from frontier ones like GPT 5.5 and Opus 4.8 to locally hosted models like Qwen 3.6 35B.

For models fluent in `patch`, it's the most efficient tool to use, but for models that are not, it's exorbitantly expensive. Across the full benchmark, `hashline` is never cheaper than `replace`, and ranges from 8% to 38% more expensive.

```chart
{
  "title": "Average dollar cost per model per tool",
  "subtitle": "Relative to replace tool",
  "groups": ["GPT 5.5", "Opus 4.8", "Gemini 3.5 Flash", "DeepSeek V4 Flash", "Qwen 3.6 35B"],
  "series": [
    {"name": "replace",  "color": "#f9a8d4", "values": [1.00, 1.00, 1.00, 1.00, 1.00]},
    {"name": "patch",    "color": "#5eead4", "values": [0.91, 0.96, 6.63, 4.90, 5.78]},
    {"name": "hashline", "color": "#fcd34d", "values": [1.08, 1.09, 1.16, 1.25, 1.38]}
  ],
  "y": {"max": 7, "ticks": [0, 2, 4, 6]},
  "labels": {"decimals": 2, "suffix": "×"}
}
```

## Replace: Simple Is Better

The most robust edit tool is `replace`: almost 100% pass rate with near-zero invalid tool calls across all 5 tested models. It is the only tool that never melts down.

```chart
{
  "title": "Pass rate by model and edit tool",
  "subtitle": "72 trials (12 tasks × 6 reps) per model per tool",
  "groups": ["GPT 5.5", "Opus 4.8", "Gemini 3.5 Flash", "DeepSeek V4 Flash", "Qwen 3.6 35B"],
  "series": [
    {"name": "replace",  "color": "#f9a8d4", "values": [100, 98.6, 100, 100, 100]},
    {"name": "patch",    "color": "#5eead4", "values": [100, 94.4, 97.2, 61.1, 50]},
    {"name": "hashline", "color": "#fcd34d", "values": [100, 98.6, 100, 95.8, 97.2]}
  ],
  "y": {"max": 100, "ticks": [0, 25, 50, 75, 100]},
  "labels": {"decimals": 1, "trim": true}
}
```

## Patch: Cost Of Being Untrained

For `patch`, it swings from best to worst depending purely on the model. Pim's tool description for `patch` is [deliberately minimal](https://github.com/AaronCQL/pim-agent/blob/13541c129d01be4dbf78f874fcbe467f75be7b71/src/extensions/apply-patch/index.ts#L20-L22): it tells the model the input is "a V4A patch", what the begin/end markers look like, and nothing else. There is no mention of the grammar (`@@` hunk headers, `+`/`-` line prefixes, etc.) at all. The error messages are descriptive but they only appear *after* a failed call.

Thus, if a model produces a valid call on its first try, that knowledge almost certainly came from its weights, not from the prompt. In that case, first-call success is a good proxy for "is the model trained on this tool/format?".

GPT 5.5 and Opus 4.8 are both able to produce valid first calls every time. The rest, Gemini 3.5 Flash, DeepSeek V4 Flash, and Qwen 3.6 35B, all sit at 0% first call.

```chart
{
  "title": "Percentage of valid first calls to patch",
  "subtitle": "72 trials (12 tasks × 6 reps) per model",
  "groups": ["GPT 5.5", "Opus 4.8", "Gemini 3.5 Flash", "DeepSeek V4 Flash", "Qwen 3.6 35B"],
  "series": [
    {"color": "#f9a8d4", "values": [100, 100, 0, 0, 0]}
  ],
  "y": {"max": 100, "ticks": [0, 25, 50, 75, 100]},
  "labels": {"decimals": 0}
}
```

However, Gemini still *passes* 97% of the time. It just pays 13.7x the input tokens and around 6.6x the dollars to get there, retrying against the error messages until a patch finally applies. Qwen and DeepSeek can't recover at all: they stall at 50%-61%, burning 6x-22x the input tokens on the way down.

```chart
{
  "title": "Average cost of patch per model",
  "subtitle": "Relative to replace tool",
  "groups": ["GPT 5.5", "Opus 4.8", "Gemini 3.5 Flash", "DeepSeek V4 Flash", "Qwen 3.6 35B"],
  "series": [
    {"name": "input tokens",  "color": "#f9a8d4", "values": [0.92, 0.94, 13.73, 21.80, 6.50]},
    {"name": "output tokens", "color": "#5eead4", "values": [0.89, 0.99, 6.40, 5.87, 4.56]},
    {"name": "dollar cost",   "color": "#fcd34d", "values": [0.91, 0.96, 6.63, 4.90, 5.78]}
  ],
  "y": {"max": 23, "ticks": [0, 5, 10, 15, 20]},
  "rule": 1,
  "labels": {"decimals": 1, "suffix": "×"}
}
```

First-call success tracks training exposure (either the model has seen V4A or it hasn't). General capability tracks whether the model can recover, and how much that costs. Gemini is the clean dissociation: obviously capable (97% pass on `patch`), obviously not trained on V4A (0% first-call success on `patch`), recovers, but bleeds tokens doing it.

The most surprising data was Opus 4.8.[^opus] It is not a Codex model, yet it nails `patch` first try consistently. There are no official sources (that I could find) mentioning whether Opus 4.8 was deliberately fine-tuned to V4A or if this was just incidental exposure (since V4A is publicly documented), only that it didn't come from the prompt.

## Hashline: Narrow Use Case

Lastly, `hashline` is the one I most wanted to like, because its whole pitch is about how weaker models can stand to gain the most from it. However, for the weaker/smaller models like DeepSeek V4 Flash and Qwen 3.6 35B, the pass rates are a few percentage points short of the 100% they get from just using `replace`.

The one metric where `hashline` ever edges `replace` is output tokens, and even that only holds on some models. However, it trades slightly fewer output tokens for disproportionately more input tokens. Across all five models, the dollar cost of that trade still favours `replace` over `hashline`.

```chart
{
  "title": "Average cost of hashline per model",
  "subtitle": "Relative to replace tool",
  "groups": ["GPT 5.5", "Opus 4.8", "Gemini 3.5 Flash", "DeepSeek V4 Flash", "Qwen 3.6 35B"],
  "series": [
    {"name": "input tokens",  "color": "#f9a8d4", "values": [1.58, 2.52, 1.26, 1.50, 1.58]},
    {"name": "output tokens", "color": "#5eead4", "values": [0.74, 1.08, 0.93, 0.93, 1.06]},
    {"name": "dollar cost",   "color": "#fcd34d", "values": [1.08, 1.09, 1.16, 1.25, 1.38]}
  ],
  "y": {"max": 2.7, "ticks": [0, 0.5, 1, 1.5, 2, 2.5]},
  "rule": 1,
  "labels": {"decimals": 2, "suffix": "×"}
}
```

This tradeoff happens because every read call injects the hash anchors, and every edit changes the anchors again, even if the line numbers didn't. In an agentic loop, an agent must run a `read ➝ edit ➝ read ➝ edit` cycle to make multiple sequential edits to a file.

Where hashline's design genuinely pays off seems to be high edit density: amortising the anchor tax of a single read across many compact edits. On exactly that [kind of task](https://github.com/AaronCQL/pim-agent/tree/main/benchmarks/edit/tasks/many-field-edits), it can be the cheapest tool. However, this is a very narrow use case, and the costs seem lower only for GPT 5.5 and Opus 4.8, with the rest still incurring higher dollar costs.

```chart
{
  "title": "Average cost of hashline on a high edit-density trial",
  "subtitle": "Relative to replace tool",
  "groups": ["GPT 5.5", "Opus 4.8", "Gemini 3.5 Flash", "DeepSeek V4 Flash", "Qwen 3.6 35B"],
  "series": [
    {"name": "input tokens",  "color": "#f9a8d4", "values": [1.55, 1.78, 1.07, 2.00, 1.40]},
    {"name": "output tokens", "color": "#5eead4", "values": [0.34, 0.52, 1.15, 0.57, 0.84]},
    {"name": "dollar cost",   "color": "#fcd34d", "values": [0.78, 0.80, 1.10, 1.08, 1.13]}
  ],
  "y": {"max": 2.2, "ticks": [0, 0.5, 1, 1.5, 2]},
  "rule": 1,
  "labels": {"decimals": 2, "suffix": "×"}
}
```

## There Is No Best Edit Tool

To conclude, the "best" edit tool really depends on the model and what you are optimising for:

- **`replace` as the sensible default**: especially on weaker or local models. It is the most robust, never spirals, and stays cheap.
- **`patch` if (and only if) the model was trained on it**: on Codex models (and apparently on Claude), it is first-try clean and the lowest cost. On anything else, it ranges from expensive to downright catastrophic.
- **`hashline` when edit density is high enough to amortise the anchor tax**: it can reduce output tokens, but usually at the cost of significantly more input tokens over a long agentic workflow.

[^data]: All charts are computed from the [raw per-trial results](https://github.com/AaronCQL/pim-agent/tree/main/benchmarks/edit/results/r1).

[^qwen]: Token prices for Qwen 3.6 35B uses OpenRouter's rate of $0.14/M input and $1/M output.

[^opus]: Opus 4.7 also has 100% first-call success; 4.6 is where first-call success degrades toward 0%.
