Harness Engineering — Rethinking How We Work With AI Agents

“Give it a map, not a 1,000-page instruction manual.”

There’s a pattern I keep seeing in teams that are getting real, consistent results from AI agents — and it’s not better prompts.

It’s better scaffolding.

OpenAI published a piece on how they built Codex that crystallized something I’d been circling around. They called it Harness Engineering. Once I understood the framing, I couldn’t unsee it.


The Prompt Trap

Most of us started the same way: write a detailed system prompt, pack in all the rules, run the agent, get frustrated when it doesn’t follow them.

So we write a longer prompt. Add more rules. Get more inconsistent results.

The problem isn’t the prompt. The problem is we’re treating the agent like a function that takes instructions as input and produces correct output. It doesn’t work that way. An agent working on a complex, multi-step task needs process — not just instructions.

That process is the harness.


What Harness Engineering Is

The harness is everything around the agent: the loop it operates in, the docs it can navigate, the lints enforcing structure, the hooks that automate verification, the review cycle before a human sees anything.

The code the agent writes matters less than the environment it runs inside.

OpenAI’s core insight was this: agents perform better when given a map, not an encyclopedia. A giant AGENTS.md with 500 lines of rules fails in predictable ways:

  • Context is scarce. A huge instruction file crowds out the task, the code, and the relevant docs.
  • Too much guidance becomes non-guidance. When everything is “important,” nothing is.
  • It rots. Monolithic rule files become graveyards of stale instructions. Agents can’t tell what’s still true.

So instead of the encyclopedia, keep your context file short — around 100 lines — and treat it as a table of contents. The actual knowledge lives in a structured docs/ directory the agent can navigate when it needs to.

CLAUDE.md        ← map (~100 lines)
docs/
  architecture.md   ← where things live and why
  conventions.md
  workflow.md
  ...

This is progressive disclosure: the agent explores what it needs, when it needs it. Of all the docs in this structure, architecture.md is the most important starting point — it’s the codemap that answers “where is the thing that does X?” See ARCHITECTURE.md - Save AI Slop for what that file should actually contain.


The Loop

The other half of the harness is the loop. One task, one cycle:

  1. Identify — define the task clearly
  2. Replicate — reproduce the problem or scenario
  3. Fix — agent proposes a solution
  4. Test — validate correctness
  5. Deploy / PR — ship it

An agent should do ONE task per loop. Not “build this feature” — “fix this specific failing test.” Not “refactor the module” — “extract this function and update its callers.”

Scope is a forcing function for reliability.

Day 0 rule: start with a human at every step, giving feedback, helping the agent resolve issues. The loop improves with time as the harness accumulates patterns. Don’t start with all the automation — start with the loop itself.


Five Things That Make the Harness Stronger

1. AI-written code should be AI-reviewed

The biggest bottleneck in AI-driven engineering isn’t writing — it’s reviewing. Agents can produce ten PRs in an hour. Humans can’t review ten PRs in an hour.

The solution isn’t to slow the agents down. It’s to put a review pass before the human sees anything. Have a different model review what the first model wrote. Cross-model review consistently produces better results than same-model self-review — the review cycle goes deeper before a human needs to step in.

Anything written by AI should be fixable by AI.

2. CLI over MCP for tool access

When an agent needs to access a tool — a database, a cloud service, an API — give it a CLI with scoped permissions rather than a broad MCP connection. A scoped IAM role with read-only CloudWatch access is more predictable, easier to reason about, and requires less prompt overhead than connecting an MCP and prompting the agent to stay within boundaries.

This isn’t a hard rule. But CLI with explicit scope is usually the more reliable starting point.

3. Logs are signal, not noise

Agents can catch things humans miss in logs — edge-case failures, patterns across multiple lines, silent errors that don’t trip a crash reporter. As humans, we’ve trained ourselves to scan for the high-signal stuff: crashes, exceptions, hard warnings. An agent reads everything.

The key is giving agents structured, queryable logs — not raw dumps. And building a pipeline where agents can identify bugs, assign a confidence score to each finding, and only act (or escalate) above a threshold.

4. Custom lints are guardrails

Anything enforceable by a deterministic rule — file length, function length, naming conventions, import structure — should be a lint, not an agent prompt.

Lints are deterministic. Prompts are probabilistic. If you want functions under 35 lines, write a lint that fails the build when a function exceeds 35 lines. Then give the agent context of that rule so it doesn’t try to enforce it manually.

The goal is local autonomy: agents have freedom to solve the problem, within clearly enforced structural boundaries. Paradoxically, more constraints produce better output — the agent spends its capacity on the problem, not on negotiating structure.

5. Repo-local is the only reality

From an agent’s perspective, anything it can’t access in-context doesn’t exist. Chat threads, verbal decisions, shared docs, engineering lore — none of it is visible. Only what’s in the repository, versioned and readable, is real.

This means: convert institutional knowledge to repo-local markdown. Architecture decisions, conventions, context about why something was built a certain way — if it’s not in the repo, the agent will work around it or contradict it.


What Changes

Before harness engineering: the agent is a function. You pass it instructions, you get inconsistent results, you iterate on the prompt.

After: the agent is a process participant. It has a loop to operate in, a docs directory to navigate, lints to respect, and a review cycle that validates its output before a human sees it.

The discipline shifts from writing better prompts to designing better environments.

What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding than in the code.

That’s the reframe. The harness is the work.


Further Reading