Harness Engineering — Rethinking How We Work With AI Agents

“Give it a map, not a 1,000-page instruction manual.”

There’s a pattern I keep seeing in teams that are getting real, consistent results from AI agents — and it’s not better prompts.

It’s better scaffolding.

OpenAI published a piece on how they built Codex that crystallized something I’d been circling around. They called it Harness Engineering. Once I understood the framing, I couldn’t unsee it.

The Prompt Trap

Most of us started the same way: write a detailed system prompt, pack in all the rules, run the agent, get frustrated when it doesn’t follow them.

So we write a longer prompt. Add more rules. Get more inconsistent results.

The problem isn’t the prompt. The problem is we’re treating the agent like a function that takes instructions as input and produces correct output. It doesn’t work that way. An agent working on a complex, multi-step task needs process — not just instructions.

That process is the harness.

What Harness Engineering Is

The harness is everything around the agent: the loop it operates in, the docs it can navigate, the lints enforcing structure, the hooks that automate verification, the review cycle before a human sees anything.

The code the agent writes matters less than the environment it runs inside.

OpenAI’s core insight was this: agents perform better when given a map, not an encyclopedia. A giant AGENTS.md with 500 lines of rules fails in predictable ways:

Context is scarce. A huge instruction file crowds out the task, the code, and the relevant docs.
Too much guidance becomes non-guidance. When everything is “important,” nothing is.
It rots. Monolithic rule files become graveyards of stale instructions. Agents can’t tell what’s still true.

So instead of the encyclopedia, keep your context file short — around 100 lines — and treat it as a table of contents. The actual knowledge lives in a structured docs/ directory the agent can navigate when it needs to.

CLAUDE.md        ← map (~100 lines)
docs/
  architecture.md   ← where things live and why
  conventions.md
  workflow.md
  ...

This is progressive disclosure: the agent explores what it needs, when it needs it. Of all the docs in this structure, architecture.md is the most important starting point — it’s the codemap that answers “where is the thing that does X?” See ARCHITECTURE.md - Save AI Slop for what that file should actually contain.

The Loop

The other half of the harness is the loop. One task, one cycle:

Identify — define the task clearly
Replicate — reproduce the problem or scenario
Fix — agent proposes a solution
Test — validate correctness
Deploy / PR — ship it

An agent should do ONE task per loop. Not “build this feature” — “fix this specific failing test.” Not “refactor the module” — “extract this function and update its callers.”

Scope is a forcing function for reliability.

Day 0 rule: start with a human at every step, giving feedback, helping the agent resolve issues. The loop improves with time as the harness accumulates patterns. Don’t start with all the automation — start with the loop itself.

Five Things That Make the Harness Stronger

1. AI-written code should be AI-reviewed

The biggest bottleneck in AI-driven engineering isn’t writing — it’s reviewing. Agents can produce ten PRs in an hour. Humans can’t review ten PRs in an hour.

The solution isn’t to slow the agents down. It’s to put a review pass before the human sees anything. Have a different model review what the first model wrote. Cross-model review consistently produces better results than same-model self-review — the review cycle goes deeper before a human needs to step in.

Anything written by AI should be fixable by AI.

2. CLI over MCP for tool access

When an agent needs to access a tool — a database, a cloud service, an API — give it a CLI with scoped permissions rather than a broad MCP connection. A scoped IAM role with read-only CloudWatch access is more predictable, easier to reason about, and requires less prompt overhead than connecting an MCP and prompting the agent to stay within boundaries.

This isn’t a hard rule. But CLI with explicit scope is usually the more reliable starting point.

3. Logs are signal, not noise

Agents can catch things humans miss in logs — edge-case failures, patterns across multiple lines, silent errors that don’t trip a crash reporter. As humans, we’ve trained ourselves to scan for the high-signal stuff: crashes, exceptions, hard warnings. An agent reads everything.

The key is giving agents structured, queryable logs — not raw dumps. And building a pipeline where agents can identify bugs, assign a confidence score to each finding, and only act (or escalate) above a threshold.

4. Custom lints are guardrails

Anything enforceable by a deterministic rule — file length, function length, naming conventions, import structure — should be a lint, not an agent prompt.

Lints are deterministic. Prompts are probabilistic. If you want functions under 35 lines, write a lint that fails the build when a function exceeds 35 lines. Then give the agent context of that rule so it doesn’t try to enforce it manually.

The goal is local autonomy: agents have freedom to solve the problem, within clearly enforced structural boundaries. Paradoxically, more constraints produce better output — the agent spends its capacity on the problem, not on negotiating structure.

5. Repo-local is the only reality

From an agent’s perspective, anything it can’t access in-context doesn’t exist. Chat threads, verbal decisions, shared docs, engineering lore — none of it is visible. Only what’s in the repository, versioned and readable, is real.

This means: convert institutional knowledge to repo-local markdown. Architecture decisions, conventions, context about why something was built a certain way — if it’s not in the repo, the agent will work around it or contradict it.

What Changes

Before harness engineering: the agent is a function. You pass it instructions, you get inconsistent results, you iterate on the prompt.

After: the agent is a process participant. It has a loop to operate in, a docs directory to navigate, lints to respect, and a review cycle that validates its output before a human sees it.

The discipline shifts from writing better prompts to designing better environments.

What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding than in the code.

That’s the reframe. The harness is the work.

Blog

Explorer

Harness Engineering — Rethinking How We Work With AI Agents

Harness Engineering — Rethinking How We Work With AI Agents

The Prompt Trap

What Harness Engineering Is

The Loop

Five Things That Make the Harness Stronger

1. AI-written code should be AI-reviewed

2. CLI over MCP for tool access

3. Logs are signal, not noise

4. Custom lints are guardrails

5. Repo-local is the only reality

What Changes

Further Reading

Graph View

Table of Contents

Backlinks