Harness Engineering — Rethinking How We Work With AI Agents
“Give it a map, not a 1,000-page instruction manual.”
There’s a pattern I keep seeing in teams that are getting real, consistent results from AI agents — and it’s not better prompts.
It’s better scaffolding.
OpenAI published a piece on how they built Codex that crystallized something I’d been circling around. They called it Harness Engineering. Once I understood the framing, I couldn’t unsee it.
The Prompt Trap
Most of us started the same way: write a detailed system prompt, pack in all the rules, run the agent, get frustrated when it doesn’t follow them.
So we write a longer prompt. Add more rules. Get more inconsistent results.
The problem isn’t the prompt. The problem is we’re treating the agent like a function that takes instructions as input and produces correct output. It doesn’t work that way. An agent working on a complex, multi-step task needs process — not just instructions.
That process is the harness.
What Harness Engineering Is
The harness is everything around the agent: the loop it operates in, the docs it can navigate, the lints enforcing structure, the hooks that automate verification, the review cycle before a human sees anything.
The code the agent writes matters less than the environment it runs inside.
OpenAI’s core insight was this: agents perform better when given a map, not an encyclopedia. A giant AGENTS.md with 500 lines of rules fails in predictable ways:
- Context is scarce. A huge instruction file crowds out the task, the code, and the relevant docs.
- Too much guidance becomes non-guidance. When everything is “important,” nothing is.
- It rots. Monolithic rule files become graveyards of stale instructions. Agents can’t tell what’s still true.
So instead of the encyclopedia, keep your context file short — around 100 lines — and treat it as a table of contents. The actual knowledge lives in a structured docs/ directory the agent can navigate when it needs to.
CLAUDE.md ← map (~100 lines)
docs/
architecture.md ← where things live and why
conventions.md
workflow.md
...
This is progressive disclosure: the agent explores what it needs, when it needs it. Of all the docs in this structure, architecture.md is the most important starting point — it’s the codemap that answers “where is the thing that does X?” See ARCHITECTURE.md - Save AI Slop for what that file should actually contain.
The Loop
The other half of the harness is the loop. One task, one cycle:
- Identify — define the task clearly
- Replicate — reproduce the problem or scenario
- Fix — agent proposes a solution
- Test — validate correctness
- Deploy / PR — ship it
An agent should do ONE task per loop. Not “build this feature” — “fix this specific failing test.” Not “refactor the module” — “extract this function and update its callers.”
Scope is a forcing function for reliability.
Day 0 rule: start with a human at every step, giving feedback, helping the agent resolve issues. The loop improves with time as the harness accumulates patterns. Don’t start with all the automation — start with the loop itself.
Five Things That Make the Harness Stronger
1. AI-written code should be AI-reviewed
The biggest bottleneck in AI-driven engineering isn’t writing — it’s reviewing. Agents can produce ten PRs in an hour. Humans can’t review ten PRs in an hour.
The solution isn’t to slow the agents down. It’s to put a review pass before the human sees anything. Have a different model review what the first model wrote. Cross-model review consistently produces better results than same-model self-review — the review cycle goes deeper before a human needs to step in.
Anything written by AI should be fixable by AI.
2. CLI over MCP for tool access
When an agent needs to access a tool — a database, a cloud service, an API — give it a CLI with scoped permissions rather than a broad MCP connection. A scoped IAM role with read-only CloudWatch access is more predictable, easier to reason about, and requires less prompt overhead than connecting an MCP and prompting the agent to stay within boundaries.
This isn’t a hard rule. But CLI with explicit scope is usually the more reliable starting point.
3. Logs are signal, not noise
Agents can catch things humans miss in logs — edge-case failures, patterns across multiple lines, silent errors that don’t trip a crash reporter. As humans, we’ve trained ourselves to scan for the high-signal stuff: crashes, exceptions, hard warnings. An agent reads everything.
The key is giving agents structured, queryable logs — not raw dumps. And building a pipeline where agents can identify bugs, assign a confidence score to each finding, and only act (or escalate) above a threshold.
4. Custom lints are guardrails
Anything enforceable by a deterministic rule — file length, function length, naming conventions, import structure — should be a lint, not an agent prompt.
Lints are deterministic. Prompts are probabilistic. If you want functions under 35 lines, write a lint that fails the build when a function exceeds 35 lines. Then give the agent context of that rule so it doesn’t try to enforce it manually.
The goal is local autonomy: agents have freedom to solve the problem, within clearly enforced structural boundaries. Paradoxically, more constraints produce better output — the agent spends its capacity on the problem, not on negotiating structure.
5. Repo-local is the only reality
From an agent’s perspective, anything it can’t access in-context doesn’t exist. Chat threads, verbal decisions, shared docs, engineering lore — none of it is visible. Only what’s in the repository, versioned and readable, is real.
This means: convert institutional knowledge to repo-local markdown. Architecture decisions, conventions, context about why something was built a certain way — if it’s not in the repo, the agent will work around it or contradict it.
What Changes
Before harness engineering: the agent is a function. You pass it instructions, you get inconsistent results, you iterate on the prompt.
After: the agent is a process participant. It has a loop to operate in, a docs directory to navigate, lints to respect, and a review cycle that validates its output before a human sees it.
The discipline shifts from writing better prompts to designing better environments.
What’s become clear: building software still demands discipline, but the discipline shows up more in the scaffolding than in the code.
That’s the reframe. The harness is the work.