1. Where AI Still Falls Short

The authors of SWE-Bench just released a set of numbers that should make engineers uncomfortable.

Their new hardcore benchmark, ProgramBench, asks AI to rebuild real open-source software projects from scratch. No internet access. No judging by code similarity. Only final behavior counts.

The result: Claude Opus 4.7, GPT-5.4, Gemini 3.1 Pro — every frontier model scored a 0% completion rate.

To be clear, this does not mean AI cannot write code. It can write plenty of code. It can make individual functions look beautiful.

But ask it to build a real project from zero that actually runs, and it stuffs all the logic into one monolithic file. No modularity, no architecture, no long-term plan. In the end, it fails behavioral verification.

What does that tell us?

Code generation is no longer the bottleneck. The bottleneck is system architecture and engineering practice.

Now look at another data point.

LangChain ran an experiment. They used the same gpt-5.2-codex model without changing a single weight. They only optimized the engineering structure around the model. The result: their coding agent went from 52.8 to 66.5 on Terminal-Bench 2.0, jumping from outside the Top 30 into the Top 5.

OpenAI has shared an even more extreme case. A three-person team used Codex for five months to write roughly one million lines of code and merge about 1,500 PRs. This was not a demo. It was a real software product with internal daily active users and external alpha testers. OpenAI’s team said their focus had shifted from writing code to four things: designing the environment, making intent explicit, building feedback loops, and letting the Agent see, verify, and repair its own work.

Put these two data points together and the conclusion is obvious:

Raw models have a 0% completion rate on serious engineering tasks. Add the right engineering infrastructure around the same model, and a three-person team can build production-grade software.

The gap between those two outcomes is the Harness.

2. What Is a Harness?

The literal meaning of harness is the gear used to control a horse.

A large model, whether Opus 4.7 or gpt 5.5, is like a wild horse. If you cannot control it with reins, its speed means nothing.

Harness Engineering is the craft of building that gear and using it to control the horse.

Hooks, Skills, MCP, CLAUDE.md / AGENTS.md, sub-agents, plugins, tools — you have probably heard of these or used a few of them. Harness is the umbrella term for designing all of them as one system.

A lot of people use AI like this: they see a new MCP that looks fun and install it; they see a hook example and copy it; they write two random lines in CLAUDE.md and forget about it; when a bug appears, they have no idea which part to fix. The features do not work together, and too much harness can tire out your horse. The effect is limited.

The point of the term Harness Engineering is to force you to look at these scattered actions as a system.

It gives you a checklist: is your AI workflow complete across five core dimensions? Is your horse always in the best possible condition?

The industry has offered a simple formula:

Agent = Model + Harness

That is why some tools feel so smooth while others have strong models but feel brainless in practice. The harness is different.

Seen in the evolution of AI engineering:

  • The innermost layer is Prompt Engineering, which is about how to instruct the AI.
  • The middle layer is Context Engineering, which is about what information to give the AI and when to give it.
  • The outer layer, Harness Engineering, wraps both of those and adds tool orchestration, state persistence, verification loops, task decomposition, sub-Agents, permission sandboxes, and rollback mechanisms. Together, they form a complete engineering infrastructure.

Next, we will use the five dimensions of Harness to look back at your current Claude Code / Codex / OpenClaw / Cursor setup. You will get a clearer sense of whether your Agent needs more gear, or whether it needs less weight so it can run lighter.

3. The Five Core Dimensions of Harness

To turn the scattered idea of Harness into concrete engineering actions, we can break it into five dimensions:

context management, execution capability, task orchestration, feedback loops, architectural guardrails

3.1 Context Management: A Three-Layer Memory Architecture

AI forgets project rules in long conversations because of how the context window works. In each turn, the model sees a flat list of messages. Everything you said earlier is mixed together with the current question. There is no hierarchy between project rules and casual chat. The longer the conversation gets, the more diluted the earlier constraints become.

OpenAI made this clear in its Codex engineering article: from the Agent’s point of view, knowledge that is not available at runtime does not exist. Rules you said out loud, discussed in Slack, or assumed as team convention are invisible to AI unless they live in the repository as files.

The core Harness move at this layer is to turn rules into structured files. In production practice, this usually has three layers.

The first layer is AGENTS.md or CLAUDE.md in the project root. Think of it as the project map. It gets loaded at the top of the context in every new conversation. It includes the tech stack, directory structure, things that are forbidden, commands that must be run before committing, and UI style no-go zones. Keep it to about 100 lines.

The second layer is detailed rule files split by topic under the docs directory, such as frontend.md, security.md, and api-design.md. The AI sees the pointers in AGENTS.md and reads these files when needed, instead of stuffing all of them into context at once.

The third layer is the built-in memory mechanism in tools like Claude Code: lightweight indexes of roughly 150 characters each are always loaded; detailed files are pulled in on demand; raw records are accessed only through search tools like grep. This design balances two needs: the AI can see the big picture, while the context window does not get blown up.

OpenAI admitted one mistake in its article. At first, they put all their rules into one giant AGENTS.md that ran for thousands of lines. The AI became more likely to ignore key information. They fixed it by switching to a two-layer structure: a map plus detailed documents.

Core principle: persist context in layers, and avoid consuming so much context that you damage the model’s attention.

3.2 Execution Capability: Giving the Model Hands and Feet

A model by itself can only output text. It can tell you to run npm install in the terminal, but it cannot run the command, inspect the result, or adjust the next step based on the error. Pure output leaves AI unable to close the loop. At every step, a human has to act as the messenger.

The Harness work at this layer is to connect the model to a real operating environment. There are three levels, from basic to advanced.

The basic layer is terminal plus file system plus browser. The terminal lets AI run commands, install dependencies, execute tests, and inspect logs. The file system lets it read code, modify files, and write intermediate documents. The browser lets it view real pages, click buttons, and verify with screenshots.

The advanced layer is MCP. MCP is a standard protocol that lets AI connect to external capabilities. Common targets include databases, search engines, crawlers, design tools, and monitoring systems.

The higher layer is Skills. Skills package multi-step workflows into reusable capability bundles: writing a technical tweet, generating a weekly report, scraping competitor data from a website. When the AI sees the matching request, it invokes the whole Skill directly instead of redesigning the steps every time.

But you cannot pile on tools forever. Vercel hit a counterexample while building an internal text-to-SQL Agent. At first, they built a stack of specialized tools: schema lookup, query validation, and error recovery. The success rate was 80%. Later, they deleted 80% of the specialized tools and let Claude use basic Unix tools like grep, cat, find, and ls to read files and write SQL by itself. The success rate rose to 100%, speed improved by 3.5x, and token usage dropped by 37%.

The reason is simple: the more tools you give the model, the larger its choice space becomes at every step, and the more likely it is to pick the wrong tool or go down the wrong path.

Core principle: choose every tool deliberately. Fewer and sharper beats more and messier.

3.3 Task Orchestration: Letting AI Handle Long Tasks

The most common failure mode on long tasks is that AI tries to one-shot an entire feature. The model’s context window is limited. Asking it to swallow a requirement like “a list page with search, filters, and pagination” all at once is like asking an engineer to skip the design doc, skip task decomposition, skip iterations, and just grind until the end. Failure is guaranteed.

Anthropic described this failure mode in detail in its article on long-task Harness: the model tries to write everything in one pass, runs out of context, realizes halfway through that the earlier plan was wrong, then goes back to rewrite earlier code. The more it changes, the more tangled things get.

The Harness work at this layer is to structure long tasks.

Step one is Plan Mode. The AI first outputs an implementation plan: which subtasks to split into, and how each subtask will be implemented. A human confirms the plan before the AI starts editing. This is the brake pedal. It moves the cost of going in the wrong direction up into the planning stage.

Step two is step-by-step execution. Do one subtask at a time, and verify each one after completion. This prevents the collapse that happens when the context fills up midstream.

Step three is externalized state. After each feature is completed, ask the AI to write a document, usually called progress.md or plan.md. It should state in detail what has been completed, which technical approach was used, which key architectural decisions were made, which bugs remain unresolved, and what is still left to do. This document is external memory across context windows. When a new conversation starts, the AI reads it and immediately gets back into the work.

Step four is parallelism. Use sub-agents to run independent subtasks at the same time.

Anthropic proposed a more advanced solution for long-task engineering called Ralph Loop. It is basically a two-stage relay.

The first stage is the Initializer Agent. It runs only once at the beginning of the project. Its job is to set up the development environment, break the full requirement into a feature list, write the first progress.md, and make the initial git commit.

The second stage is the Coding Agent. This runs in every new conversation window. Its fixed routine is: read git log to understand the commit history; read progress.md to understand current progress and where the previous round stopped; pick the highest-priority item from the unfinished feature list; complete it; then commit to git, update progress.md, and clearly write what was done in this round and what the next round should do.

Even if the AI gets interrupted, the model version changes, or the conversation window fills up, the next round can read git log and progress.md and immediately get back into state.

Core principle: progress.md and git commits are save points for AI. Save state before you trust it with long tasks.

3.4 Feedback Loops: AI Test-Driven Development

Models do not run code. They read the code text and judge whether it looks like it should work. If a piece of code follows a familiar pattern, the variable names line up, and the indentation looks clean, the model marks it as done. But whether code runs has nothing to do with whether it looks runnable. Only running it can prove that. This is why AI often says “fixed” with total confidence while the project is still full of errors.

The Harness work at this layer is to move verification from humans to automation. There are three kinds of feedback.

Rule feedback: make the AI automatically run the linter, typecheck, unit tests, and integration tests before every commit. If any one of them fails, the task is not done.

Visual feedback: for UI tasks, make the AI use tools like Playwright to open the browser, click through the user path, and produce screenshots as evidence of completion.

LLM review feedback: make another AI independently review the code that was just written, looking for logic holes, architecture problems, and potential bugs.

Anthropic gave a concrete number in its engineering blog: giving the model a loop to verify its own work can improve output quality by 2x to 3x. This is the single most reliable Harness investment.

There is one counterintuitive design point here. Letting the same AI that generated the code review its own work performs much worse than you might expect, because the generator is naturally biased toward defending itself. Anthropic’s experience is to split the generator and evaluator into two independent Agents with different role configurations and different prompts. Real peer review finds problems. This is the same reason traditional software teams say you should not be the only reviewer of your own code.

Core principle: AI saying it fixed something means nothing. Passing tests is what counts.

3.5 Architectural Guardrails: Blocking Bad AI Code Before It Lands

AI-written code has a hidden problem: it imitates the patterns already present in the repository. Good code gets imitated. Bad code gets imitated too. Each individual AI commit may look reasonable in isolation, but stacked together, the project keeps getting worse.

OpenAI called out this behavior in its Codex article: Agents copy existing patterns in the repository. If the existing patterns are unstable, inconsistent, or simply bad, AI amplifies them.

The Harness work at this layer is to move architectural rules out of documents and into executable code, so bad code gets blocked automatically before it enters the main branch.

The most basic layer is pre-commit hooks. Before a git commit, they automatically run a batch of checks. Noncompliant commits are blocked immediately.

The second layer is an architecture linter. It checks architectural violations specifically: the UI layer cannot directly access the database layer, module dependencies must flow in one direction, files above a size threshold must be split. This is different from the syntax linter in 3.4. That one checks syntax errors; this one checks architecture errors.

The third layer is the CI gate as a backstop. Even if local hooks are bypassed, CI runs the checks again to ensure the main branch always satisfies architectural constraints.

OpenAI also has a more aggressive practice it calls “garbage collection”: periodically run background Codex tasks to scan the whole codebase, find places that drift away from architectural principles, and automatically open small PRs to pay down technical debt. The logic is that AI creates code faster, so technical debt accumulates faster. Debt cleanup has to be automated too.

Core principle: account for AI’s limits ahead of time, and set up guardrails.


4. How Top Teams Build Harnesses

4.1 Anthropic / Claude Code: Textbook Long-Task Engineering

Claude Code is Anthropic’s own agent harness for driving Claude. It breaks Harness into 12 independent components. I will not list all of them here. Instead, let’s focus on the two designs most worth copying.

Design 1: A three-layer memory architecture

Claude Code’s memory system has three layers.

The top layer is a lightweight index. Each item is about 150 characters and is always loaded into context. Its job is to make sure the AI always knows what files, modules, and core conventions exist in the project. Because each item is short, dozens or even hundreds of them will not blow up the context window.

The middle layer is detailed files: README.md, ARCHITECTURE.md, API docs, and design notes for each module. These files are not loaded by default. When the AI sees a lightweight index item that says “see docs/architecture.md,” it reads that file only when needed. After the information enters context, it can also be compressed out when it is no longer needed.

The bottom layer is raw records: the full git log, complete conversation history, and complete log files. This layer is never loaded automatically. The AI can only retrieve it through commands like grep and tail. It has the most data, but it does not pollute context.

The core idea behind these three layers is simple: context is scarce, so it must be layered by “how often it needs to be read” and “how important it is.” Put what must always be seen at the top. Put what is occasionally needed in the middle. Leave everything else to search.

Design 2: Ralph Loop for long tasks across context windows

For long tasks that take days and span multiple conversation windows, Anthropic designed a workflow called Ralph Loop. It is basically a two-stage relay.

The first stage is the Initializer Agent. It runs only once at the beginning of the project. It does four things: sets up the development environment, breaks the full requirement into a feature list, writes the first progress.md with the todo items, and makes the initial git commit.

The second stage is the Coding Agent. It runs in every new conversation window. Its fixed process is:

  1. Read git log first to see the commit history.
  2. Read progress.md to see current progress and the unfinished list.
  3. Pick the highest-priority item from the unfinished list.
  4. Complete that item, run verification, and commit to git.
  5. Update progress.md with what this round did and what the next round should do.

Even if the AI gets interrupted, the model version changes, or the conversation window fills up, the next AI only needs to read git log and progress.md to get back into state. The key to this design is not that the AI is smarter. It is that progress.md and git history together create external memory across context windows.

What you can learn from this design:

  • Your own project should at least have a progress.md that clearly states what is done, which key architectural decisions were made, which bugs remain unresolved, and what should happen next. Do not leave this stuff only in your head.
  • Treat each completed feature as one git commit. Git history itself is an AI-readable work log.
  • Layer your context. Put what the AI must always see in AGENTS.md, put what it may need to look up in docs, and leave the rest to search.

4.2 OpenAI / Codex: Rebuilding the Development Environment for AI

Behind the story of OpenAI’s Codex team using AI to write one million lines of code in five months with three people, the most important factor was not model strength. It was an engineering idea they call Codex legibility.

Codex legibility roughly means “readable by Codex.” The point is that future codebases must be readable not only by humans, but also by Agents. That means all the development infrastructure originally built for human engineers — logs, monitoring, debugging tools, local environments — has to be rebuilt into forms AI can use directly.

OpenAI had five concrete practices.

1. Each git worktree automatically starts an independent app instance. When Codex works on a change, the corresponding branch can automatically spin up an independent development server. The AI can open that instance, operate it, inspect responses, and verify whether the change actually worked as intended. What used to require an engineer to manually run npm run dev is now done by the AI.

2. Chrome DevTools Protocol is connected to the Agent runtime. This gives AI the same browser-level debugging ability as an engineer: inspect the DOM, listen to network requests, inject JS, take screenshots. Reproducing a UI bug no longer requires an engineer to open the browser by hand. The AI can reproduce and locate it by itself.

3. Logs, metrics, and traces are exposed for AI queries. OpenAI lets Codex directly access production monitoring data through query languages like LogQL, PromQL, and TraceQL. That means the AI does not need an engineer to manually paste logs into the chat during debugging. It can grep logs, inspect metric anomalies, and trace requests by itself.

4. Custom linters turn architectural constraints into executable rules. OpenAI encoded which layers cannot call which layers, which names must be consistent, and which patterns are forbidden as automatic linter rules. Any AI-written code that violates those rules gets blocked.

5. Background garbage collection tasks. This is the most interesting one. OpenAI periodically runs a batch of background Codex tasks to scan the entire codebase for places that drift away from architectural principles, then automatically opens small PRs to fix them. In other words, paying down technical debt is automated too. It no longer depends on engineers manually refactoring code.

Together, these five practices turn the whole development environment into an AI-readable, AI-operable, AI-verifiable workbench.

What you can learn from this design:

  • If your project logs are unreadable even to humans, AI will understand them even less. Refactoring logs into a structured format, such as JSON with a fixed schema, is the first step toward making them AI-friendly.
  • Move the architectural principles in your head, in Slack, and in code review comments into linter rules. The real value is not only preventing violations. It also turns architecture into something learnable: AI can read the linter error and learn how to write compliant code.
  • During system design, ask one question: can AI see this state? If not, figure out how to expose it.

4.3 Nous Research / Superpowers: An Open-Source, Reusable Skills Framework

Superpowers is an open-source Agent Skills framework from Nous Research. If Claude Code is Anthropic’s internal engineering practice, and Codex is OpenAI’s model for rebuilding the development environment, then Superpowers is the open-source version individual developers can copy and use at home.

It includes several common workflows out of the box.

TDD workflow: AI writes tests before implementation. The work is not done until the tests pass. This turns the feedback loop from 3.4 into the default path. AI must pass tests before it can deliver.

Two-stage Code Review: the first stage has the generator agent write code; the second stage has the reviewer agent review it. The two agents use different role configurations and prompts, forcing a real split between generator and evaluator.

Sub-agent collaboration template: it includes a standard flow for task decomposition, parallel execution, and result synthesis, ready to use out of the box.

Hermes Agent, the sister project of Superpowers, has an even more forward-looking direction called Self-Evolution. This system uses DSPy and GEPA to continuously optimize the Harness itself. Put simply, DSPy treats prompts and tool descriptions as optimizable parameters, while GEPA uses a genetic-algorithm-like approach to find better configurations from the success and failure traces in execution records. The whole process does not retrain the model. It only tunes the Harness. In other words, Harness can automatically improve from the tasks it has done before.

What you can learn from this design:

  • If you do not want to design a TDD workflow and two-stage review from scratch, copy the Superpowers templates and start there.
  • Package the few task types you do most often, such as writing a blog post or analyzing a dataset, into Skills. Next time, the AI can call them by itself instead of making you teach the workflow again.
  • Pay attention to the DSPy / GEPA line of work. Stronger models are the big trend, but self-improving Harness may be the improvement path closer to your actual work.

5. Three Counterintuitive Harness Principles

Now that we have covered how top teams build Harnesses, let’s pull out a few principles in this field that are especially easy to get wrong.

Principle 1: More Harness Is Not Always Better

Intuitively, more tools should make AI stronger. In reality, the more tools you give it, the larger the choice space becomes at every step, and the more likely it is to pick the wrong tool, take the wrong path, or call an interface it should not call.

Vercel tested this directly on its internal text-to-SQL Agent. At first, they gave the Agent a full set of specialized tools: schema lookup, query validation, and error recovery. The success rate was 80%. Later, they removed 80% of those specialized tools and let Claude use basic Unix tools like grep, cat, find, and ls to read schema files and write SQL by itself. The success rate rose to 100%, speed improved by 3.5x, and token usage dropped by 37%.

Manus observed the same pattern: a heavily armed agent gets dumber.

The practical rule is simple: before adding any tool, ask what specific behavioral gap it solves. If you cannot answer, do not add it yet.

Principle 2: How You Organize Context Matters Much More Than How Much Context You Have

Many people see Claude 4 Opus has a million-token context window and their first reaction is, “Then just stuff everything in.” That is an expensive misunderstanding.

Chroma Research ran a Context Rot study testing 18 mainstream models under long-context conditions. The conclusion: as input length grows, models become less reliable at using context. Even when task difficulty stays the same, simply increasing context from 10,000 tokens to 500,000 tokens significantly reduces performance.

Stanford’s famous Lost in the Middle paper proved another phenomenon: models use information near the beginning and end of the context best, and tend to ignore information in the middle. That means the position of key information in the context directly affects whether AI can use it.

The Manus team added one more cut from an economics angle. In their typical Agent tasks, the ratio of input tokens to output tokens is about 100:1. Most of the Agent’s time and cost is spent repeatedly feeding context into the model, and the price difference between KV-cache hits and misses is 10x. For the same task, context organization that allows cache reuse can change your monthly API bill by an order of magnitude.

Put these three things together and the message is clear: context design is a real engineering problem. It is not as simple as throwing everything in and letting the model pick what matters.

Principle 3: Harness Should Get Simpler as Models Improve

Behind every Harness component is an implicit assumption: the model cannot do this by itself, so it needs an external patch. But models are improving fast, and old patches may already be obsolete.

Anthropic gave a concrete example in its Harness Design blog. In early versions of Claude Code, they added an explicit “planning” step that forced AI to output a plan before acting. Later, after a new version of Claude was released, Anthropic found that planning ability had been internalized by the model. The external planning step became redundant overhead, so they deleted it.

The practical method is this: every time you upgrade the model version, review your Harness. Which components still compensate for real capability gaps? Which ones are just historical baggage? Which checks can now be trusted to the model?

A good Harness keeps just enough thickness outside the current boundary of model capability.


Conclusion: Harness Engineering Is Becoming a Core Skill

The fact that every large model scored 0% on ProgramBench tells us something in reverse: Harness Engineering is becoming more important.

AI can already write a large share of your code today. But AI will not do requirements analysis for you, decompose tasks for you, design the context architecture for you, decide which tools to add or remove, establish verification standards, or defend architectural boundaries for you.

The sum of those things is Harness Engineering.

It sounds like a new term, but at its core it is the systematic migration of traditional software engineering methods into AI workflows:

  • AGENTS.md corresponds to requirements documents and design documents.
  • Task orchestration corresponds to iterative decomposition in agile development.
  • Feedback loops correspond to unit tests and code review.
  • Architectural guardrails correspond to code standards and security checks.

These were already the fundamentals of good engineering.

Next time your Agent messes things up again, look back at your Harness across the five dimensions and see where it stands.

In most cases, the AI is already smart enough. What is missing is a good set of reins.