Skip to content

Eval suites for codebase-specific agent use

Most AI rollouts skip evals because they feel like overhead. A small, codebase-specific eval suite, built in an afternoon, is the cheapest way to keep model and prompt changes from becoming a vibes call.

Published07 May 2026

Updated

Read time3 min · 495 words

Tool focusCodex, Claude Code, OpenAI, Anthropic

The most valuable thing the team takes away from a good AI workshop is not a workflow document or a tool short list. It is a small, codebase-specific eval suite — three to five tasks with known good outputs that get re-run every time something changes. Without it, every model upgrade and prompt tweak is a vibes call. With it, the team has a real signal.

What three to five tasks look like

The instinct is to make the eval suite ambitious. Resist. The first version should have three to five tasks, no more, each chosen because it represents a class of work the agent will actually do.

A useful starting set for an engineering team:

  1. A small refactor with full test coverage. A single function gets extracted from one file into a helper module, with tests passing.
  2. A repo question with a known answer. "Where is the authentication middleware applied?" — with the right files named in the answer.
  3. A deterministic migration. A version bump or rename that touches three to five files in a predictable way.

Each task has a written prompt, a known good output (a diff, a file list, a structured response), and an automated comparator that says pass or fail.

How to build the comparator

The comparator does not have to be smart. For a refactor task, the comparator can be: did the test suite pass, and is the file list close to the expected one? For a question task, the comparator can be: did the answer mention the expected files? Exact-match comparison is fragile; structured-checks comparison is robust.

The point is not perfect grading. The point is a deterministic signal that catches regressions. A model upgrade that quietly drops the agent's repo-navigation skill from "good" to "guessing" should fail the comparator the same day, not three weeks later when the team has noticed PRs getting worse and is trying to debug why.

When to run the suite

Run the suite on every workflow change worth caring about. Concretely: model upgrade, prompt template change, new MCP server in the rotation, new hook installed, new sub-agent in regular use.

Once the suite is small and fast, this stops being effortful. A model upgrade that takes an hour to evaluate against the suite is two cups of coffee, not a project.

Keeping the suite honest

The biggest failure mode is not the suite being wrong; it is the suite getting stale. The tasks made sense for the codebase eight months ago and no longer match the kind of work the team does. The fix is a quarterly review, with the rule that any task that is no longer representative is replaced rather than patched.

The eval suite is, in the long run, more valuable than any specific prompt or tool choice. The prompts will change, the tools will be replaced, the models will get better. The eval suite is what turns those changes from drama into a passing or failing run.

Related notes

22 Apr 2026 · 3 min

Codex on a real repo

Codex is a repo-aware coding agent. Used carelessly it generates churn the team has to clean up. Used with scope and a real review gate, it ships work.

02 May 2026 · 3 min

First MCP server tool design

Building an MCP server is mostly an API design problem with one extra constraint: the caller is a model, not a person. Naming and arguments matter more than transport.

© 2026 Magrathean UK Ltd. All rights reserved.

Registered in England & Wales: Company No. 16955343. Registered Office: 16 Caledonian Court West Street, Watford, WD17 1RY.