Eval suites for codebase-specific agent use
Most AI rollouts skip evals because they feel like overhead. A small, codebase-specific eval suite, built in an afternoon, is the cheapest way to keep model and prompt changes from becoming a vibes call.
Published07 May 2026
Updated
Read time3 min · 495 words
Tool focusCodex, Claude Code, OpenAI, Anthropic
The most valuable thing the team takes away from a good AI workshop is not a workflow document or a tool short list. It is a small, codebase-specific eval suite — three to five tasks with known good outputs that get re-run every time something changes. Without it, every model upgrade and prompt tweak is a vibes call. With it, the team has a real signal.
What three to five tasks look like
The instinct is to make the eval suite ambitious. Resist. The first version should have three to five tasks, no more, each chosen because it represents a class of work the agent will actually do.
A useful starting set for an engineering team:
- A small refactor with full test coverage. A single function gets extracted from one file into a helper module, with tests passing.
- A repo question with a known answer. "Where is the authentication middleware applied?" — with the right files named in the answer.
- A deterministic migration. A version bump or rename that touches three to five files in a predictable way.
Each task has a written prompt, a known good output (a diff, a file list, a structured response), and an automated comparator that says pass or fail.
How to build the comparator
The comparator does not have to be smart. For a refactor task, the comparator can be: did the test suite pass, and is the file list close to the expected one? For a question task, the comparator can be: did the answer mention the expected files? Exact-match comparison is fragile; structured-checks comparison is robust.
The point is not perfect grading. The point is a deterministic signal that catches regressions. A model upgrade that quietly drops the agent's repo-navigation skill from "good" to "guessing" should fail the comparator the same day, not three weeks later when the team has noticed PRs getting worse and is trying to debug why.
When to run the suite
Run the suite on every workflow change worth caring about. Concretely: model upgrade, prompt template change, new MCP server in the rotation, new hook installed, new sub-agent in regular use.
Once the suite is small and fast, this stops being effortful. A model upgrade that takes an hour to evaluate against the suite is two cups of coffee, not a project.
Keeping the suite honest
The biggest failure mode is not the suite being wrong; it is the suite getting stale. The tasks made sense for the codebase eight months ago and no longer match the kind of work the team does. The fix is a quarterly review, with the rule that any task that is no longer representative is replaced rather than patched.
The eval suite is, in the long run, more valuable than any specific prompt or tool choice. The prompts will change, the tools will be replaced, the models will get better. The eval suite is what turns those changes from drama into a passing or failing run.
Related notes
29 Apr 2026 · 3 min
Claude Code hooks that actually save time
Claude Code hooks are easy to over-engineer. The right four save real time and prevent the failure modes you actually hit in week one.
22 Apr 2026 · 3 min
Codex on a real repo
Codex is a repo-aware coding agent. Used carelessly it generates churn the team has to clean up. Used with scope and a real review gate, it ships work.
02 May 2026 · 3 min
First MCP server tool design
Building an MCP server is mostly an API design problem with one extra constraint: the caller is a model, not a person. Naming and arguments matter more than transport.