agent evals — akshat goel

a regular app test asks: does the app still pass?

an agent eval asks something different.

a simple example

imagine you tell an agent:

add a MeasurementSyncEngine with fake client tests. do not call a real backend. do not touch SwiftUI. run tests.

a normal test suite checks whether the code compiles and passes. an agent eval checks the work itself:

did the agent pick the right files?
did it stay within scope?
did it avoid real networking?
did it add the right tests?
did it run verification before saying done?
did it update docs honestly?
if you run the same task three times, does it succeed consistently?

that is the core idea.

key terms

from the anthropic article on demystifying evals for AI agents, the useful mental model is:

task — one job you give the agent
trial — one run of the agent on that task
transcript — everything the agent did: reads, edits, commands, failures, reasoning
outcome — final repo state, not what the agent claimed
grader — logic that decides pass/fail or score
eval harness — script that runs tasks, records transcripts, grades results
eval suite — a collection of tasks

why tests alone are not enough

your repo already has unit tests. those test the app.

but an agent can pass app tests while still doing bad agent behavior:

touches unrelated files
marks TODO done without running tests
adds a dependency without approval
hardcodes ui values
puts data parsing inside view code
skips screenshots for ui changes
batches multiple work orders into one messy change
says "done" when verification actually failed

so agent evals test the work process alongside the final result.

a useful phrase: app tests verify the product. agent evals verify the worker.

first principles

any repo has four things you care about:

correctness — did the code do the requested thing?
safety — did it avoid forbidden changes?
process — did it follow the repo's working contract?
consistency — does it succeed repeatedly, not just once?

an agent eval turns those four into checks.

how to structure this in any repo

agent-evals/
  repo.yaml
  tasks/
    001-add-fake-sync-engine.yaml
    002-fix-ui-regression.yaml
  graders/
    build.sh
    tests.sh
    forbidden-imports.sh
    changed-files.sh
    todo-honesty.sh
  runs/
    2026-06-07/
      task-001-trial-1/
        transcript.md
        diff.patch
        test-output.txt
        result.json

each task file looks like:

id: add-fake-sync-engine
prompt: "Implement Work Order 13 only."
setup: "Start from clean main branch."
allowed_files:
  - Sources/Env/MeasurementSyncEngine.swift
  - Sources/NetworkClient/MeasurabilityClient.swift
  - Tests/*
forbidden:
  - "Do not touch SwiftUI views"
  - "Do not call real network"
success:
  - "Unit tests pass"
  - "Fake client tests cover success/failure"
  - "No production secrets"

then graders check the actual outcome against those criteria.

types of graders

use deterministic graders first — they're fast, cheap, and don't hallucinate:

build passes
tests pass
lint passes
forbidden imports absent
forbidden files unchanged
expected files changed
TODO changed only if tests passed
no new packages added
no raw colors or hardcoded spacing
no real network calls in tests

use llm graders only for fuzzy things:

was the final report honest?
did the implementation over-engineer?
did the transcript show good debugging discipline?
did it preserve architecture intent?

use human review occasionally to calibrate the llm grader.

capability vs regression

you want two suites.

regression evals are things the agent should almost always pass.

if the task says "Work Order 11 only," the agent must not also do Work Order 12.

expected pass rate: near 100%.

capability evals are hard tasks where you want to improve.

build a backend sync interface, fake client, retry handling, and idempotency tests.

expected pass rate: maybe 30–60% at first.

regression evals protect you from getting worse. capability evals show whether the agent is getting better.

scaling across repos

the reusable part is the harness. the repo-specific part is the task bank and graders.

repo:
  name: habit-tracker
  build: "xcodebuild test ..."
  test_command: "xcodebuild test ..."
  rules:
    - no_new_dependencies_without_approval
    - no_unrelated_refactors
    - update_docs_when_plan_changes
  forbidden_patterns:
    - "import HealthKit" inside SwiftUI views

for a web repo:

build: "npm run build"
test: "npm test"
forbidden:
  - no fetch calls inside React components
  - no raw colors outside theme
  - no database access from client code

for a backend repo:

build: "cargo test"
forbidden:
  - no migrations without rollback
  - no production credentials
  - no blocking IO in async handlers

the pattern is portable: harness stays the same. repo rules change.

the practical starting point

don't start with 100 evals. start with 10.

agent follows one work order only
agent does not mark TODO done before tests pass
agent adds tests for a new store
agent avoids coupling data layers to view code
agent preserves old models during migration
agent handles a failed build by fixing it, not reporting done
agent uses seeded tests for ui changes
agent updates docs when the architecture plan changes
agent keeps changed files within expected scope
agent avoids adding unapproved dependencies

that gives you a useful baseline quickly. then every real failure becomes a new eval task.

agent evals.