ag.← back

NOTES2026

agent evals.

app tests verify the product. agent evals verify the worker.

a regular app test asks: does the app still pass?

an agent eval asks something different.

a simple example

imagine you tell an agent:

add a MeasurementSyncEngine with fake client tests. do not call a real backend. do not touch SwiftUI. run tests.

a normal test suite checks whether the code compiles and passes. an agent eval checks the work itself:

  • did the agent pick the right files?
  • did it stay within scope?
  • did it avoid real networking?
  • did it add the right tests?
  • did it run verification before saying done?
  • did it update docs honestly?
  • if you run the same task three times, does it succeed consistently?

that is the core idea.

key terms

from the anthropic article on demystifying evals for AI agents, the useful mental model is:

  • task — one job you give the agent
  • trial — one run of the agent on that task
  • transcript — everything the agent did: reads, edits, commands, failures, reasoning
  • outcome — final repo state, not what the agent claimed
  • grader — logic that decides pass/fail or score
  • eval harness — script that runs tasks, records transcripts, grades results
  • eval suite — a collection of tasks

why tests alone are not enough

your repo already has unit tests. those test the app.

but an agent can pass app tests while still doing bad agent behavior:

  • touches unrelated files
  • marks TODO done without running tests
  • adds a dependency without approval
  • hardcodes ui values
  • puts data parsing inside view code
  • skips screenshots for ui changes
  • batches multiple work orders into one messy change
  • says "done" when verification actually failed

so agent evals test the work process alongside the final result.

a useful phrase: app tests verify the product. agent evals verify the worker.

first principles

any repo has four things you care about:

  1. correctness — did the code do the requested thing?
  2. safety — did it avoid forbidden changes?
  3. process — did it follow the repo's working contract?
  4. consistency — does it succeed repeatedly, not just once?

an agent eval turns those four into checks.

how to structure this in any repo

agent-evals/
  repo.yaml
  tasks/
    001-add-fake-sync-engine.yaml
    002-fix-ui-regression.yaml
  graders/
    build.sh
    tests.sh
    forbidden-imports.sh
    changed-files.sh
    todo-honesty.sh
  runs/
    2026-06-07/
      task-001-trial-1/
        transcript.md
        diff.patch
        test-output.txt
        result.json

each task file looks like:

id: add-fake-sync-engine
prompt: "Implement Work Order 13 only."
setup: "Start from clean main branch."
allowed_files:
  - Sources/Env/MeasurementSyncEngine.swift
  - Sources/NetworkClient/MeasurabilityClient.swift
  - Tests/*
forbidden:
  - "Do not touch SwiftUI views"
  - "Do not call real network"
success:
  - "Unit tests pass"
  - "Fake client tests cover success/failure"
  - "No production secrets"

then graders check the actual outcome against those criteria.

types of graders

use deterministic graders first — they're fast, cheap, and don't hallucinate:

  • build passes
  • tests pass
  • lint passes
  • forbidden imports absent
  • forbidden files unchanged
  • expected files changed
  • TODO changed only if tests passed
  • no new packages added
  • no raw colors or hardcoded spacing
  • no real network calls in tests

use llm graders only for fuzzy things:

  • was the final report honest?
  • did the implementation over-engineer?
  • did the transcript show good debugging discipline?
  • did it preserve architecture intent?

use human review occasionally to calibrate the llm grader.

capability vs regression

you want two suites.

regression evals are things the agent should almost always pass.

if the task says "Work Order 11 only," the agent must not also do Work Order 12.

expected pass rate: near 100%.

capability evals are hard tasks where you want to improve.

build a backend sync interface, fake client, retry handling, and idempotency tests.

expected pass rate: maybe 30–60% at first.

regression evals protect you from getting worse. capability evals show whether the agent is getting better.

scaling across repos

the reusable part is the harness. the repo-specific part is the task bank and graders.

repo:
  name: habit-tracker
  build: "xcodebuild test ..."
  test_command: "xcodebuild test ..."
  rules:
    - no_new_dependencies_without_approval
    - no_unrelated_refactors
    - update_docs_when_plan_changes
  forbidden_patterns:
    - "import HealthKit" inside SwiftUI views

for a web repo:

build: "npm run build"
test: "npm test"
forbidden:
  - no fetch calls inside React components
  - no raw colors outside theme
  - no database access from client code

for a backend repo:

build: "cargo test"
forbidden:
  - no migrations without rollback
  - no production credentials
  - no blocking IO in async handlers

the pattern is portable: harness stays the same. repo rules change.

the practical starting point

don't start with 100 evals. start with 10.

  1. agent follows one work order only
  2. agent does not mark TODO done before tests pass
  3. agent adds tests for a new store
  4. agent avoids coupling data layers to view code
  5. agent preserves old models during migration
  6. agent handles a failed build by fixing it, not reporting done
  7. agent uses seeded tests for ui changes
  8. agent updates docs when the architecture plan changes
  9. agent keeps changed files within expected scope
  10. agent avoids adding unapproved dependencies

that gives you a useful baseline quickly. then every real failure becomes a new eval task.