Errata is a common agentic harness for model evaluation. Same prompt, same tools, different models. Write evaluation recipes in markdown, run them from the command line, and judge the results yourself.

$ go install github.com/errata-app/errata-cli/cmd/errata@latest
$ errata run -r go_docstore.md --verbose

errata: DocStore Bug Fix Challenge (1 task, 6 models)

  claude-sonnet-4-6      PASS   12804ms  $0.089  4/4 criteria
  gemini-2.5-pro         PASS   18443ms  $0.047  4/4 criteria
  o3                     PASS   21587ms  $0.120  4/4 criteria
  gpt-4.1                FAIL   15872ms  $0.031  2/4 criteria
  claude-haiku-4.5       FAIL    9241ms  $0.004  1/4 criteria
  llama-3.1-8b-instruct  FAIL    6102ms  $0.000  0/4 criteria

Recipes are markdown files. They define which models to test, what tools they get, the tasks to run, and how to score the results. No SDK, no boilerplate.

There's also an interactive TUI mode — run errata with no arguments, type a prompt, and watch every model work side by side. Pick the winner. Your preferences are logged to data/preferences.jsonl.


github — cli and readme