Errata is a common agentic harness for model evaluation. Same prompt, same tools, different models. Write evaluation recipes in markdown, run them from the command line, and judge the results yourself.
$ go install github.com/errata-app/errata-cli/cmd/errata@latest $ errata run -r go_docstore.md --verbose errata: DocStore Bug Fix Challenge (1 task, 6 models) claude-sonnet-4-6 PASS 12804ms $0.089 4/4 criteria gemini-2.5-pro PASS 18443ms $0.047 4/4 criteria o3 PASS 21587ms $0.120 4/4 criteria gpt-4.1 FAIL 15872ms $0.031 2/4 criteria claude-haiku-4.5 FAIL 9241ms $0.004 1/4 criteria llama-3.1-8b-instruct FAIL 6102ms $0.000 0/4 criteria
Recipes are markdown files. They define which models to test, what tools they get, the tasks to run, and how to score the results. No SDK, no boilerplate.
There's also an interactive TUI mode — run errata with no arguments, type a prompt,
and watch every model work side by side. Pick the winner. Your preferences are logged to data/preferences.jsonl.
github — cli and readme