Benchmark runner quickstart

Run a retrieval benchmark — with optional faithfulness scoring — that gates CI on configurable thresholds.

from attune_rag.benchmark import main

exit_code = main()
print(exit_code)  # 0

Prerequisites

Run your first benchmark

  1. Import and call main. With no arguments, the runner uses its defaults:

    from attune_rag.benchmark import main
    
    exit_code = main()
    

    A return value of 0 means all metrics passed their thresholds.

  2. Pass a custom query file. Supply your own queries via argv:

    exit_code = main(["--queries", "my_queries.jsonl"])
    
  3. Enable faithfulness scoring. Add --with-faithfulness to score generated answers against their source passages:

    exit_code = main(["--queries", "my_queries.jsonl", "--with-faithfulness"])
    

    A non-zero return value means at least one metric fell below its configured threshold — check the logged output to see which one.

Expected output

Retrieval precision: 0.91  ✓
Retrieval recall:    0.87  ✓
Faithfulness:        0.93  ✓
Benchmark passed (exit 0)

Next: Wire main() into your CI pipeline and set threshold values in your benchmark configuration to enforce quality gates on every pull request.

Source files

Tags: benchmark, ci, precision, recall, quality