Evaluation Runner
Production - Automated evaluation datasets, benchmark execution, metric calculation
The Evaluation Runner provides automated testing of agent performance against curated datasets. It executes benchmark suites, calculates quality metrics, and tracks performance trends over time.
12.2.11.1Evaluation Architecture
| Component | File | Purpose |
|---|---|---|
AutoEvalRunnerService | eval_runner_service.py | Orchestrates evaluation runs |
EvalDatasetService | eval_dataset_service.py | Manages evaluation datasets |
EvalRunnerRoutes | eval_runner_routes.py | REST API endpoints |
EvalDatasetRoutes | eval_dataset_routes.py | Dataset management endpoints |
Evaluation Workflow
Dataset (questions + expected answers)
|
v
EvalRunner.execute(agent_id, dataset_id)
|
+-- For each test case:
| +-- Send question to agent
| +-- Compare response to expected answer
| +-- Calculate metrics (accuracy, relevance, SQL correctness)
|
v
EvalResult (aggregate scores + per-case results)12.2.11.2API Endpoints
# Create evaluation dataset
curl -X POST http://localhost:8000/api/v1/eval-datasets \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: acme-corp" \
-d '{
"name": "SQL Generation Benchmark v2",
"test_cases": [
{
"question": "What is total revenue for 2024?",
"expected_sql": "SELECT SUM(revenue) FROM orders WHERE YEAR(order_date) = 2024",
"expected_answer": "Total revenue for 2024 is $12.5M",
"tags": ["aggregation", "date_filter"]
}
]
}'
# Run evaluation
curl -X POST http://localhost:8000/api/v1/eval-runs \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: acme-corp" \
-d '{
"agent_id": "sql-agent-v2",
"dataset_id": "dataset-uuid-123",
"config": {"temperature": 0.0, "max_retries": 0}
}'
# Get evaluation results
curl http://localhost:8000/api/v1/eval-runs/{run_id}?tenant_id=acme-corpEvaluation Result
{
"run_id": "run-uuid-456",
"agent_id": "sql-agent-v2",
"dataset_id": "dataset-uuid-123",
"status": "completed",
"metrics": {
"overall_accuracy": 0.85,
"sql_correctness": 0.82,
"semantic_similarity": 0.91,
"execution_success_rate": 0.88
},
"total_cases": 50,
"passed": 42,
"failed": 8,
"duration_seconds": 245
}