Benchmarking
Production - Spider benchmark, accuracy metrics, regression testing
The benchmarking system evaluates SQL generation quality against standard datasets like Spider and custom tenant-specific test suites.
12.3.7.1Spider Benchmark
Implemented in data-plane/ai-service/src/sql_generation/benchmark/spider_benchmark.py:
# Run Spider benchmark
curl -X POST http://localhost:8000/api/v1/text-to-sql/benchmark \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: acme-corp" \
-d '{
"benchmark": "spider",
"difficulty": "all",
"limit": 100,
"dialect": "trino"
}'{
"benchmark": "spider",
"total_cases": 100,
"metrics": {
"exact_match_accuracy": 0.72,
"execution_accuracy": 0.81,
"partial_match_accuracy": 0.88,
"by_difficulty": {
"easy": {"count": 25, "accuracy": 0.92},
"medium": {"count": 40, "accuracy": 0.82},
"hard": {"count": 25, "accuracy": 0.68},
"extra_hard": {"count": 10, "accuracy": 0.50}
}
},
"duration_seconds": 180
}