Benchmarking

Production - Spider benchmark, accuracy metrics, regression testing

The benchmarking system evaluates SQL generation quality against standard datasets like Spider and custom tenant-specific test suites.

12.3.7.1Spider Benchmark

Implemented in data-plane/ai-service/src/sql_generation/benchmark/spider_benchmark.py:

# Run Spider benchmark
curl -X POST http://localhost:8000/api/v1/text-to-sql/benchmark \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: acme-corp" \
  -d '{
    "benchmark": "spider",
    "difficulty": "all",
    "limit": 100,
    "dialect": "trino"
  }'

{
  "benchmark": "spider",
  "total_cases": 100,
  "metrics": {
    "exact_match_accuracy": 0.72,
    "execution_accuracy": 0.81,
    "partial_match_accuracy": 0.88,
    "by_difficulty": {
      "easy": {"count": 25, "accuracy": 0.92},
      "medium": {"count": 40, "accuracy": 0.82},
      "hard": {"count": 25, "accuracy": 0.68},
      "extra_hard": {"count": 10, "accuracy": 0.50}
    }
  },
  "duration_seconds": 180
}

Autocomplete & Suggestions Conversational SQL