MATIH Platform is in active MVP development. Documentation reflects current implementation status.
12. AI Service
Agent System
Evaluation Runner

Evaluation Runner

Production - Automated evaluation datasets, benchmark execution, metric calculation

The Evaluation Runner provides automated testing of agent performance against curated datasets. It executes benchmark suites, calculates quality metrics, and tracks performance trends over time.


12.2.11.1Evaluation Architecture

ComponentFilePurpose
AutoEvalRunnerServiceeval_runner_service.pyOrchestrates evaluation runs
EvalDatasetServiceeval_dataset_service.pyManages evaluation datasets
EvalRunnerRouteseval_runner_routes.pyREST API endpoints
EvalDatasetRouteseval_dataset_routes.pyDataset management endpoints

Evaluation Workflow

Dataset (questions + expected answers)
    |
    v
EvalRunner.execute(agent_id, dataset_id)
    |
    +-- For each test case:
    |     +-- Send question to agent
    |     +-- Compare response to expected answer
    |     +-- Calculate metrics (accuracy, relevance, SQL correctness)
    |
    v
EvalResult (aggregate scores + per-case results)

12.2.11.2API Endpoints

# Create evaluation dataset
curl -X POST http://localhost:8000/api/v1/eval-datasets \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: acme-corp" \
  -d '{
    "name": "SQL Generation Benchmark v2",
    "test_cases": [
      {
        "question": "What is total revenue for 2024?",
        "expected_sql": "SELECT SUM(revenue) FROM orders WHERE YEAR(order_date) = 2024",
        "expected_answer": "Total revenue for 2024 is $12.5M",
        "tags": ["aggregation", "date_filter"]
      }
    ]
  }'
 
# Run evaluation
curl -X POST http://localhost:8000/api/v1/eval-runs \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: acme-corp" \
  -d '{
    "agent_id": "sql-agent-v2",
    "dataset_id": "dataset-uuid-123",
    "config": {"temperature": 0.0, "max_retries": 0}
  }'
 
# Get evaluation results
curl http://localhost:8000/api/v1/eval-runs/{run_id}?tenant_id=acme-corp

Evaluation Result

{
  "run_id": "run-uuid-456",
  "agent_id": "sql-agent-v2",
  "dataset_id": "dataset-uuid-123",
  "status": "completed",
  "metrics": {
    "overall_accuracy": 0.85,
    "sql_correctness": 0.82,
    "semantic_similarity": 0.91,
    "execution_success_rate": 0.88
  },
  "total_cases": 50,
  "passed": 42,
  "failed": 8,
  "duration_seconds": 245
}