MATIH Platform is in active MVP development. Documentation reflects current implementation status.
12. AI Service
SQL Generation

SQL Generation

Production - LLM-powered NL to SQL conversion with multi-dialect support

The SQL Generation component converts parsed natural language questions into SQL queries using LLM prompting with schema context. It supports multiple SQL dialects including Trino, Spark, PostgreSQL, BigQuery, Snowflake, and DuckDB.


12.3.2.1Generation Process

The LLM receives a carefully constructed prompt containing:

  1. System instructions: SQL generation rules, dialect-specific syntax
  2. Schema context: Relevant tables, columns, relationships
  3. Semantic mappings: Business term to column mappings
  4. Sample queries: Similar previously-successful queries
  5. User question: The natural language question with parsed intent and entities
async def _generate_sql(self, question, parsed_question, context, dialect):
    return await self.llm_client.generate_sql(
        question=question,
        tables=context.tables,
        columns=context.columns,
        relationships=context.relationships,
        semantic_mappings=context.semantic_mappings,
        sample_queries=context.sample_queries,
        dialect=dialect,
        intent=parsed_question.intent.value,
        entities=parsed_question.entities,
        time_references=parsed_question.time_references,
        aggregations=parsed_question.aggregations,
    )

12.3.2.2Supported Dialects

DialectKeyNotes
TrinotrinoDefault dialect, federated query engine
Spark SQLsparkDistributed processing engine
PostgreSQLpostgresqlRelational database
BigQuerybigqueryGoogle Cloud data warehouse
SnowflakesnowflakeCloud data warehouse
DuckDBduckdbEmbedded analytical database
MySQLmysqlRelational database

Dialect Transpilation

The SQLValidator includes a transpilation feature using sqlglot:

def transpile_sql(self, sql: str, source_dialect: str, target_dialect: str) -> str:
    source = self.DIALECT_MAP.get(source_dialect, source_dialect)
    target = self.DIALECT_MAP.get(target_dialect, target_dialect)
    return sqlglot.transpile(sql, read=source, write=target)[0]

12.3.2.3Auto-Correction Loop

When validation fails, the generator attempts auto-correction:

for attempt in range(self.settings.max_sql_retries + 1):
    validation_result = self.validator.validate(sql, dialect, available_tables, available_columns)
 
    if validation_result.is_valid:
        break
 
    if not auto_correct or attempt >= self.settings.max_sql_retries:
        break
 
    # Send errors back to LLM for correction
    error_msg = "; ".join(e.get("message", "") for e in validation_result.errors)
    sql = await self.llm_client.correct_sql(
        sql=sql,
        error=error_msg,
        tables=context.tables,
        columns=context.columns,
    )
    retries += 1

12.3.2.4API Usage

curl -X POST http://localhost:8000/api/v1/text-to-sql \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: acme-corp" \
  -d '{
    "question": "What is the average order value by region for Q4 2024?",
    "dialect": "trino",
    "validate": true,
    "auto_correct": true,
    "catalog": "lakehouse"
  }'

Response:

{
  "success": true,
  "query": {
    "sql": "SELECT r.region_name, AVG(o.revenue) AS avg_order_value\nFROM lakehouse.sales.orders o\nJOIN lakehouse.sales.regions r ON o.region_id = r.region_id\nWHERE o.order_date >= DATE '2024-10-01' AND o.order_date < DATE '2025-01-01'\nGROUP BY r.region_name\nORDER BY avg_order_value DESC",
    "confidence": 0.87,
    "tables_used": ["lakehouse.sales.orders", "lakehouse.sales.regions"],
    "explanation": "This query calculates aggregate values from orders, regions using AVG for Q4 2024.",
    "status": "validated"
  },
  "validation": {
    "is_valid": true,
    "errors": [],
    "warnings": []
  },
  "retries": 0
}