SQL Generation

Production - LLM-powered NL to SQL conversion with multi-dialect support

The SQL Generation component converts parsed natural language questions into SQL queries using LLM prompting with schema context. It supports multiple SQL dialects including Trino, Spark, PostgreSQL, BigQuery, Snowflake, and DuckDB.

12.3.2.1Generation Process

The LLM receives a carefully constructed prompt containing:

System instructions: SQL generation rules, dialect-specific syntax
Schema context: Relevant tables, columns, relationships
Semantic mappings: Business term to column mappings
Sample queries: Similar previously-successful queries
User question: The natural language question with parsed intent and entities

async def _generate_sql(self, question, parsed_question, context, dialect):
    return await self.llm_client.generate_sql(
        question=question,
        tables=context.tables,
        columns=context.columns,
        relationships=context.relationships,
        semantic_mappings=context.semantic_mappings,
        sample_queries=context.sample_queries,
        dialect=dialect,
        intent=parsed_question.intent.value,
        entities=parsed_question.entities,
        time_references=parsed_question.time_references,
        aggregations=parsed_question.aggregations,
    )

12.3.2.2Supported Dialects

Dialect	Key	Notes
Trino	`trino`	Default dialect, federated query engine
Spark SQL	`spark`	Distributed processing engine
PostgreSQL	`postgresql`	Relational database
BigQuery	`bigquery`	Google Cloud data warehouse
Snowflake	`snowflake`	Cloud data warehouse
DuckDB	`duckdb`	Embedded analytical database
MySQL	`mysql`	Relational database

Dialect Transpilation

The SQLValidator includes a transpilation feature using sqlglot:

def transpile_sql(self, sql: str, source_dialect: str, target_dialect: str) -> str:
    source = self.DIALECT_MAP.get(source_dialect, source_dialect)
    target = self.DIALECT_MAP.get(target_dialect, target_dialect)
    return sqlglot.transpile(sql, read=source, write=target)[0]

12.3.2.3Auto-Correction Loop

When validation fails, the generator attempts auto-correction:

for attempt in range(self.settings.max_sql_retries + 1):
    validation_result = self.validator.validate(sql, dialect, available_tables, available_columns)
 
    if validation_result.is_valid:
        break
 
    if not auto_correct or attempt >= self.settings.max_sql_retries:
        break
 
    # Send errors back to LLM for correction
    error_msg = "; ".join(e.get("message", "") for e in validation_result.errors)
    sql = await self.llm_client.correct_sql(
        sql=sql,
        error=error_msg,
        tables=context.tables,
        columns=context.columns,
    )
    retries += 1

12.3.2.4API Usage

curl -X POST http://localhost:8000/api/v1/text-to-sql \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: acme-corp" \
  -d '{
    "question": "What is the average order value by region for Q4 2024?",
    "dialect": "trino",
    "validate": true,
    "auto_correct": true,
    "catalog": "lakehouse"
  }'

Response:

{
  "success": true,
  "query": {
    "sql": "SELECT r.region_name, AVG(o.revenue) AS avg_order_value\nFROM lakehouse.sales.orders o\nJOIN lakehouse.sales.regions r ON o.region_id = r.region_id\nWHERE o.order_date >= DATE '2024-10-01' AND o.order_date < DATE '2025-01-01'\nGROUP BY r.region_name\nORDER BY avg_order_value DESC",
    "confidence": 0.87,
    "tables_used": ["lakehouse.sales.orders", "lakehouse.sales.regions"],
    "explanation": "This query calculates aggregate values from orders, regions using AVG for Q4 2024.",
    "status": "validated"
  },
  "validation": {
    "is_valid": true,
    "errors": [],
    "warnings": []
  },
  "retries": 0
}

Schema Extraction SQL Validation