MATIH Platform is in active MVP development. Documentation reflects current implementation status.
12. AI Service
Prompt Engineering

Prompt Engineering Patterns

The AI Service includes a comprehensive prompt engineering framework in agents/prompt_engineering.py that provides template management, versioning, A/B testing, few-shot example curation, prompt chains for complex workflows, and optimization strategies. This section covers the framework architecture, template system, testing infrastructure, and best practices for prompt development within the MATIH platform.


Framework Overview

+-------------------+     +-------------------+     +-------------------+
| Template Manager  |     | Chain Manager     |     | A/B Test Manager  |
| (versioning,      |     | (multi-step       |     | (experiment       |
|  variables)       |     |  workflows)       |     |  tracking)        |
+--------+----------+     +--------+----------+     +--------+----------+
         |                         |                          |
         v                         v                          v
+----------------------------------------------------------------+
|                    Prompt Registry                              |
|  Stores all prompt templates with metadata and versions        |
+----------------------------------------------------------------+
         |                         |                          |
         v                         v                          v
+-------------------+     +-------------------+     +-------------------+
| Few-Shot Manager  |     | Optimization      |     | Metrics Collector |
| (example          |     | (compression,     |     | (token usage,     |
|  selection)       |     |  expansion)       |     |  quality)         |
+-------------------+     +-------------------+     +-------------------+

Core Types

PromptMessage

class PromptRole(str, Enum):
    SYSTEM = "system"
    USER = "user"
    ASSISTANT = "assistant"
    FUNCTION = "function"
 
@dataclass
class PromptMessage:
    """A single message in a prompt."""
    role: PromptRole
    content: str
    name: str | None = None
    metadata: dict[str, Any] = field(default_factory=dict)

Template Formats

class TemplateFormat(str, Enum):
    JINJA2 = "jinja2"       # Jinja2 templates with {{ variable }}
    FSTRING = "fstring"     # Python f-string style {variable}
    MUSTACHE = "mustache"   # Mustache style {{variable}}

PromptVersion

@dataclass
class PromptVersion:
    """Version information for a prompt template."""
    version: str
    created_at: datetime
    created_by: str
    description: str
    is_active: bool = True
    performance_metrics: dict[str, float] = field(default_factory=dict)

Template Management

PromptTemplate

@dataclass
class PromptTemplate:
    """A versioned prompt template."""
    id: str
    name: str
    description: str
    format: TemplateFormat = TemplateFormat.JINJA2
    messages: list[PromptMessage] = field(default_factory=list)
    variables: dict[str, str] = field(default_factory=dict)  # name -> description
    version: PromptVersion | None = None
    tags: list[str] = field(default_factory=list)
    metadata: dict[str, Any] = field(default_factory=dict)

Template Registry

The template registry manages all prompt templates with version tracking:

class PromptTemplateRegistry:
    """Registry for managing prompt templates."""
 
    def __init__(self):
        self._templates: dict[str, PromptTemplate] = {}
        self._version_history: dict[str, list[PromptVersion]] = {}
 
    def register(self, template: PromptTemplate) -> None:
        """Register or update a prompt template."""
        self._templates[template.id] = template
        self._version_history.setdefault(template.id, []).append(
            template.version
        )
 
    def get(self, template_id: str) -> PromptTemplate | None:
        """Get a template by ID."""
        return self._templates.get(template_id)
 
    def render(
        self,
        template_id: str,
        variables: dict[str, Any],
    ) -> list[dict[str, str]]:
        """Render a template with variables."""
        template = self._templates[template_id]
        rendered_messages = []
 
        for msg in template.messages:
            content = self._render_content(
                msg.content,
                variables,
                template.format,
            )
            rendered_messages.append({
                "role": msg.role.value,
                "content": content,
            })
 
        return rendered_messages

Built-in Templates

The AI Service ships with pre-built templates for common tasks:

Template IDPurposeVariables
sql_generation_v3Text-to-SQL generationschema_context, question, dialect
sql_correction_v2SQL error correctionsql, error, tables, columns
intent_classification_v2Intent classificationmessage, available_intents
analysis_prompt_v1Statistical analysisdata, analysis_type
visualization_selection_v1Chart type selectiondata_shape, intent
conversation_summary_v1Conversation summarizationmessages
query_suggestion_v1Query suggestionstables, user_history

Few-Shot Example Management

FewShotExample

@dataclass
class FewShotExample:
    """A few-shot example for prompt enhancement."""
    input_text: str
    output_text: str
    explanation: str | None = None
    category: str | None = None
    quality_score: float = 1.0
    usage_count: int = 0
    success_rate: float = 1.0

Example Selection

The few-shot manager selects the most relevant examples for each query:

class FewShotManager:
    """Manages few-shot examples for prompts."""
 
    def __init__(self, max_examples: int = 5):
        self._examples: dict[str, list[FewShotExample]] = {}
        self._max_examples = max_examples
 
    def add_example(
        self,
        category: str,
        example: FewShotExample,
    ) -> None:
        """Add a few-shot example to a category."""
        self._examples.setdefault(category, []).append(example)
 
    def select_examples(
        self,
        category: str,
        query: str,
        k: int = 3,
    ) -> list[FewShotExample]:
        """Select the most relevant examples."""
        candidates = self._examples.get(category, [])
 
        # Score by similarity and quality
        scored = [
            (ex, self._score(ex, query))
            for ex in candidates
        ]
 
        # Sort by combined score
        scored.sort(key=lambda x: x[1], reverse=True)
 
        return [ex for ex, _ in scored[:k]]
 
    def _score(
        self,
        example: FewShotExample,
        query: str,
    ) -> float:
        """Score an example for relevance to a query."""
        # Combine textual similarity with quality metrics
        similarity = self._text_similarity(example.input_text, query)
        quality = example.quality_score * example.success_rate
        recency_bonus = 1.0 / (1 + example.usage_count * 0.1)
        return similarity * 0.6 + quality * 0.3 + recency_bonus * 0.1

Dynamic Example Updates

The few-shot pool is continuously updated based on feedback:

SourceActionImpact
Positive ratingIncrease quality_scoreExample used more often
Negative ratingDecrease quality_scoreExample used less often
User correctionAdd corrected pair as new exampleExpand example pool
Execution successIncrease success_ratePrefer successful patterns
Execution failureDecrease success_rateAvoid failing patterns

Prompt Chains

For complex workflows that require multiple LLM calls, prompt chains coordinate sequential prompt execution:

@dataclass
class PromptChainStep:
    """A single step in a prompt chain."""
    template_id: str
    input_mapping: dict[str, str]     # variable -> source
    output_key: str                    # Key to store result
    condition: str | None = None       # Optional condition to execute
 
class PromptChain:
    """Executes a chain of prompts sequentially."""
 
    def __init__(self, steps: list[PromptChainStep]):
        self._steps = steps
 
    async def execute(
        self,
        initial_context: dict[str, Any],
        llm_client: Any,
    ) -> dict[str, Any]:
        """Execute the prompt chain."""
        context = dict(initial_context)
 
        for step in self._steps:
            # Check condition
            if step.condition and not self._evaluate(
                step.condition, context
            ):
                continue
 
            # Map inputs
            variables = {
                var: context.get(source, "")
                for var, source in step.input_mapping.items()
            }
 
            # Render and execute
            messages = registry.render(step.template_id, variables)
            response = await llm_client.chat(messages)
 
            # Store output
            context[step.output_key] = response["content"]
 
        return context

Example Chain: Enhanced Text-to-SQL

enhanced_sql_chain = PromptChain([
    PromptChainStep(
        template_id="query_rewrite_v1",
        input_mapping={"question": "user_question"},
        output_key="rewritten_question",
    ),
    PromptChainStep(
        template_id="sql_generation_v3",
        input_mapping={
            "question": "rewritten_question",
            "schema_context": "schema",
            "dialect": "dialect",
        },
        output_key="generated_sql",
    ),
    PromptChainStep(
        template_id="sql_explanation_v1",
        input_mapping={
            "sql": "generated_sql",
            "question": "user_question",
        },
        output_key="explanation",
    ),
])

A/B Testing

The prompt A/B testing infrastructure in prompt_ab_testing/ enables controlled experiments:

Experiment Configuration

@dataclass
class PromptExperiment:
    """An A/B test experiment for prompts."""
    id: str
    name: str
    template_id: str
    variant_a: PromptVersion      # Control
    variant_b: PromptVersion      # Treatment
    traffic_split: float = 0.5    # Fraction to variant B
    metrics: list[str] = field(default_factory=list)
    status: str = "active"        # active, paused, completed
    start_date: datetime = field(default_factory=datetime.utcnow)
    end_date: datetime | None = None
    min_sample_size: int = 100

Experiment Execution

class PromptABTestManager:
    """Manages prompt A/B testing experiments."""
 
    async def get_variant(
        self,
        experiment_id: str,
        user_id: str,
    ) -> PromptVersion:
        """Deterministically assign a user to a variant."""
        experiment = self._experiments[experiment_id]
 
        # Deterministic assignment based on user_id hash
        hash_val = hash(f"{experiment_id}:{user_id}") % 100
        if hash_val < experiment.traffic_split * 100:
            return experiment.variant_b
        return experiment.variant_a
 
    async def record_outcome(
        self,
        experiment_id: str,
        variant: str,
        metrics: dict[str, float],
    ) -> None:
        """Record the outcome of a prompt execution."""
        # Track per-variant metrics for comparison
        ...

Tracked Metrics

MetricDescription
accuracyResponse accuracy (from user feedback)
latency_msEnd-to-end response time
token_countTotal tokens used
user_satisfactionUser rating score
sql_validityWhether generated SQL is valid
correction_rateHow often users correct the response

Optimization Strategies

class OptimizationStrategy(str, Enum):
    NONE = "none"
    COMPRESS = "compress"      # Reduce token count
    EXPAND = "expand"          # Add more context
    SIMPLIFY = "simplify"      # Simplify language
    TECHNICAL = "technical"    # Use technical language

Prompt Compression

class PromptOptimizer:
    """Optimizes prompts for token efficiency."""
 
    async def compress(
        self,
        messages: list[dict[str, str]],
        target_reduction: float = 0.3,
    ) -> list[dict[str, str]]:
        """Compress prompt to reduce token count."""
        # Strategies:
        # 1. Remove redundant examples
        # 2. Abbreviate schema descriptions
        # 3. Compress conversation history via summarization
        # 4. Remove low-relevance context
        ...

Token Budget Management

ComponentToken BudgetPriority
System prompt500-1000High (always included)
Schema context1000-2000High (essential for SQL)
Few-shot examples500-1000Medium (improves quality)
Conversation history500-2000Medium (context dependent)
User preferences100-200Low (optional enhancement)
Total budget4096-8192Varies by model

Best Practices

Prompt Development Workflow

  1. Draft: Create initial template with clear instructions
  2. Test: Run against evaluation dataset with known correct answers
  3. Iterate: Refine based on failure analysis
  4. Version: Create versioned snapshot with performance metrics
  5. A/B Test: Run controlled experiment against current production version
  6. Deploy: Promote winning variant to production
  7. Monitor: Track ongoing quality metrics via feedback pipeline

Template Guidelines

GuidelineRationale
Use explicit SQL dialect in system promptPrevents cross-dialect syntax errors
Include 2-3 relevant few-shot examplesImproves accuracy by 15-20%
Specify output format explicitlyReduces post-processing failures
Include negative examplesPrevents common mistakes
Separate concerns per templateEnables independent optimization
Use low temperature (0.1-0.3) for SQLSQL requires deterministic output
Use higher temperature (0.5-0.7) for analysisAnalysis benefits from creativity