Prompt Engineering Patterns
The AI Service includes a comprehensive prompt engineering framework in agents/prompt_engineering.py that provides template management, versioning, A/B testing, few-shot example curation, prompt chains for complex workflows, and optimization strategies. This section covers the framework architecture, template system, testing infrastructure, and best practices for prompt development within the MATIH platform.
Framework Overview
+-------------------+ +-------------------+ +-------------------+
| Template Manager | | Chain Manager | | A/B Test Manager |
| (versioning, | | (multi-step | | (experiment |
| variables) | | workflows) | | tracking) |
+--------+----------+ +--------+----------+ +--------+----------+
| | |
v v v
+----------------------------------------------------------------+
| Prompt Registry |
| Stores all prompt templates with metadata and versions |
+----------------------------------------------------------------+
| | |
v v v
+-------------------+ +-------------------+ +-------------------+
| Few-Shot Manager | | Optimization | | Metrics Collector |
| (example | | (compression, | | (token usage, |
| selection) | | expansion) | | quality) |
+-------------------+ +-------------------+ +-------------------+Core Types
PromptMessage
class PromptRole(str, Enum):
SYSTEM = "system"
USER = "user"
ASSISTANT = "assistant"
FUNCTION = "function"
@dataclass
class PromptMessage:
"""A single message in a prompt."""
role: PromptRole
content: str
name: str | None = None
metadata: dict[str, Any] = field(default_factory=dict)Template Formats
class TemplateFormat(str, Enum):
JINJA2 = "jinja2" # Jinja2 templates with {{ variable }}
FSTRING = "fstring" # Python f-string style {variable}
MUSTACHE = "mustache" # Mustache style {{variable}}PromptVersion
@dataclass
class PromptVersion:
"""Version information for a prompt template."""
version: str
created_at: datetime
created_by: str
description: str
is_active: bool = True
performance_metrics: dict[str, float] = field(default_factory=dict)Template Management
PromptTemplate
@dataclass
class PromptTemplate:
"""A versioned prompt template."""
id: str
name: str
description: str
format: TemplateFormat = TemplateFormat.JINJA2
messages: list[PromptMessage] = field(default_factory=list)
variables: dict[str, str] = field(default_factory=dict) # name -> description
version: PromptVersion | None = None
tags: list[str] = field(default_factory=list)
metadata: dict[str, Any] = field(default_factory=dict)Template Registry
The template registry manages all prompt templates with version tracking:
class PromptTemplateRegistry:
"""Registry for managing prompt templates."""
def __init__(self):
self._templates: dict[str, PromptTemplate] = {}
self._version_history: dict[str, list[PromptVersion]] = {}
def register(self, template: PromptTemplate) -> None:
"""Register or update a prompt template."""
self._templates[template.id] = template
self._version_history.setdefault(template.id, []).append(
template.version
)
def get(self, template_id: str) -> PromptTemplate | None:
"""Get a template by ID."""
return self._templates.get(template_id)
def render(
self,
template_id: str,
variables: dict[str, Any],
) -> list[dict[str, str]]:
"""Render a template with variables."""
template = self._templates[template_id]
rendered_messages = []
for msg in template.messages:
content = self._render_content(
msg.content,
variables,
template.format,
)
rendered_messages.append({
"role": msg.role.value,
"content": content,
})
return rendered_messagesBuilt-in Templates
The AI Service ships with pre-built templates for common tasks:
| Template ID | Purpose | Variables |
|---|---|---|
sql_generation_v3 | Text-to-SQL generation | schema_context, question, dialect |
sql_correction_v2 | SQL error correction | sql, error, tables, columns |
intent_classification_v2 | Intent classification | message, available_intents |
analysis_prompt_v1 | Statistical analysis | data, analysis_type |
visualization_selection_v1 | Chart type selection | data_shape, intent |
conversation_summary_v1 | Conversation summarization | messages |
query_suggestion_v1 | Query suggestions | tables, user_history |
Few-Shot Example Management
FewShotExample
@dataclass
class FewShotExample:
"""A few-shot example for prompt enhancement."""
input_text: str
output_text: str
explanation: str | None = None
category: str | None = None
quality_score: float = 1.0
usage_count: int = 0
success_rate: float = 1.0Example Selection
The few-shot manager selects the most relevant examples for each query:
class FewShotManager:
"""Manages few-shot examples for prompts."""
def __init__(self, max_examples: int = 5):
self._examples: dict[str, list[FewShotExample]] = {}
self._max_examples = max_examples
def add_example(
self,
category: str,
example: FewShotExample,
) -> None:
"""Add a few-shot example to a category."""
self._examples.setdefault(category, []).append(example)
def select_examples(
self,
category: str,
query: str,
k: int = 3,
) -> list[FewShotExample]:
"""Select the most relevant examples."""
candidates = self._examples.get(category, [])
# Score by similarity and quality
scored = [
(ex, self._score(ex, query))
for ex in candidates
]
# Sort by combined score
scored.sort(key=lambda x: x[1], reverse=True)
return [ex for ex, _ in scored[:k]]
def _score(
self,
example: FewShotExample,
query: str,
) -> float:
"""Score an example for relevance to a query."""
# Combine textual similarity with quality metrics
similarity = self._text_similarity(example.input_text, query)
quality = example.quality_score * example.success_rate
recency_bonus = 1.0 / (1 + example.usage_count * 0.1)
return similarity * 0.6 + quality * 0.3 + recency_bonus * 0.1Dynamic Example Updates
The few-shot pool is continuously updated based on feedback:
| Source | Action | Impact |
|---|---|---|
| Positive rating | Increase quality_score | Example used more often |
| Negative rating | Decrease quality_score | Example used less often |
| User correction | Add corrected pair as new example | Expand example pool |
| Execution success | Increase success_rate | Prefer successful patterns |
| Execution failure | Decrease success_rate | Avoid failing patterns |
Prompt Chains
For complex workflows that require multiple LLM calls, prompt chains coordinate sequential prompt execution:
@dataclass
class PromptChainStep:
"""A single step in a prompt chain."""
template_id: str
input_mapping: dict[str, str] # variable -> source
output_key: str # Key to store result
condition: str | None = None # Optional condition to execute
class PromptChain:
"""Executes a chain of prompts sequentially."""
def __init__(self, steps: list[PromptChainStep]):
self._steps = steps
async def execute(
self,
initial_context: dict[str, Any],
llm_client: Any,
) -> dict[str, Any]:
"""Execute the prompt chain."""
context = dict(initial_context)
for step in self._steps:
# Check condition
if step.condition and not self._evaluate(
step.condition, context
):
continue
# Map inputs
variables = {
var: context.get(source, "")
for var, source in step.input_mapping.items()
}
# Render and execute
messages = registry.render(step.template_id, variables)
response = await llm_client.chat(messages)
# Store output
context[step.output_key] = response["content"]
return contextExample Chain: Enhanced Text-to-SQL
enhanced_sql_chain = PromptChain([
PromptChainStep(
template_id="query_rewrite_v1",
input_mapping={"question": "user_question"},
output_key="rewritten_question",
),
PromptChainStep(
template_id="sql_generation_v3",
input_mapping={
"question": "rewritten_question",
"schema_context": "schema",
"dialect": "dialect",
},
output_key="generated_sql",
),
PromptChainStep(
template_id="sql_explanation_v1",
input_mapping={
"sql": "generated_sql",
"question": "user_question",
},
output_key="explanation",
),
])A/B Testing
The prompt A/B testing infrastructure in prompt_ab_testing/ enables controlled experiments:
Experiment Configuration
@dataclass
class PromptExperiment:
"""An A/B test experiment for prompts."""
id: str
name: str
template_id: str
variant_a: PromptVersion # Control
variant_b: PromptVersion # Treatment
traffic_split: float = 0.5 # Fraction to variant B
metrics: list[str] = field(default_factory=list)
status: str = "active" # active, paused, completed
start_date: datetime = field(default_factory=datetime.utcnow)
end_date: datetime | None = None
min_sample_size: int = 100Experiment Execution
class PromptABTestManager:
"""Manages prompt A/B testing experiments."""
async def get_variant(
self,
experiment_id: str,
user_id: str,
) -> PromptVersion:
"""Deterministically assign a user to a variant."""
experiment = self._experiments[experiment_id]
# Deterministic assignment based on user_id hash
hash_val = hash(f"{experiment_id}:{user_id}") % 100
if hash_val < experiment.traffic_split * 100:
return experiment.variant_b
return experiment.variant_a
async def record_outcome(
self,
experiment_id: str,
variant: str,
metrics: dict[str, float],
) -> None:
"""Record the outcome of a prompt execution."""
# Track per-variant metrics for comparison
...Tracked Metrics
| Metric | Description |
|---|---|
accuracy | Response accuracy (from user feedback) |
latency_ms | End-to-end response time |
token_count | Total tokens used |
user_satisfaction | User rating score |
sql_validity | Whether generated SQL is valid |
correction_rate | How often users correct the response |
Optimization Strategies
class OptimizationStrategy(str, Enum):
NONE = "none"
COMPRESS = "compress" # Reduce token count
EXPAND = "expand" # Add more context
SIMPLIFY = "simplify" # Simplify language
TECHNICAL = "technical" # Use technical languagePrompt Compression
class PromptOptimizer:
"""Optimizes prompts for token efficiency."""
async def compress(
self,
messages: list[dict[str, str]],
target_reduction: float = 0.3,
) -> list[dict[str, str]]:
"""Compress prompt to reduce token count."""
# Strategies:
# 1. Remove redundant examples
# 2. Abbreviate schema descriptions
# 3. Compress conversation history via summarization
# 4. Remove low-relevance context
...Token Budget Management
| Component | Token Budget | Priority |
|---|---|---|
| System prompt | 500-1000 | High (always included) |
| Schema context | 1000-2000 | High (essential for SQL) |
| Few-shot examples | 500-1000 | Medium (improves quality) |
| Conversation history | 500-2000 | Medium (context dependent) |
| User preferences | 100-200 | Low (optional enhancement) |
| Total budget | 4096-8192 | Varies by model |
Best Practices
Prompt Development Workflow
- Draft: Create initial template with clear instructions
- Test: Run against evaluation dataset with known correct answers
- Iterate: Refine based on failure analysis
- Version: Create versioned snapshot with performance metrics
- A/B Test: Run controlled experiment against current production version
- Deploy: Promote winning variant to production
- Monitor: Track ongoing quality metrics via feedback pipeline
Template Guidelines
| Guideline | Rationale |
|---|---|
| Use explicit SQL dialect in system prompt | Prevents cross-dialect syntax errors |
| Include 2-3 relevant few-shot examples | Improves accuracy by 15-20% |
| Specify output format explicitly | Reduces post-processing failures |
| Include negative examples | Prevents common mistakes |
| Separate concerns per template | Enables independent optimization |
| Use low temperature (0.1-0.3) for SQL | SQL requires deterministic output |
| Use higher temperature (0.5-0.7) for analysis | Analysis benefits from creativity |