MATIH Platform is in active MVP development. Documentation reflects current implementation status.
11. Pipelines & Data Engineering
Flink Streaming Jobs
LLM Operations

LLM Operations

The LLM Operations Flink SQL job aggregates LLM API call metrics from the matih.ai.llm-ops Kafka topic into 15-minute tumbling windows. It tracks cache hit rates, token usage, costs, latency, and error rates grouped by provider, model, and tenant.

Source: infrastructure/flink/jobs/llm-operations-agg.sql


Job Configuration

PropertyValue
Source topicmatih.ai.llm-ops
Consumer groupflink-llm-ops-agg
Sink tablepolaris.matih_analytics.llm_operations_metrics
Window typeTumbling, 15 minutes
Watermark delay30 seconds
Group bytenant_id, provider, model

Source Schema

ColumnTypeDescription
tenant_idSTRINGTenant identifier (NOT NULL)
providerSTRINGLLM provider (openai, azure, anthropic)
modelSTRINGModel name (gpt-4o, claude-3-opus)
event_typeSTRINGEvent classification
cache_hitBOOLEANWhether the response was served from cache
tokens_inputINTInput token count
tokens_outputINTOutput token count
latency_msINTAPI call latency in milliseconds
cost_usdDOUBLEEstimated cost in USD
successBOOLEANWhether the call succeeded
timestampTIMESTAMP(3)Event timestamp with watermark

Output Columns

ColumnExpressionDescription
window_startTUMBLE_START(timestamp, 15 MIN)Window start time
window_endTUMBLE_END(timestamp, 15 MIN)Window end time
tenant_idGroup keyTenant identifier
providerCOALESCE(provider, 'unknown')LLM provider
modelCOALESCE(model, 'unknown')Model name
cache_hit_rateCache hits / totalProportion of cached responses
avg_tokens_inputAVG(tokens_input)Average input tokens per call
avg_tokens_outputAVG(tokens_output)Average output tokens per call
total_requestsCOUNT(*)Total LLM API calls
total_cost_usdSUM(cost_usd)Total cost in window
avg_latency_msAVG(latency_ms)Average API latency
error_rateFailed / totalError rate

Cache Hit Rate Calculation

CAST(SUM(CASE WHEN cache_hit = TRUE THEN 1 ELSE 0 END) AS DOUBLE)
    / GREATEST(COUNT(*), 1) AS cache_hit_rate

A high cache hit rate indicates effective semantic caching. Low rates may suggest cache configuration issues or highly varied queries.


Cost Analysis

The aggregated cost data enables cost optimization:

AnalysisQuery Pattern
Cost per tenantGroup by tenant_id, sum total_cost_usd
Cost per modelGroup by model, sum total_cost_usd
Cost trendTime series of total_cost_usd over windows
Cost per requesttotal_cost_usd / total_requests

Error Rate Calculation

CAST(SUM(CASE WHEN success = FALSE THEN 1 ELSE 0 END) AS DOUBLE)
    / GREATEST(COUNT(*), 1) AS error_rate

Common error categories include rate limiting (429), context length exceeded (400), and provider outages (503).


Downstream Consumers

ConsumerPurpose
AI Service dashboard APILLM cost and performance dashboards
Billing serviceTenant cost attribution
Grafana dashboardsOperational monitoring and alerting
Cost optimization engineModel routing recommendations

Related Pages