MATIH Platform is in active MVP development. Documentation reflects current implementation status.
11. Pipelines & Data Engineering
Flink Streaming Jobs
Agent Performance

Agent Performance

The Agent Performance Flink SQL job aggregates agent execution traces from the matih.ai.agent-traces Kafka topic into 5-minute tumbling windows. It computes success rates, latency statistics, token usage, cost, and error rates per agent and tenant.

Source: infrastructure/flink/jobs/agent-performance-agg.sql


Job Configuration

PropertyValue
Source topicmatih.ai.agent-traces
Consumer groupflink-agent-perf-agg
Sink tablepolaris.matih_analytics.agent_performance_metrics
Window typeTumbling, 5 minutes
Watermark delay30 seconds
Filteraction IN ('completed', 'failed')

Source Schema

ColumnTypeDescription
event_idSTRINGUnique event identifier
event_typeSTRINGEvent classification
tenant_idSTRINGTenant identifier (NOT NULL)
trace_idSTRINGDistributed trace ID
agent_idSTRINGAgent identifier (NOT NULL)
actionSTRINGAction type (completed, failed)
session_idSTRINGParent session ID
latency_msINTExecution latency in milliseconds
tokens_inputINTInput tokens consumed
tokens_outputINTOutput tokens generated
cost_usdDOUBLEEstimated cost in USD
tools_usedARRAY of STRINGTools invoked during execution
error_messageSTRINGError message (NULL on success)
timestampTIMESTAMP(3)Event timestamp with watermark

Output Columns

ColumnExpressionDescription
window_startTUMBLE_START(timestamp, 5 MIN)Window start time
window_endTUMBLE_END(timestamp, 5 MIN)Window end time
tenant_idGroup keyTenant identifier
agent_idGroup keyAgent identifier
agent_nameNULLRequires enrichment from agent_definitions
success_rateCompleted without error / totalAgent success rate
avg_latency_msAVG(latency_ms)Average execution latency
p50_latency_msNULLRequires UDF (use Trino for ad-hoc)
p95_latency_msNULLRequires UDF (use Trino for ad-hoc)
p99_latency_msNULLRequires UDF (use Trino for ad-hoc)
total_tracesCOUNT(*)Total trace events
total_tokensSUM(tokens_input + tokens_output)Total tokens consumed
total_cost_usdSUM(cost_usd)Total cost in USD
error_rateErrors / totalError rate

Known Limitations

Percentile Latencies

Percentile latencies (p50, p95, p99) require the PERCENTILE_APPROX() function which needs a registered UDF. These columns are set to NULL in the streaming job. Use Trino or StarRocks for ad-hoc percentile queries over the Iceberg table:

-- Trino query for p95 latency
SELECT
    agent_id,
    approx_percentile(avg_latency_ms, 0.95) AS p95_latency
FROM matih_analytics.agent_performance_metrics
WHERE window_start >= TIMESTAMP '2026-02-12 00:00:00'
GROUP BY agent_id

Agent Name Enrichment

The agent_name field requires a JOIN with the agent_definitions table and is not available in the trace event payload. Downstream queries should JOIN on agent_id.


Success Rate Calculation

CAST(SUM(CASE WHEN action = 'completed' AND error_message IS NULL THEN 1 ELSE 0 END) AS DOUBLE)
    / GREATEST(COUNT(*), 1) AS success_rate

Related Pages