Agent Performance
The Agent Performance Flink SQL job aggregates agent execution traces from the matih.ai.agent-traces Kafka topic into 5-minute tumbling windows. It computes success rates, latency statistics, token usage, cost, and error rates per agent and tenant.
Source: infrastructure/flink/jobs/agent-performance-agg.sql
Job Configuration
| Property | Value |
|---|---|
| Source topic | matih.ai.agent-traces |
| Consumer group | flink-agent-perf-agg |
| Sink table | polaris.matih_analytics.agent_performance_metrics |
| Window type | Tumbling, 5 minutes |
| Watermark delay | 30 seconds |
| Filter | action IN ('completed', 'failed') |
Source Schema
| Column | Type | Description |
|---|---|---|
event_id | STRING | Unique event identifier |
event_type | STRING | Event classification |
tenant_id | STRING | Tenant identifier (NOT NULL) |
trace_id | STRING | Distributed trace ID |
agent_id | STRING | Agent identifier (NOT NULL) |
action | STRING | Action type (completed, failed) |
session_id | STRING | Parent session ID |
latency_ms | INT | Execution latency in milliseconds |
tokens_input | INT | Input tokens consumed |
tokens_output | INT | Output tokens generated |
cost_usd | DOUBLE | Estimated cost in USD |
tools_used | ARRAY of STRING | Tools invoked during execution |
error_message | STRING | Error message (NULL on success) |
timestamp | TIMESTAMP(3) | Event timestamp with watermark |
Output Columns
| Column | Expression | Description |
|---|---|---|
window_start | TUMBLE_START(timestamp, 5 MIN) | Window start time |
window_end | TUMBLE_END(timestamp, 5 MIN) | Window end time |
tenant_id | Group key | Tenant identifier |
agent_id | Group key | Agent identifier |
agent_name | NULL | Requires enrichment from agent_definitions |
success_rate | Completed without error / total | Agent success rate |
avg_latency_ms | AVG(latency_ms) | Average execution latency |
p50_latency_ms | NULL | Requires UDF (use Trino for ad-hoc) |
p95_latency_ms | NULL | Requires UDF (use Trino for ad-hoc) |
p99_latency_ms | NULL | Requires UDF (use Trino for ad-hoc) |
total_traces | COUNT(*) | Total trace events |
total_tokens | SUM(tokens_input + tokens_output) | Total tokens consumed |
total_cost_usd | SUM(cost_usd) | Total cost in USD |
error_rate | Errors / total | Error rate |
Known Limitations
Percentile Latencies
Percentile latencies (p50, p95, p99) require the PERCENTILE_APPROX() function which needs a registered UDF. These columns are set to NULL in the streaming job. Use Trino or StarRocks for ad-hoc percentile queries over the Iceberg table:
-- Trino query for p95 latency
SELECT
agent_id,
approx_percentile(avg_latency_ms, 0.95) AS p95_latency
FROM matih_analytics.agent_performance_metrics
WHERE window_start >= TIMESTAMP '2026-02-12 00:00:00'
GROUP BY agent_idAgent Name Enrichment
The agent_name field requires a JOIN with the agent_definitions table and is not available in the trace event payload. Downstream queries should JOIN on agent_id.
Success Rate Calculation
CAST(SUM(CASE WHEN action = 'completed' AND error_message IS NULL THEN 1 ELSE 0 END) AS DOUBLE)
/ GREATEST(COUNT(*), 1) AS success_rateRelated Pages
- Session Analytics -- Session-level metrics
- LLM Operations -- LLM-specific metrics
- Flink Overview -- Flink architecture