Span Analysis
Span analysis examines the internal structure of distributed traces to identify performance bottlenecks, error patterns, and optimization opportunities. Each trace consists of a tree of spans representing individual operations across services.
Span Anatomy
Each span contains:
| Field | Description |
|---|---|
traceId | Unique identifier for the entire trace |
spanId | Unique identifier for this span |
parentSpanId | ID of the parent span |
operationName | Name of the operation (e.g., POST /api/v1/search) |
serviceName | Service that created the span |
startTime | When the span started |
duration | How long the span lasted |
status | OK, ERROR, or UNSET |
attributes | Key-value metadata |
events | Timestamped log entries within the span |
Common Span Types in MATIH
| Span | Service | Description |
|---|---|---|
HTTP POST /api/v1/chat | API Gateway | Incoming user request |
agent.orchestrate | AI Service | Agent orchestration |
llm.generate | AI Service | LLM API call |
sql.execute | Query Engine | SQL query execution |
db.query | Any | Database operation |
kafka.produce | Any | Kafka message production |
vector.search | AI Service | Vector similarity search |
Identifying Bottlenecks
Sequential Bottlenecks
Look for long spans that block the critical path:
[---- agent.orchestrate (2500ms) ----]
[-- llm.generate (1800ms) --] <-- Bottleneck: 72% of total
[-- sql.execute (200ms) --]
[-- vector.search (50ms) --]Parallelization Opportunities
Look for sequential spans that could be parallelized:
[-- fetch_schema (100ms) --]
[-- fetch_metadata (80ms) --]
[-- fetch_stats (60ms) --]These three fetches could run in parallel, reducing total time from 240ms to 100ms.
Error Analysis
Error Span Attributes
| Attribute | Description |
|---|---|
otel.status_code | ERROR for failed spans |
error.type | Exception class name |
error.message | Error message |
error.stack | Stack trace (if configured) |
Error Propagation
Errors propagate up the span tree. A failed database query span causes the parent service span to be marked as errored, which propagates to the root span.
Performance Attributes
Key attributes for performance analysis:
| Attribute | Description |
|---|---|
db.statement | SQL query text (may be truncated) |
db.rows_affected | Number of rows returned |
http.status_code | HTTP response status |
llm.model | LLM model used |
llm.tokens.input | Input token count |
llm.tokens.output | Output token count |
llm.cost_usd | Estimated LLM cost |
Analysis Queries in Grafana
Slow Traces
Find traces longer than 5 seconds:
{duration > 5s && resource.service.name = "ai-service"}Error Traces
Find traces with errors:
{status = error && resource.service.name = "ai-service"}Specific Operation
Find traces for a specific operation:
{name = "llm.generate" && span.model = "gpt-4"}Tail Sampling
For production, use tail sampling to capture all error traces and a sample of successful ones:
processors:
tail_sampling:
decision_wait: 30s
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 5000}
- name: sample
type: probabilistic
probabilistic: {sampling_percentage: 1}