Span Analysis

Span analysis examines the internal structure of distributed traces to identify performance bottlenecks, error patterns, and optimization opportunities. Each trace consists of a tree of spans representing individual operations across services.

Span Anatomy

Each span contains:

Field	Description
`traceId`	Unique identifier for the entire trace
`spanId`	Unique identifier for this span
`parentSpanId`	ID of the parent span
`operationName`	Name of the operation (e.g., `POST /api/v1/search`)
`serviceName`	Service that created the span
`startTime`	When the span started
`duration`	How long the span lasted
`status`	OK, ERROR, or UNSET
`attributes`	Key-value metadata
`events`	Timestamped log entries within the span

Common Span Types in MATIH

Span	Service	Description
`HTTP POST /api/v1/chat`	API Gateway	Incoming user request
`agent.orchestrate`	AI Service	Agent orchestration
`llm.generate`	AI Service	LLM API call
`sql.execute`	Query Engine	SQL query execution
`db.query`	Any	Database operation
`kafka.produce`	Any	Kafka message production
`vector.search`	AI Service	Vector similarity search

Identifying Bottlenecks

Sequential Bottlenecks

Look for long spans that block the critical path:

[---- agent.orchestrate (2500ms) ----]
  [-- llm.generate (1800ms) --]       <-- Bottleneck: 72% of total
  [-- sql.execute (200ms) --]
  [-- vector.search (50ms) --]

Parallelization Opportunities

Look for sequential spans that could be parallelized:

[-- fetch_schema (100ms) --]
                           [-- fetch_metadata (80ms) --]
                                                       [-- fetch_stats (60ms) --]

These three fetches could run in parallel, reducing total time from 240ms to 100ms.

Error Analysis

Error Span Attributes

Attribute	Description
`otel.status_code`	`ERROR` for failed spans
`error.type`	Exception class name
`error.message`	Error message
`error.stack`	Stack trace (if configured)

Error Propagation

Errors propagate up the span tree. A failed database query span causes the parent service span to be marked as errored, which propagates to the root span.

Performance Attributes

Key attributes for performance analysis:

Attribute	Description
`db.statement`	SQL query text (may be truncated)
`db.rows_affected`	Number of rows returned
`http.status_code`	HTTP response status
`llm.model`	LLM model used
`llm.tokens.input`	Input token count
`llm.tokens.output`	Output token count
`llm.cost_usd`	Estimated LLM cost

Analysis Queries in Grafana

Slow Traces

Find traces longer than 5 seconds:

{duration > 5s && resource.service.name = "ai-service"}

Error Traces

Find traces with errors:

{status = error && resource.service.name = "ai-service"}

Specific Operation

Find traces for a specific operation:

{name = "llm.generate" && span.model = "gpt-4"}

Tail Sampling

For production, use tail sampling to capture all error traces and a sample of successful ones:

processors:
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 5000}
      - name: sample
        type: probabilistic
        probabilistic: {sampling_percentage: 1}

Trace Correlation Logging