MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Span Analysis

Span Analysis

Span analysis examines the internal structure of distributed traces to identify performance bottlenecks, error patterns, and optimization opportunities. Each trace consists of a tree of spans representing individual operations across services.


Span Anatomy

Each span contains:

FieldDescription
traceIdUnique identifier for the entire trace
spanIdUnique identifier for this span
parentSpanIdID of the parent span
operationNameName of the operation (e.g., POST /api/v1/search)
serviceNameService that created the span
startTimeWhen the span started
durationHow long the span lasted
statusOK, ERROR, or UNSET
attributesKey-value metadata
eventsTimestamped log entries within the span

Common Span Types in MATIH

SpanServiceDescription
HTTP POST /api/v1/chatAPI GatewayIncoming user request
agent.orchestrateAI ServiceAgent orchestration
llm.generateAI ServiceLLM API call
sql.executeQuery EngineSQL query execution
db.queryAnyDatabase operation
kafka.produceAnyKafka message production
vector.searchAI ServiceVector similarity search

Identifying Bottlenecks

Sequential Bottlenecks

Look for long spans that block the critical path:

[---- agent.orchestrate (2500ms) ----]
  [-- llm.generate (1800ms) --]       <-- Bottleneck: 72% of total
  [-- sql.execute (200ms) --]
  [-- vector.search (50ms) --]

Parallelization Opportunities

Look for sequential spans that could be parallelized:

[-- fetch_schema (100ms) --]
                           [-- fetch_metadata (80ms) --]
                                                       [-- fetch_stats (60ms) --]

These three fetches could run in parallel, reducing total time from 240ms to 100ms.


Error Analysis

Error Span Attributes

AttributeDescription
otel.status_codeERROR for failed spans
error.typeException class name
error.messageError message
error.stackStack trace (if configured)

Error Propagation

Errors propagate up the span tree. A failed database query span causes the parent service span to be marked as errored, which propagates to the root span.


Performance Attributes

Key attributes for performance analysis:

AttributeDescription
db.statementSQL query text (may be truncated)
db.rows_affectedNumber of rows returned
http.status_codeHTTP response status
llm.modelLLM model used
llm.tokens.inputInput token count
llm.tokens.outputOutput token count
llm.cost_usdEstimated LLM cost

Analysis Queries in Grafana

Slow Traces

Find traces longer than 5 seconds:

{duration > 5s && resource.service.name = "ai-service"}

Error Traces

Find traces with errors:

{status = error && resource.service.name = "ai-service"}

Specific Operation

Find traces for a specific operation:

{name = "llm.generate" && span.model = "gpt-4"}

Tail Sampling

For production, use tail sampling to capture all error traces and a sample of successful ones:

processors:
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 5000}
      - name: sample
        type: probabilistic
        probabilistic: {sampling_percentage: 1}