MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Trace Correlation

Trace Correlation

Trace correlation links distributed traces with logs and metrics to provide a complete picture of request processing. MATIH uses trace IDs as the common identifier across all three observability pillars, enabling seamless navigation from a metric spike to the causing trace to the relevant log entries.


Correlation Architecture

Metric Alert --> Exemplar Trace ID --> Trace View --> Log Lines
                                          |
                                     Span Attributes --> Metric Labels

Trace-to-Log Correlation

Every log line includes the trace ID and span ID for correlation:

Python (structlog)

import structlog
from opentelemetry import trace
 
def add_trace_context(logger, method_name, event_dict):
    span = trace.get_current_span()
    if span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict
 
structlog.configure(processors=[
    add_trace_context,
    structlog.dev.ConsoleRenderer(),
])

Java (Spring Boot)

Trace context is automatically included in Spring Boot logs via the Micrometer Tracing integration:

2025-06-15 10:30:00 [trace_id=abc123, span_id=def456] INFO c.m.i.TenantService - Provisioning tenant acme

Log-to-Trace Linking in Grafana

Loki is configured with derived fields that extract trace IDs from log lines:

datasources:
  - name: Loki
    type: loki
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: "trace_id=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"

Clicking a trace ID in a log line opens the corresponding trace in the Tempo data source.


Metric-to-Trace Linking

Prometheus histogram metrics include exemplars that contain trace IDs:

from prometheus_client import Histogram
 
request_duration = Histogram(
    "matih_http_request_duration_seconds",
    "Request duration",
    ["method", "endpoint"],
)
 
# Record with exemplar
request_duration.labels(method="POST", endpoint="/search").observe(
    0.45,
    exemplar={"traceID": current_trace_id},
)

In Grafana, enabling exemplars on a histogram panel displays individual trace links on the histogram bars.


Tenant Context Propagation

The tenant ID is propagated alongside trace context:

HeaderPurpose
traceparentW3C trace context (trace_id, span_id)
X-Tenant-IdTenant identifier
X-Request-IdRequest correlation ID

These headers are set by the API gateway and propagated through all downstream services.


Correlation Workflow

  1. Alert fires -- Prometheus alert triggers on high error rate
  2. View dashboard -- Grafana dashboard shows the error spike with exemplars
  3. Click exemplar -- Navigate to the specific trace in Tempo
  4. View trace -- See the full span tree with error annotations
  5. Jump to logs -- Click "Logs for this span" to see related log entries in Loki
  6. Root cause -- Identify the failing service and specific error from logs