MATIH Platform is in active MVP development. Documentation reflects current implementation status.
2. Architecture
Agent Architecture

Agent Architecture

Production - Data Plane Agent (Java:8085) + Ops Agent (Python:8080)

The MATIH platform has two distinct agent services: the Data Plane Agent which bridges the Control Plane and Data Plane for lifecycle management, and the Ops Agent Service which provides AI-powered operational intelligence.


2.4.F.1Data Plane Agent (Port 8085)

The Data Plane Agent is a lightweight Java coordination service deployed in each tenant namespace. It serves as the local representative of the Control Plane within the tenant's Data Plane.

Core responsibilities:

  • Receive provisioning and lifecycle commands from the Control Plane via Kafka
  • Report tenant health status back to the Control Plane
  • Coordinate local service lifecycle (start, stop, scale)
  • Forward configuration updates from config-service to local services
  • Aggregate local metrics for Control Plane consumption

Communication Pattern

Control Plane                    Data Plane (per-tenant namespace)
+-----------------+             +----------------------+
| tenant-service  |--Kafka----->| data-plane-agent     |
| config-service  |--Kafka----->|                      |
+-----------------+             +----------+-----------+
                                           |
                                +----------v-----------+
                                | Local tenant services |
                                | (query-engine,        |
                                |  ai-service, etc.)    |
                                +-----------------------+

Kafka Topics

DirectionTopic PatternContent
Control --> Agenttenant.:tenantId.commandsProvisioning commands, scale requests
Control --> Agenttenant.:tenantId.configConfiguration updates, feature flags
Agent --> Controltenant.:tenantId.statusHealth reports, resource utilization
Agent --> Controltenant.:tenantId.eventsBusiness events, error reports

Health Aggregation

The Data Plane Agent polls all local services every 30 seconds and publishes an aggregated health report:

{
  "tenantId": "acme-corp",
  "timestamp": "2026-02-12T10:30:00Z",
  "overallStatus": "HEALTHY",
  "services": {
    "query-engine": { "status": "UP", "responseTime": 12 },
    "ai-service": { "status": "UP", "responseTime": 45 },
    "bi-service": { "status": "UP", "responseTime": 8 },
    "catalog-service": { "status": "UP", "responseTime": 15 }
  },
  "resources": {
    "cpuUsage": 0.45,
    "memoryUsage": 0.62,
    "podCount": 14
  }
}

2.4.F.2Ops Agent Service (Port 8080)

The Ops Agent Service provides intelligent operations management through specialized AI agents that monitor, diagnose, and remediate platform issues.

Core responsibilities:

  • Automated incident detection and triage via anomaly detection
  • Root cause analysis using AI agents with access to metrics, logs, and traces
  • Runbook automation and execution
  • Performance anomaly detection via statistical models
  • Capacity planning recommendations
  • Knowledge base via ChromaDB for operational patterns

Observability Integration

SourceProtocolData
PrometheusREST (PromQL)Metrics: CPU, memory, latency, error rates
LokiREST (LogQL)Log entries from all services
TempoRESTDistributed traces
Kubernetes APIRESTPod status, events, resource utilization

AI-Powered Diagnosis

When an anomaly is detected, the Ops Agent:

1. DETECT: Statistical anomaly in metrics (e.g., p99 latency spike)
2. GATHER: Collect correlated metrics, logs, and traces
3. ANALYZE: AI agent reasons about root cause using gathered evidence
4. RECOMMEND: Suggest remediation (scale up, restart, configuration change)
5. EXECUTE: If auto-remediation is enabled, execute the runbook
6. REPORT: Publish incident report with timeline and root cause

Resource Profile

The Ops Agent Service requires the most resources of any Data Plane service:

EnvironmentCPU RequestCPU LimitMemory RequestMemory Limit
Development500m2000m1Gi4Gi
Production500m2000m1Gi4Gi

Related Sections