Agent Architecture
The MATIH platform has two distinct agent services: the Data Plane Agent which bridges the Control Plane and Data Plane for lifecycle management, and the Ops Agent Service which provides AI-powered operational intelligence.
2.4.F.1Data Plane Agent (Port 8085)
The Data Plane Agent is a lightweight Java coordination service deployed in each tenant namespace. It serves as the local representative of the Control Plane within the tenant's Data Plane.
Core responsibilities:
- Receive provisioning and lifecycle commands from the Control Plane via Kafka
- Report tenant health status back to the Control Plane
- Coordinate local service lifecycle (start, stop, scale)
- Forward configuration updates from config-service to local services
- Aggregate local metrics for Control Plane consumption
Communication Pattern
Control Plane Data Plane (per-tenant namespace)
+-----------------+ +----------------------+
| tenant-service |--Kafka----->| data-plane-agent |
| config-service |--Kafka----->| |
+-----------------+ +----------+-----------+
|
+----------v-----------+
| Local tenant services |
| (query-engine, |
| ai-service, etc.) |
+-----------------------+Kafka Topics
| Direction | Topic Pattern | Content |
|---|---|---|
| Control --> Agent | tenant.:tenantId.commands | Provisioning commands, scale requests |
| Control --> Agent | tenant.:tenantId.config | Configuration updates, feature flags |
| Agent --> Control | tenant.:tenantId.status | Health reports, resource utilization |
| Agent --> Control | tenant.:tenantId.events | Business events, error reports |
Health Aggregation
The Data Plane Agent polls all local services every 30 seconds and publishes an aggregated health report:
{
"tenantId": "acme-corp",
"timestamp": "2026-02-12T10:30:00Z",
"overallStatus": "HEALTHY",
"services": {
"query-engine": { "status": "UP", "responseTime": 12 },
"ai-service": { "status": "UP", "responseTime": 45 },
"bi-service": { "status": "UP", "responseTime": 8 },
"catalog-service": { "status": "UP", "responseTime": 15 }
},
"resources": {
"cpuUsage": 0.45,
"memoryUsage": 0.62,
"podCount": 14
}
}2.4.F.2Ops Agent Service (Port 8080)
The Ops Agent Service provides intelligent operations management through specialized AI agents that monitor, diagnose, and remediate platform issues.
Core responsibilities:
- Automated incident detection and triage via anomaly detection
- Root cause analysis using AI agents with access to metrics, logs, and traces
- Runbook automation and execution
- Performance anomaly detection via statistical models
- Capacity planning recommendations
- Knowledge base via ChromaDB for operational patterns
Observability Integration
| Source | Protocol | Data |
|---|---|---|
| Prometheus | REST (PromQL) | Metrics: CPU, memory, latency, error rates |
| Loki | REST (LogQL) | Log entries from all services |
| Tempo | REST | Distributed traces |
| Kubernetes API | REST | Pod status, events, resource utilization |
AI-Powered Diagnosis
When an anomaly is detected, the Ops Agent:
1. DETECT: Statistical anomaly in metrics (e.g., p99 latency spike)
2. GATHER: Collect correlated metrics, logs, and traces
3. ANALYZE: AI agent reasons about root cause using gathered evidence
4. RECOMMEND: Suggest remediation (scale up, restart, configuration change)
5. EXECUTE: If auto-remediation is enabled, execute the runbook
6. REPORT: Publish incident report with timeline and root causeResource Profile
The Ops Agent Service requires the most resources of any Data Plane service:
| Environment | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---|---|---|---|---|
| Development | 500m | 2000m | 1Gi | 4Gi |
| Production | 500m | 2000m | 1Gi | 4Gi |
Related Sections
- Service Topology -- Cross-plane communication
- Event-Driven Architecture -- Kafka topic design
- Observability -- Monitoring stack integration