MATIH Platform is in active MVP development. Documentation reflects current implementation status.
2. Architecture
Design Philosophy

Design Philosophy

Every architectural decision in the MATIH platform traces back to a small set of core principles. This section articulates those principles, explains the trade-offs made during design, and documents the constraints that shaped the platform's structure. Understanding these foundational choices is essential for anyone extending, operating, or evaluating the system.


Core Design Principles

1. Intent to Insights

The platform's central thesis is that the distance between a user's question and a data-driven answer should be measured in seconds, not sprints. Every architectural choice is evaluated against this north star: does it reduce the friction between human intent and analytical insight?

This principle drives the multi-agent AI architecture, the conversational interface, and the tight integration between the query engine, semantic layer, and visualization services. The system is designed so that a natural language question flows through intent classification, SQL generation, query execution, and visualization rendering in a single request lifecycle.

2. Separation of Concerns via Two-Plane Architecture

The platform enforces a strict boundary between platform management (Control Plane) and tenant workload execution (Data Plane). This separation is not merely organizational -- it is an architectural invariant that shapes deployment topology, security boundaries, failure domains, and scaling strategies.

AspectControl PlaneData Plane
ResponsibilityPlatform operations, tenant lifecycle, IAMTenant-specific data processing, AI/ML
TechnologyJava/Spring Boot 3.2 (homogeneous)Java, Python, Node.js (polyglot)
Namespacematih-control-planematih-data-plane (per-tenant)
ScalingScales with tenant countScales with workload volume
Failure impactAffects management operationsAffects tenant workloads only
Data accessPlatform metadata onlyTenant data and models

The two-plane model ensures that a runaway query in one tenant's data plane cannot degrade the platform's ability to manage other tenants. It also allows independent scaling -- the Control Plane can remain lean while individual tenant Data Planes scale vertically or horizontally based on workload demands.

3. Multi-Tenancy as a First-Class Concern

Multi-tenancy is not bolted on after the fact -- it permeates every layer of the stack. From JWT claims that carry tenant_id to Hibernate tenant identifier resolvers that route database queries, from Kafka event headers that tag messages with tenant context to Kubernetes namespace policies that enforce network isolation, tenant awareness is woven into the platform's fabric.

The design uses a hybrid isolation model:

  • Namespace-level isolation in Kubernetes for network and resource boundaries
  • Schema-level isolation in PostgreSQL for data separation
  • Logical isolation in application code via TenantContext propagation
  • Topic-level isolation in Kafka for event stream separation

This hybrid approach balances the operational simplicity of shared infrastructure with the security guarantees that enterprise customers require.

4. Cloud-Agnostic, Kubernetes-Native

The platform is designed to run on any Kubernetes distribution -- AKS, EKS, GKE, or on-premises clusters. Cloud-specific concerns are abstracted behind well-defined interfaces:

  • Storage: Uses Kubernetes PersistentVolumeClaims rather than cloud-specific storage APIs
  • Secrets: Kubernetes Secrets with External Secrets Operator for cloud vault integration
  • DNS: Abstracted through cert-manager and ingress controllers
  • Identity: Workload identity patterns that map to Azure AD, AWS IAM, or GCP IAM

The platform uses 55+ Helm charts for deployment, with environment-specific value overrides (values-dev.yaml, values-prod.yaml) rather than separate chart structures per cloud provider.

5. Convention Over Configuration

The platform establishes strong conventions that reduce decision fatigue and enforce consistency:

  • Port assignment: Every service has a registered port in scripts/config/components.yaml, the single source of truth
  • Health endpoints: Java services use /api/v1/actuator/health; Python services use /health
  • API versioning: All APIs follow /api/v1/ prefix convention with header-based version negotiation
  • Event naming: Events follow CATEGORY.ACTION naming (e.g., QUERY.COMPLETED, MODEL.PUBLISHED)
  • Database naming: Each service owns a dedicated database named after the service (e.g., iam, tenant, billing)

6. Observability by Default

Every service ships with built-in observability. This is not optional instrumentation added during debugging -- it is structural:

  • Structured logging via StructuredLoggingConfig in commons-java, with tenant and correlation ID enrichment
  • Distributed tracing via OpenTelemetry with @Traced annotations and automatic span propagation
  • Metrics via Micrometer with tenant-aware dimensional tagging
  • Health indicators via custom ComponentHealthCheck implementations

The observability stack (Prometheus, Grafana, Tempo, Loki) is deployed in its own namespace (matih-observability) and is accessible through the observability-api service in the Control Plane.


Architectural Trade-Offs

Every design decision involves trade-offs. This section documents the conscious trade-offs made in the MATIH architecture and the reasoning behind each choice.

Microservices vs. Monolith

Choice: 24 microservices across two planes.

Trade-off: The platform accepts the operational complexity of distributed systems (network partitions, distributed transactions, deployment coordination) in exchange for:

  • Independent deployment and scaling of individual services
  • Technology diversity (Java for data-intensive services, Python for AI/ML, Node.js for rendering)
  • Fault isolation -- a crashed AI service does not take down billing
  • Team autonomy -- different teams can own different services

Mitigation: The complexity cost is mitigated by:

  • Commons libraries that provide consistent behavior across services
  • A single component registry (components.yaml) as the source of truth
  • Standardized deployment via Helm charts and scripted CD pipelines
  • Comprehensive observability that makes distributed debugging tractable

Synchronous REST vs. Asynchronous Events

Choice: Hybrid communication model using both REST APIs and Kafka event streaming.

Trade-off: Maintaining two communication paradigms increases system complexity, but each serves a distinct purpose:

PatternUsed ForExamples
Synchronous RESTRequest-response flows requiring immediate resultsQuery execution, dashboard rendering, user authentication
Asynchronous KafkaFire-and-forget notifications, event sourcing, cross-service coordinationAudit logging, billing events, data quality alerts
Redis Pub/SubReal-time updates to connected clientsLive dashboard updates, agent progress streaming

The key principle is: if the caller needs to wait for the result, use REST; if the caller can proceed without the result, use events.

Polyglot vs. Single Language

Choice: Three primary languages (Java, Python, Node.js) across the platform.

Trade-off: Multiple language runtimes increase operational burden (different build tools, different container images, different debugging approaches). The platform accepts this cost because:

  • Java/Spring Boot provides the best ecosystem for enterprise data services (JDBC, Hibernate multi-tenancy, Spring Security)
  • Python/FastAPI is the natural choice for AI/ML workloads (LangChain, PyTorch, scikit-learn, pandas)
  • Node.js delivers optimal server-side rendering performance for visualization components

Mitigation: Commons libraries (commons-java, commons-python, commons-typescript) ensure behavioral consistency regardless of language. Cross-language contracts are defined through shared event schemas and API specifications.

Per-Tenant Namespaces vs. Shared Namespaces

Choice: Hybrid model -- Control Plane in a shared namespace, Data Plane in per-tenant namespaces.

Trade-off: Per-tenant namespaces increase Kubernetes resource overhead (more ServiceAccounts, NetworkPolicies, ResourceQuotas), but provide:

  • Network isolation via Kubernetes NetworkPolicies
  • Resource quota enforcement per tenant
  • Independent lifecycle management
  • Security boundary that maps naturally to organizational boundaries

Design Constraints

Several constraints shaped the architecture. Understanding these constraints helps explain why certain approaches were chosen over alternatives.

Constraint 1: Enterprise Multi-Tenancy Requirements

Enterprise customers require data isolation guarantees that go beyond logical separation. The platform must demonstrate:

  • No cross-tenant data leakage, even under failure conditions
  • Auditable access controls with per-tenant audit trails
  • Resource isolation so one tenant's workload cannot degrade another's experience
  • Independent data residency and compliance postures per tenant

These requirements drove the decision to use namespace-level isolation, per-tenant databases, and the TenantContext propagation pattern.

Constraint 2: Variable Workload Profiles

AI/ML workloads have fundamentally different resource profiles than CRUD operations:

  • A text-to-SQL request might consume 2-10 seconds of GPU time
  • A dashboard refresh is a sub-100ms database query
  • A model training job might run for hours

The platform handles this variance through:

  • Separate resource quotas per service type
  • GPU scheduling via Kubernetes device plugins (for AI/ML services)
  • Async processing via Kafka for long-running operations
  • Redis-backed session storage for stateful conversation flows

Constraint 3: Regulatory Compliance

Financial services and healthcare customers operate under strict regulatory frameworks (SOC 2, HIPAA, GDPR). The architecture accounts for this through:

  • Immutable audit logs via the audit-service with Elasticsearch backing
  • Encryption at rest and in transit (TLS everywhere, field-level encryption via EncryptionService)
  • Role-based access control with fine-grained permissions via RbacService
  • Data residency controls through per-tenant namespace placement

Constraint 4: Developer Experience

The platform must be operable by a small team. This constraint eliminated approaches that require dedicated infrastructure teams:

  • Helm-based deployment rather than custom operators (lower learning curve)
  • Script-based CD (cd-new.sh) rather than complex CI/CD platforms
  • Convention-driven configuration rather than extensive per-service setup
  • Single source of truth (components.yaml) rather than scattered configuration

Layered Architecture

Within each service, the codebase follows a layered architecture pattern:

+------------------------------------------+
|           API Layer (Controllers)         |
|  - Request validation                    |
|  - Input sanitization                    |
|  - API versioning                        |
+------------------------------------------+
|           Service Layer (Business Logic)  |
|  - Domain operations                     |
|  - Cross-cutting concerns               |
|  - Transaction management               |
+------------------------------------------+
|           Repository Layer (Persistence)  |
|  - Database access                       |
|  - Cache management                      |
|  - External service clients             |
+------------------------------------------+
|           Infrastructure Layer            |
|  - Kafka producers/consumers            |
|  - Redis connections                     |
|  - HTTP clients                          |
+------------------------------------------+

Each layer has a clear responsibility boundary:

  • The API layer handles HTTP concerns and delegates to services
  • The Service layer contains business logic and is unit-testable without infrastructure
  • The Repository layer abstracts data access behind interfaces
  • The Infrastructure layer manages connections to external systems

Security Architecture Principles

Security is not a feature -- it is a structural property of the system. The platform follows defense-in-depth with multiple security layers:

  1. Network perimeter: Kong API Gateway validates all inbound requests
  2. Authentication: JWT tokens validated at every service via JwtTokenValidator
  3. Authorization: RBAC with tenant-scoped permissions via RbacService and PermissionEvaluator
  4. Input validation: SecurityFilter and InputValidation reject malicious payloads before they reach business logic
  5. Tenant isolation: TenantContextHolder ensures every operation is scoped to the correct tenant
  6. Data encryption: EncryptionService and KeyManagementService provide field-level encryption
  7. Audit: AuditLogger records all security-relevant operations

The SecurityFilter runs at the highest priority in the filter chain (Ordered.HIGHEST_PRECEDENCE + 10) and validates request headers (X-Tenant-ID, X-User-ID, X-Request-ID, X-Correlation-ID), query parameters, and URI paths for injection attacks, path traversal, and suspicious patterns before any business logic executes.


Resilience Patterns

The platform employs several resilience patterns to handle partial failures gracefully:

Circuit Breakers

Inter-service calls use CircuitBreakerConfig from commons-java to prevent cascade failures. When a downstream service becomes unhealthy, the circuit opens and requests fail fast rather than accumulating timeouts.

Retry with Backoff

RetryableRestClient implements exponential backoff for transient failures, with configurable retry counts and jitter to prevent thundering herd problems.

Graceful Degradation

Services are designed to degrade gracefully when dependencies are unavailable:

  • The AI service falls back to in-memory session storage when Redis is down
  • The query engine returns cached results when the semantic layer is temporarily unreachable
  • The notification service queues messages locally when Kafka is unavailable

Health Checks

Every service implements deep health checks via ComponentHealthCheck and HealthIndicatorRegistry that verify not just that the process is running, but that its dependencies (database, cache, message broker) are reachable and functional.


Evolution Strategy

The architecture is designed to evolve without wholesale replacement. Key extension points include:

  • New services can be added by creating a Helm chart, registering in components.yaml, and implementing the commons library interfaces
  • New event types can be introduced by extending DataPlaneEvent.EventCategory and registering handlers
  • New tenant isolation models can be supported by extending TenantIdentifierResolver
  • New cloud providers can be supported by adding Terraform modules without changing application code

The Architecture Decision Records (see ADRs) document the evolution of these decisions over time, ensuring that future contributors understand not just what was decided, but why.