Design Philosophy

Every architectural decision in the MATIH platform traces back to a small set of core principles. This section articulates those principles, explains the trade-offs made during design, and documents the constraints that shaped the platform's structure. Understanding these foundational choices is essential for anyone extending, operating, or evaluating the system.

Core Design Principles

1. Intent to Insights

The platform's central thesis is that the distance between a user's question and a data-driven answer should be measured in seconds, not sprints. Every architectural choice is evaluated against this north star: does it reduce the friction between human intent and analytical insight?

This principle drives the multi-agent AI architecture, the conversational interface, and the tight integration between the query engine, semantic layer, and visualization services. The system is designed so that a natural language question flows through intent classification, SQL generation, query execution, and visualization rendering in a single request lifecycle.

2. Separation of Concerns via Two-Plane Architecture

The platform enforces a strict boundary between platform management (Control Plane) and tenant workload execution (Data Plane). This separation is not merely organizational -- it is an architectural invariant that shapes deployment topology, security boundaries, failure domains, and scaling strategies.

Aspect	Control Plane	Data Plane
Responsibility	Platform operations, tenant lifecycle, IAM	Tenant-specific data processing, AI/ML
Technology	Java/Spring Boot 3.2 (homogeneous)	Java, Python, Node.js (polyglot)
Namespace	`matih-control-plane`	`matih-data-plane` (per-tenant)
Scaling	Scales with tenant count	Scales with workload volume
Failure impact	Affects management operations	Affects tenant workloads only
Data access	Platform metadata only	Tenant data and models

The two-plane model ensures that a runaway query in one tenant's data plane cannot degrade the platform's ability to manage other tenants. It also allows independent scaling -- the Control Plane can remain lean while individual tenant Data Planes scale vertically or horizontally based on workload demands.

3. Multi-Tenancy as a First-Class Concern

Multi-tenancy is not bolted on after the fact -- it permeates every layer of the stack. From JWT claims that carry tenant_id to Hibernate tenant identifier resolvers that route database queries, from Kafka event headers that tag messages with tenant context to Kubernetes namespace policies that enforce network isolation, tenant awareness is woven into the platform's fabric.

The design uses a hybrid isolation model:

Namespace-level isolation in Kubernetes for network and resource boundaries
Schema-level isolation in PostgreSQL for data separation
Logical isolation in application code via TenantContext propagation
Topic-level isolation in Kafka for event stream separation

This hybrid approach balances the operational simplicity of shared infrastructure with the security guarantees that enterprise customers require.

4. Cloud-Agnostic, Kubernetes-Native

The platform is designed to run on any Kubernetes distribution -- AKS, EKS, GKE, or on-premises clusters. Cloud-specific concerns are abstracted behind well-defined interfaces:

Storage: Uses Kubernetes PersistentVolumeClaims rather than cloud-specific storage APIs
Secrets: Kubernetes Secrets with External Secrets Operator for cloud vault integration
DNS: Abstracted through cert-manager and ingress controllers
Identity: Workload identity patterns that map to Azure AD, AWS IAM, or GCP IAM

The platform uses 55+ Helm charts for deployment, with environment-specific value overrides (values-dev.yaml, values-prod.yaml) rather than separate chart structures per cloud provider.

5. Convention Over Configuration

The platform establishes strong conventions that reduce decision fatigue and enforce consistency:

Port assignment: Every service has a registered port in scripts/config/components.yaml, the single source of truth
Health endpoints: Java services use /api/v1/actuator/health; Python services use /health
API versioning: All APIs follow /api/v1/ prefix convention with header-based version negotiation
Event naming: Events follow CATEGORY.ACTION naming (e.g., QUERY.COMPLETED, MODEL.PUBLISHED)
Database naming: Each service owns a dedicated database named after the service (e.g., iam, tenant, billing)

6. Observability by Default

Every service ships with built-in observability. This is not optional instrumentation added during debugging -- it is structural:

Structured logging via StructuredLoggingConfig in commons-java, with tenant and correlation ID enrichment
Distributed tracing via OpenTelemetry with @Traced annotations and automatic span propagation
Metrics via Micrometer with tenant-aware dimensional tagging
Health indicators via custom ComponentHealthCheck implementations

The observability stack (Prometheus, Grafana, Tempo, Loki) is deployed in its own namespace (matih-observability) and is accessible through the observability-api service in the Control Plane.

Architectural Trade-Offs

Every design decision involves trade-offs. This section documents the conscious trade-offs made in the MATIH architecture and the reasoning behind each choice.

Microservices vs. Monolith

Choice: 24 microservices across two planes.

Trade-off: The platform accepts the operational complexity of distributed systems (network partitions, distributed transactions, deployment coordination) in exchange for:

Independent deployment and scaling of individual services
Technology diversity (Java for data-intensive services, Python for AI/ML, Node.js for rendering)
Fault isolation -- a crashed AI service does not take down billing
Team autonomy -- different teams can own different services

Mitigation: The complexity cost is mitigated by:

Commons libraries that provide consistent behavior across services
A single component registry (components.yaml) as the source of truth
Standardized deployment via Helm charts and scripted CD pipelines
Comprehensive observability that makes distributed debugging tractable

Synchronous REST vs. Asynchronous Events

Choice: Hybrid communication model using both REST APIs and Kafka event streaming.

Trade-off: Maintaining two communication paradigms increases system complexity, but each serves a distinct purpose:

Pattern	Used For	Examples
Synchronous REST	Request-response flows requiring immediate results	Query execution, dashboard rendering, user authentication
Asynchronous Kafka	Fire-and-forget notifications, event sourcing, cross-service coordination	Audit logging, billing events, data quality alerts
Redis Pub/Sub	Real-time updates to connected clients	Live dashboard updates, agent progress streaming

The key principle is: if the caller needs to wait for the result, use REST; if the caller can proceed without the result, use events.

Polyglot vs. Single Language

Choice: Three primary languages (Java, Python, Node.js) across the platform.

Trade-off: Multiple language runtimes increase operational burden (different build tools, different container images, different debugging approaches). The platform accepts this cost because:

Java/Spring Boot provides the best ecosystem for enterprise data services (JDBC, Hibernate multi-tenancy, Spring Security)
Python/FastAPI is the natural choice for AI/ML workloads (LangChain, PyTorch, scikit-learn, pandas)
Node.js delivers optimal server-side rendering performance for visualization components

Mitigation: Commons libraries (commons-java, commons-python, commons-typescript) ensure behavioral consistency regardless of language. Cross-language contracts are defined through shared event schemas and API specifications.

Per-Tenant Namespaces vs. Shared Namespaces

Choice: Hybrid model -- Control Plane in a shared namespace, Data Plane in per-tenant namespaces.

Trade-off: Per-tenant namespaces increase Kubernetes resource overhead (more ServiceAccounts, NetworkPolicies, ResourceQuotas), but provide:

Network isolation via Kubernetes NetworkPolicies
Resource quota enforcement per tenant
Independent lifecycle management
Security boundary that maps naturally to organizational boundaries

Design Constraints

Several constraints shaped the architecture. Understanding these constraints helps explain why certain approaches were chosen over alternatives.

Constraint 1: Enterprise Multi-Tenancy Requirements

Enterprise customers require data isolation guarantees that go beyond logical separation. The platform must demonstrate:

No cross-tenant data leakage, even under failure conditions
Auditable access controls with per-tenant audit trails
Resource isolation so one tenant's workload cannot degrade another's experience
Independent data residency and compliance postures per tenant

These requirements drove the decision to use namespace-level isolation, per-tenant databases, and the TenantContext propagation pattern.

Constraint 2: Variable Workload Profiles

AI/ML workloads have fundamentally different resource profiles than CRUD operations:

A text-to-SQL request might consume 2-10 seconds of GPU time
A dashboard refresh is a sub-100ms database query
A model training job might run for hours

The platform handles this variance through:

Separate resource quotas per service type
GPU scheduling via Kubernetes device plugins (for AI/ML services)
Async processing via Kafka for long-running operations
Redis-backed session storage for stateful conversation flows

Constraint 3: Regulatory Compliance

Financial services and healthcare customers operate under strict regulatory frameworks (SOC 2, HIPAA, GDPR). The architecture accounts for this through:

Immutable audit logs via the audit-service with Elasticsearch backing
Encryption at rest and in transit (TLS everywhere, field-level encryption via EncryptionService)
Role-based access control with fine-grained permissions via RbacService
Data residency controls through per-tenant namespace placement

Constraint 4: Developer Experience

The platform must be operable by a small team. This constraint eliminated approaches that require dedicated infrastructure teams:

Helm-based deployment rather than custom operators (lower learning curve)
Script-based CD (cd-new.sh) rather than complex CI/CD platforms
Convention-driven configuration rather than extensive per-service setup
Single source of truth (components.yaml) rather than scattered configuration

Layered Architecture

Within each service, the codebase follows a layered architecture pattern:

+------------------------------------------+
|           API Layer (Controllers)         |
|  - Request validation                    |
|  - Input sanitization                    |
|  - API versioning                        |
+------------------------------------------+
|           Service Layer (Business Logic)  |
|  - Domain operations                     |
|  - Cross-cutting concerns               |
|  - Transaction management               |
+------------------------------------------+
|           Repository Layer (Persistence)  |
|  - Database access                       |
|  - Cache management                      |
|  - External service clients             |
+------------------------------------------+
|           Infrastructure Layer            |
|  - Kafka producers/consumers            |
|  - Redis connections                     |
|  - HTTP clients                          |
+------------------------------------------+

Each layer has a clear responsibility boundary:

The API layer handles HTTP concerns and delegates to services
The Service layer contains business logic and is unit-testable without infrastructure
The Repository layer abstracts data access behind interfaces
The Infrastructure layer manages connections to external systems

Security Architecture Principles

Security is not a feature -- it is a structural property of the system. The platform follows defense-in-depth with multiple security layers:

Network perimeter: Kong API Gateway validates all inbound requests
Authentication: JWT tokens validated at every service via JwtTokenValidator
Authorization: RBAC with tenant-scoped permissions via RbacService and PermissionEvaluator
Input validation: SecurityFilter and InputValidation reject malicious payloads before they reach business logic
Tenant isolation: TenantContextHolder ensures every operation is scoped to the correct tenant
Data encryption: EncryptionService and KeyManagementService provide field-level encryption
Audit: AuditLogger records all security-relevant operations

The SecurityFilter runs at the highest priority in the filter chain (Ordered.HIGHEST_PRECEDENCE + 10) and validates request headers (X-Tenant-ID, X-User-ID, X-Request-ID, X-Correlation-ID), query parameters, and URI paths for injection attacks, path traversal, and suspicious patterns before any business logic executes.

Resilience Patterns

The platform employs several resilience patterns to handle partial failures gracefully:

Circuit Breakers

Inter-service calls use CircuitBreakerConfig from commons-java to prevent cascade failures. When a downstream service becomes unhealthy, the circuit opens and requests fail fast rather than accumulating timeouts.

Retry with Backoff

RetryableRestClient implements exponential backoff for transient failures, with configurable retry counts and jitter to prevent thundering herd problems.

Graceful Degradation

Services are designed to degrade gracefully when dependencies are unavailable:

The AI service falls back to in-memory session storage when Redis is down
The query engine returns cached results when the semantic layer is temporarily unreachable
The notification service queues messages locally when Kafka is unavailable

Health Checks

Every service implements deep health checks via ComponentHealthCheck and HealthIndicatorRegistry that verify not just that the process is running, but that its dependencies (database, cache, message broker) are reachable and functional.

Evolution Strategy

The architecture is designed to evolve without wholesale replacement. Key extension points include:

New services can be added by creating a Helm chart, registering in components.yaml, and implementing the commons library interfaces
New event types can be introduced by extending DataPlaneEvent.EventCategory and registering handlers
New tenant isolation models can be supported by extending TenantIdentifierResolver
New cloud providers can be supported by adding Terraform modules without changing application code

The Architecture Decision Records (see ADRs) document the evolution of these decisions over time, ensuring that future contributors understand not just what was decided, but why.

Overview Control Plane