Architecture Decision Records

Architecture Decision Records (ADRs) document the significant architectural decisions made during the design and evolution of the MATIH Enterprise Platform. Each ADR captures the context, the decision itself, and the anticipated consequences. ADRs serve as institutional memory, helping current and future team members understand why the platform is built the way it is.

ADR Format

Each ADR follows this standard structure:

Section	Purpose
Title	Short, descriptive name for the decision
Status	Current status: Accepted, Proposed, Deprecated, or Superseded
Date	When the decision was made or last updated
Context	The forces and constraints that led to this decision
Decision	The architectural choice that was made
Consequences	The positive, negative, and neutral outcomes of this decision
Alternatives Considered	Other options evaluated and reasons for rejection

ADR-0001: Microservices Architecture with Control Plane / Data Plane Split

Status: Accepted Date: 2025-03-15 Deciders: Platform Architecture Team

Context

MATIH aims to be a unified platform for Data Engineering, AI/ML, Business Intelligence, and Conversational Analytics. The platform must support multi-tenancy, independent scaling of different workload types, and the ability to evolve individual capabilities without coordinating monolithic deployments.

Key constraints included:

Different workload profiles: control plane operations (low latency, transactional) versus data plane operations (high throughput, compute-intensive, potentially GPU-accelerated)
Multi-tenant isolation requirements where tenant data must never cross boundaries
The need for independent deployment and scaling of individual services
Team autonomy: different teams should be able to develop and deploy their services independently
Cloud-agnostic deployment targeting Azure, AWS, and GCP

Decision

We adopt a microservices architecture with a clear separation between the Control Plane and the Data Plane.

Control Plane (10 services): Manages platform-wide concerns including identity, tenant lifecycle, configuration, auditing, billing, notifications, observability, infrastructure, platform registry, and API gateway. All Control Plane services are Java Spring Boot 3.2 applications running in the matih-control-plane Kubernetes namespace.

Data Plane (14 services): Handles tenant-specific workloads including AI/conversational analytics, ML model lifecycle, query execution, BI dashboards, catalog metadata, semantic layer, pipelines, data quality, governance, ontology, ops agents, server-side rendering, and agent coordination. Data Plane services use a mix of Java Spring Boot (for data-intensive services) and Python FastAPI (for AI/ML services), running in the matih-data-plane Kubernetes namespace.

Communication patterns:

Synchronous: REST APIs with JWT-based authentication for request/response operations
Asynchronous: Apache Kafka (via Strimzi) for event-driven communication and domain events
Real-time: WebSocket and Server-Sent Events (SSE) for streaming AI responses

Consequences

Positive:

Independent scaling: AI services can scale GPU resources without affecting billing infrastructure
Technology diversity: Python FastAPI for AI/ML workloads, Java Spring Boot for enterprise services
Fault isolation: a failure in the ML training pipeline does not impact the authentication system
Team autonomy: separate CI/CD pipelines per service
Clear security boundary between platform management and tenant data

Negative:

Increased operational complexity: 24+ services to monitor, deploy, and debug
Network overhead for inter-service communication
Distributed tracing required to understand request flows
Data consistency challenges requiring eventual consistency patterns

Neutral:

Kubernetes-native deployment is both an enabler and a dependency
Service mesh may be required in the future for advanced traffic management

Alternatives Considered

Monolithic application: Rejected due to scaling limitations, deployment coupling, and inability to support multiple technology stacks.
Modular monolith: Considered as a simpler starting point, but rejected because the Control Plane / Data Plane split requires fundamentally different deployment characteristics (CPU-bound vs GPU-bound).
Serverless / function-based: Rejected due to cold start latency issues for conversational AI, difficulty managing state in LangGraph agents, and cloud-provider lock-in.

ADR-0002: Multi-Tenancy via Kubernetes Namespace Isolation

Status: Accepted Date: 2025-03-22 Deciders: Platform Architecture Team, Security Team

Context

MATIH is a multi-tenant SaaS platform where each customer (tenant) must have strong data isolation guarantees. Tenants may have compliance requirements (SOC 2, HIPAA, GDPR) that mandate strict separation of data and compute resources. At the same time, the platform must be cost-efficient, avoiding the overhead of entirely separate clusters per tenant.

Decision

We implement multi-tenancy using a combination of strategies:

Shared services with TenantContext: Control Plane services are shared across all tenants. A TenantContext object, propagated via JWT claims and thread-local storage, ensures every database query, cache operation, and Kafka message is scoped to the correct tenant. The TenantContext is set in a servlet filter / middleware that extracts the tenant ID from the JWT and sets it on the current thread.
Namespace isolation for compute: Each tenant's data plane workloads run in a dedicated Kubernetes namespace (matih-tenant-{slug}). Network policies restrict cross-namespace traffic to only explicitly permitted service-to-service calls.
Dedicated database schemas: Each tenant has its own database schema within shared PostgreSQL instances (dev/staging) or dedicated database instances (production enterprise tier). Row-level security (RLS) provides defense-in-depth for shared-schema configurations.
Per-tenant ingress: Enterprise tenants receive a dedicated NGINX Ingress Controller with a unique LoadBalancer IP, DNS zone, and TLS certificate. Standard tenants share a common ingress with path-based routing.

The TenantContext class in commons/commons-java/ is the canonical implementation:

public class TenantContext {
    private static final ThreadLocal<String> currentTenant = new ThreadLocal<>();
 
    public static void setTenantId(String tenantId) {
        currentTenant.set(tenantId);
    }
 
    public static String getTenantId() {
        return currentTenant.get();
    }
 
    public static void clear() {
        currentTenant.remove();
    }
}

Consequences

Positive:

Strong isolation: Kubernetes network policies prevent cross-tenant network traffic
Flexible isolation levels: shared infrastructure for cost-efficiency, dedicated resources for compliance
Audit trail: every operation is tagged with tenant context
Defense-in-depth: namespace isolation + database schema separation + RLS

Negative:

Complexity in service code: every data access layer must be tenant-aware
Operational overhead: managing per-tenant namespaces, network policies, and secrets
Thread-local TenantContext requires careful management in async/reactive code paths

Neutral:

Tenant provisioning is a multi-step workflow managed by the Tenant Service

Alternatives Considered

Single shared database with RLS only: Rejected for enterprise customers who require physical data separation.
Cluster-per-tenant: Rejected due to cost (each AKS/EKS/GKE cluster carries significant overhead) and operational complexity.
Virtual clusters (vCluster): Considered but deferred due to maturity concerns and additional abstraction layer.

ADR-0003: JWT-Based Authentication with OAuth2 and MFA

Status: Accepted Date: 2025-04-01 Deciders: Security Team, Platform Architecture Team

Context

The platform requires a secure, scalable authentication mechanism that supports:

Multi-tenant user authentication
Single Sign-On (SSO) via enterprise identity providers (Azure AD, Okta, Google Workspace)
Multi-factor authentication (MFA) for compliance
API token authentication for programmatic access (SDKs, CI/CD)
Stateless validation for high-throughput data plane services
SCIM 2.0 provisioning for automated user lifecycle management

Decision

We implement JWT-based authentication with RSA-256 signed tokens, issued by the IAM Service. The authentication flow works as follows:

User authenticates via email/password or OAuth2/SAML SSO through the IAM Service
MFA challenge is presented if MFA is enabled for the user or required by tenant policy
JWT access token is issued with claims: sub (user ID), tenant_id, roles, permissions, exp, iat
JWT refresh token is issued with a longer TTL for token renewal
Downstream services validate the JWT signature using the IAM Service's public key (cached locally) and extract tenant context and authorization claims

Token structure:

{
  "header": {
    "alg": "RS256",
    "typ": "JWT",
    "kid": "key-2026-01"
  },
  "payload": {
    "sub": "usr-a1b2c3d4",
    "tenant_id": "tnt-xyz789",
    "tenant_slug": "acme",
    "email": "jane@acme.com",
    "roles": ["TENANT_ADMIN", "DATA_ENGINEER"],
    "permissions": ["dashboard:read", "dashboard:write", "query:execute"],
    "iss": "matih-iam",
    "aud": "matih-platform",
    "exp": 1739361600,
    "iat": 1739358000,
    "jti": "tok-unique-id"
  }
}

Token lifetimes:

Access token: 1 hour (configurable per tenant)
Refresh token: 7 days (configurable per tenant)
API token: 90 days (configurable, with rotation reminders)

Consequences

Positive:

Stateless validation: data plane services do not need to call the IAM Service for every request
Standard protocols: OAuth2 and SAML are widely supported by enterprise identity providers
MFA adds a strong security layer for sensitive operations
SCIM enables automated provisioning from identity providers

Negative:

Token revocation requires distributed cache invalidation (solved via Redis blacklist with TTL)
JWT size grows with permissions, requiring careful claim management
RSA key rotation requires coordinated rollout across all services

Alternatives Considered

Session-based authentication: Rejected because it requires sticky sessions or centralized session storage, creating a scalability bottleneck.
Opaque tokens with introspection: Rejected because every API call would require a round-trip to the IAM Service.
PASETO tokens: Considered for stronger cryptographic guarantees, but rejected due to limited library support in the Java/Python ecosystem.

ADR-0004: Authorization via RBAC with OPA Policy Engine

Status: Accepted Date: 2025-04-08 Deciders: Security Team

Context

The platform requires fine-grained authorization that supports:

Role-based access control (RBAC) for standard operations
Attribute-based access control (ABAC) for data-level policies (e.g., column masking, row filtering)
Dynamic policies that can be updated without code changes or redeployment
Tenant-specific policy customization
Compliance reporting on access patterns

Decision

We implement a two-tier authorization model:

Tier 1 - RBAC (Role-Based Access Control): Handled in-process by each service. JWT tokens contain role and permission claims. Services check these claims against required permissions for each endpoint. Built-in roles include: PLATFORM_ADMIN, TENANT_ADMIN, DATA_ENGINEER, DATA_ANALYST, ML_ENGINEER, BI_ANALYST, AUDITOR, VIEWER.

Tier 2 - OPA (Open Policy Agent): For complex authorization decisions that go beyond role checks (data masking rules, cross-tenant access, time-based policies, resource quotas), services delegate to an OPA sidecar. Policies are written in Rego and distributed via OPA bundles stored in the Config Service.

Policy evaluation flow:

Request -> JWT Validation -> RBAC Check (in-process)
                                |
                         [If complex policy needed]
                                |
                         OPA Sidecar -> Policy Decision -> Allow/Deny

Example Rego policy for data access:

package matih.governance

default allow = false

allow {
    input.user.roles[_] == "DATA_ENGINEER"
    input.resource.type == "table"
    input.action == "read"
    not is_pii_column(input.resource.column)
}

allow {
    input.user.roles[_] == "TENANT_ADMIN"
}

is_pii_column(column) {
    column.classification == "PII"
}

Consequences

Positive:

Standard RBAC covers 90% of authorization needs with minimal overhead
OPA provides a powerful, declarative policy engine for complex scenarios
Policies are version-controlled and auditable
Tenant-specific policies can be applied without code changes

Negative:

Two-tier model adds complexity to the authorization flow
OPA sidecar adds resource overhead to every pod
Rego language has a learning curve for developers
Policy testing requires specialized tooling

Alternatives Considered

RBAC only: Rejected because data-level policies (column masking, row filtering) cannot be expressed purely in role-based terms.
Casbin: Considered as a lighter-weight alternative to OPA, but rejected due to limited support for complex policy logic and no sidecar deployment model.
Custom policy engine: Rejected to avoid reinventing the wheel when OPA is a CNCF graduated project with strong community support.

ADR-0005: Event-Driven Architecture with Apache Kafka (Strimzi)

Status: Accepted Date: 2025-04-15 Deciders: Platform Architecture Team

Context

The platform requires asynchronous communication between services for domain events, audit trail capture, real-time analytics, and decoupled integration. Events include tenant provisioning state changes, AI agent traces, query execution metrics, data quality alerts, and user activity streams.

Decision

We adopt Apache Kafka as the central event backbone, deployed via the Strimzi Operator on Kubernetes. Key decisions:

Strimzi Operator manages Kafka cluster lifecycle within Kubernetes, providing declarative configuration via CRDs
Topic naming convention: matih.{domain}.{event-type} (e.g., matih.ai.state-changes, matih.tenant.provisioning)
Serialization: Avro with Schema Registry for schema evolution, JSON for simple events
Partitioning: By tenant ID to ensure ordering within a tenant
Security: TLS encryption (SSL protocol) for inter-broker and client communication
Retention: Configurable per topic (7 days for operational events, 90 days for audit events)

Consequences

Positive:

Decoupled services: producers and consumers evolve independently
Event replay: consumers can reprocess historical events for recovery or analytics
Real-time analytics: Flink jobs consume Kafka topics for streaming aggregations
Audit trail: all significant events are captured for compliance

Negative:

Operational complexity of managing a Kafka cluster (mitigated by Strimzi Operator)
Eventual consistency: consumers may see events with delay
Message ordering is only guaranteed within a partition (solved by partitioning by tenant ID)

Alternatives Considered

RabbitMQ: Rejected due to limited replay capability and lower throughput for streaming use cases.
NATS: Considered for its simplicity, but rejected due to limited ecosystem for schema management and stream processing integration.
AWS SNS/SQS, Azure Service Bus: Rejected due to cloud-provider lock-in.

ADR-0006: LangGraph for Multi-Agent AI Orchestration

Status: Accepted Date: 2025-05-01 Deciders: AI Engineering Team

Context

The AI Service must orchestrate multiple specialized agents (intent classification, schema retrieval, SQL generation, SQL validation, query execution, visualization, guardrails) in a flexible, stateful workflow. Agents need to:

Execute conditionally based on previous agent outputs
Support cycles (e.g., retry SQL generation after validation failure)
Maintain conversation state across turns
Stream intermediate results to the frontend via WebSocket/SSE

Decision

We adopt LangGraph as the multi-agent orchestration framework. LangGraph provides a graph-based execution model where:

Nodes represent individual agents or processing steps
Edges define transitions with conditional routing
State is managed via a typed state dictionary passed through the graph
Checkpointing enables conversation persistence and replay

The orchestrator graph follows this structure:

START -> IntentClassifier -> SchemaRetriever -> SQLGenerator
             |                                      |
             v                                      v
      [Non-SQL Intent]                      SQLValidator
             |                                |        |
             v                           [Valid]   [Invalid]
      DirectResponse                        |        |
                                            v        v
                                       Executor   SQLGenerator (retry)
                                            |
                                            v
                                     Visualizer -> END

Consequences

Positive:

Directed graph model maps naturally to multi-agent workflows
Built-in support for cycles (retry loops) and conditional routing
State management and checkpointing simplify conversation persistence
Streaming support for real-time feedback to users

Negative:

LangGraph is a relatively new framework with evolving APIs
Debugging graph execution requires specialized tooling (LangSmith)
Performance overhead of graph traversal for simple queries

Alternatives Considered

LangChain sequential chains: Rejected due to lack of cycle support and limited state management.
Custom orchestrator: Rejected to avoid maintaining a bespoke framework.
AutoGen: Considered for multi-agent conversations, but rejected due to limited control over execution flow.

ADR-0007: Trino as the Federated Query Engine

Status: Accepted Date: 2025-05-10 Deciders: Data Engineering Team

Context

The platform must execute SQL queries across heterogeneous data sources (PostgreSQL, Iceberg/Parquet on object storage, ClickHouse, StarRocks, Elasticsearch) with a unified SQL interface. The query engine must support:

ANSI SQL compatibility
Federated queries joining data from multiple sources
Cost-based query optimization
Integration with the Iceberg table format via Polaris catalog
Horizontal scaling for concurrent query workloads

Decision

We adopt Trino (formerly PrestoSQL) as the federated query engine, configured with connectors for:

Iceberg (via Polaris catalog): Primary lakehouse storage
PostgreSQL: Operational databases
ClickHouse: OLAP analytics
Elasticsearch: Full-text search and log analytics
Hive Metastore: Legacy Hive-compatible data

Trino is deployed as a coordinator + worker cluster, with workers scaling based on query load.

Consequences

Positive:

Single SQL interface for all data sources
Strong ANSI SQL support
Active open-source community and enterprise support (Starburst)
Cost-based optimizer produces efficient query plans

Negative:

JVM memory management requires careful tuning for large queries
No built-in support for DML operations on some connectors
Coordinator is a single point of failure (mitigated by Kubernetes restart policies)

Alternatives Considered

Apache Spark SQL: Considered for unified analytics, but rejected as a query engine due to higher latency for interactive queries.
DuckDB: Considered for embedded analytics, but rejected due to lack of distributed query support.
Presto (PrestoDB): Rejected in favor of Trino due to Trino's more active community and better Iceberg integration.

ADR-0008: Kubernetes-Native Deployment with Helm Charts

Status: Accepted Date: 2025-03-20 Deciders: Platform Engineering Team

Context

The platform requires a consistent deployment mechanism across development, staging, and production environments on Azure (AKS), AWS (EKS), and GCP (GKE). Deployments must be:

Repeatable and version-controlled
Configurable per environment
Capable of managing 55+ component deployments in a specific order
Integrated with CI/CD pipelines

Decision

We adopt Helm as the package manager for Kubernetes deployments. Each service has its own Helm chart under infrastructure/helm/. Environment-specific values are maintained in values-dev.yaml, values-staging.yaml, and values-prod.yaml files.

Deployment is orchestrated by a multi-phase CD pipeline (scripts/cd-new.sh) that deploys components in dependency order across 14 phases, from Terraform infrastructure through data infrastructure, observability, compute engines, control plane, data plane, and frontend.

Consequences

Positive:

Industry-standard tooling with strong community support
Template-based configuration with environment-specific overrides
Built-in rollback support
Integration with GitOps tools (ArgoCD, Flux) for production

Negative:

Helm's deep merge behavior requires careful management of base and override values (see Rule 4 in project guidelines)
Chart maintenance overhead for 55+ charts
Helm template debugging can be challenging for complex charts

Alternatives Considered

Kustomize: Considered for its simplicity, but rejected due to limited support for complex parameterization.
Pulumi: Considered for programmatic infrastructure, but rejected due to additional language runtime dependency.
Raw Kubernetes manifests: Rejected due to lack of parameterization and release management.

ADR-0009: PostgreSQL as the Primary Relational Database

Status: Accepted Date: 2025-03-18 Deciders: Platform Architecture Team, Data Engineering Team

Context

The platform requires a relational database for:

Transactional data in control plane services (users, tenants, configuration, audit)
Operational data in data plane services (sessions, agent traces, model metadata, pipeline state)
Support for JSON/JSONB columns for semi-structured data
Row-level security for multi-tenant isolation
Extensions for full-text search, geospatial data, and vector embeddings (pgvector)

Decision

We adopt PostgreSQL as the primary relational database, deployed via the Bitnami Helm chart in Kubernetes. Each service uses a dedicated database schema to provide logical isolation while sharing the same PostgreSQL cluster in development and staging environments. Production deployments use Azure Database for PostgreSQL Flexible Server (or equivalent on AWS/GCP) with per-service dedicated instances for enterprise tenants.

Connection pooling is managed at the application level using HikariCP (Java) and asyncpg (Python), with pool sizes tuned per service based on workload characteristics.

Database migrations are managed by:

Flyway for Java Spring Boot services (version-numbered SQL scripts)
Alembic for Python FastAPI services (revision-based Python migration scripts)

Consequences

Positive:

Mature, proven technology with extensive ecosystem
Strong ACID guarantees for transactional workloads
Rich extension ecosystem (pgvector, PostGIS, pg_trgm)
Row-level security provides defense-in-depth for multi-tenancy
Managed service options on all major cloud providers

Negative:

Single-node write bottleneck (mitigated by read replicas and connection pooling)
Schema management across 20+ services requires disciplined migration practices
Shared PostgreSQL in dev/staging can lead to resource contention

Alternatives Considered

MySQL: Rejected due to weaker JSON support, no built-in RLS, and limited extension ecosystem.
CockroachDB: Considered for distributed SQL, but rejected due to operational complexity and cost.
MongoDB: Rejected for services requiring strong relational integrity and multi-table transactions.

ADR Index

ADR	Title	Status	Date
ADR-0001	Microservices with Control Plane / Data Plane Split	Accepted	2025-03-15
ADR-0002	Multi-Tenancy via Kubernetes Namespace Isolation	Accepted	2025-03-22
ADR-0003	JWT-Based Authentication with OAuth2 and MFA	Accepted	2025-04-01
ADR-0004	Authorization via RBAC with OPA Policy Engine	Accepted	2025-04-08
ADR-0005	Event-Driven Architecture with Apache Kafka (Strimzi)	Accepted	2025-04-15
ADR-0006	LangGraph for Multi-Agent AI Orchestration	Accepted	2025-05-01
ADR-0007	Trino as the Federated Query Engine	Accepted	2025-05-10
ADR-0008	Kubernetes-Native Deployment with Helm Charts	Accepted	2025-03-20
ADR-0009	PostgreSQL as the Primary Relational Database	Accepted	2025-03-18

Process for New ADRs

When a significant architectural decision is needed, the following process applies:

Draft: Create a new ADR document following the template above with status "Proposed"
Review: Present the ADR to the Architecture Review Board (ARB) for discussion
Decide: The ARB votes to accept, modify, or reject the proposal
Record: Update the ADR status to "Accepted" and commit to the repository
Communicate: Share the ADR with all engineering teams via the platform notification system

An ADR should be created for any decision that:

Affects multiple services or teams
Is difficult or expensive to reverse
Has significant trade-offs that future team members should understand
Changes a previously accepted ADR

Complete API Reference Error Codes & Status Codes