MATIH Platform is in active MVP development. Documentation reflects current implementation status.
20. Appendices & Reference
Architecture Decision Records

Architecture Decision Records

Architecture Decision Records (ADRs) document the significant architectural decisions made during the design and evolution of the MATIH Enterprise Platform. Each ADR captures the context, the decision itself, and the anticipated consequences. ADRs serve as institutional memory, helping current and future team members understand why the platform is built the way it is.


ADR Format

Each ADR follows this standard structure:

SectionPurpose
TitleShort, descriptive name for the decision
StatusCurrent status: Accepted, Proposed, Deprecated, or Superseded
DateWhen the decision was made or last updated
ContextThe forces and constraints that led to this decision
DecisionThe architectural choice that was made
ConsequencesThe positive, negative, and neutral outcomes of this decision
Alternatives ConsideredOther options evaluated and reasons for rejection

ADR-0001: Microservices Architecture with Control Plane / Data Plane Split

Status: Accepted Date: 2025-03-15 Deciders: Platform Architecture Team

Context

MATIH aims to be a unified platform for Data Engineering, AI/ML, Business Intelligence, and Conversational Analytics. The platform must support multi-tenancy, independent scaling of different workload types, and the ability to evolve individual capabilities without coordinating monolithic deployments.

Key constraints included:

  • Different workload profiles: control plane operations (low latency, transactional) versus data plane operations (high throughput, compute-intensive, potentially GPU-accelerated)
  • Multi-tenant isolation requirements where tenant data must never cross boundaries
  • The need for independent deployment and scaling of individual services
  • Team autonomy: different teams should be able to develop and deploy their services independently
  • Cloud-agnostic deployment targeting Azure, AWS, and GCP

Decision

We adopt a microservices architecture with a clear separation between the Control Plane and the Data Plane.

Control Plane (10 services): Manages platform-wide concerns including identity, tenant lifecycle, configuration, auditing, billing, notifications, observability, infrastructure, platform registry, and API gateway. All Control Plane services are Java Spring Boot 3.2 applications running in the matih-control-plane Kubernetes namespace.

Data Plane (14 services): Handles tenant-specific workloads including AI/conversational analytics, ML model lifecycle, query execution, BI dashboards, catalog metadata, semantic layer, pipelines, data quality, governance, ontology, ops agents, server-side rendering, and agent coordination. Data Plane services use a mix of Java Spring Boot (for data-intensive services) and Python FastAPI (for AI/ML services), running in the matih-data-plane Kubernetes namespace.

Communication patterns:

  • Synchronous: REST APIs with JWT-based authentication for request/response operations
  • Asynchronous: Apache Kafka (via Strimzi) for event-driven communication and domain events
  • Real-time: WebSocket and Server-Sent Events (SSE) for streaming AI responses

Consequences

Positive:

  • Independent scaling: AI services can scale GPU resources without affecting billing infrastructure
  • Technology diversity: Python FastAPI for AI/ML workloads, Java Spring Boot for enterprise services
  • Fault isolation: a failure in the ML training pipeline does not impact the authentication system
  • Team autonomy: separate CI/CD pipelines per service
  • Clear security boundary between platform management and tenant data

Negative:

  • Increased operational complexity: 24+ services to monitor, deploy, and debug
  • Network overhead for inter-service communication
  • Distributed tracing required to understand request flows
  • Data consistency challenges requiring eventual consistency patterns

Neutral:

  • Kubernetes-native deployment is both an enabler and a dependency
  • Service mesh may be required in the future for advanced traffic management

Alternatives Considered

  1. Monolithic application: Rejected due to scaling limitations, deployment coupling, and inability to support multiple technology stacks.
  2. Modular monolith: Considered as a simpler starting point, but rejected because the Control Plane / Data Plane split requires fundamentally different deployment characteristics (CPU-bound vs GPU-bound).
  3. Serverless / function-based: Rejected due to cold start latency issues for conversational AI, difficulty managing state in LangGraph agents, and cloud-provider lock-in.

ADR-0002: Multi-Tenancy via Kubernetes Namespace Isolation

Status: Accepted Date: 2025-03-22 Deciders: Platform Architecture Team, Security Team

Context

MATIH is a multi-tenant SaaS platform where each customer (tenant) must have strong data isolation guarantees. Tenants may have compliance requirements (SOC 2, HIPAA, GDPR) that mandate strict separation of data and compute resources. At the same time, the platform must be cost-efficient, avoiding the overhead of entirely separate clusters per tenant.

Decision

We implement multi-tenancy using a combination of strategies:

  1. Shared services with TenantContext: Control Plane services are shared across all tenants. A TenantContext object, propagated via JWT claims and thread-local storage, ensures every database query, cache operation, and Kafka message is scoped to the correct tenant. The TenantContext is set in a servlet filter / middleware that extracts the tenant ID from the JWT and sets it on the current thread.

  2. Namespace isolation for compute: Each tenant's data plane workloads run in a dedicated Kubernetes namespace (matih-tenant-{slug}). Network policies restrict cross-namespace traffic to only explicitly permitted service-to-service calls.

  3. Dedicated database schemas: Each tenant has its own database schema within shared PostgreSQL instances (dev/staging) or dedicated database instances (production enterprise tier). Row-level security (RLS) provides defense-in-depth for shared-schema configurations.

  4. Per-tenant ingress: Enterprise tenants receive a dedicated NGINX Ingress Controller with a unique LoadBalancer IP, DNS zone, and TLS certificate. Standard tenants share a common ingress with path-based routing.

The TenantContext class in commons/commons-java/ is the canonical implementation:

public class TenantContext {
    private static final ThreadLocal<String> currentTenant = new ThreadLocal<>();
 
    public static void setTenantId(String tenantId) {
        currentTenant.set(tenantId);
    }
 
    public static String getTenantId() {
        return currentTenant.get();
    }
 
    public static void clear() {
        currentTenant.remove();
    }
}

Consequences

Positive:

  • Strong isolation: Kubernetes network policies prevent cross-tenant network traffic
  • Flexible isolation levels: shared infrastructure for cost-efficiency, dedicated resources for compliance
  • Audit trail: every operation is tagged with tenant context
  • Defense-in-depth: namespace isolation + database schema separation + RLS

Negative:

  • Complexity in service code: every data access layer must be tenant-aware
  • Operational overhead: managing per-tenant namespaces, network policies, and secrets
  • Thread-local TenantContext requires careful management in async/reactive code paths

Neutral:

  • Tenant provisioning is a multi-step workflow managed by the Tenant Service

Alternatives Considered

  1. Single shared database with RLS only: Rejected for enterprise customers who require physical data separation.
  2. Cluster-per-tenant: Rejected due to cost (each AKS/EKS/GKE cluster carries significant overhead) and operational complexity.
  3. Virtual clusters (vCluster): Considered but deferred due to maturity concerns and additional abstraction layer.

ADR-0003: JWT-Based Authentication with OAuth2 and MFA

Status: Accepted Date: 2025-04-01 Deciders: Security Team, Platform Architecture Team

Context

The platform requires a secure, scalable authentication mechanism that supports:

  • Multi-tenant user authentication
  • Single Sign-On (SSO) via enterprise identity providers (Azure AD, Okta, Google Workspace)
  • Multi-factor authentication (MFA) for compliance
  • API token authentication for programmatic access (SDKs, CI/CD)
  • Stateless validation for high-throughput data plane services
  • SCIM 2.0 provisioning for automated user lifecycle management

Decision

We implement JWT-based authentication with RSA-256 signed tokens, issued by the IAM Service. The authentication flow works as follows:

  1. User authenticates via email/password or OAuth2/SAML SSO through the IAM Service
  2. MFA challenge is presented if MFA is enabled for the user or required by tenant policy
  3. JWT access token is issued with claims: sub (user ID), tenant_id, roles, permissions, exp, iat
  4. JWT refresh token is issued with a longer TTL for token renewal
  5. Downstream services validate the JWT signature using the IAM Service's public key (cached locally) and extract tenant context and authorization claims

Token structure:

{
  "header": {
    "alg": "RS256",
    "typ": "JWT",
    "kid": "key-2026-01"
  },
  "payload": {
    "sub": "usr-a1b2c3d4",
    "tenant_id": "tnt-xyz789",
    "tenant_slug": "acme",
    "email": "jane@acme.com",
    "roles": ["TENANT_ADMIN", "DATA_ENGINEER"],
    "permissions": ["dashboard:read", "dashboard:write", "query:execute"],
    "iss": "matih-iam",
    "aud": "matih-platform",
    "exp": 1739361600,
    "iat": 1739358000,
    "jti": "tok-unique-id"
  }
}

Token lifetimes:

  • Access token: 1 hour (configurable per tenant)
  • Refresh token: 7 days (configurable per tenant)
  • API token: 90 days (configurable, with rotation reminders)

Consequences

Positive:

  • Stateless validation: data plane services do not need to call the IAM Service for every request
  • Standard protocols: OAuth2 and SAML are widely supported by enterprise identity providers
  • MFA adds a strong security layer for sensitive operations
  • SCIM enables automated provisioning from identity providers

Negative:

  • Token revocation requires distributed cache invalidation (solved via Redis blacklist with TTL)
  • JWT size grows with permissions, requiring careful claim management
  • RSA key rotation requires coordinated rollout across all services

Alternatives Considered

  1. Session-based authentication: Rejected because it requires sticky sessions or centralized session storage, creating a scalability bottleneck.
  2. Opaque tokens with introspection: Rejected because every API call would require a round-trip to the IAM Service.
  3. PASETO tokens: Considered for stronger cryptographic guarantees, but rejected due to limited library support in the Java/Python ecosystem.

ADR-0004: Authorization via RBAC with OPA Policy Engine

Status: Accepted Date: 2025-04-08 Deciders: Security Team

Context

The platform requires fine-grained authorization that supports:

  • Role-based access control (RBAC) for standard operations
  • Attribute-based access control (ABAC) for data-level policies (e.g., column masking, row filtering)
  • Dynamic policies that can be updated without code changes or redeployment
  • Tenant-specific policy customization
  • Compliance reporting on access patterns

Decision

We implement a two-tier authorization model:

Tier 1 - RBAC (Role-Based Access Control): Handled in-process by each service. JWT tokens contain role and permission claims. Services check these claims against required permissions for each endpoint. Built-in roles include: PLATFORM_ADMIN, TENANT_ADMIN, DATA_ENGINEER, DATA_ANALYST, ML_ENGINEER, BI_ANALYST, AUDITOR, VIEWER.

Tier 2 - OPA (Open Policy Agent): For complex authorization decisions that go beyond role checks (data masking rules, cross-tenant access, time-based policies, resource quotas), services delegate to an OPA sidecar. Policies are written in Rego and distributed via OPA bundles stored in the Config Service.

Policy evaluation flow:

Request -> JWT Validation -> RBAC Check (in-process)
                                |
                         [If complex policy needed]
                                |
                         OPA Sidecar -> Policy Decision -> Allow/Deny

Example Rego policy for data access:

package matih.governance

default allow = false

allow {
    input.user.roles[_] == "DATA_ENGINEER"
    input.resource.type == "table"
    input.action == "read"
    not is_pii_column(input.resource.column)
}

allow {
    input.user.roles[_] == "TENANT_ADMIN"
}

is_pii_column(column) {
    column.classification == "PII"
}

Consequences

Positive:

  • Standard RBAC covers 90% of authorization needs with minimal overhead
  • OPA provides a powerful, declarative policy engine for complex scenarios
  • Policies are version-controlled and auditable
  • Tenant-specific policies can be applied without code changes

Negative:

  • Two-tier model adds complexity to the authorization flow
  • OPA sidecar adds resource overhead to every pod
  • Rego language has a learning curve for developers
  • Policy testing requires specialized tooling

Alternatives Considered

  1. RBAC only: Rejected because data-level policies (column masking, row filtering) cannot be expressed purely in role-based terms.
  2. Casbin: Considered as a lighter-weight alternative to OPA, but rejected due to limited support for complex policy logic and no sidecar deployment model.
  3. Custom policy engine: Rejected to avoid reinventing the wheel when OPA is a CNCF graduated project with strong community support.

ADR-0005: Event-Driven Architecture with Apache Kafka (Strimzi)

Status: Accepted Date: 2025-04-15 Deciders: Platform Architecture Team

Context

The platform requires asynchronous communication between services for domain events, audit trail capture, real-time analytics, and decoupled integration. Events include tenant provisioning state changes, AI agent traces, query execution metrics, data quality alerts, and user activity streams.

Decision

We adopt Apache Kafka as the central event backbone, deployed via the Strimzi Operator on Kubernetes. Key decisions:

  • Strimzi Operator manages Kafka cluster lifecycle within Kubernetes, providing declarative configuration via CRDs
  • Topic naming convention: matih.{domain}.{event-type} (e.g., matih.ai.state-changes, matih.tenant.provisioning)
  • Serialization: Avro with Schema Registry for schema evolution, JSON for simple events
  • Partitioning: By tenant ID to ensure ordering within a tenant
  • Security: TLS encryption (SSL protocol) for inter-broker and client communication
  • Retention: Configurable per topic (7 days for operational events, 90 days for audit events)

Consequences

Positive:

  • Decoupled services: producers and consumers evolve independently
  • Event replay: consumers can reprocess historical events for recovery or analytics
  • Real-time analytics: Flink jobs consume Kafka topics for streaming aggregations
  • Audit trail: all significant events are captured for compliance

Negative:

  • Operational complexity of managing a Kafka cluster (mitigated by Strimzi Operator)
  • Eventual consistency: consumers may see events with delay
  • Message ordering is only guaranteed within a partition (solved by partitioning by tenant ID)

Alternatives Considered

  1. RabbitMQ: Rejected due to limited replay capability and lower throughput for streaming use cases.
  2. NATS: Considered for its simplicity, but rejected due to limited ecosystem for schema management and stream processing integration.
  3. AWS SNS/SQS, Azure Service Bus: Rejected due to cloud-provider lock-in.

ADR-0006: LangGraph for Multi-Agent AI Orchestration

Status: Accepted Date: 2025-05-01 Deciders: AI Engineering Team

Context

The AI Service must orchestrate multiple specialized agents (intent classification, schema retrieval, SQL generation, SQL validation, query execution, visualization, guardrails) in a flexible, stateful workflow. Agents need to:

  • Execute conditionally based on previous agent outputs
  • Support cycles (e.g., retry SQL generation after validation failure)
  • Maintain conversation state across turns
  • Stream intermediate results to the frontend via WebSocket/SSE

Decision

We adopt LangGraph as the multi-agent orchestration framework. LangGraph provides a graph-based execution model where:

  • Nodes represent individual agents or processing steps
  • Edges define transitions with conditional routing
  • State is managed via a typed state dictionary passed through the graph
  • Checkpointing enables conversation persistence and replay

The orchestrator graph follows this structure:

START -> IntentClassifier -> SchemaRetriever -> SQLGenerator
             |                                      |
             v                                      v
      [Non-SQL Intent]                      SQLValidator
             |                                |        |
             v                           [Valid]   [Invalid]
      DirectResponse                        |        |
                                            v        v
                                       Executor   SQLGenerator (retry)
                                            |
                                            v
                                     Visualizer -> END

Consequences

Positive:

  • Directed graph model maps naturally to multi-agent workflows
  • Built-in support for cycles (retry loops) and conditional routing
  • State management and checkpointing simplify conversation persistence
  • Streaming support for real-time feedback to users

Negative:

  • LangGraph is a relatively new framework with evolving APIs
  • Debugging graph execution requires specialized tooling (LangSmith)
  • Performance overhead of graph traversal for simple queries

Alternatives Considered

  1. LangChain sequential chains: Rejected due to lack of cycle support and limited state management.
  2. Custom orchestrator: Rejected to avoid maintaining a bespoke framework.
  3. AutoGen: Considered for multi-agent conversations, but rejected due to limited control over execution flow.

ADR-0007: Trino as the Federated Query Engine

Status: Accepted Date: 2025-05-10 Deciders: Data Engineering Team

Context

The platform must execute SQL queries across heterogeneous data sources (PostgreSQL, Iceberg/Parquet on object storage, ClickHouse, StarRocks, Elasticsearch) with a unified SQL interface. The query engine must support:

  • ANSI SQL compatibility
  • Federated queries joining data from multiple sources
  • Cost-based query optimization
  • Integration with the Iceberg table format via Polaris catalog
  • Horizontal scaling for concurrent query workloads

Decision

We adopt Trino (formerly PrestoSQL) as the federated query engine, configured with connectors for:

  • Iceberg (via Polaris catalog): Primary lakehouse storage
  • PostgreSQL: Operational databases
  • ClickHouse: OLAP analytics
  • Elasticsearch: Full-text search and log analytics
  • Hive Metastore: Legacy Hive-compatible data

Trino is deployed as a coordinator + worker cluster, with workers scaling based on query load.

Consequences

Positive:

  • Single SQL interface for all data sources
  • Strong ANSI SQL support
  • Active open-source community and enterprise support (Starburst)
  • Cost-based optimizer produces efficient query plans

Negative:

  • JVM memory management requires careful tuning for large queries
  • No built-in support for DML operations on some connectors
  • Coordinator is a single point of failure (mitigated by Kubernetes restart policies)

Alternatives Considered

  1. Apache Spark SQL: Considered for unified analytics, but rejected as a query engine due to higher latency for interactive queries.
  2. DuckDB: Considered for embedded analytics, but rejected due to lack of distributed query support.
  3. Presto (PrestoDB): Rejected in favor of Trino due to Trino's more active community and better Iceberg integration.

ADR-0008: Kubernetes-Native Deployment with Helm Charts

Status: Accepted Date: 2025-03-20 Deciders: Platform Engineering Team

Context

The platform requires a consistent deployment mechanism across development, staging, and production environments on Azure (AKS), AWS (EKS), and GCP (GKE). Deployments must be:

  • Repeatable and version-controlled
  • Configurable per environment
  • Capable of managing 55+ component deployments in a specific order
  • Integrated with CI/CD pipelines

Decision

We adopt Helm as the package manager for Kubernetes deployments. Each service has its own Helm chart under infrastructure/helm/. Environment-specific values are maintained in values-dev.yaml, values-staging.yaml, and values-prod.yaml files.

Deployment is orchestrated by a multi-phase CD pipeline (scripts/cd-new.sh) that deploys components in dependency order across 14 phases, from Terraform infrastructure through data infrastructure, observability, compute engines, control plane, data plane, and frontend.

Consequences

Positive:

  • Industry-standard tooling with strong community support
  • Template-based configuration with environment-specific overrides
  • Built-in rollback support
  • Integration with GitOps tools (ArgoCD, Flux) for production

Negative:

  • Helm's deep merge behavior requires careful management of base and override values (see Rule 4 in project guidelines)
  • Chart maintenance overhead for 55+ charts
  • Helm template debugging can be challenging for complex charts

Alternatives Considered

  1. Kustomize: Considered for its simplicity, but rejected due to limited support for complex parameterization.
  2. Pulumi: Considered for programmatic infrastructure, but rejected due to additional language runtime dependency.
  3. Raw Kubernetes manifests: Rejected due to lack of parameterization and release management.

ADR-0009: PostgreSQL as the Primary Relational Database

Status: Accepted Date: 2025-03-18 Deciders: Platform Architecture Team, Data Engineering Team

Context

The platform requires a relational database for:

  • Transactional data in control plane services (users, tenants, configuration, audit)
  • Operational data in data plane services (sessions, agent traces, model metadata, pipeline state)
  • Support for JSON/JSONB columns for semi-structured data
  • Row-level security for multi-tenant isolation
  • Extensions for full-text search, geospatial data, and vector embeddings (pgvector)

Decision

We adopt PostgreSQL as the primary relational database, deployed via the Bitnami Helm chart in Kubernetes. Each service uses a dedicated database schema to provide logical isolation while sharing the same PostgreSQL cluster in development and staging environments. Production deployments use Azure Database for PostgreSQL Flexible Server (or equivalent on AWS/GCP) with per-service dedicated instances for enterprise tenants.

Connection pooling is managed at the application level using HikariCP (Java) and asyncpg (Python), with pool sizes tuned per service based on workload characteristics.

Database migrations are managed by:

  • Flyway for Java Spring Boot services (version-numbered SQL scripts)
  • Alembic for Python FastAPI services (revision-based Python migration scripts)

Consequences

Positive:

  • Mature, proven technology with extensive ecosystem
  • Strong ACID guarantees for transactional workloads
  • Rich extension ecosystem (pgvector, PostGIS, pg_trgm)
  • Row-level security provides defense-in-depth for multi-tenancy
  • Managed service options on all major cloud providers

Negative:

  • Single-node write bottleneck (mitigated by read replicas and connection pooling)
  • Schema management across 20+ services requires disciplined migration practices
  • Shared PostgreSQL in dev/staging can lead to resource contention

Alternatives Considered

  1. MySQL: Rejected due to weaker JSON support, no built-in RLS, and limited extension ecosystem.
  2. CockroachDB: Considered for distributed SQL, but rejected due to operational complexity and cost.
  3. MongoDB: Rejected for services requiring strong relational integrity and multi-table transactions.

ADR Index

ADRTitleStatusDate
ADR-0001Microservices with Control Plane / Data Plane SplitAccepted2025-03-15
ADR-0002Multi-Tenancy via Kubernetes Namespace IsolationAccepted2025-03-22
ADR-0003JWT-Based Authentication with OAuth2 and MFAAccepted2025-04-01
ADR-0004Authorization via RBAC with OPA Policy EngineAccepted2025-04-08
ADR-0005Event-Driven Architecture with Apache Kafka (Strimzi)Accepted2025-04-15
ADR-0006LangGraph for Multi-Agent AI OrchestrationAccepted2025-05-01
ADR-0007Trino as the Federated Query EngineAccepted2025-05-10
ADR-0008Kubernetes-Native Deployment with Helm ChartsAccepted2025-03-20
ADR-0009PostgreSQL as the Primary Relational DatabaseAccepted2025-03-18

Process for New ADRs

When a significant architectural decision is needed, the following process applies:

  1. Draft: Create a new ADR document following the template above with status "Proposed"
  2. Review: Present the ADR to the Architecture Review Board (ARB) for discussion
  3. Decide: The ARB votes to accept, modify, or reject the proposal
  4. Record: Update the ADR status to "Accepted" and commit to the repository
  5. Communicate: Share the ADR with all engineering teams via the platform notification system

An ADR should be created for any decision that:

  • Affects multiple services or teams
  • Is difficult or expensive to reverse
  • Has significant trade-offs that future team members should understand
  • Changes a previously accepted ADR