Architecture Decision Records

Architecture Decision Records (ADRs) document the significant architectural decisions made during the design and evolution of the MATIH platform. Each record captures the context, the decision, the alternatives considered, and the consequences -- ensuring that future contributors understand not just what was decided, but why.

ADR Index

ADR	Title	Status	Date
ADR-0001	Platform Architecture	Accepted	2025-03
ADR-0002	Multi-Tenancy Model	Accepted	2025-03
ADR-0003	Authentication Strategy	Accepted	2025-04
ADR-0004	Authorization Model	Accepted	2025-04
ADR-0005	Event-Driven Communication	Accepted	2025-05
ADR-0006	API Gateway Selection	Accepted	2025-05
ADR-0007	Data Plane Technology Mix	Accepted	2025-06
ADR-0008	Observability Strategy	Accepted	2025-07
ADR-0009	Configuration Management	Accepted	2025-08

ADR-0001: Platform Architecture

Status: Accepted

Context:

The MATIH platform needs to serve as a unified data/AI/ML/BI platform for enterprise customers. The architecture must support:

Multiple tenants with strict data isolation
Diverse workloads (OLAP queries, AI inference, ML training, dashboard rendering)
Cloud-agnostic deployment (Azure, AWS, GCP, on-premises)
Independent scaling of different platform capabilities
Small operations team (platform must be operable without dedicated SRE)

Decision:

Adopt a two-plane microservices architecture:

Control Plane: 10 Java/Spring Boot services for platform management, deployed in a shared namespace
Data Plane: 14 polyglot services for tenant workloads, deployed in per-tenant namespaces
Kubernetes-native: All services deployed as Helm charts on Kubernetes
Event-driven: Kafka for asynchronous communication, REST for synchronous

Alternatives Considered:

Alternative	Reason Rejected
Monolithic application	Cannot scale individual capabilities independently; single technology stack would be suboptimal for both Java enterprise services and Python AI/ML workloads
Serverless (Lambda/Functions)	Cold start latency unacceptable for interactive analytics; state management complexity; vendor lock-in contradicts cloud-agnostic requirement
Single-plane microservices	No clear separation between management and workload concerns; harder to enforce tenant isolation; scaling management services affects workload services
Service mesh (Istio-based)	Complexity overhead too high for initial team size; decided to evaluate after MVP phase

Consequences:

24 services to build, deploy, and monitor (operational complexity)
Need for commons libraries to enforce consistency across services
Need for standardized deployment scripts and CI/CD pipelines
Clear security boundary between management and workload operations
Independent scaling of Control Plane and Data Plane
Ability to use optimal technology stack per problem domain

ADR-0002: Multi-Tenancy Model

Status: Accepted

Context:

Enterprise customers require strong tenant isolation guarantees. The platform must prevent cross-tenant data access, limit resource consumption per tenant, and provide auditable isolation boundaries. At the same time, operational efficiency requires shared infrastructure where safe to do so.

Decision:

Adopt a hybrid isolation model with four layers:

Kubernetes namespace isolation: Each tenant gets a dedicated namespace with NetworkPolicies and ResourceQuotas
Database schema isolation: Each tenant's data is stored in a separate PostgreSQL schema within a shared database
Application-level isolation: TenantContextHolder (ThreadLocal) ensures every operation is scoped to the correct tenant
Event isolation: Kafka messages are keyed by tenant_id for partition affinity

Alternatives Considered:

Alternative	Reason Rejected
Shared schema with tenant column	Insufficient isolation -- a bug in a WHERE clause could leak data across tenants; no database-level enforcement
Database per tenant	Excessive resource consumption -- each PostgreSQL instance has memory overhead; connection pool management becomes complex at 100+ tenants
VM per tenant	Cost-prohibitive at scale; long provisioning times; underutilized resources for small tenants
Container per tenant (no namespace)	Insufficient network isolation; no resource quota enforcement; no RBAC boundary

Consequences:

TenantContext must be propagated through every layer (filter chain, service layer, repository layer, event publishing)
Thread pool dispatch requires explicit context wrapping (wrapWithContext)
Every database query is automatically scoped via Hibernate TenantIdentifierResolver
Redis keys, Kafka messages, and log entries must carry tenant context
Tenant provisioning requires creating Kubernetes namespace, database schemas, and network policies
Operational queries (platform admin) must use SYSTEM_TENANT context

ADR-0003: Authentication Strategy

Status: Accepted

Context:

The platform needs a unified authentication mechanism that works across:

Browser-based single-page applications (SPAs)
CLI tools and API clients
Service-to-service communication
Long-lived API integrations

Decision:

Use JWT-based authentication with four token types:

Access tokens (15-minute expiry): Short-lived tokens for user requests, carrying tenant_id and roles as claims
Refresh tokens (7-day expiry): Used to obtain new access tokens without re-authentication
Service tokens (5-minute expiry): Short-lived tokens for inter-service communication
API key tokens (configurable expiry): Long-lived tokens for programmatic access

All tokens are signed with HMAC-SHA256 and validated at every service boundary.

Alternatives Considered:

Alternative	Reason Rejected
Session-based authentication	Does not work well for API clients and service-to-service calls; requires sticky sessions or shared session store
OAuth2 with external IdP only	Adds latency for token introspection on every request; external dependency for critical path
mTLS only	Excellent for service-to-service, but poor developer experience for browser-based SPAs; complex certificate management
API keys only	No expiration enforcement without custom infrastructure; no standard claims structure

Consequences:

JWT secret must be securely distributed to all services
Token validation can happen locally at each service (no round-trip to IAM)
tenant_id is embedded in the token and cannot be forged by clients
Short access token expiry (15 min) limits the window for stolen tokens
Refresh tokens require secure storage on the client side
Service tokens enable zero-trust inter-service communication
Token blacklisting requires a shared Redis store for immediate revocation

ADR-0004: Authorization Model

Status: Accepted

Context:

The platform needs a fine-grained authorization model that supports:

Role-based access control (RBAC) for platform-level permissions
Tenant-scoped permissions (a user's role may differ between tenants)
Resource-level permissions (e.g., "can edit this specific dashboard")
Data-level access control (column-level and row-level security)

Decision:

Implement a hierarchical RBAC model with three layers:

Platform roles: Global permissions (e.g., platform_admin, tenant_creator)
Tenant roles: Tenant-scoped permissions (e.g., data_analyst, dashboard_editor)
Resource permissions: Fine-grained access to specific resources

Authorization is enforced at two levels:

Application layer: RbacService and PermissionEvaluator in commons-java check permissions on every API call
Data layer: Governance service enforces column-level and row-level security via Apache Polaris

Alternatives Considered:

Alternative	Reason Rejected
Attribute-based access control (ABAC)	Higher complexity for initial MVP; can be added as an extension to RBAC later
External authorization service (OPA/Cedar)	Additional infrastructure dependency; latency on every authorization check; decided to evaluate post-MVP
Simple ACL lists	Insufficient for hierarchical tenant/resource model; does not scale well with user count

Consequences:

Every API endpoint must declare its required permissions
Role assignment is tenant-scoped -- the same user can have different roles in different tenants
Permission checks add ~1ms per request (acceptable overhead)
The governance service handles data-level access separately from API-level access
Role hierarchy reduces permission management overhead (e.g., admin inherits all editor permissions)

ADR-0005: Event-Driven Communication

Status: Accepted

Context:

The platform has multiple communication patterns:

Request-response (user queries, API calls)
Fire-and-forget notifications (audit events, billing meters)
Real-time streaming (AI chat tokens, live dashboards)
Cross-plane coordination (Control Plane to Data Plane commands)

A single communication mechanism cannot optimally serve all these patterns.

Decision:

Use a hybrid communication model:

Synchronous REST for request-response interactions requiring immediate results
Apache Kafka for durable, asynchronous event streaming (audit, billing, cross-service coordination)
Redis Pub/Sub for ephemeral, real-time notifications (config changes, streaming tokens)

All events extend the DataPlaneEvent base class with standardized fields: eventId, eventType, category, tenantId, correlationId, payload.

Alternatives Considered:

Alternative	Reason Rejected
REST only	Cannot handle fire-and-forget patterns efficiently; polling for events wastes resources
Kafka only	Adds unnecessary latency and complexity for simple request-response patterns
RabbitMQ instead of Kafka	Kafka's log-based architecture better suited for event sourcing and replay; better Kubernetes operator support (Strimzi)
gRPC for all inter-service	Excellent for performance but higher implementation complexity; REST is sufficient for most services; gRPC adopted only where needed (Temporal, Ray)

Consequences:

Two messaging systems to operate (Kafka + Redis)
Developers must choose the appropriate communication pattern for each interaction
Events must carry tenant context for isolation
Event schema evolution must be managed (schemaVersion field)
Kafka topic naming and partitioning conventions must be documented and enforced

ADR-0006: API Gateway Selection

Status: Accepted

Context:

The platform needs an API gateway that provides:

JWT validation and claims extraction
Per-tenant rate limiting
Request routing to multiple backend services
WebSocket and SSE support for streaming
Extensibility for custom logic (tenant context injection)

Decision:

Adopt Kong 3.5.0 in DB-less (declarative) mode with custom Lua plugins.

Three custom plugins were developed:

JWT claims extraction (injects X-Tenant-ID, X-User-ID headers)
Tenant-aware rate limiting (per-tenant quotas stored in Redis)
Request validation (input sanitization at the edge)

Alternatives Considered:

Alternative	Reason Rejected
Spring Cloud Gateway	Would require all gateway logic in Java; less mature plugin ecosystem; harder to extend with custom middleware
Envoy + custom filters	Excellent performance but custom filters require C++ or Wasm; higher development effort for custom logic
AWS API Gateway / Azure APIM	Cloud-specific -- contradicts cloud-agnostic requirement
Traefik	Less mature plugin ecosystem; custom middleware requires Go plugins compiled into the binary
NGINX Ingress Controller only	Insufficient for JWT claims extraction and tenant-aware rate limiting without extensive custom Lua

Consequences:

Lua plugin development requires a different skill set than the primary Java/Python stack
DB-less mode means configuration changes require redeployment (acceptable for GitOps workflow)
Kong provides extensive built-in telemetry (Prometheus metrics)
Single point of entry simplifies security auditing
Gateway becomes a critical path component -- must be highly available

ADR-0007: Data Plane Technology Mix

Status: Accepted

Context:

The Data Plane must support diverse workloads:

SQL query execution and data-intensive processing (best served by JVM ecosystem)
AI/ML inference, text-to-SQL, agent orchestration (best served by Python ecosystem)
Server-side chart rendering (best served by Node.js/headless browser ecosystem)

Decision:

Adopt a polyglot architecture with three primary stacks:

Stack	Services	Rationale
Java / Spring Boot 3.2	query-engine, catalog-service, semantic-layer, bi-service, pipeline-service, data-plane-agent	JDBC, Hibernate multi-tenancy, Trino integration, Spring ecosystem
Python / FastAPI	ai-service, ml-service, data-quality-service, ontology-service, governance-service, ops-agent-service, auth-proxy	LangChain, PyTorch, pandas, scikit-learn, LLM libraries
Node.js	render-service	Puppeteer/Playwright for chart rendering, npm visualization ecosystem

Shared behavior is enforced through commons libraries: commons-java, commons-python, commons-typescript.

Alternatives Considered:

Alternative	Reason Rejected
Java only	Python AI/ML ecosystem is vastly superior; forcing AI workloads into Java would result in inferior capabilities and slower development
Python only	Python is suboptimal for high-throughput data services (query engine, catalog); lacks Hibernate multi-tenancy support; GIL limits concurrency
Go for infrastructure services	Would add a fourth language; team expertise is in Java for backend services

Consequences:

Three different build systems (Gradle, pip/poetry, npm)
Three different container base images
Three different debugging and profiling toolchains
Commons libraries must provide consistent cross-language contracts
Service interaction tests must verify cross-language serialization compatibility

ADR-0008: Observability Strategy

Status: Accepted

Context:

With 24 microservices across multiple namespaces, the platform requires comprehensive observability to be operable. The team is small, so observability must be automated rather than relying on manual investigation.

Decision:

Adopt the three pillars of observability with open-source tools:

Pillar	Tool	Integration
Metrics	Prometheus + Grafana	Micrometer (Java), prometheus_client (Python)
Tracing	Tempo	OpenTelemetry SDK, `@Traced` annotations
Logging	Loki	Structured JSON logging with tenant context enrichment

All three are deployed in the matih-observability namespace and queryable through the observability-api Control Plane service.

Key decisions within this strategy:

Structured JSON logging (not plain text) for machine-parseable logs
Tenant context (tenant_id, user_id, correlation_id) enriched into every log line, metric label, and trace attribute
@Traced, @Timed, and @Logged annotations in commons-java for declarative instrumentation

Alternatives Considered:

Alternative	Reason Rejected
Datadog / New Relic	Vendor lock-in; cost scales unpredictably with data volume; contradicts cloud-agnostic principle
ELK Stack (Elasticsearch + Logstash + Kibana)	Higher resource consumption than Loki; Elasticsearch already used for audit/search
Jaeger for tracing	Tempo offers better Grafana integration and lower resource usage; compatible with same OpenTelemetry SDK

Consequences:

All services must include observability dependencies (commons-java handles this for Java services)
Log volume management requires attention (retention policies, sampling)
Grafana dashboards must be provisioned as code (JSON models in Helm charts)
Alert rules must be tenant-aware to avoid noisy per-tenant alerts
The observability-api service provides a unified query interface for the admin UI

ADR-0009: Configuration Management

Status: Accepted

Context:

The platform needs centralized configuration that supports:

Per-environment settings (dev, staging, production)
Per-tenant overrides (custom settings per customer)
Feature flags for gradual rollouts
Hot reload without service restarts
Audit trail for configuration changes

Decision:

Build a dedicated config-service that provides hierarchical configuration with Redis Pub/Sub for change distribution:

Global defaults
  --> Environment overrides (dev, staging, prod)
    --> Service-specific overrides
      --> Tenant-specific overrides

Configuration changes published via Redis Pub/Sub trigger local cache invalidation in all subscribed services, enabling zero-downtime configuration updates.

Alternatives Considered:

Alternative	Reason Rejected
Spring Cloud Config Server	Only serves Java services; no tenant-specific override support built-in
Consul / etcd	Additional infrastructure dependency; does not natively support hierarchical overrides with tenant scoping
Kubernetes ConfigMaps only	No hot reload (requires pod restart); no tenant-specific overrides; no audit trail
Environment variables only	No dynamic updates; no hierarchy; difficult to manage at scale

Consequences:

Configuration reads are fast (local cache with Redis invalidation)
Configuration writes go through the config-service API (audit trail)
Tenant-specific overrides enable per-customer customization
Feature flags can be toggled without deployment
Services must handle configuration change events gracefully (no restart required)
The config-service itself uses file-based configuration for bootstrap (avoiding circular dependency)

ADR Template

Future architectural decisions should follow this template:

## ADR-XXXX: [Title]
 
**Status:** Proposed | Accepted | Deprecated | Superseded by ADR-XXXX
 
**Context:**
What is the issue that we are seeing that is motivating this decision?
 
**Decision:**
What is the change that we are proposing and/or doing?
 
**Alternatives Considered:**
What other options did we evaluate, and why were they rejected?
 
**Consequences:**
What becomes easier or harder as a result of this decision?

Decision Log Summary

The following table summarizes the key technology choices made across all ADRs:

Decision Area	Choice	Key Rationale
Architecture	Two-plane microservices	Separation of management and workload concerns
Multi-tenancy	Hybrid isolation (namespace + schema + app)	Balance of security and operational efficiency
Authentication	JWT with 4 token types	Local validation, no round-trips, multi-client support
Authorization	Hierarchical RBAC	Platform, tenant, and resource-level permissions
Async messaging	Apache Kafka	Durable event streaming with exactly-once semantics
Real-time messaging	Redis Pub/Sub	Low-latency ephemeral notifications
API Gateway	Kong 3.5.0 (DB-less)	Custom Lua plugins, declarative config, SSE/WS support
Backend (Control Plane)	Java 21 / Spring Boot 3.2	Enterprise ecosystem, Hibernate multi-tenancy
Backend (AI/ML)	Python / FastAPI	AI/ML library ecosystem, async support
Backend (Rendering)	Node.js	Headless browser ecosystem for chart rendering
Metrics	Prometheus + Grafana	Open-source, Kubernetes-native, tenant-aware labels
Tracing	Tempo + OpenTelemetry	Grafana integration, low resource footprint
Logging	Loki	Grafana integration, label-based querying
Configuration	Custom config-service + Redis	Hierarchical overrides, tenant-specific, hot reload

Related Sections

Design Philosophy -- The principles behind these decisions
Service Topology -- How these decisions shape service interactions
Kubernetes and Helm -- Infrastructure decisions in practice
Appendices -- Full ADR documents

Rate Limiting Overview