Architecture Decision Records
Architecture Decision Records (ADRs) document the significant architectural decisions made during the design and evolution of the MATIH platform. Each record captures the context, the decision, the alternatives considered, and the consequences -- ensuring that future contributors understand not just what was decided, but why.
ADR Index
| ADR | Title | Status | Date |
|---|---|---|---|
| ADR-0001 | Platform Architecture | Accepted | 2025-03 |
| ADR-0002 | Multi-Tenancy Model | Accepted | 2025-03 |
| ADR-0003 | Authentication Strategy | Accepted | 2025-04 |
| ADR-0004 | Authorization Model | Accepted | 2025-04 |
| ADR-0005 | Event-Driven Communication | Accepted | 2025-05 |
| ADR-0006 | API Gateway Selection | Accepted | 2025-05 |
| ADR-0007 | Data Plane Technology Mix | Accepted | 2025-06 |
| ADR-0008 | Observability Strategy | Accepted | 2025-07 |
| ADR-0009 | Configuration Management | Accepted | 2025-08 |
ADR-0001: Platform Architecture
Status: Accepted
Context:
The MATIH platform needs to serve as a unified data/AI/ML/BI platform for enterprise customers. The architecture must support:
- Multiple tenants with strict data isolation
- Diverse workloads (OLAP queries, AI inference, ML training, dashboard rendering)
- Cloud-agnostic deployment (Azure, AWS, GCP, on-premises)
- Independent scaling of different platform capabilities
- Small operations team (platform must be operable without dedicated SRE)
Decision:
Adopt a two-plane microservices architecture:
- Control Plane: 10 Java/Spring Boot services for platform management, deployed in a shared namespace
- Data Plane: 14 polyglot services for tenant workloads, deployed in per-tenant namespaces
- Kubernetes-native: All services deployed as Helm charts on Kubernetes
- Event-driven: Kafka for asynchronous communication, REST for synchronous
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| Monolithic application | Cannot scale individual capabilities independently; single technology stack would be suboptimal for both Java enterprise services and Python AI/ML workloads |
| Serverless (Lambda/Functions) | Cold start latency unacceptable for interactive analytics; state management complexity; vendor lock-in contradicts cloud-agnostic requirement |
| Single-plane microservices | No clear separation between management and workload concerns; harder to enforce tenant isolation; scaling management services affects workload services |
| Service mesh (Istio-based) | Complexity overhead too high for initial team size; decided to evaluate after MVP phase |
Consequences:
- 24 services to build, deploy, and monitor (operational complexity)
- Need for commons libraries to enforce consistency across services
- Need for standardized deployment scripts and CI/CD pipelines
- Clear security boundary between management and workload operations
- Independent scaling of Control Plane and Data Plane
- Ability to use optimal technology stack per problem domain
ADR-0002: Multi-Tenancy Model
Status: Accepted
Context:
Enterprise customers require strong tenant isolation guarantees. The platform must prevent cross-tenant data access, limit resource consumption per tenant, and provide auditable isolation boundaries. At the same time, operational efficiency requires shared infrastructure where safe to do so.
Decision:
Adopt a hybrid isolation model with four layers:
- Kubernetes namespace isolation: Each tenant gets a dedicated namespace with NetworkPolicies and ResourceQuotas
- Database schema isolation: Each tenant's data is stored in a separate PostgreSQL schema within a shared database
- Application-level isolation:
TenantContextHolder(ThreadLocal) ensures every operation is scoped to the correct tenant - Event isolation: Kafka messages are keyed by
tenant_idfor partition affinity
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| Shared schema with tenant column | Insufficient isolation -- a bug in a WHERE clause could leak data across tenants; no database-level enforcement |
| Database per tenant | Excessive resource consumption -- each PostgreSQL instance has memory overhead; connection pool management becomes complex at 100+ tenants |
| VM per tenant | Cost-prohibitive at scale; long provisioning times; underutilized resources for small tenants |
| Container per tenant (no namespace) | Insufficient network isolation; no resource quota enforcement; no RBAC boundary |
Consequences:
TenantContextmust be propagated through every layer (filter chain, service layer, repository layer, event publishing)- Thread pool dispatch requires explicit context wrapping (
wrapWithContext) - Every database query is automatically scoped via Hibernate
TenantIdentifierResolver - Redis keys, Kafka messages, and log entries must carry tenant context
- Tenant provisioning requires creating Kubernetes namespace, database schemas, and network policies
- Operational queries (platform admin) must use
SYSTEM_TENANTcontext
ADR-0003: Authentication Strategy
Status: Accepted
Context:
The platform needs a unified authentication mechanism that works across:
- Browser-based single-page applications (SPAs)
- CLI tools and API clients
- Service-to-service communication
- Long-lived API integrations
Decision:
Use JWT-based authentication with four token types:
- Access tokens (15-minute expiry): Short-lived tokens for user requests, carrying
tenant_idandrolesas claims - Refresh tokens (7-day expiry): Used to obtain new access tokens without re-authentication
- Service tokens (5-minute expiry): Short-lived tokens for inter-service communication
- API key tokens (configurable expiry): Long-lived tokens for programmatic access
All tokens are signed with HMAC-SHA256 and validated at every service boundary.
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| Session-based authentication | Does not work well for API clients and service-to-service calls; requires sticky sessions or shared session store |
| OAuth2 with external IdP only | Adds latency for token introspection on every request; external dependency for critical path |
| mTLS only | Excellent for service-to-service, but poor developer experience for browser-based SPAs; complex certificate management |
| API keys only | No expiration enforcement without custom infrastructure; no standard claims structure |
Consequences:
- JWT secret must be securely distributed to all services
- Token validation can happen locally at each service (no round-trip to IAM)
tenant_idis embedded in the token and cannot be forged by clients- Short access token expiry (15 min) limits the window for stolen tokens
- Refresh tokens require secure storage on the client side
- Service tokens enable zero-trust inter-service communication
- Token blacklisting requires a shared Redis store for immediate revocation
ADR-0004: Authorization Model
Status: Accepted
Context:
The platform needs a fine-grained authorization model that supports:
- Role-based access control (RBAC) for platform-level permissions
- Tenant-scoped permissions (a user's role may differ between tenants)
- Resource-level permissions (e.g., "can edit this specific dashboard")
- Data-level access control (column-level and row-level security)
Decision:
Implement a hierarchical RBAC model with three layers:
- Platform roles: Global permissions (e.g.,
platform_admin,tenant_creator) - Tenant roles: Tenant-scoped permissions (e.g.,
data_analyst,dashboard_editor) - Resource permissions: Fine-grained access to specific resources
Authorization is enforced at two levels:
- Application layer:
RbacServiceandPermissionEvaluatorin commons-java check permissions on every API call - Data layer: Governance service enforces column-level and row-level security via Apache Polaris
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| Attribute-based access control (ABAC) | Higher complexity for initial MVP; can be added as an extension to RBAC later |
| External authorization service (OPA/Cedar) | Additional infrastructure dependency; latency on every authorization check; decided to evaluate post-MVP |
| Simple ACL lists | Insufficient for hierarchical tenant/resource model; does not scale well with user count |
Consequences:
- Every API endpoint must declare its required permissions
- Role assignment is tenant-scoped -- the same user can have different roles in different tenants
- Permission checks add ~1ms per request (acceptable overhead)
- The governance service handles data-level access separately from API-level access
- Role hierarchy reduces permission management overhead (e.g.,
admininherits alleditorpermissions)
ADR-0005: Event-Driven Communication
Status: Accepted
Context:
The platform has multiple communication patterns:
- Request-response (user queries, API calls)
- Fire-and-forget notifications (audit events, billing meters)
- Real-time streaming (AI chat tokens, live dashboards)
- Cross-plane coordination (Control Plane to Data Plane commands)
A single communication mechanism cannot optimally serve all these patterns.
Decision:
Use a hybrid communication model:
- Synchronous REST for request-response interactions requiring immediate results
- Apache Kafka for durable, asynchronous event streaming (audit, billing, cross-service coordination)
- Redis Pub/Sub for ephemeral, real-time notifications (config changes, streaming tokens)
All events extend the DataPlaneEvent base class with standardized fields: eventId, eventType, category, tenantId, correlationId, payload.
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| REST only | Cannot handle fire-and-forget patterns efficiently; polling for events wastes resources |
| Kafka only | Adds unnecessary latency and complexity for simple request-response patterns |
| RabbitMQ instead of Kafka | Kafka's log-based architecture better suited for event sourcing and replay; better Kubernetes operator support (Strimzi) |
| gRPC for all inter-service | Excellent for performance but higher implementation complexity; REST is sufficient for most services; gRPC adopted only where needed (Temporal, Ray) |
Consequences:
- Two messaging systems to operate (Kafka + Redis)
- Developers must choose the appropriate communication pattern for each interaction
- Events must carry tenant context for isolation
- Event schema evolution must be managed (schemaVersion field)
- Kafka topic naming and partitioning conventions must be documented and enforced
ADR-0006: API Gateway Selection
Status: Accepted
Context:
The platform needs an API gateway that provides:
- JWT validation and claims extraction
- Per-tenant rate limiting
- Request routing to multiple backend services
- WebSocket and SSE support for streaming
- Extensibility for custom logic (tenant context injection)
Decision:
Adopt Kong 3.5.0 in DB-less (declarative) mode with custom Lua plugins.
Three custom plugins were developed:
- JWT claims extraction (injects
X-Tenant-ID,X-User-IDheaders) - Tenant-aware rate limiting (per-tenant quotas stored in Redis)
- Request validation (input sanitization at the edge)
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| Spring Cloud Gateway | Would require all gateway logic in Java; less mature plugin ecosystem; harder to extend with custom middleware |
| Envoy + custom filters | Excellent performance but custom filters require C++ or Wasm; higher development effort for custom logic |
| AWS API Gateway / Azure APIM | Cloud-specific -- contradicts cloud-agnostic requirement |
| Traefik | Less mature plugin ecosystem; custom middleware requires Go plugins compiled into the binary |
| NGINX Ingress Controller only | Insufficient for JWT claims extraction and tenant-aware rate limiting without extensive custom Lua |
Consequences:
- Lua plugin development requires a different skill set than the primary Java/Python stack
- DB-less mode means configuration changes require redeployment (acceptable for GitOps workflow)
- Kong provides extensive built-in telemetry (Prometheus metrics)
- Single point of entry simplifies security auditing
- Gateway becomes a critical path component -- must be highly available
ADR-0007: Data Plane Technology Mix
Status: Accepted
Context:
The Data Plane must support diverse workloads:
- SQL query execution and data-intensive processing (best served by JVM ecosystem)
- AI/ML inference, text-to-SQL, agent orchestration (best served by Python ecosystem)
- Server-side chart rendering (best served by Node.js/headless browser ecosystem)
Decision:
Adopt a polyglot architecture with three primary stacks:
| Stack | Services | Rationale |
|---|---|---|
| Java / Spring Boot 3.2 | query-engine, catalog-service, semantic-layer, bi-service, pipeline-service, data-plane-agent | JDBC, Hibernate multi-tenancy, Trino integration, Spring ecosystem |
| Python / FastAPI | ai-service, ml-service, data-quality-service, ontology-service, governance-service, ops-agent-service, auth-proxy | LangChain, PyTorch, pandas, scikit-learn, LLM libraries |
| Node.js | render-service | Puppeteer/Playwright for chart rendering, npm visualization ecosystem |
Shared behavior is enforced through commons libraries: commons-java, commons-python, commons-typescript.
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| Java only | Python AI/ML ecosystem is vastly superior; forcing AI workloads into Java would result in inferior capabilities and slower development |
| Python only | Python is suboptimal for high-throughput data services (query engine, catalog); lacks Hibernate multi-tenancy support; GIL limits concurrency |
| Go for infrastructure services | Would add a fourth language; team expertise is in Java for backend services |
Consequences:
- Three different build systems (Gradle, pip/poetry, npm)
- Three different container base images
- Three different debugging and profiling toolchains
- Commons libraries must provide consistent cross-language contracts
- Service interaction tests must verify cross-language serialization compatibility
ADR-0008: Observability Strategy
Status: Accepted
Context:
With 24 microservices across multiple namespaces, the platform requires comprehensive observability to be operable. The team is small, so observability must be automated rather than relying on manual investigation.
Decision:
Adopt the three pillars of observability with open-source tools:
| Pillar | Tool | Integration |
|---|---|---|
| Metrics | Prometheus + Grafana | Micrometer (Java), prometheus_client (Python) |
| Tracing | Tempo | OpenTelemetry SDK, @Traced annotations |
| Logging | Loki | Structured JSON logging with tenant context enrichment |
All three are deployed in the matih-observability namespace and queryable through the observability-api Control Plane service.
Key decisions within this strategy:
- Structured JSON logging (not plain text) for machine-parseable logs
- Tenant context (
tenant_id,user_id,correlation_id) enriched into every log line, metric label, and trace attribute @Traced,@Timed, and@Loggedannotations in commons-java for declarative instrumentation
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| Datadog / New Relic | Vendor lock-in; cost scales unpredictably with data volume; contradicts cloud-agnostic principle |
| ELK Stack (Elasticsearch + Logstash + Kibana) | Higher resource consumption than Loki; Elasticsearch already used for audit/search |
| Jaeger for tracing | Tempo offers better Grafana integration and lower resource usage; compatible with same OpenTelemetry SDK |
Consequences:
- All services must include observability dependencies (
commons-javahandles this for Java services) - Log volume management requires attention (retention policies, sampling)
- Grafana dashboards must be provisioned as code (JSON models in Helm charts)
- Alert rules must be tenant-aware to avoid noisy per-tenant alerts
- The
observability-apiservice provides a unified query interface for the admin UI
ADR-0009: Configuration Management
Status: Accepted
Context:
The platform needs centralized configuration that supports:
- Per-environment settings (dev, staging, production)
- Per-tenant overrides (custom settings per customer)
- Feature flags for gradual rollouts
- Hot reload without service restarts
- Audit trail for configuration changes
Decision:
Build a dedicated config-service that provides hierarchical configuration with Redis Pub/Sub for change distribution:
Global defaults
--> Environment overrides (dev, staging, prod)
--> Service-specific overrides
--> Tenant-specific overridesConfiguration changes published via Redis Pub/Sub trigger local cache invalidation in all subscribed services, enabling zero-downtime configuration updates.
Alternatives Considered:
| Alternative | Reason Rejected |
|---|---|
| Spring Cloud Config Server | Only serves Java services; no tenant-specific override support built-in |
| Consul / etcd | Additional infrastructure dependency; does not natively support hierarchical overrides with tenant scoping |
| Kubernetes ConfigMaps only | No hot reload (requires pod restart); no tenant-specific overrides; no audit trail |
| Environment variables only | No dynamic updates; no hierarchy; difficult to manage at scale |
Consequences:
- Configuration reads are fast (local cache with Redis invalidation)
- Configuration writes go through the config-service API (audit trail)
- Tenant-specific overrides enable per-customer customization
- Feature flags can be toggled without deployment
- Services must handle configuration change events gracefully (no restart required)
- The config-service itself uses file-based configuration for bootstrap (avoiding circular dependency)
ADR Template
Future architectural decisions should follow this template:
## ADR-XXXX: [Title]
**Status:** Proposed | Accepted | Deprecated | Superseded by ADR-XXXX
**Context:**
What is the issue that we are seeing that is motivating this decision?
**Decision:**
What is the change that we are proposing and/or doing?
**Alternatives Considered:**
What other options did we evaluate, and why were they rejected?
**Consequences:**
What becomes easier or harder as a result of this decision?Decision Log Summary
The following table summarizes the key technology choices made across all ADRs:
| Decision Area | Choice | Key Rationale |
|---|---|---|
| Architecture | Two-plane microservices | Separation of management and workload concerns |
| Multi-tenancy | Hybrid isolation (namespace + schema + app) | Balance of security and operational efficiency |
| Authentication | JWT with 4 token types | Local validation, no round-trips, multi-client support |
| Authorization | Hierarchical RBAC | Platform, tenant, and resource-level permissions |
| Async messaging | Apache Kafka | Durable event streaming with exactly-once semantics |
| Real-time messaging | Redis Pub/Sub | Low-latency ephemeral notifications |
| API Gateway | Kong 3.5.0 (DB-less) | Custom Lua plugins, declarative config, SSE/WS support |
| Backend (Control Plane) | Java 21 / Spring Boot 3.2 | Enterprise ecosystem, Hibernate multi-tenancy |
| Backend (AI/ML) | Python / FastAPI | AI/ML library ecosystem, async support |
| Backend (Rendering) | Node.js | Headless browser ecosystem for chart rendering |
| Metrics | Prometheus + Grafana | Open-source, Kubernetes-native, tenant-aware labels |
| Tracing | Tempo + OpenTelemetry | Grafana integration, low resource footprint |
| Logging | Loki | Grafana integration, label-based querying |
| Configuration | Custom config-service + Redis | Hierarchical overrides, tenant-specific, hot reload |
Related Sections
- Design Philosophy -- The principles behind these decisions
- Service Topology -- How these decisions shape service interactions
- Kubernetes and Helm -- Infrastructure decisions in practice
- Appendices -- Full ADR documents