MATIH Platform is in active MVP development. Documentation reflects current implementation status.
2. Architecture
Architecture Decision Records

Architecture Decision Records

Architecture Decision Records (ADRs) document the significant architectural decisions made during the design and evolution of the MATIH platform. Each record captures the context, the decision, the alternatives considered, and the consequences -- ensuring that future contributors understand not just what was decided, but why.


ADR Index

ADRTitleStatusDate
ADR-0001Platform ArchitectureAccepted2025-03
ADR-0002Multi-Tenancy ModelAccepted2025-03
ADR-0003Authentication StrategyAccepted2025-04
ADR-0004Authorization ModelAccepted2025-04
ADR-0005Event-Driven CommunicationAccepted2025-05
ADR-0006API Gateway SelectionAccepted2025-05
ADR-0007Data Plane Technology MixAccepted2025-06
ADR-0008Observability StrategyAccepted2025-07
ADR-0009Configuration ManagementAccepted2025-08

ADR-0001: Platform Architecture

Status: Accepted

Context:

The MATIH platform needs to serve as a unified data/AI/ML/BI platform for enterprise customers. The architecture must support:

  • Multiple tenants with strict data isolation
  • Diverse workloads (OLAP queries, AI inference, ML training, dashboard rendering)
  • Cloud-agnostic deployment (Azure, AWS, GCP, on-premises)
  • Independent scaling of different platform capabilities
  • Small operations team (platform must be operable without dedicated SRE)

Decision:

Adopt a two-plane microservices architecture:

  • Control Plane: 10 Java/Spring Boot services for platform management, deployed in a shared namespace
  • Data Plane: 14 polyglot services for tenant workloads, deployed in per-tenant namespaces
  • Kubernetes-native: All services deployed as Helm charts on Kubernetes
  • Event-driven: Kafka for asynchronous communication, REST for synchronous

Alternatives Considered:

AlternativeReason Rejected
Monolithic applicationCannot scale individual capabilities independently; single technology stack would be suboptimal for both Java enterprise services and Python AI/ML workloads
Serverless (Lambda/Functions)Cold start latency unacceptable for interactive analytics; state management complexity; vendor lock-in contradicts cloud-agnostic requirement
Single-plane microservicesNo clear separation between management and workload concerns; harder to enforce tenant isolation; scaling management services affects workload services
Service mesh (Istio-based)Complexity overhead too high for initial team size; decided to evaluate after MVP phase

Consequences:

  • 24 services to build, deploy, and monitor (operational complexity)
  • Need for commons libraries to enforce consistency across services
  • Need for standardized deployment scripts and CI/CD pipelines
  • Clear security boundary between management and workload operations
  • Independent scaling of Control Plane and Data Plane
  • Ability to use optimal technology stack per problem domain

ADR-0002: Multi-Tenancy Model

Status: Accepted

Context:

Enterprise customers require strong tenant isolation guarantees. The platform must prevent cross-tenant data access, limit resource consumption per tenant, and provide auditable isolation boundaries. At the same time, operational efficiency requires shared infrastructure where safe to do so.

Decision:

Adopt a hybrid isolation model with four layers:

  1. Kubernetes namespace isolation: Each tenant gets a dedicated namespace with NetworkPolicies and ResourceQuotas
  2. Database schema isolation: Each tenant's data is stored in a separate PostgreSQL schema within a shared database
  3. Application-level isolation: TenantContextHolder (ThreadLocal) ensures every operation is scoped to the correct tenant
  4. Event isolation: Kafka messages are keyed by tenant_id for partition affinity

Alternatives Considered:

AlternativeReason Rejected
Shared schema with tenant columnInsufficient isolation -- a bug in a WHERE clause could leak data across tenants; no database-level enforcement
Database per tenantExcessive resource consumption -- each PostgreSQL instance has memory overhead; connection pool management becomes complex at 100+ tenants
VM per tenantCost-prohibitive at scale; long provisioning times; underutilized resources for small tenants
Container per tenant (no namespace)Insufficient network isolation; no resource quota enforcement; no RBAC boundary

Consequences:

  • TenantContext must be propagated through every layer (filter chain, service layer, repository layer, event publishing)
  • Thread pool dispatch requires explicit context wrapping (wrapWithContext)
  • Every database query is automatically scoped via Hibernate TenantIdentifierResolver
  • Redis keys, Kafka messages, and log entries must carry tenant context
  • Tenant provisioning requires creating Kubernetes namespace, database schemas, and network policies
  • Operational queries (platform admin) must use SYSTEM_TENANT context

ADR-0003: Authentication Strategy

Status: Accepted

Context:

The platform needs a unified authentication mechanism that works across:

  • Browser-based single-page applications (SPAs)
  • CLI tools and API clients
  • Service-to-service communication
  • Long-lived API integrations

Decision:

Use JWT-based authentication with four token types:

  1. Access tokens (15-minute expiry): Short-lived tokens for user requests, carrying tenant_id and roles as claims
  2. Refresh tokens (7-day expiry): Used to obtain new access tokens without re-authentication
  3. Service tokens (5-minute expiry): Short-lived tokens for inter-service communication
  4. API key tokens (configurable expiry): Long-lived tokens for programmatic access

All tokens are signed with HMAC-SHA256 and validated at every service boundary.

Alternatives Considered:

AlternativeReason Rejected
Session-based authenticationDoes not work well for API clients and service-to-service calls; requires sticky sessions or shared session store
OAuth2 with external IdP onlyAdds latency for token introspection on every request; external dependency for critical path
mTLS onlyExcellent for service-to-service, but poor developer experience for browser-based SPAs; complex certificate management
API keys onlyNo expiration enforcement without custom infrastructure; no standard claims structure

Consequences:

  • JWT secret must be securely distributed to all services
  • Token validation can happen locally at each service (no round-trip to IAM)
  • tenant_id is embedded in the token and cannot be forged by clients
  • Short access token expiry (15 min) limits the window for stolen tokens
  • Refresh tokens require secure storage on the client side
  • Service tokens enable zero-trust inter-service communication
  • Token blacklisting requires a shared Redis store for immediate revocation

ADR-0004: Authorization Model

Status: Accepted

Context:

The platform needs a fine-grained authorization model that supports:

  • Role-based access control (RBAC) for platform-level permissions
  • Tenant-scoped permissions (a user's role may differ between tenants)
  • Resource-level permissions (e.g., "can edit this specific dashboard")
  • Data-level access control (column-level and row-level security)

Decision:

Implement a hierarchical RBAC model with three layers:

  1. Platform roles: Global permissions (e.g., platform_admin, tenant_creator)
  2. Tenant roles: Tenant-scoped permissions (e.g., data_analyst, dashboard_editor)
  3. Resource permissions: Fine-grained access to specific resources

Authorization is enforced at two levels:

  • Application layer: RbacService and PermissionEvaluator in commons-java check permissions on every API call
  • Data layer: Governance service enforces column-level and row-level security via Apache Polaris

Alternatives Considered:

AlternativeReason Rejected
Attribute-based access control (ABAC)Higher complexity for initial MVP; can be added as an extension to RBAC later
External authorization service (OPA/Cedar)Additional infrastructure dependency; latency on every authorization check; decided to evaluate post-MVP
Simple ACL listsInsufficient for hierarchical tenant/resource model; does not scale well with user count

Consequences:

  • Every API endpoint must declare its required permissions
  • Role assignment is tenant-scoped -- the same user can have different roles in different tenants
  • Permission checks add ~1ms per request (acceptable overhead)
  • The governance service handles data-level access separately from API-level access
  • Role hierarchy reduces permission management overhead (e.g., admin inherits all editor permissions)

ADR-0005: Event-Driven Communication

Status: Accepted

Context:

The platform has multiple communication patterns:

  • Request-response (user queries, API calls)
  • Fire-and-forget notifications (audit events, billing meters)
  • Real-time streaming (AI chat tokens, live dashboards)
  • Cross-plane coordination (Control Plane to Data Plane commands)

A single communication mechanism cannot optimally serve all these patterns.

Decision:

Use a hybrid communication model:

  1. Synchronous REST for request-response interactions requiring immediate results
  2. Apache Kafka for durable, asynchronous event streaming (audit, billing, cross-service coordination)
  3. Redis Pub/Sub for ephemeral, real-time notifications (config changes, streaming tokens)

All events extend the DataPlaneEvent base class with standardized fields: eventId, eventType, category, tenantId, correlationId, payload.

Alternatives Considered:

AlternativeReason Rejected
REST onlyCannot handle fire-and-forget patterns efficiently; polling for events wastes resources
Kafka onlyAdds unnecessary latency and complexity for simple request-response patterns
RabbitMQ instead of KafkaKafka's log-based architecture better suited for event sourcing and replay; better Kubernetes operator support (Strimzi)
gRPC for all inter-serviceExcellent for performance but higher implementation complexity; REST is sufficient for most services; gRPC adopted only where needed (Temporal, Ray)

Consequences:

  • Two messaging systems to operate (Kafka + Redis)
  • Developers must choose the appropriate communication pattern for each interaction
  • Events must carry tenant context for isolation
  • Event schema evolution must be managed (schemaVersion field)
  • Kafka topic naming and partitioning conventions must be documented and enforced

ADR-0006: API Gateway Selection

Status: Accepted

Context:

The platform needs an API gateway that provides:

  • JWT validation and claims extraction
  • Per-tenant rate limiting
  • Request routing to multiple backend services
  • WebSocket and SSE support for streaming
  • Extensibility for custom logic (tenant context injection)

Decision:

Adopt Kong 3.5.0 in DB-less (declarative) mode with custom Lua plugins.

Three custom plugins were developed:

  1. JWT claims extraction (injects X-Tenant-ID, X-User-ID headers)
  2. Tenant-aware rate limiting (per-tenant quotas stored in Redis)
  3. Request validation (input sanitization at the edge)

Alternatives Considered:

AlternativeReason Rejected
Spring Cloud GatewayWould require all gateway logic in Java; less mature plugin ecosystem; harder to extend with custom middleware
Envoy + custom filtersExcellent performance but custom filters require C++ or Wasm; higher development effort for custom logic
AWS API Gateway / Azure APIMCloud-specific -- contradicts cloud-agnostic requirement
TraefikLess mature plugin ecosystem; custom middleware requires Go plugins compiled into the binary
NGINX Ingress Controller onlyInsufficient for JWT claims extraction and tenant-aware rate limiting without extensive custom Lua

Consequences:

  • Lua plugin development requires a different skill set than the primary Java/Python stack
  • DB-less mode means configuration changes require redeployment (acceptable for GitOps workflow)
  • Kong provides extensive built-in telemetry (Prometheus metrics)
  • Single point of entry simplifies security auditing
  • Gateway becomes a critical path component -- must be highly available

ADR-0007: Data Plane Technology Mix

Status: Accepted

Context:

The Data Plane must support diverse workloads:

  • SQL query execution and data-intensive processing (best served by JVM ecosystem)
  • AI/ML inference, text-to-SQL, agent orchestration (best served by Python ecosystem)
  • Server-side chart rendering (best served by Node.js/headless browser ecosystem)

Decision:

Adopt a polyglot architecture with three primary stacks:

StackServicesRationale
Java / Spring Boot 3.2query-engine, catalog-service, semantic-layer, bi-service, pipeline-service, data-plane-agentJDBC, Hibernate multi-tenancy, Trino integration, Spring ecosystem
Python / FastAPIai-service, ml-service, data-quality-service, ontology-service, governance-service, ops-agent-service, auth-proxyLangChain, PyTorch, pandas, scikit-learn, LLM libraries
Node.jsrender-servicePuppeteer/Playwright for chart rendering, npm visualization ecosystem

Shared behavior is enforced through commons libraries: commons-java, commons-python, commons-typescript.

Alternatives Considered:

AlternativeReason Rejected
Java onlyPython AI/ML ecosystem is vastly superior; forcing AI workloads into Java would result in inferior capabilities and slower development
Python onlyPython is suboptimal for high-throughput data services (query engine, catalog); lacks Hibernate multi-tenancy support; GIL limits concurrency
Go for infrastructure servicesWould add a fourth language; team expertise is in Java for backend services

Consequences:

  • Three different build systems (Gradle, pip/poetry, npm)
  • Three different container base images
  • Three different debugging and profiling toolchains
  • Commons libraries must provide consistent cross-language contracts
  • Service interaction tests must verify cross-language serialization compatibility

ADR-0008: Observability Strategy

Status: Accepted

Context:

With 24 microservices across multiple namespaces, the platform requires comprehensive observability to be operable. The team is small, so observability must be automated rather than relying on manual investigation.

Decision:

Adopt the three pillars of observability with open-source tools:

PillarToolIntegration
MetricsPrometheus + GrafanaMicrometer (Java), prometheus_client (Python)
TracingTempoOpenTelemetry SDK, @Traced annotations
LoggingLokiStructured JSON logging with tenant context enrichment

All three are deployed in the matih-observability namespace and queryable through the observability-api Control Plane service.

Key decisions within this strategy:

  • Structured JSON logging (not plain text) for machine-parseable logs
  • Tenant context (tenant_id, user_id, correlation_id) enriched into every log line, metric label, and trace attribute
  • @Traced, @Timed, and @Logged annotations in commons-java for declarative instrumentation

Alternatives Considered:

AlternativeReason Rejected
Datadog / New RelicVendor lock-in; cost scales unpredictably with data volume; contradicts cloud-agnostic principle
ELK Stack (Elasticsearch + Logstash + Kibana)Higher resource consumption than Loki; Elasticsearch already used for audit/search
Jaeger for tracingTempo offers better Grafana integration and lower resource usage; compatible with same OpenTelemetry SDK

Consequences:

  • All services must include observability dependencies (commons-java handles this for Java services)
  • Log volume management requires attention (retention policies, sampling)
  • Grafana dashboards must be provisioned as code (JSON models in Helm charts)
  • Alert rules must be tenant-aware to avoid noisy per-tenant alerts
  • The observability-api service provides a unified query interface for the admin UI

ADR-0009: Configuration Management

Status: Accepted

Context:

The platform needs centralized configuration that supports:

  • Per-environment settings (dev, staging, production)
  • Per-tenant overrides (custom settings per customer)
  • Feature flags for gradual rollouts
  • Hot reload without service restarts
  • Audit trail for configuration changes

Decision:

Build a dedicated config-service that provides hierarchical configuration with Redis Pub/Sub for change distribution:

Global defaults
  --> Environment overrides (dev, staging, prod)
    --> Service-specific overrides
      --> Tenant-specific overrides

Configuration changes published via Redis Pub/Sub trigger local cache invalidation in all subscribed services, enabling zero-downtime configuration updates.

Alternatives Considered:

AlternativeReason Rejected
Spring Cloud Config ServerOnly serves Java services; no tenant-specific override support built-in
Consul / etcdAdditional infrastructure dependency; does not natively support hierarchical overrides with tenant scoping
Kubernetes ConfigMaps onlyNo hot reload (requires pod restart); no tenant-specific overrides; no audit trail
Environment variables onlyNo dynamic updates; no hierarchy; difficult to manage at scale

Consequences:

  • Configuration reads are fast (local cache with Redis invalidation)
  • Configuration writes go through the config-service API (audit trail)
  • Tenant-specific overrides enable per-customer customization
  • Feature flags can be toggled without deployment
  • Services must handle configuration change events gracefully (no restart required)
  • The config-service itself uses file-based configuration for bootstrap (avoiding circular dependency)

ADR Template

Future architectural decisions should follow this template:

## ADR-XXXX: [Title]
 
**Status:** Proposed | Accepted | Deprecated | Superseded by ADR-XXXX
 
**Context:**
What is the issue that we are seeing that is motivating this decision?
 
**Decision:**
What is the change that we are proposing and/or doing?
 
**Alternatives Considered:**
What other options did we evaluate, and why were they rejected?
 
**Consequences:**
What becomes easier or harder as a result of this decision?

Decision Log Summary

The following table summarizes the key technology choices made across all ADRs:

Decision AreaChoiceKey Rationale
ArchitectureTwo-plane microservicesSeparation of management and workload concerns
Multi-tenancyHybrid isolation (namespace + schema + app)Balance of security and operational efficiency
AuthenticationJWT with 4 token typesLocal validation, no round-trips, multi-client support
AuthorizationHierarchical RBACPlatform, tenant, and resource-level permissions
Async messagingApache KafkaDurable event streaming with exactly-once semantics
Real-time messagingRedis Pub/SubLow-latency ephemeral notifications
API GatewayKong 3.5.0 (DB-less)Custom Lua plugins, declarative config, SSE/WS support
Backend (Control Plane)Java 21 / Spring Boot 3.2Enterprise ecosystem, Hibernate multi-tenancy
Backend (AI/ML)Python / FastAPIAI/ML library ecosystem, async support
Backend (Rendering)Node.jsHeadless browser ecosystem for chart rendering
MetricsPrometheus + GrafanaOpen-source, Kubernetes-native, tenant-aware labels
TracingTempo + OpenTelemetryGrafana integration, low resource footprint
LoggingLokiGrafana integration, label-based querying
ConfigurationCustom config-service + RedisHierarchical overrides, tenant-specific, hot reload

Related Sections