MATIH Platform is in active MVP development. Documentation reflects current implementation status.
2. Architecture
Service Interactions

Control Plane Service Interactions

Production - REST + Kafka inter-service communication

This section documents how the 10 Control Plane services communicate with each other. Understanding these interaction patterns is essential for debugging cross-service issues, planning capacity, and reasoning about failure propagation within the management layer.


2.3.A.1Synchronous REST Interactions

The following table shows every direct REST dependency between Control Plane services:

Source ServiceTarget ServiceProtocolPurposeFailure Impact
api-gatewayiam-serviceRESTJWT token validation, signing key retrievalAll authenticated requests fail
api-gatewayconfig-serviceRESTDynamic routing configuration, feature flagsFalls back to cached config
api-gatewayAll servicesRESTRequest proxying to target servicesSpecific service routes fail
tenant-serviceiam-serviceRESTCreate tenant admin user during provisioningProvisioning pauses at ACTIVATE phase
tenant-serviceinfrastructure-serviceRESTNamespace creation, database provisioning, DNS setupProvisioning pauses at current phase
billing-serviceconfig-serviceRESTRetrieve tenant tier and feature entitlementsFalls back to cached tier data
observability-apiPrometheusRESTQuery metrics via PromQLMetrics API returns empty results
observability-apiElasticsearchRESTQuery audit logs and search indicesSearch API returns empty results

REST Client Configuration

All inter-service REST calls use the RetryableRestClient from commons-java, which provides:

// Retry configuration (from commons-java)
RetryableRestClient {
    maxRetries: 3
    initialBackoffMs: 100
    maxBackoffMs: 5000
    backoffMultiplier: 2.0
    jitterFactor: 0.1   // 10% jitter to prevent thundering herd
 
    // Circuit breaker settings
    failureThreshold: 5       // Open circuit after 5 consecutive failures
    resetTimeoutMs: 30000     // Try half-open after 30 seconds
    halfOpenRequests: 3       // Allow 3 test requests in half-open state
}

When a downstream service is unavailable, the circuit breaker opens and subsequent requests fail fast (within 1ms) rather than accumulating timeouts. This prevents cascade failures where a slow IAM service causes thread pool exhaustion in the API gateway.


2.3.A.2Asynchronous Kafka Interactions

Services publish events to Kafka topics when operations complete. Consumers process these events asynchronously, decoupling the producer from consumer availability.

PublisherTopicEvent TypesConsumers
tenant-servicetenant.lifecycle.eventsTENANT_CREATED, TENANT_PROVISIONED, TENANT_SUSPENDED, TENANT_DELETEDaudit-service, billing-service, notification-service
iam-servicesecurity.audit.eventsUSER_LOGIN, USER_LOGOUT, MFA_ENABLED, PASSWORD_CHANGED, API_KEY_CREATEDaudit-service
billing-servicebilling.usage.eventsUSAGE_THRESHOLD_REACHED, SUBSCRIPTION_CHANGED, INVOICE_GENERATEDnotification-service
config-serviceconfig.change.eventsCONFIG_UPDATED, FEATURE_FLAG_TOGGLEDAll services (via consumer groups)

Event Flow: Tenant Provisioning

The tenant provisioning workflow demonstrates how Kafka coordinates multiple services:

tenant-service: POST /api/v1/tenants (creates tenant record)
    |
    +--> [Phase 1-8: Provisioning executes]
    |
    +--> Kafka: tenant.lifecycle.events
         {eventType: "TENANT_PROVISIONED", tenantId: "acme-corp", ...}
         |
         +--> audit-service: Records provisioning in audit trail
         |    (persists to PostgreSQL + Elasticsearch)
         |
         +--> billing-service: Creates subscription record
         |    (sets up usage metering for new tenant)
         |
         +--> notification-service: Sends welcome notification
              (email to tenant admin, Slack to ops channel)

Each consumer operates in its own consumer group, so they all receive every event independently. If the notification-service is temporarily down, it will process missed events when it restarts (Kafka retains events for the configured retention period).


2.3.A.3Redis Pub/Sub Interactions

Redis Pub/Sub is used for ephemeral, real-time broadcasts that do not require durability:

PublisherChannelSubscribersPurpose
config-serviceconfig:changesAll Control Plane servicesConfiguration change notification
observability-apihealth:status:control-planeapi-gatewayReal-time health status for routing decisions

When the config-service updates a configuration value, it publishes to the config:changes Redis channel. All subscribing services receive the notification and refresh their local configuration cache from the config-service REST API. This pattern enables zero-downtime configuration changes.


2.3.A.4Service-to-Service Authentication

Inter-service REST calls are authenticated using short-lived service tokens:

// Service token generation (from iam-service)
String serviceToken = jwtTokenProvider.generateServiceToken(
    "tenant-service",                    // requesting service name
    Set.of("user:create", "user:read")   // required scopes
);
 
// Token claims
{
    "sub": "service:tenant-service",
    "iss": "matih-platform",
    "type": "service",
    "scopes": ["user:create", "user:read"],
    "exp": 1707700300    // 5-minute expiry
}

Service tokens are cached locally for their validity period (minus a 30-second buffer) to avoid requesting new tokens on every inter-service call.

Header Propagation

When a Control Plane service makes a downstream call, it propagates these headers:

Authorization: Bearer <service-token>
X-Tenant-ID: <current-tenant-context>
X-Correlation-ID: <original-request-correlation-id>
X-Request-ID: <new-request-id-for-this-hop>
X-Source-Service: <calling-service-name>

This header chain enables end-to-end request tracing across service boundaries.


2.3.A.5Dependency Graph

The complete dependency graph for Control Plane services, showing both synchronous and asynchronous dependencies:

                    +------------------+
                    |   api-gateway    |
                    +--------+---------+
                             |
              +--------------+--------------+
              |  (REST)      |  (REST)      |
    +---------v---+  +-------v-----+  +-----v---------+
    | iam-service |  |config-service|  | All services  |
    +------+------+  +------+------+  | (proxied)     |
           |                |          +---------------+
           |    +-----------+----------+
           |    | Redis Pub/Sub        |
           |    | (config broadcast)   |
           |    +----------------------+
           |
    +------v------+
    |tenant-service|
    +------+------+
           |
    +------+------+------+------+------+
    |      |      |      |      |      |
    v      v      v      v      v      v
  infra  notif  audit  billing  K8s   DNS
  svc    svc    svc    svc     API   API
 (REST) (Kafka)(Kafka)(Kafka)

Critical Path

The critical path for the platform is: api-gateway --> iam-service. If the IAM service is unavailable, no authenticated requests can be processed. This dependency is mitigated by:

  1. Token caching: The gateway caches validated tokens for their remaining TTL
  2. Multi-replica deployment: IAM service runs 3 replicas in production
  3. Fast health checks: 10-second intervals with 3-failure threshold for removal

Startup Order

Services should start in this order to satisfy dependencies:

1. PostgreSQL, Redis, Kafka (infrastructure)
2. config-service (configuration source)
3. iam-service (authentication)
4. All other Control Plane services (in any order)
5. api-gateway (last, needs all upstreams ready)

In Kubernetes, this is enforced through readiness probes. The api-gateway only becomes ready after its health checks confirm that iam-service and config-service are reachable.


Related Sections