Control Plane Service Interactions
This section documents how the 10 Control Plane services communicate with each other. Understanding these interaction patterns is essential for debugging cross-service issues, planning capacity, and reasoning about failure propagation within the management layer.
2.3.A.1Synchronous REST Interactions
The following table shows every direct REST dependency between Control Plane services:
| Source Service | Target Service | Protocol | Purpose | Failure Impact |
|---|---|---|---|---|
api-gateway | iam-service | REST | JWT token validation, signing key retrieval | All authenticated requests fail |
api-gateway | config-service | REST | Dynamic routing configuration, feature flags | Falls back to cached config |
api-gateway | All services | REST | Request proxying to target services | Specific service routes fail |
tenant-service | iam-service | REST | Create tenant admin user during provisioning | Provisioning pauses at ACTIVATE phase |
tenant-service | infrastructure-service | REST | Namespace creation, database provisioning, DNS setup | Provisioning pauses at current phase |
billing-service | config-service | REST | Retrieve tenant tier and feature entitlements | Falls back to cached tier data |
observability-api | Prometheus | REST | Query metrics via PromQL | Metrics API returns empty results |
observability-api | Elasticsearch | REST | Query audit logs and search indices | Search API returns empty results |
REST Client Configuration
All inter-service REST calls use the RetryableRestClient from commons-java, which provides:
// Retry configuration (from commons-java)
RetryableRestClient {
maxRetries: 3
initialBackoffMs: 100
maxBackoffMs: 5000
backoffMultiplier: 2.0
jitterFactor: 0.1 // 10% jitter to prevent thundering herd
// Circuit breaker settings
failureThreshold: 5 // Open circuit after 5 consecutive failures
resetTimeoutMs: 30000 // Try half-open after 30 seconds
halfOpenRequests: 3 // Allow 3 test requests in half-open state
}When a downstream service is unavailable, the circuit breaker opens and subsequent requests fail fast (within 1ms) rather than accumulating timeouts. This prevents cascade failures where a slow IAM service causes thread pool exhaustion in the API gateway.
2.3.A.2Asynchronous Kafka Interactions
Services publish events to Kafka topics when operations complete. Consumers process these events asynchronously, decoupling the producer from consumer availability.
| Publisher | Topic | Event Types | Consumers |
|---|---|---|---|
tenant-service | tenant.lifecycle.events | TENANT_CREATED, TENANT_PROVISIONED, TENANT_SUSPENDED, TENANT_DELETED | audit-service, billing-service, notification-service |
iam-service | security.audit.events | USER_LOGIN, USER_LOGOUT, MFA_ENABLED, PASSWORD_CHANGED, API_KEY_CREATED | audit-service |
billing-service | billing.usage.events | USAGE_THRESHOLD_REACHED, SUBSCRIPTION_CHANGED, INVOICE_GENERATED | notification-service |
config-service | config.change.events | CONFIG_UPDATED, FEATURE_FLAG_TOGGLED | All services (via consumer groups) |
Event Flow: Tenant Provisioning
The tenant provisioning workflow demonstrates how Kafka coordinates multiple services:
tenant-service: POST /api/v1/tenants (creates tenant record)
|
+--> [Phase 1-8: Provisioning executes]
|
+--> Kafka: tenant.lifecycle.events
{eventType: "TENANT_PROVISIONED", tenantId: "acme-corp", ...}
|
+--> audit-service: Records provisioning in audit trail
| (persists to PostgreSQL + Elasticsearch)
|
+--> billing-service: Creates subscription record
| (sets up usage metering for new tenant)
|
+--> notification-service: Sends welcome notification
(email to tenant admin, Slack to ops channel)Each consumer operates in its own consumer group, so they all receive every event independently. If the notification-service is temporarily down, it will process missed events when it restarts (Kafka retains events for the configured retention period).
2.3.A.3Redis Pub/Sub Interactions
Redis Pub/Sub is used for ephemeral, real-time broadcasts that do not require durability:
| Publisher | Channel | Subscribers | Purpose |
|---|---|---|---|
config-service | config:changes | All Control Plane services | Configuration change notification |
observability-api | health:status:control-plane | api-gateway | Real-time health status for routing decisions |
When the config-service updates a configuration value, it publishes to the config:changes Redis channel. All subscribing services receive the notification and refresh their local configuration cache from the config-service REST API. This pattern enables zero-downtime configuration changes.
2.3.A.4Service-to-Service Authentication
Inter-service REST calls are authenticated using short-lived service tokens:
// Service token generation (from iam-service)
String serviceToken = jwtTokenProvider.generateServiceToken(
"tenant-service", // requesting service name
Set.of("user:create", "user:read") // required scopes
);
// Token claims
{
"sub": "service:tenant-service",
"iss": "matih-platform",
"type": "service",
"scopes": ["user:create", "user:read"],
"exp": 1707700300 // 5-minute expiry
}Service tokens are cached locally for their validity period (minus a 30-second buffer) to avoid requesting new tokens on every inter-service call.
Header Propagation
When a Control Plane service makes a downstream call, it propagates these headers:
Authorization: Bearer <service-token>
X-Tenant-ID: <current-tenant-context>
X-Correlation-ID: <original-request-correlation-id>
X-Request-ID: <new-request-id-for-this-hop>
X-Source-Service: <calling-service-name>This header chain enables end-to-end request tracing across service boundaries.
2.3.A.5Dependency Graph
The complete dependency graph for Control Plane services, showing both synchronous and asynchronous dependencies:
+------------------+
| api-gateway |
+--------+---------+
|
+--------------+--------------+
| (REST) | (REST) |
+---------v---+ +-------v-----+ +-----v---------+
| iam-service | |config-service| | All services |
+------+------+ +------+------+ | (proxied) |
| | +---------------+
| +-----------+----------+
| | Redis Pub/Sub |
| | (config broadcast) |
| +----------------------+
|
+------v------+
|tenant-service|
+------+------+
|
+------+------+------+------+------+
| | | | | |
v v v v v v
infra notif audit billing K8s DNS
svc svc svc svc API API
(REST) (Kafka)(Kafka)(Kafka)Critical Path
The critical path for the platform is: api-gateway --> iam-service. If the IAM service is unavailable, no authenticated requests can be processed. This dependency is mitigated by:
- Token caching: The gateway caches validated tokens for their remaining TTL
- Multi-replica deployment: IAM service runs 3 replicas in production
- Fast health checks: 10-second intervals with 3-failure threshold for removal
Startup Order
Services should start in this order to satisfy dependencies:
1. PostgreSQL, Redis, Kafka (infrastructure)
2. config-service (configuration source)
3. iam-service (authentication)
4. All other Control Plane services (in any order)
5. api-gateway (last, needs all upstreams ready)In Kubernetes, this is enforced through readiness probes. The api-gateway only becomes ready after its health checks confirm that iam-service and config-service are reachable.
Related Sections
- Control Plane Overview -- Service descriptions and resource allocation
- Gateway Architecture -- API gateway deep dive
- Event-Driven Architecture -- Kafka and Redis patterns
- Service Topology -- Full platform topology