Control Plane Service Interactions

Production - REST + Kafka inter-service communication

This section documents how the 10 Control Plane services communicate with each other. Understanding these interaction patterns is essential for debugging cross-service issues, planning capacity, and reasoning about failure propagation within the management layer.

2.3.A.1Synchronous REST Interactions

The following table shows every direct REST dependency between Control Plane services:

Source Service	Target Service	Protocol	Purpose	Failure Impact
`api-gateway`	`iam-service`	REST	JWT token validation, signing key retrieval	All authenticated requests fail
`api-gateway`	`config-service`	REST	Dynamic routing configuration, feature flags	Falls back to cached config
`api-gateway`	All services	REST	Request proxying to target services	Specific service routes fail
`tenant-service`	`iam-service`	REST	Create tenant admin user during provisioning	Provisioning pauses at ACTIVATE phase
`tenant-service`	`infrastructure-service`	REST	Namespace creation, database provisioning, DNS setup	Provisioning pauses at current phase
`billing-service`	`config-service`	REST	Retrieve tenant tier and feature entitlements	Falls back to cached tier data
`observability-api`	Prometheus	REST	Query metrics via PromQL	Metrics API returns empty results
`observability-api`	Elasticsearch	REST	Query audit logs and search indices	Search API returns empty results

REST Client Configuration

All inter-service REST calls use the RetryableRestClient from commons-java, which provides:

// Retry configuration (from commons-java)
RetryableRestClient {
    maxRetries: 3
    initialBackoffMs: 100
    maxBackoffMs: 5000
    backoffMultiplier: 2.0
    jitterFactor: 0.1   // 10% jitter to prevent thundering herd
 
    // Circuit breaker settings
    failureThreshold: 5       // Open circuit after 5 consecutive failures
    resetTimeoutMs: 30000     // Try half-open after 30 seconds
    halfOpenRequests: 3       // Allow 3 test requests in half-open state
}

When a downstream service is unavailable, the circuit breaker opens and subsequent requests fail fast (within 1ms) rather than accumulating timeouts. This prevents cascade failures where a slow IAM service causes thread pool exhaustion in the API gateway.

2.3.A.2Asynchronous Kafka Interactions

Services publish events to Kafka topics when operations complete. Consumers process these events asynchronously, decoupling the producer from consumer availability.

Publisher	Topic	Event Types	Consumers
`tenant-service`	`tenant.lifecycle.events`	TENANT_CREATED, TENANT_PROVISIONED, TENANT_SUSPENDED, TENANT_DELETED	audit-service, billing-service, notification-service
`iam-service`	`security.audit.events`	USER_LOGIN, USER_LOGOUT, MFA_ENABLED, PASSWORD_CHANGED, API_KEY_CREATED	audit-service
`billing-service`	`billing.usage.events`	USAGE_THRESHOLD_REACHED, SUBSCRIPTION_CHANGED, INVOICE_GENERATED	notification-service
`config-service`	`config.change.events`	CONFIG_UPDATED, FEATURE_FLAG_TOGGLED	All services (via consumer groups)

Event Flow: Tenant Provisioning

The tenant provisioning workflow demonstrates how Kafka coordinates multiple services:

tenant-service: POST /api/v1/tenants (creates tenant record)
    |
    +--> [Phase 1-8: Provisioning executes]
    |
    +--> Kafka: tenant.lifecycle.events
         {eventType: "TENANT_PROVISIONED", tenantId: "acme-corp", ...}
         |
         +--> audit-service: Records provisioning in audit trail
         |    (persists to PostgreSQL + Elasticsearch)
         |
         +--> billing-service: Creates subscription record
         |    (sets up usage metering for new tenant)
         |
         +--> notification-service: Sends welcome notification
              (email to tenant admin, Slack to ops channel)

Each consumer operates in its own consumer group, so they all receive every event independently. If the notification-service is temporarily down, it will process missed events when it restarts (Kafka retains events for the configured retention period).

2.3.A.3Redis Pub/Sub Interactions

Redis Pub/Sub is used for ephemeral, real-time broadcasts that do not require durability:

Publisher	Channel	Subscribers	Purpose
`config-service`	`config:changes`	All Control Plane services	Configuration change notification
`observability-api`	`health:status:control-plane`	`api-gateway`	Real-time health status for routing decisions

When the config-service updates a configuration value, it publishes to the config:changes Redis channel. All subscribing services receive the notification and refresh their local configuration cache from the config-service REST API. This pattern enables zero-downtime configuration changes.

2.3.A.4Service-to-Service Authentication

Inter-service REST calls are authenticated using short-lived service tokens:

// Service token generation (from iam-service)
String serviceToken = jwtTokenProvider.generateServiceToken(
    "tenant-service",                    // requesting service name
    Set.of("user:create", "user:read")   // required scopes
);
 
// Token claims
{
    "sub": "service:tenant-service",
    "iss": "matih-platform",
    "type": "service",
    "scopes": ["user:create", "user:read"],
    "exp": 1707700300    // 5-minute expiry
}

Service tokens are cached locally for their validity period (minus a 30-second buffer) to avoid requesting new tokens on every inter-service call.

Header Propagation

When a Control Plane service makes a downstream call, it propagates these headers:

Authorization: Bearer <service-token>
X-Tenant-ID: <current-tenant-context>
X-Correlation-ID: <original-request-correlation-id>
X-Request-ID: <new-request-id-for-this-hop>
X-Source-Service: <calling-service-name>

This header chain enables end-to-end request tracing across service boundaries.

2.3.A.5Dependency Graph

The complete dependency graph for Control Plane services, showing both synchronous and asynchronous dependencies:

                    +------------------+
                    |   api-gateway    |
                    +--------+---------+
                             |
              +--------------+--------------+
              |  (REST)      |  (REST)      |
    +---------v---+  +-------v-----+  +-----v---------+
    | iam-service |  |config-service|  | All services  |
    +------+------+  +------+------+  | (proxied)     |
           |                |          +---------------+
           |    +-----------+----------+
           |    | Redis Pub/Sub        |
           |    | (config broadcast)   |
           |    +----------------------+
           |
    +------v------+
    |tenant-service|
    +------+------+
           |
    +------+------+------+------+------+
    |      |      |      |      |      |
    v      v      v      v      v      v
  infra  notif  audit  billing  K8s   DNS
  svc    svc    svc    svc     API   API
 (REST) (Kafka)(Kafka)(Kafka)

Critical Path

The critical path for the platform is: api-gateway --> iam-service. If the IAM service is unavailable, no authenticated requests can be processed. This dependency is mitigated by:

Token caching: The gateway caches validated tokens for their remaining TTL
Multi-replica deployment: IAM service runs 3 replicas in production
Fast health checks: 10-second intervals with 3-failure threshold for removal

Startup Order

Services should start in this order to satisfy dependencies:

1. PostgreSQL, Redis, Kafka (infrastructure)
2. config-service (configuration source)
3. iam-service (authentication)
4. All other Control Plane services (in any order)
5. api-gateway (last, needs all upstreams ready)

In Kubernetes, this is enforced through readiness probes. The api-gateway only becomes ready after its health checks confirm that iam-service and config-service are reachable.

Related Sections

Control Plane Overview -- Service descriptions and resource allocation
Gateway Architecture -- API gateway deep dive
Event-Driven Architecture -- Kafka and Redis patterns
Service Topology -- Full platform topology

Overview IAM Architecture