Tenant Service Architecture
The Tenant Service manages the complete lifecycle of tenants from initial creation through provisioning, configuration, scaling, suspension, and eventual decommissioning. It is the most operationally complex Control Plane service, orchestrating interactions with the IAM service, infrastructure service, Kubernetes API, DNS providers, and all Data Plane services.
2.3.C.1Provisioning State Machine
Tenant provisioning is implemented as a state machine with 8 phases. Each phase is idempotent, meaning it can be safely retried if it fails partway through. The state machine persists its current phase to PostgreSQL, enabling recovery from crashes.
Phase Diagram
+----------+ +-----------+ +----------+ +----------+
| VALIDATE |---->| CREATE |---->| DEPLOY |---->| DEPLOY |
| | | NAMESPACE | | SECRETS | | DATABASES|
+----------+ +-----------+ +----------+ +----------+
|
+----------+ +-----------+ +----------+ +----v-----+
| ACTIVATE |<----| VERIFY |<----| CONFIGURE|<----| DEPLOY |
| | | | | | | SERVICES |
+----------+ +-----------+ +----------+ +----------+
^
+----+-----+
| DEPLOY |
| INGRESS |
| (5.5) |
+----------+Phase Details
Phase 1: VALIDATE
Validates tenant creation request:
- Tenant name uniqueness check
- Slug format validation (lowercase, alphanumeric, hyphens)
- Tier validation against available plans
- Admin email format and uniqueness
- Cloud provider availability check
Phase 2: CREATE_NAMESPACE
Creates the tenant's Kubernetes namespace with security boundaries:
# Created resources:
# 1. Namespace
apiVersion: v1
kind: Namespace
metadata:
name: matih-data-plane-{tenant-slug}
labels:
matih.io/tenant: "{tenant-slug}"
matih.io/tier: "{tier}"
matih.io/managed-by: "tenant-service"
# 2. NetworkPolicy (restrict cross-namespace traffic)
# 3. ResourceQuota (CPU, memory, pod limits per tier)
# 4. ServiceAccount (for pod identity)
# 5. RBAC RoleBindingsPhase 3: DEPLOY_SECRETS
Creates Kubernetes secrets for database credentials, service tokens, and external integrations:
| Secret | Contents | Used By |
|---|---|---|
:tenant-db-credentials | PostgreSQL username, password | All Java services |
:tenant-redis-credentials | Redis password | All services |
:tenant-kafka-credentials | Kafka SASL credentials | Event-producing services |
:tenant-jwt-secret | JWT signing key | IAM service proxy |
:tenant-llm-api-key | OpenAI/Azure API key | AI service |
Phase 4: DEPLOY_DATABASES
Provisions per-tenant PostgreSQL schemas for each Data Plane service:
-- Schema creation for each service
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_ai;
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_bi;
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_query;
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_catalog;
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_pipeline;
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_ml;
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_quality;
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_ontology;
CREATE SCHEMA IF NOT EXISTS {tenant_slug}_governance;Phase 5: DEPLOY_SERVICES
Deploys all 14 Data Plane services via Helm:
For each service in [query-engine, catalog-service, semantic-layer, bi-service,
pipeline-service, ai-service, ml-service, data-quality-service,
render-service, data-plane-agent, ontology-service,
governance-service, ops-agent-service, auth-proxy]:
helm install {tenant}-{service} infrastructure/helm/{service}/ \
--namespace matih-data-plane-{tenant-slug} \
--values values.yaml \
--values values-{environment}.yaml \
--set tenant.id={tenant-slug} \
--set tenant.tier={tier}Phase 5.5: DEPLOY_INGRESS
Provisions the tenant's ingress infrastructure:
- Deploy NGINX ingress controller in tenant namespace (dedicated LoadBalancer IP)
- Create Azure DNS child zone (e.g.,
acme.matih.ai) with NS delegation - Create A records pointing to the tenant's LoadBalancer IP
- Create cert-manager Certificate for TLS (DNS01 challenge)
- Create Kubernetes Ingress resource with TLS termination
Phase 6: CONFIGURE
Applies tenant-specific configuration overrides through the config-service:
- AI model preferences (GPT-4, Claude, custom models)
- Query timeout limits
- Dashboard branding
- Feature flag overrides based on tier
Phase 7: VERIFY
Health-checks all deployed services:
for (Service service : deployedServices) {
String healthUrl = String.format(
"http://%s.%s.svc.cluster.local:%d/health",
service.getName(),
tenantNamespace,
service.getPort()
);
HealthStatus status = healthChecker.check(healthUrl, Duration.ofSeconds(30));
if (status != HealthStatus.HEALTHY) {
throw new ProvisioningException(
"Service " + service.getName() + " failed health check"
);
}
}Phase 8: ACTIVATE
Finalizes provisioning:
- Creates tenant admin user via IAM service
- Sets tenant status to ACTIVE
- Publishes TENANT_PROVISIONED event to Kafka
- Sends welcome notification to tenant admin
2.3.C.2State Persistence and Recovery
The provisioning state is persisted to PostgreSQL after each phase transition:
@Entity
@Table(name = "tenant_provisioning_state")
public class ProvisioningState {
@Id
private UUID tenantId;
private ProvisioningPhase currentPhase;
private ProvisioningStatus status; // IN_PROGRESS, COMPLETED, FAILED
private String failureReason;
private Map<String, Object> phaseOutputs; // Results from each phase
private Instant startedAt;
private Instant lastUpdatedAt;
private int retryCount;
}If the tenant-service crashes during provisioning, the state machine resumes from the last completed phase on restart. Each phase checks whether its work has already been done (idempotency) before executing.
2.3.C.3Tenant Tiers and Quotas
Each tenant is assigned a tier that determines resource quotas and feature access:
| Tier | CPU Quota | Memory Quota | Max Pods | Max Users | Features |
|---|---|---|---|---|---|
| Free | 2 cores | 4Gi | 20 | 5 | Basic analytics, limited AI |
| Professional | 8 cores | 16Gi | 50 | 50 | Full analytics, AI chat, ML |
| Enterprise | Custom | Custom | Custom | Unlimited | All features, custom models, SLA |
Tier-specific quotas are enforced via Kubernetes ResourceQuotas in the tenant namespace. Feature gating is enforced through the config-service, which evaluates tenant tier when resolving feature flags.
2.3.C.4Tenant Lifecycle Operations
Suspension
Tenant suspension preserves all data but stops all services:
1. Set tenant status to SUSPENDED
2. Scale all Data Plane deployments to 0 replicas
3. Revoke all active JWT tokens (add to blacklist)
4. Publish TENANT_SUSPENDED event
5. Send notification to tenant adminReactivation
1. Scale all Data Plane deployments back to configured replicas
2. Wait for all services to pass health checks
3. Set tenant status to ACTIVE
4. Publish TENANT_ACTIVATED eventDeletion
Tenant deletion is a destructive, multi-step process:
1. Set tenant status to DELETING
2. Export audit logs to long-term storage (compliance requirement)
3. Delete all Helm releases in tenant namespace
4. Delete tenant PostgreSQL schemas
5. Delete tenant Kubernetes namespace (cascades all resources)
6. Delete DNS zone and ingress records
7. Delete tenant record from Control Plane database
8. Publish TENANT_DELETED eventDeletion requires platform_admin role and a confirmation code.
Related Sections
- Control Plane Overview -- All Control Plane services
- Namespace Isolation -- Kubernetes namespace per tenant
- Tenant Lifecycle -- Complete provisioning workflow documentation