Two-Tier Provisioning
MATIH's provisioning architecture separates tenant setup into two distinct tiers: the control plane tier and the data plane tier. This separation is a deliberate design choice that isolates fast, lightweight metadata operations from slow, resource-intensive infrastructure operations. It ensures that control plane responsiveness is never blocked by data plane provisioning, and that each tier can fail and recover independently.
Architectural Overview
Tenant Registration Request
|
v
+---------------------------+
| CONTROL PLANE TIER |
| (Synchronous, fast) |
| |
| 1. Create tenant record |
| 2. Assign admin user |
| 3. Configure billing |
| 4. Set default roles |
| 5. Initialize settings |
+---------------------------+
|
Responds to user
(tenant created)
|
v
+---------------------------+
| DATA PLANE TIER |
| (Asynchronous, slow) |
| |
| 6. Create namespace |
| 7. Deploy database |
| 8. Deploy core services |
| 9. Configure networking |
| 10. Deploy data services |
| 11. Deploy ingress |
| 12. Create DNS zone |
| 13. Configure ingress |
| 14. Deploy monitoring |
| 15. Setup observability |
+---------------------------+Control Plane Tier
The control plane tier handles tenant metadata creation. These operations interact only with the control plane PostgreSQL database and are completed synchronously within the HTTP request lifecycle.
Operations
| Step | Service | Duration | Description |
|---|---|---|---|
| Create tenant record | TenantService | under 100ms | Insert tenant entity with configuration |
| Create admin user | IamServiceClient | under 200ms | Call IAM service to create initial admin |
| Configure billing | BillingPlanService | under 100ms | Assign default billing plan and subscription |
| Set default roles | RoleService | < 100ms | Create tenant-specific roles (Admin, Analyst, Viewer) |
| Initialize settings | SettingsService | < 100ms | Set default tenant configuration values |
| Create API key | ApiKeyService | < 100ms | Generate initial admin API key |
Characteristics
| Property | Value |
|---|---|
| Execution model | Synchronous, within HTTP request |
| Typical duration | 200-500ms |
| Dependencies | PostgreSQL, IAM Service |
| Failure recovery | Database transaction rollback |
| User feedback | Immediate HTTP response |
Tenant Record After Control Plane Tier
After the control plane tier completes, the tenant record has:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "Acme Corporation",
"slug": "acme",
"tier": "PROFESSIONAL",
"status": "PROVISIONING",
"region": "eastus",
"adminEmail": "admin@acme.com",
"provisioningStartedAt": "2026-02-12T10:00:00Z",
"deploymentType": null,
"kubernetesNamespace": null,
"azureAksClusterName": null
}The tenant is usable for basic control plane operations (viewing settings, managing users) but cannot run queries or access data plane features until provisioning completes.
Data Plane Tier
The data plane tier handles infrastructure provisioning and service deployment. These operations are asynchronous and managed by the ProvisioningOrchestrator with a state machine that tracks progress.
Operations
| Phase | Service | Duration | Description |
|---|---|---|---|
| CREATE_NAMESPACE | ProvisioningService | 5-30s | Create Kubernetes namespace and RBAC |
| SETUP_DATABASE | ProvisioningService | 30-120s | Provision tenant database schema |
| DEPLOY_CORE_SERVICES | TenantHelmService | 60-300s | Deploy essential services (query engine, AI service) |
| CONFIGURE_NETWORKING | ProvisioningService | 10-30s | Apply network policies |
| DEPLOY_DATA_SERVICES | TenantHelmService | 60-300s | Deploy data pipeline and catalog services |
| DEPLOY_INGRESS_CONTROLLER | TenantIngressService | 30-120s | Deploy per-tenant NGINX |
| CREATE_DNS_ZONE | AzureDnsService | 10-60s | Create child DNS zone with NS delegation |
| CREATE_TENANT_INGRESS | TenantIngressService | 10-30s | Create Ingress resources and TLS certificates |
| DEPLOY_MONITORING | TenantHelmService | 30-60s | Deploy Prometheus, Grafana dashboards |
| SETUP_OBSERVABILITY | ProvisioningService | 10-30s | Configure log aggregation and tracing |
Characteristics
| Property | Value |
|---|---|
| Execution model | Asynchronous (@Async with custom thread pool) |
| Typical duration | 5-15 minutes (shared), 15-45 minutes (dedicated) |
| Dependencies | Kubernetes, Helm, Azure, Terraform |
| Failure recovery | State machine retry with exponential backoff |
| User feedback | WebSocket status updates, polling endpoint |
Tier-Specific Provisioning Paths
The data plane tier follows different paths depending on the tenant tier:
Free Tier Path
Free tier tenants are provisioned on the shared cluster with resource quotas:
INITIAL
|
v
VALIDATING_INPUT
|
v
CREATING_TENANT_RECORD
|
v
ALLOCATING_SHARED_CLUSTER <-- Select shared cluster, create namespace
|
v
CONFIGURING_QUOTAS <-- Apply ResourceQuota and LimitRange
|
v
DEPLOYING_SERVICES <-- Deploy subset of services via Helm
|
v
VERIFYING_CONNECTIVITY <-- Health checks on deployed services
|
v
COMPLETEDFree tier resource quotas:
| Resource | Limit |
|---|---|
| CPU requests | 2 cores |
| CPU limits | 4 cores |
| Memory requests | 4 Gi |
| Memory limits | 8 Gi |
| Pods | 20 |
| Services | 10 |
| PVCs | 5 |
Professional Tier Path
Professional tier tenants get a dedicated namespace on the shared cluster with higher resource quotas and full service deployment:
INITIAL
|
v
VALIDATING_INPUT
|
v
CREATING_TENANT_RECORD
|
v
ALLOCATING_SHARED_CLUSTER <-- Dedicated namespace in shared cluster
|
v
CONFIGURING_QUOTAS <-- Higher quotas
|
v
DEPLOYING_SERVICES <-- Full service stack
|
v
VERIFYING_CONNECTIVITY
|
v
COMPLETEDEnterprise Tier Path
Enterprise tier tenants get a fully dedicated Kubernetes cluster provisioned through Terraform:
INITIAL
|
v
VALIDATING_INPUT
|
v
CREATING_TENANT_RECORD
|
v
VALIDATING_SERVICE_PRINCIPAL <-- Validate Azure credentials
|
v
ACQUIRING_TERRAFORM_LOCK <-- Distributed lock for Terraform state
|
v
PROVISIONING_INFRASTRUCTURE <-- Terraform: AKS, networking, storage
|
v
CREATING_KUBERNETES_RESOURCES <-- Namespace, RBAC, network policies
|
v
DEPLOYING_SERVICES <-- Full service stack + custom config
|
v
VERIFYING_CONNECTIVITY
|
v
COMPLETEDProvisioning Job Entity
Each provisioning attempt is tracked by a TenantProvisioningJob entity:
| Field | Type | Description |
|---|---|---|
id | UUID | Job identifier |
tenantId | UUID | Target tenant |
tier | TenantTier | Tier being provisioned |
currentState | TenantProvisioningState | Current state machine state |
initiatedBy | UUID | User who triggered provisioning |
startedAt | Instant | Job start time |
completedAt | Instant | Job completion time |
errorMessage | String | Last error message |
retryCount | Integer | Number of retry attempts |
maxRetries | Integer | Maximum allowed retries (default: 3) |
nextRetryAt | Instant | Scheduled retry time |
context | Map | Key-value metadata (tenant slug, cluster name, etc.) |
Provisioning Status API
Clients can track provisioning progress through a polling endpoint:
GET /api/v1/tenants/{tenantId}/provisioning/status
Authorization: Bearer {admin_token}{
"tenantId": "550e8400-e29b-41d4-a716-446655440000",
"status": "PROVISIONING",
"currentPhase": "DEPLOYING_SERVICES",
"completedPhases": [
"VALIDATING_INPUT",
"CREATING_TENANT_RECORD",
"ALLOCATING_SHARED_CLUSTER",
"CONFIGURING_QUOTAS"
],
"remainingPhases": [
"VERIFYING_CONNECTIVITY"
],
"progress": 80,
"startedAt": "2026-02-12T10:00:00Z",
"estimatedCompletion": "2026-02-12T10:08:00Z",
"retryCount": 0
}Failure Handling Comparison
| Aspect | Control Plane Tier | Data Plane Tier |
|---|---|---|
| Failure scope | Transaction rollback | State-machine-based retry |
| Retry strategy | Immediate re-attempt | Exponential backoff (60s, 120s, 240s) |
| Max retries | 1 (within transaction) | 3 (configurable) |
| Rollback | Automatic (DB transaction) | Step-by-step reverse operations |
| User notification | Immediate error response | Async notification (email, webhook) |
| Admin visibility | Error in HTTP response | Provisioning dashboard with phase details |
Benefits of Two-Tier Separation
Responsiveness. The control plane tier responds to the user within milliseconds. The user can start configuring their tenant (settings, users, roles) while infrastructure provisioning runs in the background.
Resilience. A failure in Kubernetes or Terraform does not prevent the tenant record from being created. The data plane tier can retry independently.
Observability. Each tier has its own monitoring. Control plane operations are tracked through standard HTTP metrics. Data plane provisioning has dedicated dashboards with phase-level progress.
Scalability. Control plane operations scale with the database. Data plane operations scale with the number of available provisioning workers. These can be scaled independently.
Testability. Control plane logic can be tested with database integration tests. Data plane logic can be tested with Kubernetes test clusters or mocked clients.
Next Steps
- 10-Phase Provisioning Flow -- detailed walkthrough of each data plane phase
- DNS Zone Management -- how DNS is configured during provisioning
- Per-Tenant Ingress -- ingress controller deployment