Provisioning State Machine
The provisioning system uses two complementary mechanisms: a high-level TenantProvisioningState state machine managed by the ProvisioningOrchestrator, and a step-level execution system managed by the ProvisioningService. Together they provide reliable, observable, and recoverable provisioning workflows.
High-Level Provisioning States
The TenantProvisioningState enum defines all possible states in the provisioning lifecycle:
public enum TenantProvisioningState implements State {
INITIAL,
VALIDATING_INPUT,
CREATING_TENANT_RECORD,
// Service Principal (PRO/ENTERPRISE)
VALIDATING_SERVICE_PRINCIPAL,
// Terraform (PRO/ENTERPRISE)
ACQUIRING_TERRAFORM_LOCK,
PLANNING_INFRASTRUCTURE,
APPLYING_INFRASTRUCTURE,
RELEASING_TERRAFORM_LOCK,
// Shared cluster (FREE)
ALLOCATING_SHARED_CLUSTER,
CREATING_SHARED_NAMESPACE,
CONFIGURING_QUOTAS,
APPLYING_RESOURCE_QUOTAS,
APPLYING_NETWORK_POLICIES,
CREATING_TENANT_SCHEMA,
// Dedicated cluster (PRO/ENTERPRISE)
PROVISIONING_INFRASTRUCTURE,
CREATING_KUBERNETES_RESOURCES,
// Service deployment (common)
DEPLOYING_SERVICES,
DEPLOYING_DATA_PLANE_AGENT,
DEPLOYING_QUERY_ENGINE,
DEPLOYING_CATALOG_SERVICE,
// ... more per-service states
// Verification and finalization
VERIFYING_CONNECTIVITY,
CONFIGURING_CONTROL_PLANE,
SENDING_WELCOME_EMAIL,
COMPLETED,
// Error states
FAILED,
ROLLING_BACK,
ROLLED_BACK;
}Terminal states: COMPLETED, FAILED, ROLLED_BACK
Step-Level Execution
Each provisioning step is tracked as a ProvisioningStep entity with its own lifecycle:
public enum StepStatus {
PENDING,
RUNNING,
COMPLETED,
FAILED,
SKIPPED,
ROLLED_BACK
}Step Lifecycle Methods
public void start() {
this.status = StepStatus.RUNNING;
this.startedAt = Instant.now();
}
public void complete(String result) {
this.status = StepStatus.COMPLETED;
this.completedAt = Instant.now();
this.durationMs = completedAt.toEpochMilli() - startedAt.toEpochMilli();
}
public void fail(String error, String details) {
this.status = StepStatus.FAILED;
this.errorMessage = error;
this.errorDetails = details;
}
public boolean canRetry() {
return status == StepStatus.FAILED && retryCount < maxRetries;
}Complete Step Definitions (56 Steps)
The ProvisioningService.PROVISIONING_STEPS list defines all steps with their type, name, order, and rollback support:
| Order | Step Type | Rollback | Phase |
|---|---|---|---|
| 1 | VALIDATE_INPUT | No | 0 |
| 2 | CHECK_QUOTA | No | 0 |
| 3 | CREATE_TENANT_RECORD | No | 1 |
| 4 | CREATE_ADMIN_USER | Yes | 1 |
| 5 | SEED_DEFAULT_ROLES | Yes | 1 |
| 6 | GENERATE_API_KEYS | Yes | 1 |
| 7 | CREATE_NAMESPACE | Yes | 2 |
| 8 | CREATE_RESOURCE_QUOTA | Yes | 2 |
| 9 | CREATE_LIMIT_RANGE | Yes | 2 |
| 10 | CREATE_NETWORK_POLICY | Yes | 2 |
| 11 | CREATE_SERVICE_ACCOUNT | Yes | 2 |
| 12 | CREATE_POD_SECURITY_POLICY | Yes | 2 |
| 13 | CREATE_RBAC_BINDINGS | Yes | 2 |
| 14-20 | Cloud Storage steps | Yes | 3 |
| 21-25 | Data Infrastructure steps | Yes | 4 |
| 26-39 | Service Deployment steps | Yes | 5 |
| 40-42 | Ingress and DNS steps | Yes | 5.5 |
| 43-48 | Observability and Security steps | Yes/No | 6 |
| 49 | PROVISION_CLOUD_INFRASTRUCTURE | Yes | 7 |
| 50-54 | Cloud AI Services | Yes | 7.6 |
| 55 | SEND_WELCOME_EMAIL | No | 8 |
| 56 | ACTIVATE_TENANT | No | 8 |
Retry Logic
Each step allows up to 3 retries (configurable via maxRetries). When a step fails and can retry:
- The
retryCountis incremented - The step status is reset to PENDING
- Error information is cleared
- Provisioning re-executes from the failed step
The orchestrator also supports exponential backoff for scheduled retries:
job.setNextRetryAt(Instant.now().plusSeconds(60 * (long) Math.pow(2, job.getRetryCount())));Idempotency
Every provisioning step carries an idempotency key formed as {tenantId}-{stepType}:
ProvisioningStep step = ProvisioningStep.builder()
.idempotencyKey(tenant.getId() + "-" + def.type.name())
.build();This ensures that re-executing a step does not create duplicate resources.
Stale Job Cleanup
A scheduled task runs hourly to detect and fail stale provisioning jobs:
@Scheduled(cron = "${matih.provisioning.stale-check-cron:0 0 * * * *}")
public void cleanupStaleJobs() {
Instant cutoff = Instant.now().minus(Duration.ofHours(4));
List<TenantProvisioningJob> staleJobs = jobRepository.findStuckJobs(cutoff);
// Mark as failed with timeout message
}Source Files
| File | Path |
|---|---|
| State enum | control-plane/tenant-service/src/main/java/com/matih/tenant/statemachine/TenantProvisioningState.java |
| Orchestrator | control-plane/tenant-service/src/main/java/com/matih/tenant/service/provisioning/ProvisioningOrchestrator.java |
| ProvisioningService | control-plane/tenant-service/src/main/java/com/matih/tenant/service/ProvisioningService.java |
| Step entity | control-plane/tenant-service/src/main/java/com/matih/tenant/entity/ProvisioningStep.java |
| Job entity | control-plane/tenant-service/src/main/java/com/matih/tenant/entity/TenantProvisioningJob.java |