MATIH Platform is in active MVP development. Documentation reflects current implementation status.
7. Tenant Lifecycle
Provisioning
Provisioning State Machine

Provisioning State Machine

The provisioning system uses two complementary mechanisms: a high-level TenantProvisioningState state machine managed by the ProvisioningOrchestrator, and a step-level execution system managed by the ProvisioningService. Together they provide reliable, observable, and recoverable provisioning workflows.


High-Level Provisioning States

The TenantProvisioningState enum defines all possible states in the provisioning lifecycle:

public enum TenantProvisioningState implements State {
    INITIAL,
    VALIDATING_INPUT,
    CREATING_TENANT_RECORD,
 
    // Service Principal (PRO/ENTERPRISE)
    VALIDATING_SERVICE_PRINCIPAL,
 
    // Terraform (PRO/ENTERPRISE)
    ACQUIRING_TERRAFORM_LOCK,
    PLANNING_INFRASTRUCTURE,
    APPLYING_INFRASTRUCTURE,
    RELEASING_TERRAFORM_LOCK,
 
    // Shared cluster (FREE)
    ALLOCATING_SHARED_CLUSTER,
    CREATING_SHARED_NAMESPACE,
    CONFIGURING_QUOTAS,
    APPLYING_RESOURCE_QUOTAS,
    APPLYING_NETWORK_POLICIES,
    CREATING_TENANT_SCHEMA,
 
    // Dedicated cluster (PRO/ENTERPRISE)
    PROVISIONING_INFRASTRUCTURE,
    CREATING_KUBERNETES_RESOURCES,
 
    // Service deployment (common)
    DEPLOYING_SERVICES,
    DEPLOYING_DATA_PLANE_AGENT,
    DEPLOYING_QUERY_ENGINE,
    DEPLOYING_CATALOG_SERVICE,
    // ... more per-service states
 
    // Verification and finalization
    VERIFYING_CONNECTIVITY,
    CONFIGURING_CONTROL_PLANE,
    SENDING_WELCOME_EMAIL,
    COMPLETED,
 
    // Error states
    FAILED,
    ROLLING_BACK,
    ROLLED_BACK;
}

Terminal states: COMPLETED, FAILED, ROLLED_BACK


Step-Level Execution

Each provisioning step is tracked as a ProvisioningStep entity with its own lifecycle:

public enum StepStatus {
    PENDING,
    RUNNING,
    COMPLETED,
    FAILED,
    SKIPPED,
    ROLLED_BACK
}

Step Lifecycle Methods

public void start() {
    this.status = StepStatus.RUNNING;
    this.startedAt = Instant.now();
}
 
public void complete(String result) {
    this.status = StepStatus.COMPLETED;
    this.completedAt = Instant.now();
    this.durationMs = completedAt.toEpochMilli() - startedAt.toEpochMilli();
}
 
public void fail(String error, String details) {
    this.status = StepStatus.FAILED;
    this.errorMessage = error;
    this.errorDetails = details;
}
 
public boolean canRetry() {
    return status == StepStatus.FAILED && retryCount < maxRetries;
}

Complete Step Definitions (56 Steps)

The ProvisioningService.PROVISIONING_STEPS list defines all steps with their type, name, order, and rollback support:

OrderStep TypeRollbackPhase
1VALIDATE_INPUTNo0
2CHECK_QUOTANo0
3CREATE_TENANT_RECORDNo1
4CREATE_ADMIN_USERYes1
5SEED_DEFAULT_ROLESYes1
6GENERATE_API_KEYSYes1
7CREATE_NAMESPACEYes2
8CREATE_RESOURCE_QUOTAYes2
9CREATE_LIMIT_RANGEYes2
10CREATE_NETWORK_POLICYYes2
11CREATE_SERVICE_ACCOUNTYes2
12CREATE_POD_SECURITY_POLICYYes2
13CREATE_RBAC_BINDINGSYes2
14-20Cloud Storage stepsYes3
21-25Data Infrastructure stepsYes4
26-39Service Deployment stepsYes5
40-42Ingress and DNS stepsYes5.5
43-48Observability and Security stepsYes/No6
49PROVISION_CLOUD_INFRASTRUCTUREYes7
50-54Cloud AI ServicesYes7.6
55SEND_WELCOME_EMAILNo8
56ACTIVATE_TENANTNo8

Retry Logic

Each step allows up to 3 retries (configurable via maxRetries). When a step fails and can retry:

  1. The retryCount is incremented
  2. The step status is reset to PENDING
  3. Error information is cleared
  4. Provisioning re-executes from the failed step

The orchestrator also supports exponential backoff for scheduled retries:

job.setNextRetryAt(Instant.now().plusSeconds(60 * (long) Math.pow(2, job.getRetryCount())));

Idempotency

Every provisioning step carries an idempotency key formed as {tenantId}-{stepType}:

ProvisioningStep step = ProvisioningStep.builder()
        .idempotencyKey(tenant.getId() + "-" + def.type.name())
        .build();

This ensures that re-executing a step does not create duplicate resources.


Stale Job Cleanup

A scheduled task runs hourly to detect and fail stale provisioning jobs:

@Scheduled(cron = "${matih.provisioning.stale-check-cron:0 0 * * * *}")
public void cleanupStaleJobs() {
    Instant cutoff = Instant.now().minus(Duration.ofHours(4));
    List<TenantProvisioningJob> staleJobs = jobRepository.findStuckJobs(cutoff);
    // Mark as failed with timeout message
}

Source Files

FilePath
State enumcontrol-plane/tenant-service/src/main/java/com/matih/tenant/statemachine/TenantProvisioningState.java
Orchestratorcontrol-plane/tenant-service/src/main/java/com/matih/tenant/service/provisioning/ProvisioningOrchestrator.java
ProvisioningServicecontrol-plane/tenant-service/src/main/java/com/matih/tenant/service/ProvisioningService.java
Step entitycontrol-plane/tenant-service/src/main/java/com/matih/tenant/entity/ProvisioningStep.java
Job entitycontrol-plane/tenant-service/src/main/java/com/matih/tenant/entity/TenantProvisioningJob.java