MATIH Platform is in active MVP development. Documentation reflects current implementation status.
8. Platform Services
Upgrade Execution

Upgrade Execution

The UpgradeOrchestrator manages the end-to-end execution of component upgrades across tenant deployments. It supports multiple deployment strategies (rolling, canary, blue-green, recreate, staged), pre- and post-upgrade validation checks, batch processing with configurable delays, automatic rollback on failure, and real-time progress tracking via Kafka events.


Initiate an Upgrade

Endpoint: POST /api/v1/registry/tenants/:tenantId/upgrades

curl -X POST http://localhost:8084/api/v1/registry/tenants/550e8400/upgrades \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${TOKEN}" \
  -d '{
    "componentId": "aaa-111",
    "fromVersionId": "bbb-222",
    "toVersionId": "ccc-333",
    "strategy": "CANARY",
    "canaryPercentage": 10,
    "batchSize": 5,
    "batchDelaySeconds": 60,
    "healthCheckIntervalSeconds": 30,
    "healthCheckTimeoutSeconds": 300,
    "maxUnavailablePercentage": 25,
    "totalInstances": 20,
    "autoRollbackEnabled": true,
    "failureThreshold": 3,
    "initiatedBy": "admin@matih.ai",
    "installedComponents": {
      "kafka": "3.6.0",
      "postgresql": "16.1"
    }
  }'

Before creating the execution, the orchestrator validates that no active upgrade exists for the same component and tenant, both versions belong to the same component, and the upgrade path validation passes (no blockers).


UpgradeRequest Structure

FieldTypeDefaultDescription
componentIdUUIDrequiredTarget component
fromVersionIdUUIDrequiredCurrent version
toVersionIdUUIDrequiredTarget version
strategyUpgradeStrategyROLLINGDeployment strategy
canaryPercentageInteger10Percentage of instances for canary
batchSizeInteger5Instances per batch
batchDelaySecondsInteger60Delay between batches in seconds
healthCheckIntervalSecondsInteger30Health check polling interval
healthCheckTimeoutSecondsInteger300Health check timeout
maxUnavailablePercentageInteger25Maximum unavailable instances
totalInstancesInteger10Total instances to upgrade
autoRollbackEnabledbooleanfalseEnable automatic rollback on failure
failureThresholdInteger3Number of failures before auto-rollback
initiatedByString--User who initiated the upgrade
installedComponentsMap--Currently installed component versions

Upgrade Strategies

StrategyDescriptionUse Case
ROLLINGGradual replacement in batchesStandard production upgrades
CANARYSmall percentage first, then full rollout after monitoringRisk-sensitive upgrades
BLUE_GREENFull parallel deployment with traffic switchZero-downtime critical upgrades
RECREATEAll instances replaced at onceDevelopment and testing environments
STAGEDManual approval required between stagesRegulated environments

Execution Status Lifecycle

StatusDescriptionValid Transitions
PENDINGCreated but not startedVALIDATING, CANCELLED
VALIDATINGRunning pre-upgrade checksIN_PROGRESS, CANARY_RUNNING, FAILED
IN_PROGRESSActively upgrading instances in batchesPAUSED, COMPLETED, FAILED, ROLLING_BACK, CANCELLED
CANARY_RUNNINGCanary instances being deployedCANARY_MONITORING, PAUSED, FAILED, CANCELLED
CANARY_MONITORINGMonitoring canary health metricsIN_PROGRESS (via promote), ROLLING_BACK, CANCELLED
PAUSEDManually paused by operatorIN_PROGRESS (via resume), CANCELLED
ROLLING_BACKRollback in progressROLLED_BACK, FAILED
ROLLED_BACKSuccessfully rolled backTerminal state
COMPLETEDAll instances upgraded successfullyTerminal state
FAILEDUpgrade failed with errorsTerminal state
CANCELLEDManually cancelled by operatorTerminal state

Start an Upgrade

Endpoint: POST /api/v1/registry/upgrades/:executionId/start

Transitions the execution from PENDING to VALIDATING and begins asynchronous pre-upgrade checks. The five validation checks are:

  1. Version compatibility -- Confirms the upgrade path is valid
  2. Resource availability -- Verifies sufficient resources exist for the upgrade
  3. Backup verification -- Confirms a backup exists for rollback
  4. Health endpoints -- Validates health check endpoints are accessible
  5. Dependency check -- Verifies dependency versions are compatible

If all checks pass and the strategy is CANARY, a canary deployment begins. Otherwise, a rolling upgrade starts immediately.


Instance Upgrade Lifecycle

Each instance goes through its own lifecycle during the upgrade:

StatusDescription
PENDINGWaiting to be upgraded
DRAININGDraining active connections
UPGRADINGApplying the version update
HEALTH_CHECKINGRunning post-upgrade health validation
COMPLETEDSuccessfully upgraded
FAILEDUpgrade failed for this instance
ROLLED_BACKInstance reverted to previous version

Canary Workflow

For canary deployments, the orchestrator follows this sequence:

  1. Calculate canary instance count from canaryPercentage (e.g., 10% of 20 = 2 instances)
  2. Deploy the new version to canary instances
  3. Transition to CANARY_MONITORING status
  4. A scheduled monitor checks canary health every 30 seconds
  5. If health checks fail and auto-rollback is enabled, rollback is triggered automatically
  6. On manual promotion (POST /api/v1/registry/upgrades/:executionId/promote-canary), the remaining instances are upgraded via rolling strategy

Pause, Resume, and Cancel

OperationEndpointValid From
PausePOST /api/v1/registry/upgrades/:executionId/pauseIN_PROGRESS, CANARY_RUNNING
ResumePOST /api/v1/registry/upgrades/:executionId/resumePAUSED
CancelPOST /api/v1/registry/upgrades/:executionId/cancelAny non-terminal status

Pausing an upgrade stops batch processing at the current position. Resuming continues from where the upgrade left off.


Rollback

Endpoint: POST /api/v1/registry/upgrades/:executionId/rollback?reason=Manual+rollback

Initiates a rollback that reverts all upgraded and health-checking instances to the previous version. The rollback runs asynchronously and updates the RollbackInfo structure with details.

RollbackInfo Structure

FieldTypeDescription
autoRollbackEnabledbooleanWhether auto-rollback is configured
failureThresholdintNumber of failures that trigger auto-rollback
currentFailuresintCurrent failure count
rollbackReasonStringReason for the rollback
rollbackInitiatedAtLocalDateTimeWhen the rollback started
rollbackInitiatedByStringWho or what triggered the rollback
rolledBackInstancesListInstance IDs that were rolled back

Auto-rollback triggers when currentFailures reaches or exceeds failureThreshold and autoRollbackEnabled is true.


Upgrade Statistics

Endpoint: GET /api/v1/registry/components/:componentId/upgrade-statistics

ExecutionStatistics Structure

FieldTypeDescription
componentIdUUIDComponent identifier
totalExecutionslongTotal upgrade executions
successfulExecutionslongCompleted successfully
failedExecutionslongFailed or rolled back
successRatedoubleSuccess percentage
averageDurationSecondsdoubleAverage upgrade duration

Query Executions

MethodPathDescription
GET/api/v1/registry/upgrades/:executionIdGet execution by ID
GET/api/v1/registry/tenants/:tenantId/upgradesList all executions for a tenant
GET/api/v1/registry/tenants/:tenantId/components/:componentId/upgrades/activeGet active execution