MATIH Platform is in active MVP development. Documentation reflects current implementation status.
18. CI/CD & Build System
CD Pipeline Overview

CD Pipeline Overview

The MATIH Continuous Deployment pipeline is orchestrated by scripts/cd-new.sh, a 1,290-line Bash script that coordinates 25 stages from Terraform infrastructure provisioning through post-deployment validation. The pipeline features dependency tracking, atomic state management, lock-based concurrency control, automatic rollback on failure, and dry-run preview mode.

Source file: scripts/cd-new.sh (Version 1.3.0)


Stage Architecture

The 25 stages are organized into four groups:

Infrastructure & Build (Stages 00-06)
├── 00-terraform                    Provision cloud infrastructure
├── 01-build-setup                  Docker buildx, environment setup
├── 01a-validate-schemas            Hibernate schema validation
├── 02-build-base-images            Base container images
├── 03-build-commons                Shared libraries
├── 04-build-service-images         All service Docker images
├── 04a-sync-tenant-images          Sync images to tenant registries
├── 05a-control-plane-infrastructure  CP PostgreSQL, Redis, Kafka
├── 05b-data-plane-infrastructure   DP PostgreSQL, Redis, Kafka
└── 06-ingress-controller           NGINX ingress

Control Plane (Stages 07-09)
├── 07-control-plane-monitoring     Prometheus, Grafana for CP
├── 08-control-plane-services       IAM, tenant, config, etc.
└── 09-control-plane-frontend       Control plane UI

Data Plane (Stages 10-17)
├── 10-data-plane-monitoring        Prometheus, Grafana for DP
├── 11-compute-engines              Spark, Flink, Ray, Trino
├── 12-workflow-orchestration       Airflow
├── 13-data-catalogs                OpenMetadata
├── 14-ml-infrastructure            KubeRay, MLflow
├── 15-ai-infrastructure            vLLM, Ollama
├── 15a-matih-operator              Platform operator
├── 16-data-plane-services          ai-service, ml-service, etc.
└── 17-data-plane-frontend          Workbenches

Validation (Stage 18)
└── 18-validate                     Health checks, smoke tests

Dependency Graph

Each stage declares its dependencies. The pipeline verifies all dependencies are satisfied before executing a stage:

00-terraform
├──> 01-build-setup ──> 01a-validate-schemas
│    └──> 02-build-base-images ──> 03-build-commons ──> 04-build-service-images
├──> 05a-control-plane-infrastructure ─┐
└──> 05b-data-plane-infrastructure ────┴──> 06-ingress-controller

     ┌────────────────────────────────────────┘
     ├──> 07-cp-monitoring ──> 08-cp-services ──> 09-cp-frontend
     ├──> 10-dp-monitoring
     ├──> 11-compute-engines ──> 14-ml-infra ──> 15-ai-infra
     ├──> 12-workflow-orchestration ──> 13-data-catalogs
     └──> 15a-matih-operator
          └──> 16-dp-services ──> 17-dp-frontend

08-cp-services + 16-dp-services ──> 18-validate

Pipeline Commands

# Run all stages
./scripts/cd-new.sh all dev
 
# Run infrastructure stages only
./scripts/cd-new.sh infra dev
 
# Run platform stages (05a through 15)
./scripts/cd-new.sh platform dev
 
# Run service stages only
./scripts/cd-new.sh services dev
 
# Run a specific stage
./scripts/cd-new.sh 04-build-service-images dev
./scripts/cd-new.sh 04 dev  # Short form
 
# Pass arguments to a stage script
./scripts/cd-new.sh 04-build-service-images dev -- --only copilot
 
# Validation only
./scripts/cd-new.sh validate dev
 
# Rollback a specific release
./scripts/cd-new.sh rollback ai-service matih-data-plane dev
 
# Rollback everything
./scripts/cd-new.sh rollback-all dev
 
# View status and history
./scripts/cd-new.sh status
./scripts/cd-new.sh history
./scripts/cd-new.sh deps
 
# Dry run (preview)
DRY_RUN=true ./scripts/cd-new.sh all dev
 
# Clear state (allow re-running stages)
./scripts/cd-new.sh clear-state

State Management

The pipeline tracks stage completion in .cd-pipeline-state:

# Format: stage:status:timestamp:version
00-terraform:completed:2026-02-10T14:30:00+05:30:v1.2.3
01-build-setup:completed:2026-02-10T14:32:15+05:30:v1.2.3
04-build-service-images:started:2026-02-10T14:35:00+05:30:v1.2.3

State is written atomically using temp files and mv:

save_stage_state() {
    local stage="$1"
    local status="$2"
    local temp_file="${STATE_FILE}.tmp.$$"
    {
        grep -v "^${stage}:" "$STATE_FILE" 2>/dev/null || true
        echo "${stage}:${status}:$(date -Iseconds):${version}"
    } > "$temp_file"
    mv "$temp_file" "$STATE_FILE"  # Atomic rename
}

Concurrency Control

A file-based lock prevents concurrent pipeline runs:

acquire_lock() {
    touch "$LOCK_FILE"
    exec 200>"$LOCK_FILE"
    if command -v flock &>/dev/null; then
        flock -w "$LOCK_TIMEOUT" 200  # Linux
    else
        # macOS fallback using PID file
        echo $$ > "${LOCK_FILE}.pid"
    fi
}

Rollback Mechanism

The pipeline records all Helm releases deployed during a run. On failure, it rolls back in reverse order:

# Recorded format: namespace/release:revision
DEPLOYED_RELEASES+=("matih-data-plane/ai-service:5")
 
# Rollback: decrement revision
helm rollback ai-service 4 -n matih-data-plane --wait

If the release is at revision 1 (first install), the pipeline uninstalls instead:

if [[ "$current_revision" -le 1 ]]; then
    helm uninstall "$release" -n "$namespace" --wait
else
    helm rollback "$release" "$target_revision" -n "$namespace" --wait
fi

AKS Health Check

Before each stage that deploys to Kubernetes, the pipeline runs a cluster health check:

if _stage_needs_healthy_cluster "$stage"; then
    if ! aks_healing_pre_deploy; then
        log_error "AKS cluster health check failed"
        return 1
    fi
fi

Stage 00 (Terraform) is excluded since it does not interact with Kubernetes.


ESO Suspension

In dev environments, the pipeline suspends External Secrets Operator (ESO) syncs before deployment. ESO would otherwise overwrite dev credentials with random passwords from Azure Key Vault:

if [[ "$DRY_RUN" != "true" ]] && type k8s_dev_suspend_eso &>/dev/null; then
    k8s_dev_suspend_eso
fi

Build Nodepool Cleanup

After build stages complete, the pipeline scales down multi-architecture build nodepools to zero for cost optimization:

cleanup_build_nodepools() {
    local cleanup_script="${SCRIPT_DIR}/build-cleanup.sh"
    "$cleanup_script" --wait-timeout "$BUILD_CLEANUP_WAIT_TIMEOUT"
}

Cleanup failure tracking with alerting after 3 consecutive failures:

if [[ $CLEANUP_FAILURE_COUNT -ge $MAX_CLEANUP_FAILURES ]]; then
    log_error "ALERT: Cleanup has failed $CLEANUP_FAILURE_COUNT consecutive times!"
fi