CD Pipeline Overview
The MATIH Continuous Deployment pipeline is orchestrated by scripts/cd-new.sh, a 1,290-line Bash script that coordinates 25 stages from Terraform infrastructure provisioning through post-deployment validation. The pipeline features dependency tracking, atomic state management, lock-based concurrency control, automatic rollback on failure, and dry-run preview mode.
Source file: scripts/cd-new.sh (Version 1.3.0)
Stage Architecture
The 25 stages are organized into four groups:
Infrastructure & Build (Stages 00-06)
├── 00-terraform Provision cloud infrastructure
├── 01-build-setup Docker buildx, environment setup
├── 01a-validate-schemas Hibernate schema validation
├── 02-build-base-images Base container images
├── 03-build-commons Shared libraries
├── 04-build-service-images All service Docker images
├── 04a-sync-tenant-images Sync images to tenant registries
├── 05a-control-plane-infrastructure CP PostgreSQL, Redis, Kafka
├── 05b-data-plane-infrastructure DP PostgreSQL, Redis, Kafka
└── 06-ingress-controller NGINX ingress
Control Plane (Stages 07-09)
├── 07-control-plane-monitoring Prometheus, Grafana for CP
├── 08-control-plane-services IAM, tenant, config, etc.
└── 09-control-plane-frontend Control plane UI
Data Plane (Stages 10-17)
├── 10-data-plane-monitoring Prometheus, Grafana for DP
├── 11-compute-engines Spark, Flink, Ray, Trino
├── 12-workflow-orchestration Airflow
├── 13-data-catalogs OpenMetadata
├── 14-ml-infrastructure KubeRay, MLflow
├── 15-ai-infrastructure vLLM, Ollama
├── 15a-matih-operator Platform operator
├── 16-data-plane-services ai-service, ml-service, etc.
└── 17-data-plane-frontend Workbenches
Validation (Stage 18)
└── 18-validate Health checks, smoke testsDependency Graph
Each stage declares its dependencies. The pipeline verifies all dependencies are satisfied before executing a stage:
00-terraform
├──> 01-build-setup ──> 01a-validate-schemas
│ └──> 02-build-base-images ──> 03-build-commons ──> 04-build-service-images
├──> 05a-control-plane-infrastructure ─┐
└──> 05b-data-plane-infrastructure ────┴──> 06-ingress-controller
│
┌────────────────────────────────────────┘
├──> 07-cp-monitoring ──> 08-cp-services ──> 09-cp-frontend
├──> 10-dp-monitoring
├──> 11-compute-engines ──> 14-ml-infra ──> 15-ai-infra
├──> 12-workflow-orchestration ──> 13-data-catalogs
└──> 15a-matih-operator
└──> 16-dp-services ──> 17-dp-frontend
08-cp-services + 16-dp-services ──> 18-validatePipeline Commands
# Run all stages
./scripts/cd-new.sh all dev
# Run infrastructure stages only
./scripts/cd-new.sh infra dev
# Run platform stages (05a through 15)
./scripts/cd-new.sh platform dev
# Run service stages only
./scripts/cd-new.sh services dev
# Run a specific stage
./scripts/cd-new.sh 04-build-service-images dev
./scripts/cd-new.sh 04 dev # Short form
# Pass arguments to a stage script
./scripts/cd-new.sh 04-build-service-images dev -- --only copilot
# Validation only
./scripts/cd-new.sh validate dev
# Rollback a specific release
./scripts/cd-new.sh rollback ai-service matih-data-plane dev
# Rollback everything
./scripts/cd-new.sh rollback-all dev
# View status and history
./scripts/cd-new.sh status
./scripts/cd-new.sh history
./scripts/cd-new.sh deps
# Dry run (preview)
DRY_RUN=true ./scripts/cd-new.sh all dev
# Clear state (allow re-running stages)
./scripts/cd-new.sh clear-stateState Management
The pipeline tracks stage completion in .cd-pipeline-state:
# Format: stage:status:timestamp:version
00-terraform:completed:2026-02-10T14:30:00+05:30:v1.2.3
01-build-setup:completed:2026-02-10T14:32:15+05:30:v1.2.3
04-build-service-images:started:2026-02-10T14:35:00+05:30:v1.2.3State is written atomically using temp files and mv:
save_stage_state() {
local stage="$1"
local status="$2"
local temp_file="${STATE_FILE}.tmp.$$"
{
grep -v "^${stage}:" "$STATE_FILE" 2>/dev/null || true
echo "${stage}:${status}:$(date -Iseconds):${version}"
} > "$temp_file"
mv "$temp_file" "$STATE_FILE" # Atomic rename
}Concurrency Control
A file-based lock prevents concurrent pipeline runs:
acquire_lock() {
touch "$LOCK_FILE"
exec 200>"$LOCK_FILE"
if command -v flock &>/dev/null; then
flock -w "$LOCK_TIMEOUT" 200 # Linux
else
# macOS fallback using PID file
echo $$ > "${LOCK_FILE}.pid"
fi
}Rollback Mechanism
The pipeline records all Helm releases deployed during a run. On failure, it rolls back in reverse order:
# Recorded format: namespace/release:revision
DEPLOYED_RELEASES+=("matih-data-plane/ai-service:5")
# Rollback: decrement revision
helm rollback ai-service 4 -n matih-data-plane --waitIf the release is at revision 1 (first install), the pipeline uninstalls instead:
if [[ "$current_revision" -le 1 ]]; then
helm uninstall "$release" -n "$namespace" --wait
else
helm rollback "$release" "$target_revision" -n "$namespace" --wait
fiAKS Health Check
Before each stage that deploys to Kubernetes, the pipeline runs a cluster health check:
if _stage_needs_healthy_cluster "$stage"; then
if ! aks_healing_pre_deploy; then
log_error "AKS cluster health check failed"
return 1
fi
fiStage 00 (Terraform) is excluded since it does not interact with Kubernetes.
ESO Suspension
In dev environments, the pipeline suspends External Secrets Operator (ESO) syncs before deployment. ESO would otherwise overwrite dev credentials with random passwords from Azure Key Vault:
if [[ "$DRY_RUN" != "true" ]] && type k8s_dev_suspend_eso &>/dev/null; then
k8s_dev_suspend_eso
fiBuild Nodepool Cleanup
After build stages complete, the pipeline scales down multi-architecture build nodepools to zero for cost optimization:
cleanup_build_nodepools() {
local cleanup_script="${SCRIPT_DIR}/build-cleanup.sh"
"$cleanup_script" --wait-timeout "$BUILD_CLEANUP_WAIT_TIMEOUT"
}Cleanup failure tracking with alerting after 3 consecutive failures:
if [[ $CLEANUP_FAILURE_COUNT -ge $MAX_CLEANUP_FAILURES ]]; then
log_error "ALERT: Cleanup has failed $CLEANUP_FAILURE_COUNT consecutive times!"
fi