Verifying the Deployment
After installation and initial configuration, run a comprehensive verification to confirm that the MATIH Platform is fully operational. This page covers the built-in verification tools, manual checks, and troubleshooting procedures for common issues.
Automated Verification Tools
MATIH provides two primary scripts for deployment verification:
Platform Status
The platform-status.sh script checks the status of all platform components:
./scripts/tools/platform-status.shThis script verifies:
| Check | Description |
|---|---|
| Kubernetes connectivity | Can reach the cluster API server |
| Namespace existence | matih-system, matih-shared, and tenant namespaces exist |
| Pod health | All pods are in Running or Completed state |
| Service endpoints | All Kubernetes services have endpoints |
| Persistent volumes | All PVCs are Bound |
| Ingress status | Ingress resources have assigned addresses |
Sample output:
MATIH Platform Status
=====================
Cluster: matih-aks-cluster (Azure AKS)
Kubernetes: v1.29.1
Namespaces:
matih-system ............ OK
matih-shared ............ OK
tenant-acme-corp ............ OK
Control Plane Services:
iam-service 2/2 pods ready .... OK
tenant-service 2/2 pods ready .... OK
config-service 1/1 pods ready .... OK
api-gateway 2/2 pods ready .... OK
notification-service 1/1 pods ready .... OK
audit-service 1/1 pods ready .... OK
Data Plane Services (tenant-acme-corp):
ai-service 2/2 pods ready .... OK
query-engine 2/2 pods ready .... OK
bi-service 1/1 pods ready .... OK
ml-service 1/1 pods ready .... OK
Infrastructure:
postgresql 1/1 pods ready .... OK
redis 1/1 pods ready .... OK
kafka 3/3 pods ready .... OK
Overall Status: HEALTHYHealth Check
The health-check.sh script performs deeper health validation:
./scripts/disaster-recovery/health-check.shThis script performs:
| Check | Description |
|---|---|
| Service health endpoints | HTTP health check on each service |
| Database connectivity | Test connection to PostgreSQL |
| Message broker | Verify Kafka broker availability |
| Cache | Test Redis connectivity |
| DNS resolution | Verify internal DNS resolution |
| Certificate validity | Check TLS certificate expiration |
Manual Verification Steps
1. Verify Kubernetes Cluster Health
# Check node status
kubectl get nodes -o wide
# Expected: All nodes showing STATUS "Ready"
# NAME STATUS ROLES AGE VERSION
# node-pool-0 Ready <none> 2d v1.29.1
# node-pool-1 Ready <none> 2d v1.29.1
# node-pool-2 Ready <none> 2d v1.29.12. Verify Pod Status
# Check all pods across namespaces
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# This should return no results (no failing pods)
# If pods are failing, investigate with:
# kubectl describe pod <pod-name> -n <namespace>
# kubectl logs <pod-name> -n <namespace>3. Verify Control Plane Services
Check each control plane service's health endpoint:
# Using port-forward (if not exposed via ingress)
kubectl port-forward svc/iam-service 8081:8081 -n matih-system &
# Check health
curl -s http://localhost:8081/actuator/health | jq .Expected response:
{
"status": "UP",
"components": {
"db": { "status": "UP" },
"redis": { "status": "UP" },
"kafka": { "status": "UP" },
"diskSpace": { "status": "UP" }
}
}4. Verify Data Plane Services
# AI Service health
curl -s http://localhost:8000/health | jq .
# Expected:
# {
# "status": "healthy",
# "version": "1.0.0",
# "dependencies": {
# "database": "connected",
# "redis": "connected",
# "kafka": "connected"
# }
# }5. Verify Database Connectivity
# Check database pods
kubectl get pods -l app=postgresql -A
# Check database service
kubectl get svc -l app=postgresql -A6. Verify Kafka
# Check Kafka pods
kubectl get pods -l app=kafka -A
# Verify topic creation (topics should exist after service deployment)
kubectl exec -it kafka-0 -n matih-shared -- \
kafka-topics.sh --list --bootstrap-server localhost:90927. Verify Ingress and TLS
# Check ingress resources
kubectl get ingress -A
# Check certificate status
kubectl get certificates -A
# Validate tenant ingress
./scripts/tools/validate-tenant-ingress.sh --tenant acme-corpPort Validation
Verify that no port conflicts exist:
./scripts/tools/validate-ports.shThe source of truth for port assignments is scripts/config/components.yaml. This script checks that actual service ports match the defined configuration.
Expected Port Assignments
Control Plane:
| Service | Port |
|---|---|
| api-gateway | 8080 |
| iam-service | 8081 |
| tenant-service | 8082 |
| platform-registry | 8084 |
| notification-service | 8085 |
| audit-service | 8086 |
| billing-service | 8087 |
| observability-api | 8088 |
| infrastructure-service | 8089 |
| config-service | 8888 |
Data Plane:
| Service | Port |
|---|---|
| ai-service | 8000 |
| ml-service | 8000 |
| data-quality-service | 8000 |
| query-engine | 8080 |
| bi-service | 8084 |
| catalog-service | 8086 |
| pipeline-service | 8092 |
Smoke Tests
After verifying infrastructure health, run smoke tests to confirm end-to-end functionality:
Authentication Smoke Test
# Register a test user
curl -s -X POST http://localhost:8081/api/v1/auth/register \
-H "Content-Type: application/json" \
-d '{
"email": "test@acme.com",
"password": "Test@12345",
"firstName": "Test",
"lastName": "User"
}' | jq .status
# Login
curl -s -X POST http://localhost:8081/api/v1/auth/login \
-H "Content-Type: application/json" \
-d '{
"email": "test@acme.com",
"password": "Test@12345"
}' | jq .tokenType
# Expected: "Bearer"AI Service Smoke Test
# Test a simple query (requires configured data source)
curl -s -X POST http://localhost:8000/api/v1/query \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"question": "How many tables are in the database?",
"tenantId": "acme-corp"
}' | jq .status
# Expected: "success" or "completed"Troubleshooting Common Issues
Pods in CrashLoopBackOff
Diagnosis:
# Check pod events
kubectl describe pod <pod-name> -n <namespace>
# Check container logs
kubectl logs <pod-name> -n <namespace>
# Check previous container logs (if restarting)
kubectl logs <pod-name> -n <namespace> --previousCommon causes:
| Symptom | Cause | Resolution |
|---|---|---|
| "Connection refused" to database | Database not ready or wrong credentials | Check database pod status and secret values |
| "Invalid JWT secret" | JWT secret not configured | Run dev-secrets.sh or check ESO sync |
| OOMKilled | Insufficient memory | Increase memory limits in Helm values |
| "Unrecognized option" | Wrong CLI flags for the image version | Verify image tag and flag compatibility |
Pods in CreateContainerConfigError
Diagnosis:
kubectl describe pod <pod-name> -n <namespace>
# Look at the Events section for the specific errorCommon causes:
| Error Message | Cause | Resolution |
|---|---|---|
secret "X" not found | Missing Kubernetes secret | Run dev-secrets.sh or check ESO ExternalSecret |
configmap "X" not found | Missing ConfigMap | Check Helm chart templates |
container has runAsNonRoot and image will run as root | Security context mismatch | Adjust podSecurityContext in values |
Services Cannot Communicate
Diagnosis:
# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>
# Check network policies
kubectl get networkpolicies -n <namespace>
# Test connectivity from within a pod
kubectl exec -it <pod-name> -n <namespace> -- \
curl -s http://<target-service>:<port>/healthCommon causes:
| Symptom | Cause | Resolution |
|---|---|---|
| Empty endpoints | No matching pods for service selector | Check pod labels match service selector |
| Connection timeout | NetworkPolicy blocking traffic | Review network policy rules |
| DNS resolution failure | CoreDNS not running | Check kube-dns pods in kube-system |
Database Connection Failures
# Test database connectivity
kubectl exec -it <pod-name> -n <namespace> -- \
pg_isready -h <db-host> -p 5432 -U matih
# Check database secrets
kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.password}' | base64 -dCertificate Issues
# Check certificate status
kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>
# Check cert-manager logs
kubectl logs -l app=cert-manager -n cert-managerMonitoring After Verification
Once verification passes, set up ongoing monitoring:
| What to Monitor | Tool | Alert Condition |
|---|---|---|
| Pod restarts | Prometheus + Grafana | Any restart in 5 minutes |
| Service latency | Prometheus + Grafana | P95 above 2 seconds |
| Error rates | Prometheus + Grafana | Above 1% for 5 minutes |
| Disk usage | Prometheus + Grafana | Above 80% capacity |
| Certificate expiry | cert-manager | Within 30 days |
| Database connections | PostgreSQL metrics | Pool exhaustion |
Verification Checklist
| Category | Check | Status |
|---|---|---|
| Cluster | All nodes Ready | |
| Cluster | No failing pods | |
| Control Plane | IAM service healthy | |
| Control Plane | Tenant service healthy | |
| Control Plane | Config service healthy | |
| Control Plane | API gateway healthy | |
| Data Plane | AI service healthy | |
| Data Plane | Query engine healthy | |
| Infrastructure | PostgreSQL connected | |
| Infrastructure | Redis connected | |
| Infrastructure | Kafka brokers available | |
| Networking | Ingress configured | |
| Networking | TLS certificates valid | |
| Smoke Test | Authentication works | |
| Smoke Test | Query execution works |
Next Steps
With the platform verified and operational, proceed to Chapter 5: Quickstart Tutorials for hands-on guides to your first natural language query, dashboard, and ML model.