MATIH Platform is in active MVP development. Documentation reflects current implementation status.
19. Observability & Operations
Health Checks
Dependency Checks

Dependency Checks

Dependency checks verify the connectivity and health of external dependencies that MATIH services rely on. These checks run as part of readiness probes and the platform health check script, ensuring that services only receive traffic when their dependencies are available.


Dependencies by Service

ServiceDependencyTypeCritical
AI ServicePostgreSQLDatabaseYes
AI ServiceRedisCacheYes
AI ServiceKafkaMessage queueYes
AI ServiceDgraphGraph databaseNo (degraded mode)
AI ServicePineconeVector storeNo (mock mode)
AI ServiceOpenAI APILLM providerNo (cached responses)
Query EnginePostgreSQLDatabaseYes
Query EngineStarRocksOLAP engineYes
IAM ServicePostgreSQLDatabaseYes
IAM ServiceRedisSession storeYes
Tenant ServicePostgreSQLDatabaseYes
Tenant ServiceKafkaEvent busYes
API GatewayRedisRate limitingYes

Check Types

Database Check

Executes a minimal query to verify connectivity:

async def check_postgresql(pool) -> dict:
    try:
        async with pool.acquire() as conn:
            result = await conn.fetchval("SELECT 1")
            return {"status": "healthy", "latency_ms": measured_latency}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

Cache Check

Pings Redis to verify connectivity:

async def check_redis(client) -> dict:
    try:
        await client.ping()
        return {"status": "healthy"}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

Message Queue Check

Verifies Kafka broker availability:

async def check_kafka(config) -> dict:
    try:
        producer = AIOKafkaProducer(bootstrap_servers=config.bootstrap_servers)
        await producer.start()
        await producer.stop()
        return {"status": "healthy", "broker_count": len(config.bootstrap_servers)}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}

Critical vs. Non-Critical Dependencies

ClassificationBehavior on FailureExample
CriticalService reports not ready, stops receiving trafficPostgreSQL, Redis
Non-criticalService reports degraded, continues serving with reduced functionalityDgraph, Pinecone

Dependency Health Response Format

{
  "status": "degraded",
  "dependencies": {
    "postgresql": {
      "status": "healthy",
      "latency_ms": 2.1,
      "pool_size": 10,
      "pool_available": 8
    },
    "redis": {
      "status": "healthy",
      "latency_ms": 0.5
    },
    "kafka": {
      "status": "healthy",
      "broker_count": 3
    },
    "dgraph": {
      "status": "unhealthy",
      "error": "Connection refused"
    },
    "pinecone": {
      "status": "degraded",
      "mode": "mock",
      "reason": "API key not configured"
    }
  }
}

Timeout Configuration

DependencyCheck TimeoutDescription
PostgreSQL5 secondsDatabase query timeout
Redis2 secondsPing timeout
Kafka10 secondsBroker connection timeout
Dgraph5 secondsGraphQL health endpoint
Pinecone5 secondsAPI health check
External APIs10 secondsHTTP request timeout

Monitoring Dependency Health

Dependency health is tracked via Prometheus metrics:

MetricTypeLabels
matih_dependency_upGaugeservice, dependency
matih_dependency_latency_secondsHistogramservice, dependency
matih_dependency_errors_totalCounterservice, dependency, error_type