Service Discovery

The ServiceDiscoveryController and ServiceDiscoveryService provide runtime service registration, instance discovery, health aggregation, and dependency graph management. Services register their instances at startup, and other services use the discovery API to locate healthy instances with weighted load balancing.

Register a Service Instance

Endpoint: POST /api/v1/services/register

curl -X POST http://localhost:8084/api/v1/services/register \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${TOKEN}" \
  -d '{
    "serviceName": "ai-service",
    "instanceId": "ai-service-pod-abc123",
    "host": "10.0.1.42",
    "port": 8000,
    "protocol": "http",
    "version": "2.1.0",
    "healthCheckUrl": "http://10.0.1.42:8000/health",
    "metricsUrl": "http://10.0.1.42:8000/metrics",
    "region": "us-east-1",
    "zone": "us-east-1a",
    "weight": 100,
    "metadata": {
      "gpu": "true",
      "model-loaded": "gpt-4"
    },
    "tags": ["gpu", "production"]
  }'

If an instance with the same serviceName and instanceId already exists, it is updated rather than duplicated. New instances start in STARTING status with zero consecutive failures.

ServiceInstance Structure

Field	Type	Description
`id`	UUID	Internal database identifier
`serviceName`	String	Logical service name
`instanceId`	String	Unique instance identifier (e.g., pod name)
`host`	String	Instance hostname or IP address
`port`	Integer	Service port
`protocol`	String	Protocol (http, https, grpc)
`version`	String	Running software version
`status`	ServiceStatus	Current instance status
`healthCheckUrl`	String	URL for health probes
`metricsUrl`	String	URL for Prometheus metrics
`metadata`	Map	Key-value metadata pairs
`tags`	Set	Tags for filtering
`lastHealthCheck`	Instant	Last health check timestamp
`consecutiveFailures`	Integer	Consecutive failed health checks
`region`	String	Cloud region
`zone`	String	Availability zone
`weight`	Integer	Load balancing weight
`registeredAt`	Instant	Registration timestamp
`lastUpdated`	Instant	Last update timestamp
`deregisteredAt`	Instant	Deregistration timestamp

Service Status

Status	Description
`STARTING`	Instance registered, not yet healthy
`HEALTHY`	Passing health checks
`UNHEALTHY`	Failing health checks
`DRAINING`	Draining connections before shutdown
`DOWN`	Instance is down
`DEREGISTERED`	Instance has been deregistered

Deregister an Instance

Endpoint: DELETE /api/v1/services/:serviceName/instances/:instanceId

Marks the instance as DEREGISTERED and records the deregistration timestamp. Deregistered instances are excluded from discovery queries but retained in the database for audit purposes.

Service Discovery Endpoints

List All Services

Endpoint: GET /api/v1/services

Returns a list of all unique service names that have at least one active (non-deregistered) instance.

Get All Instances

Endpoint: GET /api/v1/services/:serviceName

Returns all instances for a service, including unhealthy and starting instances.

Get Healthy Instances

Endpoint: GET /api/v1/services/:serviceName/healthy

Returns only instances with HEALTHY status, suitable for routing traffic.

Get Single Instance (Load Balanced)

Endpoint: GET /api/v1/services/:serviceName/instance

Returns a single healthy instance selected via weighted random load balancing. Instances with higher weight values are more likely to be selected. If no healthy instances exist, returns 404.

Service Status Summary

Endpoint: GET /api/v1/services/status

Returns a status summary for every registered service.

{
  "ai-service": {
    "serviceName": "ai-service",
    "totalInstances": 5,
    "healthyInstances": 4,
    "unhealthyInstances": 1,
    "overallStatus": "UP"
  },
  "query-engine": {
    "serviceName": "query-engine",
    "totalInstances": 3,
    "healthyInstances": 3,
    "unhealthyInstances": 0,
    "overallStatus": "UP"
  }
}

The overallStatus is UP if at least one healthy instance exists, DOWN otherwise.

Health Endpoints

Aggregated Service Health

Endpoint: GET /api/v1/services/:serviceName/health

Returns the aggregated health status for a service, computed by the HealthAggregationService. This includes health check results across all instances.

Platform-Wide Health

Endpoint: GET /api/v1/services/health

Returns the overall platform health status, aggregating health across all registered services.

Trigger Health Check

Endpoint: POST /api/v1/services/:serviceName/instances/:instanceId/health-check

Manually triggers a health check for a specific instance. Returns 202 Accepted as the health check runs asynchronously.

Service Dependencies

Add a Dependency

Endpoint: POST /api/v1/services/:serviceName/dependencies

curl -X POST http://localhost:8084/api/v1/services/ai-service/dependencies \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${TOKEN}" \
  -d '{
    "dependsOn": "query-engine",
    "type": "RUNTIME",
    "required": true,
    "minVersion": "2.0.0",
    "healthCheckEndpoint": "/health",
    "timeoutMs": 5000
  }'

Duplicate dependencies (same serviceName and dependsOn) are rejected with an error.

ServiceDependency Structure

Field	Type	Description
`id`	UUID	Dependency identifier
`serviceName`	String	Service that has the dependency
`dependsOn`	String	Service being depended on
`type`	DependencyType	Classification of the dependency
`minVersion`	String	Minimum required version
`maxVersion`	String	Maximum supported version
`required`	Boolean	Whether the dependency is required
`healthCheckEndpoint`	String	Health check endpoint on the dependency
`timeoutMs`	Integer	Health check timeout in milliseconds

Dependency Types

Type	Description
`RUNTIME`	Required at runtime for normal operation
`BUILD`	Required at build time only
`OPTIONAL`	Optional dependency for enhanced features
`DEV`	Development-only dependency

Dependency Queries

Get Dependencies

Endpoint: GET /api/v1/services/:serviceName/dependencies

Returns all declared dependencies for a service.

Get Dependents

Endpoint: GET /api/v1/services/:serviceName/dependents

Returns a list of service names that depend on the specified service. Useful for impact analysis before taking a service offline.

Check Dependency Health

Endpoint: GET /api/v1/services/:serviceName/dependencies/health

Checks whether all required dependencies have at least one healthy instance.

{
  "serviceName": "ai-service",
  "allHealthy": true,
  "dependencies": [
    {
      "serviceName": "query-engine",
      "healthy": true,
      "healthyInstances": 3
    },
    {
      "serviceName": "redis",
      "healthy": true,
      "healthyInstances": 1
    }
  ]
}

Full Dependency Graph

Endpoint: GET /api/v1/services/dependencies/graph

Returns the complete dependency graph across all services as a set of nodes (service names) and directed edges (dependency relationships).

{
  "nodes": ["ai-service", "query-engine", "redis", "kafka", "postgresql"],
  "edges": [
    { "from": "ai-service", "to": "query-engine", "type": "RUNTIME", "required": true },
    { "from": "ai-service", "to": "redis", "type": "RUNTIME", "required": true },
    { "from": "query-engine", "to": "postgresql", "type": "RUNTIME", "required": true },
    { "from": "query-engine", "to": "kafka", "type": "OPTIONAL", "required": false }
  ]
}

Upgrade Execution API Reference