ML Service Architecture
The ML Service is a Python/FastAPI application with 50+ API routers covering the full machine learning lifecycle. It follows a modular architecture where each subsystem (training, serving, features, monitoring) is independently deployable and configurable. This section examines the module organization, framework integrations, deployment topology, and configuration management.
Module Organization
src/
main.py # FastAPI entry point
main/ # Application bootstrap
api/ # API router definitions
auth/ # Authentication middleware
middleware/ # Request middleware
models/ # Data models
ensemble.py # Ensemble configuration
model_metadata.py # Model metadata
prediction.py # Prediction request/response
training/ # Training subsystem
distributed_trainer.py # Ray Train integration
distributed_workflow_service.py # Training workflows
deepspeed_fsdp_trainer.py # DeepSpeed/FSDP training
hyperparameter_tuner.py # Ray Tune integration
gpu_manager.py # GPU resource management
gpu_tracker.py # GPU utilization tracking
checkpoint_manager.py # Checkpoint persistence
enhanced_checkpoint_service.py # Advanced checkpointing
job_manager.py # Training job lifecycle
job_scheduler.py # Job scheduling
job_monitoring_service.py # Job health monitoring
metrics_collector.py # Training metrics collection
cost_calculator.py # Training cost estimation
validation_pipeline.py # Model validation
serving/ # Inference subsystem
prediction_service.py # Online predictions
model_loader.py # Model loading/caching
ray_serve.py # Ray Serve deployments
advanced_ray_serve.py # Advanced serving features
triton_inference_service.py # Triton integration
registry/ # Model registry
model_registry.py # MLflow integration
features/ # Feature store subsystem
feature_store.py # Feast core integration
unified_feature_store.py # Unified feature API
feast_online_store.py # Online feature serving
feast_offline_store.py # Offline feature retrieval
feast_registry_service.py # Feature registry
feature_group_service.py # Feature group management
feature_serving.py # Feature serving endpoints
feature_versioning_service.py # Feature versioning
feature_materialization_service.py # Materialization
streaming_feature_service.py # Streaming features
iceberg_offline_store.py # Iceberg-backed offline store
aerospike_online_store.py # Aerospike-backed online store
embedding_feature_service.py # Embedding features
agentic_feature_interface.py # Agent-accessible features
registry_state_machine.py # Feature lifecycle FSM
monitoring/ # Model monitoring
drift_detection_service.py # Data/concept drift
model_monitoring_service.py # Performance monitoring
performance_monitoring_service.py # Detailed perf tracking
retraining_trigger_service.py # Automated retraining
testing/ # Model testing
model_testing_service.py # A/B testing, canary
ray_air/ # Ray AIR integration
orchestrator.py # Ray AIR orchestration
ray_data_service.py # Ray Data integration
ray_serve_service.py # Ray Serve management
ray_cluster/ # Ray cluster management
active_learning/ # Active learning
automl/ # AutoML pipelines
batch/ # Batch prediction
caching/ # Model/feature caching
compliance/ # Model compliance
compression/ # Model compression
cost/ # Cost tracking
datasets/ # Dataset management
debugging/ # Model debugging
embeddings/ # Embedding generation
explainability/ # Model explainability
fairness/ # Fairness evaluation
governance/ # Model governance
inference/ # Inference optimization
labeling/ # Data labeling
lifecycle/ # Model lifecycle
observability/ # Metrics/tracing
pipeline/ # ML pipeline orchestration
pipelines/ # Pre-built pipelines
reproducibility/ # Reproducibility tools
safety/ # Model safety
scheduler/ # Job scheduling
shadow/ # Shadow deployments
storage/ # Artifact storage
templates/ # ML templates
tracking/ # Experiment tracking
validation/ # Data validation
versioning/ # Model versioning
vllm/ # vLLM optimizationFramework Integration Map
The ML Service integrates with multiple external ML frameworks:
ML Service (FastAPI)
|
+-- Ray AIR ---------> Ray Cluster
| +-- Ray Train (distributed training)
| +-- Ray Tune (hyperparameter tuning)
| +-- Ray Serve (model serving)
| +-- Ray Data (data processing)
|
+-- MLflow ----------> MLflow Server
| +-- Tracking (experiment tracking)
| +-- Registry (model registry)
| +-- Artifacts (model artifact storage)
|
+-- Feast ------------> Feature Store
| +-- Online Store (Redis/Aerospike)
| +-- Offline Store (Iceberg/S3)
| +-- Registry (feature definitions)
|
+-- ONNX Runtime -----> (in-process inference)
|
+-- Triton ----------> Triton Inference Server
|
+-- scikit-learn -----> (training/inference)
|
+-- PyTorch ----------> (training via Ray)
|
+-- TensorFlow -------> (training via Ray)API Router Architecture
The 50+ API routers are organized by domain:
| Router Group | Endpoints | Key Operations |
|---|---|---|
| Models | /api/v1/models | CRUD, versioning, metadata |
| Ensembles | /api/v1/ensembles | Create, configure, predict |
| Features | /api/v1/features | Feature groups, serving, materialization |
| Predictions | /api/v1/predictions | Single, batch, streaming |
| Training | /api/v1/training | Jobs, status, metrics |
| Tuning | /api/v1/tuning | Hyperparameter search |
| Deployment | /api/v1/deployments | Deploy, scale, rollback |
| Experiments | /api/v1/experiments | Tracking, comparison |
| Feature Store | /api/v1/feature-store | Feast operations |
| Monitoring | /api/v1/monitoring | Drift, performance, alerts |
| Performance | /api/v1/performance | Benchmarks, profiling |
| Drift | /api/v1/drift | Detection, analysis |
| Retraining | /api/v1/retraining | Triggers, scheduling |
| vLLM | /api/v1/vllm | LLM optimization |
| Reproducibility | /api/v1/reproducibility | Experiment reproducibility |
| Compliance | /api/v1/compliance | Model cards, auditing |
| Labeling | /api/v1/labeling | Data labeling workflows |
| Caching | /api/v1/caching | Cache management |
Configuration
class MLServiceSettings(BaseSettings):
"""ML Service configuration."""
# Service
service_name: str = "ml-service"
service_port: int = 8000
# Database
database_url: str = "postgresql+asyncpg://..."
# MLflow
mlflow_tracking_uri: str = "http://localhost:5000"
mlflow_artifact_root: str = "s3://mlflow-artifacts"
# Ray
ray_address: str = "ray://localhost:10001"
ray_namespace: str = "matih-ml"
ray_dashboard_port: int = 8265
# Feast
feast_repo_path: str = "/opt/feast/repo"
feast_online_store_type: str = "redis"
feast_offline_store_type: str = "file"
# Object Storage
s3_endpoint: str = "http://localhost:9000"
s3_bucket: str = "ml-artifacts"
# GPU
gpu_enabled: bool = False
max_gpu_per_job: int = 4
gpu_memory_fraction: float = 0.9
# Monitoring
drift_check_interval_minutes: int = 60
performance_alert_threshold: float = 0.1
class Config:
env_file = ".env"
env_prefix = "MATIH_ML_"Multi-Tenancy
All ML Service operations are tenant-scoped:
| Resource | Isolation Mechanism |
|---|---|
| Models | Stored with tenant_id in metadata |
| Experiments | MLflow experiments prefixed with tenant ID |
| Features | Feast feature views scoped by tenant |
| Training jobs | Ray namespaces per tenant |
| Artifacts | S3 paths prefixed with tenant/{tenant_id}/ |
| Predictions | Request validation ensures tenant access |
Deployment
# Kubernetes deployment
replicaCount: 2
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
# GPU nodes for training
nodeSelector:
gpu: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"Health Checks
| Endpoint | Purpose |
|---|---|
GET /health | Liveness check |
GET /health/ready | Readiness (DB, MLflow, Ray connectivity) |
GET /health/ray | Ray cluster health |
GET /health/mlflow | MLflow server health |