MATIH Platform is in active MVP development. Documentation reflects current implementation status.
2. Architecture
ML Architecture

ML Service Architecture

Production - Python/FastAPI - Port 8000 - Ray, MLflow, Feast integration

The ML Service provides machine learning operations (MLOps) capabilities including model training orchestration, experiment tracking, model versioning, feature store integration, and model serving. It bridges the gap between data scientists working in notebooks and production ML deployments.


2.4.C.1MLOps Pipeline

Data Preparation                Training                  Deployment
+------------------+    +-------------------+    +-------------------+
| Feature Store    |    | Ray Cluster       |    | Model Registry    |
| (Feast)          |--->| (distributed      |--->| (MLflow)          |
|                  |    |  training)        |    |                   |
| - Feature defs   |    | - Hyperparameter  |    | - Version control |
| - Point-in-time  |    |   tuning          |    | - Stage promotion |
| - Online serving |    | - Distributed     |    | - Artifact store  |
+------------------+    |   data parallel   |    +--------+----------+
                        +-------------------+             |
                                                          v
                                                +-------------------+
                                                | Model Serving     |
                                                | (Ray Serve /      |
                                                |  Triton)          |
                                                |                   |
                                                | - A/B testing     |
                                                | - Canary deploy   |
                                                | - Auto-scaling    |
                                                +-------------------+

2.4.C.2Infrastructure Integration

ComponentTechnologyPurpose
Distributed trainingRay TrainScales training across multiple workers
Hyperparameter tuningRay TuneBayesian optimization, grid/random search
Experiment trackingMLflowMetrics, parameters, artifacts per run
Model registryMLflow RegistryModel versioning with stage gates
Feature storeFeastFeature engineering and point-in-time joins
Model servingRay Serve / TritonLow-latency inference endpoints
Artifact storageMinIO (S3-compatible)Model files, datasets, checkpoints
GPU schedulingKubernetes device pluginGPU allocation for training and inference

Ray Cluster Configuration

EnvironmentHead NodeWorker NodesGPU Workers
Development1 (2 CPU, 4Gi)00
Production1 (4 CPU, 8Gi)2-8 (auto-scale)1-4 (on demand)

2.4.C.3Model Lifecycle

Models progress through defined stages:

Development --> Staging --> Production --> Archived
    |              |            |
    |              |            +--> Monitoring (drift detection)
    |              |
    |              +--> Validation (automated testing)
    |
    +--> Experiment tracking (MLflow)

Stage Gates

TransitionRequirements
Development --> StagingAll unit tests pass, metrics meet baseline
Staging --> ProductionIntegration tests pass, A/B test shows improvement, manual approval
Production --> ArchivedNew model version promoted, old version traffic drained

2.4.C.4Key APIs

EndpointMethodDescription
/api/v1/ml/experimentsGET/POSTExperiment management
/api/v1/ml/experiments/{id}/runsGET/POSTTraining run management
/api/v1/ml/modelsGET/POSTModel registry
/api/v1/ml/models/{id}/versionsGETModel version history
/api/v1/ml/models/{id}/deployPOSTDeploy model for serving
/api/v1/ml/models/{id}/promotePOSTPromote model to next stage
/api/v1/ml/predictPOSTRun inference against deployed model
/api/v1/ml/featuresGETFeature store catalog
/api/v1/ml/features/servePOSTGet features for inference
/api/v1/ml/training/submitPOSTSubmit distributed training job

Related Sections