Machine Learning Platform

Production - ml-service -- Experiment tracking, model registry, Ray-based training, model serving

The Machine Learning pillar of the MATIH Platform provides a complete ML lifecycle management system that integrates with the data engineering, governance, and conversational analytics pillars. Data scientists and ML engineers can track experiments, train models at scale, deploy to production with A/B testing, and monitor for drift -- all within a unified platform that maintains full lineage from training data through deployed predictions.

1.1ML Lifecycle Overview

The MATIH ML Platform supports the complete machine learning lifecycle:

Data Preparation                Training                 Deployment              Monitoring
     |                            |                         |                       |
  Feature Store (Feast)    Experiment Tracking      Model Registry           Drift Detection
  Data Quality Checks      Hyperparameter Tuning    Staging/Production       Performance Metrics
  Dataset Versioning       Distributed Training     A/B Testing              Alerting
  Schema Validation        Auto-logging             Canary Deployment        Retraining Triggers
                           Resource Provisioning    Traffic Management

Key Services

Service	Technology	Role
`ml-service`	Python FastAPI (port 8000)	Experiment tracking, model registry, deployment orchestration
Ray (KubeRay)	Python	Distributed training and hyperparameter tuning
MLflow	Python	Experiment tracking backend and artifact storage
Feast	Python	Feature store for shared feature definitions
vLLM	Python	High-performance LLM inference serving
Triton	C++/Python	Multi-framework model serving (TensorFlow, PyTorch, ONNX)

1.2Experiment Tracking

The ml-service provides comprehensive experiment tracking that captures every aspect of a training run:

Feature	Description	Storage
Experiment creation	Named groups of related training runs with shared objectives	PostgreSQL
Parameter logging	Hyperparameters, data versions, preprocessing settings, feature selections	PostgreSQL JSONB
Metric tracking	Training loss, validation metrics, custom metrics with step-level granularity	PostgreSQL + time-series
Artifact storage	Model weights, evaluation plots, feature importance charts, confusion matrices	S3-compatible (MinIO in dev)
Run comparison	Side-by-side comparison of metrics across runs with difference highlighting	ML Workbench UI
Auto-logging	Automatic capture of framework-specific metrics for scikit-learn, PyTorch, TensorFlow, XGBoost	Framework-specific hooks
Environment capture	Python version, package versions, GPU configuration, random seeds	Automatic at run start
Git integration	Commit hash, branch, diff from HEAD captured for reproducibility	Automatic from workspace

Experiment Organization

Project: Customer Analytics
  |
  +-- Experiment: churn_prediction_v3
  |     |
  |     +-- Run 1: random_forest, max_depth=10, AUC=0.82
  |     +-- Run 2: random_forest, max_depth=20, AUC=0.85
  |     +-- Run 3: xgboost, n_estimators=500, AUC=0.89  <-- Best
  |     +-- Run 4: neural_net, hidden=[128,64], AUC=0.87
  |
  +-- Experiment: demand_forecast_v2
  |     |
  |     +-- Run 1: prophet, yearly_seasonality=True, MAPE=12.3%
  |     +-- Run 2: lstm, seq_len=30, MAPE=9.8%  <-- Best
  |
  +-- Experiment: pricing_optimization_v1
        |
        +-- Run 1: linear_regression, features=12, R2=0.74
        +-- Run 2: gradient_boost, features=25, R2=0.91  <-- Best

1.3Model Registry

The model registry provides version-controlled model management with promotion stages:

Stage	Description	Governance
Development	Model created from experiment run, not yet validated	No restrictions
Staging	Model promoted for testing; deployed to staging environment	Requires run metrics meeting baseline thresholds
Production	Model serving live traffic	Requires approval from model owner or team lead
Archived	Previous production model, preserved for rollback	Automatically archived when replaced

Model Cards

Every registered model includes a model card with:

Description -- What the model does and its intended use case
Training data -- Dataset versions, date ranges, and quality scores (linked via Context Graph)
Performance metrics -- Primary and secondary metrics with confidence intervals
Fairness metrics -- Bias indicators across protected attributes (if applicable)
Limitations -- Known failure modes, data requirements, and drift sensitivity
Deployment history -- When and where the model was deployed, with rollback history
Lineage -- Full lineage from source tables through feature engineering to model artifact

1.4Distributed Training

MATIH integrates Ray for distributed model training on Kubernetes:

Feature	Implementation	Benefit
Automatic resource provisioning	KubeRay operator creates worker pods on demand	No manual cluster management
Data parallelism	Ray Train distributes data across workers with gradient synchronization	Linear scaling for large datasets
Hyperparameter tuning	Ray Tune with search algorithms (Bayesian, HyperBand, PBT)	Efficient exploration of hyperparameter space
Fault tolerance	Automatic checkpointing to S3-compatible storage with worker recovery	Training survives pod preemption
GPU scheduling	Kubernetes GPU device plugin with fractional GPU support	Efficient GPU utilization across tenants
Resource quotas	Per-tenant resource limits enforced by Kubernetes ResourceQuotas	Fair resource sharing

Training Workflow

1. User configures training job in ML Workbench:
   - Select experiment
   - Choose framework (PyTorch, TensorFlow, scikit-learn, XGBoost)
   - Set resource requirements (CPUs, GPUs, memory)
   - Configure hyperparameter search space

2. ml-service creates Ray cluster via KubeRay:
   - Head node: orchestration, metric aggregation
   - Worker nodes: distributed data loading and training
   - GPU nodes: model training with CUDA

3. Training executes with automatic logging:
   - Metrics stream to ml-service via callback hooks
   - Checkpoints saved to artifact storage periodically
   - Resource utilization tracked by Prometheus

4. Training completes:
   - Best model artifact saved to registry
   - Metrics and parameters logged to experiment
   - Ray cluster automatically scaled down
   - Notification sent to user

1.5Model Serving

MATIH supports multiple model serving patterns:

Pattern	Use Case	Implementation
Real-time inference	Low-latency predictions via REST API	Ray Serve or Triton Inference Server
Batch inference	Processing large datasets offline	Spark job triggered by pipeline-service
Streaming inference	Real-time predictions on Kafka event streams	Flink job with embedded model
A/B testing	Compare model versions with traffic splitting	Ray Serve traffic management
Shadow mode	New model runs alongside production without serving results	Dual-path execution with metric comparison

A/B Testing

Model A/B testing is built into the serving infrastructure:

Incoming Request
      |
  Traffic Splitter (Ray Serve)
     /          \
  Model A (90%)   Model B (10%)
     |              |
  Prediction     Prediction
     |              |
  Metric Logger  Metric Logger
     |              |
  Response       Response

Configuration:

Traffic split percentages (configurable per model version)
Metric collection for both versions (latency, accuracy, business metrics)
Automatic promotion: Model B promoted to 100% if it meets performance criteria
Automatic rollback: Model B traffic reduced to 0% if error rate exceeds threshold

1.6Drift Detection and Monitoring

The ML Platform continuously monitors deployed models for drift:

Drift Type	Detection Method	Action
Data drift	Statistical tests (KS test, PSI) on input feature distributions	Alert data engineering team, trigger data quality review
Concept drift	Monitoring prediction distribution changes over time	Alert ML team, trigger retraining evaluation
Performance drift	Tracking business metrics (conversion rate, accuracy) against baseline	Alert model owner, automatic rollback if threshold exceeded
Feature drift	Monitoring individual feature statistics against training distribution	Highlight drifting features in ML Workbench

Drift detection runs as a Flink streaming job that processes inference logs from Kafka:

Inference Log (Kafka)
  -> Flink Drift Detection Job
    -> Compute feature statistics per window (1 hour, 1 day)
    -> Compare against training baseline statistics
    -> If drift exceeds threshold:
      -> Publish drift alert event to Kafka
      -> notification-service sends alert
      -> ml-service annotates model in registry
      -> Context Graph updates model health status

1.7Feature Store Integration

MATIH integrates Feast as the feature store for shared feature definitions:

Capability	Description
Feature definitions	Declare features once, reuse across experiments and models
Online/offline serving	Feast serves features for training (offline store) and inference (online store via Redis)
Point-in-time joins	Correct historical feature values for training, preventing data leakage
Feature lineage	Context Graph tracks which models use which features
Feature monitoring	Data quality scores applied to features, surfaced in ML Workbench

Deep Dive References

ML Service Architecture -- Complete service documentation with code walkthroughs
Experiment Tracking -- Detailed experiment and run management
Model Lifecycle -- Registry, promotion, and deployment workflows
Training Infrastructure -- Ray integration and distributed training patterns
Inference and Serving -- Model serving, A/B testing, and canary deployment
Monitoring and Drift -- Drift detection, alerting, and retraining triggers

BI & Dashboards Data Engineering