Machine Learning Platform
The Machine Learning pillar of the MATIH Platform provides a complete ML lifecycle management system that integrates with the data engineering, governance, and conversational analytics pillars. Data scientists and ML engineers can track experiments, train models at scale, deploy to production with A/B testing, and monitor for drift -- all within a unified platform that maintains full lineage from training data through deployed predictions.
1.1ML Lifecycle Overview
The MATIH ML Platform supports the complete machine learning lifecycle:
Data Preparation Training Deployment Monitoring
| | | |
Feature Store (Feast) Experiment Tracking Model Registry Drift Detection
Data Quality Checks Hyperparameter Tuning Staging/Production Performance Metrics
Dataset Versioning Distributed Training A/B Testing Alerting
Schema Validation Auto-logging Canary Deployment Retraining Triggers
Resource Provisioning Traffic ManagementKey Services
| Service | Technology | Role |
|---|---|---|
ml-service | Python FastAPI (port 8000) | Experiment tracking, model registry, deployment orchestration |
| Ray (KubeRay) | Python | Distributed training and hyperparameter tuning |
| MLflow | Python | Experiment tracking backend and artifact storage |
| Feast | Python | Feature store for shared feature definitions |
| vLLM | Python | High-performance LLM inference serving |
| Triton | C++/Python | Multi-framework model serving (TensorFlow, PyTorch, ONNX) |
1.2Experiment Tracking
The ml-service provides comprehensive experiment tracking that captures every aspect of a training run:
| Feature | Description | Storage |
|---|---|---|
| Experiment creation | Named groups of related training runs with shared objectives | PostgreSQL |
| Parameter logging | Hyperparameters, data versions, preprocessing settings, feature selections | PostgreSQL JSONB |
| Metric tracking | Training loss, validation metrics, custom metrics with step-level granularity | PostgreSQL + time-series |
| Artifact storage | Model weights, evaluation plots, feature importance charts, confusion matrices | S3-compatible (MinIO in dev) |
| Run comparison | Side-by-side comparison of metrics across runs with difference highlighting | ML Workbench UI |
| Auto-logging | Automatic capture of framework-specific metrics for scikit-learn, PyTorch, TensorFlow, XGBoost | Framework-specific hooks |
| Environment capture | Python version, package versions, GPU configuration, random seeds | Automatic at run start |
| Git integration | Commit hash, branch, diff from HEAD captured for reproducibility | Automatic from workspace |
Experiment Organization
Project: Customer Analytics
|
+-- Experiment: churn_prediction_v3
| |
| +-- Run 1: random_forest, max_depth=10, AUC=0.82
| +-- Run 2: random_forest, max_depth=20, AUC=0.85
| +-- Run 3: xgboost, n_estimators=500, AUC=0.89 <-- Best
| +-- Run 4: neural_net, hidden=[128,64], AUC=0.87
|
+-- Experiment: demand_forecast_v2
| |
| +-- Run 1: prophet, yearly_seasonality=True, MAPE=12.3%
| +-- Run 2: lstm, seq_len=30, MAPE=9.8% <-- Best
|
+-- Experiment: pricing_optimization_v1
|
+-- Run 1: linear_regression, features=12, R2=0.74
+-- Run 2: gradient_boost, features=25, R2=0.91 <-- Best1.3Model Registry
The model registry provides version-controlled model management with promotion stages:
| Stage | Description | Governance |
|---|---|---|
| Development | Model created from experiment run, not yet validated | No restrictions |
| Staging | Model promoted for testing; deployed to staging environment | Requires run metrics meeting baseline thresholds |
| Production | Model serving live traffic | Requires approval from model owner or team lead |
| Archived | Previous production model, preserved for rollback | Automatically archived when replaced |
Model Cards
Every registered model includes a model card with:
- Description -- What the model does and its intended use case
- Training data -- Dataset versions, date ranges, and quality scores (linked via Context Graph)
- Performance metrics -- Primary and secondary metrics with confidence intervals
- Fairness metrics -- Bias indicators across protected attributes (if applicable)
- Limitations -- Known failure modes, data requirements, and drift sensitivity
- Deployment history -- When and where the model was deployed, with rollback history
- Lineage -- Full lineage from source tables through feature engineering to model artifact
1.4Distributed Training
MATIH integrates Ray for distributed model training on Kubernetes:
| Feature | Implementation | Benefit |
|---|---|---|
| Automatic resource provisioning | KubeRay operator creates worker pods on demand | No manual cluster management |
| Data parallelism | Ray Train distributes data across workers with gradient synchronization | Linear scaling for large datasets |
| Hyperparameter tuning | Ray Tune with search algorithms (Bayesian, HyperBand, PBT) | Efficient exploration of hyperparameter space |
| Fault tolerance | Automatic checkpointing to S3-compatible storage with worker recovery | Training survives pod preemption |
| GPU scheduling | Kubernetes GPU device plugin with fractional GPU support | Efficient GPU utilization across tenants |
| Resource quotas | Per-tenant resource limits enforced by Kubernetes ResourceQuotas | Fair resource sharing |
Training Workflow
1. User configures training job in ML Workbench:
- Select experiment
- Choose framework (PyTorch, TensorFlow, scikit-learn, XGBoost)
- Set resource requirements (CPUs, GPUs, memory)
- Configure hyperparameter search space
2. ml-service creates Ray cluster via KubeRay:
- Head node: orchestration, metric aggregation
- Worker nodes: distributed data loading and training
- GPU nodes: model training with CUDA
3. Training executes with automatic logging:
- Metrics stream to ml-service via callback hooks
- Checkpoints saved to artifact storage periodically
- Resource utilization tracked by Prometheus
4. Training completes:
- Best model artifact saved to registry
- Metrics and parameters logged to experiment
- Ray cluster automatically scaled down
- Notification sent to user1.5Model Serving
MATIH supports multiple model serving patterns:
| Pattern | Use Case | Implementation |
|---|---|---|
| Real-time inference | Low-latency predictions via REST API | Ray Serve or Triton Inference Server |
| Batch inference | Processing large datasets offline | Spark job triggered by pipeline-service |
| Streaming inference | Real-time predictions on Kafka event streams | Flink job with embedded model |
| A/B testing | Compare model versions with traffic splitting | Ray Serve traffic management |
| Shadow mode | New model runs alongside production without serving results | Dual-path execution with metric comparison |
A/B Testing
Model A/B testing is built into the serving infrastructure:
Incoming Request
|
Traffic Splitter (Ray Serve)
/ \
Model A (90%) Model B (10%)
| |
Prediction Prediction
| |
Metric Logger Metric Logger
| |
Response ResponseConfiguration:
- Traffic split percentages (configurable per model version)
- Metric collection for both versions (latency, accuracy, business metrics)
- Automatic promotion: Model B promoted to 100% if it meets performance criteria
- Automatic rollback: Model B traffic reduced to 0% if error rate exceeds threshold
1.6Drift Detection and Monitoring
The ML Platform continuously monitors deployed models for drift:
| Drift Type | Detection Method | Action |
|---|---|---|
| Data drift | Statistical tests (KS test, PSI) on input feature distributions | Alert data engineering team, trigger data quality review |
| Concept drift | Monitoring prediction distribution changes over time | Alert ML team, trigger retraining evaluation |
| Performance drift | Tracking business metrics (conversion rate, accuracy) against baseline | Alert model owner, automatic rollback if threshold exceeded |
| Feature drift | Monitoring individual feature statistics against training distribution | Highlight drifting features in ML Workbench |
Drift detection runs as a Flink streaming job that processes inference logs from Kafka:
Inference Log (Kafka)
-> Flink Drift Detection Job
-> Compute feature statistics per window (1 hour, 1 day)
-> Compare against training baseline statistics
-> If drift exceeds threshold:
-> Publish drift alert event to Kafka
-> notification-service sends alert
-> ml-service annotates model in registry
-> Context Graph updates model health status1.7Feature Store Integration
MATIH integrates Feast as the feature store for shared feature definitions:
| Capability | Description |
|---|---|
| Feature definitions | Declare features once, reuse across experiments and models |
| Online/offline serving | Feast serves features for training (offline store) and inference (online store via Redis) |
| Point-in-time joins | Correct historical feature values for training, preventing data leakage |
| Feature lineage | Context Graph tracks which models use which features |
| Feature monitoring | Data quality scores applied to features, surfaced in ML Workbench |
Deep Dive References
- ML Service Architecture -- Complete service documentation with code walkthroughs
- Experiment Tracking -- Detailed experiment and run management
- Model Lifecycle -- Registry, promotion, and deployment workflows
- Training Infrastructure -- Ray integration and distributed training patterns
- Inference and Serving -- Model serving, A/B testing, and canary deployment
- Monitoring and Drift -- Drift detection, alerting, and retraining triggers