MATIH Platform is in active MVP development. Documentation reflects current implementation status.
1. Introduction
ML Platform

Machine Learning Platform

Production - ml-service -- Experiment tracking, model registry, Ray-based training, model serving

The Machine Learning pillar of the MATIH Platform provides a complete ML lifecycle management system that integrates with the data engineering, governance, and conversational analytics pillars. Data scientists and ML engineers can track experiments, train models at scale, deploy to production with A/B testing, and monitor for drift -- all within a unified platform that maintains full lineage from training data through deployed predictions.


1.1ML Lifecycle Overview

The MATIH ML Platform supports the complete machine learning lifecycle:

Data Preparation                Training                 Deployment              Monitoring
     |                            |                         |                       |
  Feature Store (Feast)    Experiment Tracking      Model Registry           Drift Detection
  Data Quality Checks      Hyperparameter Tuning    Staging/Production       Performance Metrics
  Dataset Versioning       Distributed Training     A/B Testing              Alerting
  Schema Validation        Auto-logging             Canary Deployment        Retraining Triggers
                           Resource Provisioning    Traffic Management

Key Services

ServiceTechnologyRole
ml-servicePython FastAPI (port 8000)Experiment tracking, model registry, deployment orchestration
Ray (KubeRay)PythonDistributed training and hyperparameter tuning
MLflowPythonExperiment tracking backend and artifact storage
FeastPythonFeature store for shared feature definitions
vLLMPythonHigh-performance LLM inference serving
TritonC++/PythonMulti-framework model serving (TensorFlow, PyTorch, ONNX)

1.2Experiment Tracking

The ml-service provides comprehensive experiment tracking that captures every aspect of a training run:

FeatureDescriptionStorage
Experiment creationNamed groups of related training runs with shared objectivesPostgreSQL
Parameter loggingHyperparameters, data versions, preprocessing settings, feature selectionsPostgreSQL JSONB
Metric trackingTraining loss, validation metrics, custom metrics with step-level granularityPostgreSQL + time-series
Artifact storageModel weights, evaluation plots, feature importance charts, confusion matricesS3-compatible (MinIO in dev)
Run comparisonSide-by-side comparison of metrics across runs with difference highlightingML Workbench UI
Auto-loggingAutomatic capture of framework-specific metrics for scikit-learn, PyTorch, TensorFlow, XGBoostFramework-specific hooks
Environment capturePython version, package versions, GPU configuration, random seedsAutomatic at run start
Git integrationCommit hash, branch, diff from HEAD captured for reproducibilityAutomatic from workspace

Experiment Organization

Project: Customer Analytics
  |
  +-- Experiment: churn_prediction_v3
  |     |
  |     +-- Run 1: random_forest, max_depth=10, AUC=0.82
  |     +-- Run 2: random_forest, max_depth=20, AUC=0.85
  |     +-- Run 3: xgboost, n_estimators=500, AUC=0.89  <-- Best
  |     +-- Run 4: neural_net, hidden=[128,64], AUC=0.87
  |
  +-- Experiment: demand_forecast_v2
  |     |
  |     +-- Run 1: prophet, yearly_seasonality=True, MAPE=12.3%
  |     +-- Run 2: lstm, seq_len=30, MAPE=9.8%  <-- Best
  |
  +-- Experiment: pricing_optimization_v1
        |
        +-- Run 1: linear_regression, features=12, R2=0.74
        +-- Run 2: gradient_boost, features=25, R2=0.91  <-- Best

1.3Model Registry

The model registry provides version-controlled model management with promotion stages:

StageDescriptionGovernance
DevelopmentModel created from experiment run, not yet validatedNo restrictions
StagingModel promoted for testing; deployed to staging environmentRequires run metrics meeting baseline thresholds
ProductionModel serving live trafficRequires approval from model owner or team lead
ArchivedPrevious production model, preserved for rollbackAutomatically archived when replaced

Model Cards

Every registered model includes a model card with:

  • Description -- What the model does and its intended use case
  • Training data -- Dataset versions, date ranges, and quality scores (linked via Context Graph)
  • Performance metrics -- Primary and secondary metrics with confidence intervals
  • Fairness metrics -- Bias indicators across protected attributes (if applicable)
  • Limitations -- Known failure modes, data requirements, and drift sensitivity
  • Deployment history -- When and where the model was deployed, with rollback history
  • Lineage -- Full lineage from source tables through feature engineering to model artifact

1.4Distributed Training

MATIH integrates Ray for distributed model training on Kubernetes:

FeatureImplementationBenefit
Automatic resource provisioningKubeRay operator creates worker pods on demandNo manual cluster management
Data parallelismRay Train distributes data across workers with gradient synchronizationLinear scaling for large datasets
Hyperparameter tuningRay Tune with search algorithms (Bayesian, HyperBand, PBT)Efficient exploration of hyperparameter space
Fault toleranceAutomatic checkpointing to S3-compatible storage with worker recoveryTraining survives pod preemption
GPU schedulingKubernetes GPU device plugin with fractional GPU supportEfficient GPU utilization across tenants
Resource quotasPer-tenant resource limits enforced by Kubernetes ResourceQuotasFair resource sharing

Training Workflow

1. User configures training job in ML Workbench:
   - Select experiment
   - Choose framework (PyTorch, TensorFlow, scikit-learn, XGBoost)
   - Set resource requirements (CPUs, GPUs, memory)
   - Configure hyperparameter search space

2. ml-service creates Ray cluster via KubeRay:
   - Head node: orchestration, metric aggregation
   - Worker nodes: distributed data loading and training
   - GPU nodes: model training with CUDA

3. Training executes with automatic logging:
   - Metrics stream to ml-service via callback hooks
   - Checkpoints saved to artifact storage periodically
   - Resource utilization tracked by Prometheus

4. Training completes:
   - Best model artifact saved to registry
   - Metrics and parameters logged to experiment
   - Ray cluster automatically scaled down
   - Notification sent to user

1.5Model Serving

MATIH supports multiple model serving patterns:

PatternUse CaseImplementation
Real-time inferenceLow-latency predictions via REST APIRay Serve or Triton Inference Server
Batch inferenceProcessing large datasets offlineSpark job triggered by pipeline-service
Streaming inferenceReal-time predictions on Kafka event streamsFlink job with embedded model
A/B testingCompare model versions with traffic splittingRay Serve traffic management
Shadow modeNew model runs alongside production without serving resultsDual-path execution with metric comparison

A/B Testing

Model A/B testing is built into the serving infrastructure:

Incoming Request
      |
  Traffic Splitter (Ray Serve)
     /          \
  Model A (90%)   Model B (10%)
     |              |
  Prediction     Prediction
     |              |
  Metric Logger  Metric Logger
     |              |
  Response       Response

Configuration:

  • Traffic split percentages (configurable per model version)
  • Metric collection for both versions (latency, accuracy, business metrics)
  • Automatic promotion: Model B promoted to 100% if it meets performance criteria
  • Automatic rollback: Model B traffic reduced to 0% if error rate exceeds threshold

1.6Drift Detection and Monitoring

The ML Platform continuously monitors deployed models for drift:

Drift TypeDetection MethodAction
Data driftStatistical tests (KS test, PSI) on input feature distributionsAlert data engineering team, trigger data quality review
Concept driftMonitoring prediction distribution changes over timeAlert ML team, trigger retraining evaluation
Performance driftTracking business metrics (conversion rate, accuracy) against baselineAlert model owner, automatic rollback if threshold exceeded
Feature driftMonitoring individual feature statistics against training distributionHighlight drifting features in ML Workbench

Drift detection runs as a Flink streaming job that processes inference logs from Kafka:

Inference Log (Kafka)
  -> Flink Drift Detection Job
    -> Compute feature statistics per window (1 hour, 1 day)
    -> Compare against training baseline statistics
    -> If drift exceeds threshold:
      -> Publish drift alert event to Kafka
      -> notification-service sends alert
      -> ml-service annotates model in registry
      -> Context Graph updates model health status

1.7Feature Store Integration

MATIH integrates Feast as the feature store for shared feature definitions:

CapabilityDescription
Feature definitionsDeclare features once, reuse across experiments and models
Online/offline servingFeast serves features for training (offline store) and inference (online store via Redis)
Point-in-time joinsCorrect historical feature values for training, preventing data leakage
Feature lineageContext Graph tracks which models use which features
Feature monitoringData quality scores applied to features, surfaced in ML Workbench

Deep Dive References