MATIH Platform is in active MVP development. Documentation reflects current implementation status.
21. Industry Examples & Walkthroughs
Manufacturing & Supply Chain
Data Scientist Journey

Data Scientist Journey: Predictive Maintenance for CNC Machines

Persona: Lin Wei, Senior Data Scientist at Apex Manufacturing Goal: Build a predictive maintenance model to reduce unplanned downtime across 200 CNC machines Primary Workbenches: ML Workbench, Data Workbench Timeline: 6-week project from data exploration to production deployment


Business Context

Apex Manufacturing operates 200 CNC machines across 4 plants, running 2 shifts per day. Unplanned downtime currently costs the company 2.1Mannuallyeachhourofanidleproductionlinetranslatesto2.1M annually -- each hour of an idle production line translates to 15K-$50K in lost output depending on the machine type and order backlog. The current maintenance approach is calendar-based: every machine gets serviced every 500 operating hours regardless of condition. This leads to two problems: over-maintenance on healthy machines (wasted labor and parts) and under-maintenance on degrading machines (unexpected failures).

Lin Wei's objective is to replace this fixed schedule with condition-based predictions: score each machine's health in real time, predict remaining useful life (RUL), and generate maintenance work orders before failures occur.


Stage 1: Ingestion

Connecting IoT Sensor Streams

Lin Wei starts in the Data Workbench by examining what data sources are already connected. The IoT gateway pushes sensor data from all 200 machines through MQTT into Kafka, which the Ingestion Service consumes continuously.

Ingestion configuration for the sensor stream:

{
  "source": {
    "type": "kafka",
    "config": {
      "bootstrap_servers": "kafka.apex-iot.internal:9092",
      "topic_pattern": "sensors.cnc.*",
      "consumer_group": "matih-ingestion-sensors",
      "format": "avro",
      "schema_registry_url": "http://schema-registry:8081"
    }
  },
  "destination": {
    "type": "s3_parquet",
    "config": {
      "bucket": "apex-data-lake",
      "prefix": "raw/sensors/",
      "partition_by": ["plant_id", "date"],
      "compression": "snappy"
    }
  },
  "transform": {
    "aggregation_window": "5m",
    "aggregations": ["mean", "stddev", "min", "max", "count"],
    "group_by": ["machine_id", "sensor_type"]
  }
}

Connecting SAP for Maintenance History

The SAP ERP PostgreSQL database contains equipment registry and maintenance logs. Lin Wei configures an Airbyte CDC connector to capture ongoing maintenance events:

Source TableSync ModeFrequencyKey Fields
sap.equipment_masterFull refreshDailymachine_id, machine_type, install_date, plant_id
sap.maintenance_ordersCDC (incremental)Every 15 minorder_id, machine_id, maintenance_type, start_time, end_time
sap.parts_consumedCDC (incremental)Every 15 minorder_id, part_number, quantity, cost
sap.production_ordersCDC (incremental)Every 15 minorder_id, machine_id, product_id, planned_qty, actual_qty

Importing Equipment Specifications

Machine specification sheets from manufacturers arrive as CSV files. Lin Wei uses the File Import feature in Data Workbench to upload them:

File: cnc_machine_specs.csv (5,247 rows)
Columns: machine_id, manufacturer, model, spindle_max_rpm, max_feed_rate,
         axis_count, coolant_type, rated_power_kw, weight_kg, install_date

After ingestion, Lin Wei verifies the data is flowing by checking row counts in the catalog:

DatasetExpected VolumeActual VolumeStatus
sensor_readings (5-min agg)~57,600 rows/day (200 machines x 12 sensors x 24 readings)56,832 rows/day98.7% completeness
maintenance_logs~15 new records/day14 records/dayNormal
equipment_registry5,247 machines5,247 machinesComplete

Stage 2: Discovery

Exploring the Sensor Data Catalog

In the Data Workbench catalog, Lin Wei browses the sensor data assets. The Catalog Service has automatically profiled the ingested data:

Catalog: apex_manufacturing
  └── sensors/
      ├── sensor_readings_5min     (100M rows, 47 sensor types)
      │   ├── vibration_x          (g-force, range: 0.01 - 8.4)
      │   ├── vibration_y          (g-force, range: 0.01 - 7.9)
      │   ├── vibration_z          (g-force, range: 0.01 - 9.1)
      │   ├── spindle_temperature  (celsius, range: 22 - 94)
      │   ├── coolant_temperature  (celsius, range: 15 - 38)
      │   ├── coolant_flow_rate    (L/min, range: 0 - 45)
      │   ├── spindle_speed        (rpm, range: 0 - 12000)
      │   ├── feed_rate            (mm/min, range: 0 - 8000)
      │   ├── spindle_load         (%, range: 0 - 100)
      │   ├── power_consumption    (kW, range: 0 - 75)
      │   └── ... (37 more sensor types)
      └── sensor_readings_raw      (archived, not for direct query)

Data Quality Profiling

The Data Quality Service flags several issues that Lin Wei investigates:

IssueAffected RecordsRoot CauseResolution
Vibration readings = 0.00 for 3 machines12,400 recordsMiscalibrated accelerometers (machines M-0047, M-0112, M-0198)Exclude from training; notify maintenance to recalibrate
Temperature spikes > 150C847 recordsSensor noise during machine startupApply startup exclusion window (first 10 min after power-on)
Missing coolant_flow_rate on 12 machines8.2% of readingsOlder machines lack flow sensorsUse coolant_temperature as proxy feature
Duplicate timestamps2,100 recordsIoT gateway retry on network timeoutDeduplicate by (machine_id, sensor_type, timestamp)

Lineage Tracing

Lin Wei traces the data lineage from raw sensor signal to aggregated feature:

  MQTT Broker          Kafka Topic         5-min Aggregation        S3 Parquet
  ┌──────────┐        ┌──────────┐        ┌──────────────────┐     ┌──────────┐
  │ Raw      │───────▶│ sensors. │───────▶│ Pipeline:        │────▶│ sensors/ │
  │ signal   │        │ cnc.     │        │ sensor_agg_5min  │     │ 5min_agg │
  │ (100 Hz) │        │ plant_1  │        │                  │     │ .parquet │
  │          │        │          │        │ mean, stddev,    │     │          │
  │          │        │          │        │ min, max, count  │     │          │
  └──────────┘        └──────────┘        └──────────────────┘     └──────────┘

Stage 3: Query

Feature Engineering with SQL

Lin Wei moves to the Query Engine to build features for the predictive model. The key insight: machine degradation shows up as gradual changes in sensor distributions over time. Rolling statistics capture these trends.

Rolling window features for vibration analysis:

-- Rolling statistics over multiple time horizons
-- Uses DuckDB for efficient window functions on S3 parquet
SELECT
    machine_id,
    timestamp,
    -- 1-hour rolling statistics
    AVG(vibration_rms) OVER w1h AS vibration_mean_1h,
    STDDEV(vibration_rms) OVER w1h AS vibration_std_1h,
    MAX(vibration_rms) OVER w1h AS vibration_max_1h,
    -- 6-hour rolling statistics
    AVG(vibration_rms) OVER w6h AS vibration_mean_6h,
    STDDEV(vibration_rms) OVER w6h AS vibration_std_6h,
    -- 24-hour rolling statistics
    AVG(vibration_rms) OVER w24h AS vibration_mean_24h,
    STDDEV(vibration_rms) OVER w24h AS vibration_std_24h,
    -- Rate of change (trend detection)
    (AVG(vibration_rms) OVER w1h - AVG(vibration_rms) OVER w24h)
        / NULLIF(AVG(vibration_rms) OVER w24h, 0) AS vibration_trend_ratio,
    -- Temperature rate of change (degrees per hour)
    (MAX(spindle_temperature) OVER w1h - MIN(spindle_temperature) OVER w1h)
        AS temp_range_1h,
    -- Operating context
    AVG(spindle_speed) OVER w1h AS avg_spindle_speed_1h,
    AVG(spindle_load) OVER w1h AS avg_spindle_load_1h,
    -- Hours since last maintenance
    EXTRACT(EPOCH FROM (timestamp - last_maintenance_time)) / 3600.0
        AS hours_since_maintenance
FROM sensor_readings_5min s
LEFT JOIN (
    SELECT machine_id,
           MAX(end_time) AS last_maintenance_time
    FROM maintenance_logs
    WHERE maintenance_type IN ('PREVENTIVE', 'CORRECTIVE')
    GROUP BY machine_id
) m ON s.machine_id = m.machine_id
WINDOW
    w1h  AS (PARTITION BY machine_id ORDER BY timestamp
             RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW),
    w6h  AS (PARTITION BY machine_id ORDER BY timestamp
             RANGE BETWEEN INTERVAL '6' HOUR PRECEDING AND CURRENT ROW),
    w24h AS (PARTITION BY machine_id ORDER BY timestamp
             RANGE BETWEEN INTERVAL '24' HOUR PRECEDING AND CURRENT ROW)

Failure label construction -- joining sensor data with maintenance events:

-- Create target variable: time-to-failure for each sensor reading
-- A reading is labeled with the hours remaining until the next unplanned failure
WITH failure_events AS (
    SELECT
        machine_id,
        start_time AS failure_time,
        description AS failure_mode,
        CASE
            WHEN description ILIKE '%bearing%' THEN 'BEARING_WEAR'
            WHEN description ILIKE '%spindle%alignment%' THEN 'SPINDLE_ALIGNMENT'
            WHEN description ILIKE '%coolant%' THEN 'COOLANT_SYSTEM'
            WHEN description ILIKE '%electrical%' OR description ILIKE '%drive%'
                THEN 'ELECTRICAL'
            ELSE 'OTHER'
        END AS failure_category
    FROM maintenance_logs
    WHERE maintenance_type = 'CORRECTIVE'
      AND unplanned = true
)
SELECT
    s.*,
    f.failure_time,
    f.failure_category,
    EXTRACT(EPOCH FROM (f.failure_time - s.timestamp)) / 3600.0
        AS hours_to_failure
FROM sensor_features s
LEFT JOIN LATERAL (
    SELECT failure_time, failure_category
    FROM failure_events fe
    WHERE fe.machine_id = s.machine_id
      AND fe.failure_time > s.timestamp
    ORDER BY fe.failure_time ASC
    LIMIT 1
) f ON true

Stage 4: Orchestration

Streaming + Batch Pipeline Design

Lin Wei designs a two-tier pipeline architecture using the Pipeline Service (Temporal workflows):

                    Apex Predictive Maintenance Pipeline

  STREAMING (continuous)                 BATCH (nightly at 2 AM)
  ┌─────────────────────────┐           ┌─────────────────────────┐
  │ Kafka sensor topic      │           │ S3 sensor parquet       │
  │         │               │           │         │               │
  │         ▼               │           │         ▼               │
  │ 5-min aggregation       │           │ Feature computation     │
  │         │               │           │ (rolling 1h/6h/24h)     │
  │         ▼               │           │         │               │
  │ Anomaly detection       │           │         ▼               │
  │ (Z-score > 3.5)         │           │ Model retraining check  │
  │         │               │           │ (drift detected?)       │
  │         ▼               │           │         │               │
  │ Real-time health score  │           │    ┌────┴─────┐         │
  │ (Ray Serve inference)   │           │    │ Yes      │ No      │
  │         │               │           │    ▼          ▼         │
  │         ▼               │           │ Retrain    Log metrics  │
  │ Alert if score < 0.3    │           │ model      and exit     │
  └─────────────────────────┘           └─────────────────────────┘

Temporal workflow definition for the batch pipeline:

{
  "workflow": "predictive_maintenance_batch",
  "schedule": "0 2 * * *",
  "activities": [
    {
      "name": "compute_features",
      "type": "sql_transform",
      "config": {
        "query_file": "features/rolling_sensor_features.sql",
        "output_table": "ml.sensor_features_daily",
        "partition_by": "date"
      },
      "timeout": "45m"
    },
    {
      "name": "validate_features",
      "type": "data_quality",
      "config": {
        "suite": "sensor_feature_quality",
        "expectations": [
          {"type": "expect_column_values_to_not_be_null", "column": "vibration_mean_1h"},
          {"type": "expect_column_values_to_be_between", "column": "vibration_std_1h", "min": 0, "max": 10},
          {"type": "expect_table_row_count_to_be_between", "min": 50000}
        ],
        "fail_pipeline_on_violation": true
      }
    },
    {
      "name": "check_model_drift",
      "type": "model_monitoring",
      "config": {
        "model_name": "cnc_predictive_maintenance",
        "metrics": ["psi", "csi"],
        "threshold": 0.15,
        "reference_window": "30d"
      }
    },
    {
      "name": "retrain_if_drifted",
      "type": "conditional",
      "condition": "check_model_drift.drift_detected == true",
      "activity": {
        "name": "retrain_model",
        "type": "ml_training",
        "config": {
          "experiment": "cnc_predictive_maintenance",
          "training_data": "ml.sensor_features_daily",
          "model_type": "random_survival_forest",
          "hyperparameters": {
            "n_estimators": 500,
            "max_depth": 12,
            "min_samples_leaf": 20
          }
        }
      }
    }
  ]
}

Data Quality Gates

The pipeline includes sensor-specific quality gates that run before any model training:

GateRuleAction on Failure
Completeness> 95% of expected sensor readings receivedPause pipeline, alert data engineering
FreshnessMost recent reading < 15 minutes oldRetry after 5 minutes, then alert
Value rangeAll sensors within physical limitsQuarantine out-of-range readings
ConsistencyNo machine has > 5% duplicate timestampsDeduplicate and log warning
Sufficient historyEach machine has > 30 days of data for trainingExclude new machines from training set

Stage 5: Analysis

Failure Mode Profiling

Lin Wei profiles the failure data to understand what the model needs to predict:

  Failure Mode Distribution (Apex Manufacturing, 2 years)
  ════════════════════════════════════════════════════════

  BEARING_WEAR        ███████████████████████████████████████████  43%
  SPINDLE_ALIGNMENT   ████████████████████████████               28%
  COOLANT_SYSTEM      ██████████████████                         18%
  ELECTRICAL          ███████████                                11%

  Total failures: 847 events across 200 machines
  Failure rate: 2.3% of operating 5-min windows
  Mean time to failure after first anomaly: 72 hours (bearing), 48h (spindle)

Class imbalance analysis:

ClassCountPercentageStrategy
Normal operation97,200,00097.7%Subsample to 10% for training
Pre-failure (< 72h to failure)1,890,0001.9%Use all samples
Active failure410,0000.4%Use all samples, apply SMOTE on minority subclasses

Sensor-failure correlation analysis:

-- Top sensor features correlated with impending bearing failure
-- Run in ML Workbench notebook
SELECT
    sensor_feature,
    CORR(feature_value, CASE WHEN hours_to_failure < 72 THEN 1 ELSE 0 END)
        AS correlation_with_failure,
    AVG(CASE WHEN hours_to_failure < 72 THEN feature_value END)
        AS avg_pre_failure,
    AVG(CASE WHEN hours_to_failure > 168 THEN feature_value END)
        AS avg_normal
FROM sensor_features_pivoted
WHERE failure_category = 'BEARING_WEAR'
GROUP BY sensor_feature
ORDER BY ABS(correlation_with_failure) DESC
LIMIT 10
Sensor FeatureCorrelationPre-Failure AvgNormal AvgSignal
vibration_std_24h0.721.84 g0.42 gVariability increases 4.4x
vibration_trend_ratio0.68+0.31-0.02Upward trend before failure
spindle_load_std_6h0.548.7%3.1%Load becomes erratic
temp_range_1h0.5112.4 C4.2 CTemperature oscillation
power_consumption_mean_6h0.4348.2 kW38.7 kWDraws more power

Stage 6: Productionization

Model Training in ML Workbench

Lin Wei trains a Random Survival Forest model -- appropriate for time-to-event prediction where the "event" is machine failure:

Experiment configuration:

{
  "experiment_name": "cnc_predictive_maintenance",
  "model_type": "random_survival_forest",
  "training_config": {
    "features": [
      "vibration_mean_1h", "vibration_std_1h", "vibration_max_1h",
      "vibration_mean_6h", "vibration_std_6h",
      "vibration_mean_24h", "vibration_std_24h", "vibration_trend_ratio",
      "spindle_temperature_mean_1h", "temp_range_1h",
      "coolant_flow_mean_1h", "spindle_load_std_6h",
      "power_consumption_mean_6h", "hours_since_maintenance",
      "machine_type", "machine_age_years"
    ],
    "target": "hours_to_failure",
    "event_column": "failure_occurred",
    "stratify_by": "failure_category",
    "train_test_split": 0.8,
    "cross_validation_folds": 5
  },
  "hyperparameters": {
    "n_estimators": 500,
    "max_depth": 12,
    "min_samples_leaf": 20,
    "max_features": "sqrt",
    "n_jobs": -1
  },
  "compute": {
    "type": "ray_train",
    "num_workers": 4,
    "resources_per_worker": {"cpu": 4, "memory_gb": 16}
  }
}

Model evaluation results:

MetricValueThresholdStatus
Concordance Index (C-index)0.84> 0.80Pass
Time-dependent AUC (24h horizon)0.91> 0.85Pass
Time-dependent AUC (72h horizon)0.87> 0.80Pass
Brier Score (24h)0.08< 0.15Pass
Calibration slope1.030.9 - 1.1Pass

Deploying to Ray Serve

The trained model is registered in the Model Registry and deployed for real-time inference:

{
  "deployment": {
    "model_name": "cnc_predictive_maintenance",
    "model_version": "v3",
    "serving_framework": "ray_serve",
    "endpoint": "/api/v1/ml/predict/machine-health",
    "config": {
      "num_replicas": 2,
      "max_concurrent_queries": 100,
      "ray_actor_options": {
        "num_cpus": 2,
        "memory": 4294967296
      }
    },
    "input_schema": {
      "machine_id": "string",
      "sensor_features": "object (16 feature values)"
    },
    "output_schema": {
      "machine_id": "string",
      "health_score": "float (0-1, lower = higher risk)",
      "predicted_rul_hours": "float",
      "failure_probability_24h": "float",
      "failure_probability_72h": "float",
      "top_risk_factors": "array of {feature, contribution}"
    }
  }
}

Sample inference response:

{
  "machine_id": "M-0073",
  "health_score": 0.28,
  "predicted_rul_hours": 34.5,
  "failure_probability_24h": 0.12,
  "failure_probability_72h": 0.67,
  "failure_mode_probabilities": {
    "BEARING_WEAR": 0.81,
    "SPINDLE_ALIGNMENT": 0.11,
    "COOLANT_SYSTEM": 0.05,
    "ELECTRICAL": 0.03
  },
  "top_risk_factors": [
    {"feature": "vibration_std_24h", "contribution": 0.34, "value": 2.1, "normal_range": "0.2-0.6"},
    {"feature": "vibration_trend_ratio", "contribution": 0.22, "value": 0.28, "normal_range": "-0.05-0.05"},
    {"feature": "temp_range_1h", "contribution": 0.15, "value": 11.3, "normal_range": "2-5"}
  ],
  "recommended_action": "Schedule bearing inspection within 24 hours",
  "timestamp": "2026-02-28T14:30:00Z"
}

CMMS Integration

When a machine's health score drops below 0.3, the system automatically generates a maintenance work order in the CMMS:

  Sensor Data ──▶ Ray Serve ──▶ Health Score < 0.3? ──▶ CMMS Work Order
       │              │               │                       │
       │              │          ┌────┴─────┐                 │
       │              │          │ Yes      │ No              │
       │              │          ▼          ▼                 │
       │              │    Generate WO   Continue             │
       │              │    with priority  monitoring           │
       │              │    and failure                         │
       │              │    mode context                        │

Stage 7: Feedback

Production Monitoring Dashboard

Lin Wei configures monitoring for the deployed model in the ML Workbench:

MetricFrequencyAlert ThresholdCurrent Value
Precision @ Recall=80%Weekly< 0.700.78
False positive rateDaily> 15%11.2%
Prediction latency P99Continuous> 500ms127ms
Feature drift (PSI)Daily> 0.150.04
Prediction volumeHourly< 80% of expected98.3%

Business impact tracking:

MetricBefore ModelAfter Model (Month 3)Improvement
Unplanned downtime hours/month142 hours94 hours-33.8%
Avg. maintenance response time4.2 hours1.8 hours-57%
False maintenance dispatches23/month18/month-22%
Maintenance cost per unit$1.42$1.18-17%
Caught-before-failure rate0% (reactive)62%N/A

Alert Configuration

{
  "alerts": [
    {
      "name": "model_accuracy_degradation",
      "metric": "precision_at_recall_80",
      "condition": "< 0.70 for 3 consecutive evaluations",
      "action": "trigger_retraining_pipeline",
      "notify": ["lin.wei@apex.com", "#ml-ops-alerts"]
    },
    {
      "name": "false_positive_spike",
      "metric": "false_positive_rate",
      "condition": "> 15% daily",
      "action": "notify_and_log",
      "notify": ["lin.wei@apex.com", "maintenance-supervisor@apex.com"]
    },
    {
      "name": "feature_drift_detected",
      "metric": "population_stability_index",
      "condition": "> 0.15 on any feature",
      "action": "trigger_drift_investigation_workflow",
      "notify": ["lin.wei@apex.com"]
    }
  ]
}

Stage 8: Experimentation

Model Comparison

Lin Wei runs a structured experiment comparing three modeling approaches:

ModelC-IndexAUC (24h)AUC (72h)Training TimeInference P99
Random Survival Forest0.840.910.8745 min127ms
Cox Proportional Hazards0.790.850.8112 min23ms
LSTM Autoencoder (anomaly)0.810.890.833.5 hours89ms

Decision: Random Survival Forest provides the best predictive performance. The Cox model is kept as a fast fallback for edge cases. The LSTM approach is noted for future investigation with more training data.

Transfer Learning Across Machine Types

Lin Wei tests whether a model trained on CNC lathes can transfer to CNC mills:

-- Evaluate transfer learning: train on lathes, test on mills
SELECT
    e.machine_type AS test_machine_type,
    COUNT(*) AS test_samples,
    AVG(CASE WHEN predicted_failure AND actual_failure THEN 1.0 ELSE 0.0 END)
        AS precision,
    AVG(CASE WHEN predicted_failure AND actual_failure THEN 1.0
         WHEN NOT predicted_failure AND actual_failure THEN 0.0
         ELSE NULL END) AS recall
FROM model_predictions p
JOIN equipment_registry e ON p.machine_id = e.machine_id
WHERE p.experiment_id = 'transfer_learning_lathe_to_mill'
GROUP BY e.machine_type
Transfer ScenarioC-IndexAUC (24h)Viable?
Lathe model on lathes (baseline)0.840.91Yes (baseline)
Lathe model on mills (zero-shot)0.710.78Marginal
Lathe model fine-tuned on 30 days of mill data0.820.88Yes
Separate mill model (full training)0.830.89Yes (more data needed)

Finding: Fine-tuning the lathe model with just 30 days of mill-specific data achieves near-parity with a full model, reducing the cold-start problem for new machine types.

ROI Summary

After 3 months in production:

MetricValue
Unplanned downtime reduction34% (142h to 94h per month)
Annual cost savings (downtime averted)$780K
Maintenance labor efficiency gain22% fewer false dispatches
Model development cost (6-week project)~$45K (labor + compute)
Ongoing operating cost~$2K/month (compute + monitoring)
ROI (Year 1)17:1

Related Walkthroughs