Data Scientist Journey: Predictive Maintenance for CNC Machines

Persona: Lin Wei, Senior Data Scientist at Apex Manufacturing Goal: Build a predictive maintenance model to reduce unplanned downtime across 200 CNC machines Primary Workbenches: ML Workbench, Data Workbench Timeline: 6-week project from data exploration to production deployment

Business Context

Apex Manufacturing operates 200 CNC machines across 4 plants, running 2 shifts per day. Unplanned downtime currently costs the company $2.1M annually -- each hour of an idle production line translates to$ 15K-$50K in lost output depending on the machine type and order backlog. The current maintenance approach is calendar-based: every machine gets serviced every 500 operating hours regardless of condition. This leads to two problems: over-maintenance on healthy machines (wasted labor and parts) and under-maintenance on degrading machines (unexpected failures).

Lin Wei's objective is to replace this fixed schedule with condition-based predictions: score each machine's health in real time, predict remaining useful life (RUL), and generate maintenance work orders before failures occur.

Stage 1: Ingestion

Connecting IoT Sensor Streams

Lin Wei starts in the Data Workbench by examining what data sources are already connected. The IoT gateway pushes sensor data from all 200 machines through MQTT into Kafka, which the Ingestion Service consumes continuously.

Ingestion configuration for the sensor stream:

{
  "source": {
    "type": "kafka",
    "config": {
      "bootstrap_servers": "kafka.apex-iot.internal:9092",
      "topic_pattern": "sensors.cnc.*",
      "consumer_group": "matih-ingestion-sensors",
      "format": "avro",
      "schema_registry_url": "http://schema-registry:8081"
    }
  },
  "destination": {
    "type": "s3_parquet",
    "config": {
      "bucket": "apex-data-lake",
      "prefix": "raw/sensors/",
      "partition_by": ["plant_id", "date"],
      "compression": "snappy"
    }
  },
  "transform": {
    "aggregation_window": "5m",
    "aggregations": ["mean", "stddev", "min", "max", "count"],
    "group_by": ["machine_id", "sensor_type"]
  }
}

Connecting SAP for Maintenance History

The SAP ERP PostgreSQL database contains equipment registry and maintenance logs. Lin Wei configures an Airbyte CDC connector to capture ongoing maintenance events:

Source Table	Sync Mode	Frequency	Key Fields
`sap.equipment_master`	Full refresh	Daily	`machine_id`, `machine_type`, `install_date`, `plant_id`
`sap.maintenance_orders`	CDC (incremental)	Every 15 min	`order_id`, `machine_id`, `maintenance_type`, `start_time`, `end_time`
`sap.parts_consumed`	CDC (incremental)	Every 15 min	`order_id`, `part_number`, `quantity`, `cost`
`sap.production_orders`	CDC (incremental)	Every 15 min	`order_id`, `machine_id`, `product_id`, `planned_qty`, `actual_qty`

Importing Equipment Specifications

Machine specification sheets from manufacturers arrive as CSV files. Lin Wei uses the File Import feature in Data Workbench to upload them:

File: cnc_machine_specs.csv (5,247 rows)
Columns: machine_id, manufacturer, model, spindle_max_rpm, max_feed_rate,
         axis_count, coolant_type, rated_power_kw, weight_kg, install_date

After ingestion, Lin Wei verifies the data is flowing by checking row counts in the catalog:

Dataset	Expected Volume	Actual Volume	Status
`sensor_readings` (5-min agg)	~57,600 rows/day (200 machines x 12 sensors x 24 readings)	56,832 rows/day	98.7% completeness
`maintenance_logs`	~15 new records/day	14 records/day	Normal
`equipment_registry`	5,247 machines	5,247 machines	Complete

Stage 2: Discovery

Exploring the Sensor Data Catalog

In the Data Workbench catalog, Lin Wei browses the sensor data assets. The Catalog Service has automatically profiled the ingested data:

Catalog: apex_manufacturing
  └── sensors/
      ├── sensor_readings_5min     (100M rows, 47 sensor types)
      │   ├── vibration_x          (g-force, range: 0.01 - 8.4)
      │   ├── vibration_y          (g-force, range: 0.01 - 7.9)
      │   ├── vibration_z          (g-force, range: 0.01 - 9.1)
      │   ├── spindle_temperature  (celsius, range: 22 - 94)
      │   ├── coolant_temperature  (celsius, range: 15 - 38)
      │   ├── coolant_flow_rate    (L/min, range: 0 - 45)
      │   ├── spindle_speed        (rpm, range: 0 - 12000)
      │   ├── feed_rate            (mm/min, range: 0 - 8000)
      │   ├── spindle_load         (%, range: 0 - 100)
      │   ├── power_consumption    (kW, range: 0 - 75)
      │   └── ... (37 more sensor types)
      └── sensor_readings_raw      (archived, not for direct query)

Data Quality Profiling

The Data Quality Service flags several issues that Lin Wei investigates:

Issue	Affected Records	Root Cause	Resolution
Vibration readings = 0.00 for 3 machines	12,400 records	Miscalibrated accelerometers (machines M-0047, M-0112, M-0198)	Exclude from training; notify maintenance to recalibrate
Temperature spikes > 150C	847 records	Sensor noise during machine startup	Apply startup exclusion window (first 10 min after power-on)
Missing `coolant_flow_rate` on 12 machines	8.2% of readings	Older machines lack flow sensors	Use `coolant_temperature` as proxy feature
Duplicate timestamps	2,100 records	IoT gateway retry on network timeout	Deduplicate by (`machine_id`, `sensor_type`, `timestamp`)

Lineage Tracing

Lin Wei traces the data lineage from raw sensor signal to aggregated feature:

  MQTT Broker          Kafka Topic         5-min Aggregation        S3 Parquet
  ┌──────────┐        ┌──────────┐        ┌──────────────────┐     ┌──────────┐
  │ Raw      │───────▶│ sensors. │───────▶│ Pipeline:        │────▶│ sensors/ │
  │ signal   │        │ cnc.     │        │ sensor_agg_5min  │     │ 5min_agg │
  │ (100 Hz) │        │ plant_1  │        │                  │     │ .parquet │
  │          │        │          │        │ mean, stddev,    │     │          │
  │          │        │          │        │ min, max, count  │     │          │
  └──────────┘        └──────────┘        └──────────────────┘     └──────────┘

Stage 3: Query

Feature Engineering with SQL

Lin Wei moves to the Query Engine to build features for the predictive model. The key insight: machine degradation shows up as gradual changes in sensor distributions over time. Rolling statistics capture these trends.

Rolling window features for vibration analysis:

-- Rolling statistics over multiple time horizons
-- Uses DuckDB for efficient window functions on S3 parquet
SELECT
    machine_id,
    timestamp,
    -- 1-hour rolling statistics
    AVG(vibration_rms) OVER w1h AS vibration_mean_1h,
    STDDEV(vibration_rms) OVER w1h AS vibration_std_1h,
    MAX(vibration_rms) OVER w1h AS vibration_max_1h,
    -- 6-hour rolling statistics
    AVG(vibration_rms) OVER w6h AS vibration_mean_6h,
    STDDEV(vibration_rms) OVER w6h AS vibration_std_6h,
    -- 24-hour rolling statistics
    AVG(vibration_rms) OVER w24h AS vibration_mean_24h,
    STDDEV(vibration_rms) OVER w24h AS vibration_std_24h,
    -- Rate of change (trend detection)
    (AVG(vibration_rms) OVER w1h - AVG(vibration_rms) OVER w24h)
        / NULLIF(AVG(vibration_rms) OVER w24h, 0) AS vibration_trend_ratio,
    -- Temperature rate of change (degrees per hour)
    (MAX(spindle_temperature) OVER w1h - MIN(spindle_temperature) OVER w1h)
        AS temp_range_1h,
    -- Operating context
    AVG(spindle_speed) OVER w1h AS avg_spindle_speed_1h,
    AVG(spindle_load) OVER w1h AS avg_spindle_load_1h,
    -- Hours since last maintenance
    EXTRACT(EPOCH FROM (timestamp - last_maintenance_time)) / 3600.0
        AS hours_since_maintenance
FROM sensor_readings_5min s
LEFT JOIN (
    SELECT machine_id,
           MAX(end_time) AS last_maintenance_time
    FROM maintenance_logs
    WHERE maintenance_type IN ('PREVENTIVE', 'CORRECTIVE')
    GROUP BY machine_id
) m ON s.machine_id = m.machine_id
WINDOW
    w1h  AS (PARTITION BY machine_id ORDER BY timestamp
             RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW),
    w6h  AS (PARTITION BY machine_id ORDER BY timestamp
             RANGE BETWEEN INTERVAL '6' HOUR PRECEDING AND CURRENT ROW),
    w24h AS (PARTITION BY machine_id ORDER BY timestamp
             RANGE BETWEEN INTERVAL '24' HOUR PRECEDING AND CURRENT ROW)

Failure label construction -- joining sensor data with maintenance events:

-- Create target variable: time-to-failure for each sensor reading
-- A reading is labeled with the hours remaining until the next unplanned failure
WITH failure_events AS (
    SELECT
        machine_id,
        start_time AS failure_time,
        description AS failure_mode,
        CASE
            WHEN description ILIKE '%bearing%' THEN 'BEARING_WEAR'
            WHEN description ILIKE '%spindle%alignment%' THEN 'SPINDLE_ALIGNMENT'
            WHEN description ILIKE '%coolant%' THEN 'COOLANT_SYSTEM'
            WHEN description ILIKE '%electrical%' OR description ILIKE '%drive%'
                THEN 'ELECTRICAL'
            ELSE 'OTHER'
        END AS failure_category
    FROM maintenance_logs
    WHERE maintenance_type = 'CORRECTIVE'
      AND unplanned = true
)
SELECT
    s.*,
    f.failure_time,
    f.failure_category,
    EXTRACT(EPOCH FROM (f.failure_time - s.timestamp)) / 3600.0
        AS hours_to_failure
FROM sensor_features s
LEFT JOIN LATERAL (
    SELECT failure_time, failure_category
    FROM failure_events fe
    WHERE fe.machine_id = s.machine_id
      AND fe.failure_time > s.timestamp
    ORDER BY fe.failure_time ASC
    LIMIT 1
) f ON true

Stage 4: Orchestration

Streaming + Batch Pipeline Design

Lin Wei designs a two-tier pipeline architecture using the Pipeline Service (Temporal workflows):

                    Apex Predictive Maintenance Pipeline

  STREAMING (continuous)                 BATCH (nightly at 2 AM)
  ┌─────────────────────────┐           ┌─────────────────────────┐
  │ Kafka sensor topic      │           │ S3 sensor parquet       │
  │         │               │           │         │               │
  │         ▼               │           │         ▼               │
  │ 5-min aggregation       │           │ Feature computation     │
  │         │               │           │ (rolling 1h/6h/24h)     │
  │         ▼               │           │         │               │
  │ Anomaly detection       │           │         ▼               │
  │ (Z-score > 3.5)         │           │ Model retraining check  │
  │         │               │           │ (drift detected?)       │
  │         ▼               │           │         │               │
  │ Real-time health score  │           │    ┌────┴─────┐         │
  │ (Ray Serve inference)   │           │    │ Yes      │ No      │
  │         │               │           │    ▼          ▼         │
  │         ▼               │           │ Retrain    Log metrics  │
  │ Alert if score < 0.3    │           │ model      and exit     │
  └─────────────────────────┘           └─────────────────────────┘

Temporal workflow definition for the batch pipeline:

{
  "workflow": "predictive_maintenance_batch",
  "schedule": "0 2 * * *",
  "activities": [
    {
      "name": "compute_features",
      "type": "sql_transform",
      "config": {
        "query_file": "features/rolling_sensor_features.sql",
        "output_table": "ml.sensor_features_daily",
        "partition_by": "date"
      },
      "timeout": "45m"
    },
    {
      "name": "validate_features",
      "type": "data_quality",
      "config": {
        "suite": "sensor_feature_quality",
        "expectations": [
          {"type": "expect_column_values_to_not_be_null", "column": "vibration_mean_1h"},
          {"type": "expect_column_values_to_be_between", "column": "vibration_std_1h", "min": 0, "max": 10},
          {"type": "expect_table_row_count_to_be_between", "min": 50000}
        ],
        "fail_pipeline_on_violation": true
      }
    },
    {
      "name": "check_model_drift",
      "type": "model_monitoring",
      "config": {
        "model_name": "cnc_predictive_maintenance",
        "metrics": ["psi", "csi"],
        "threshold": 0.15,
        "reference_window": "30d"
      }
    },
    {
      "name": "retrain_if_drifted",
      "type": "conditional",
      "condition": "check_model_drift.drift_detected == true",
      "activity": {
        "name": "retrain_model",
        "type": "ml_training",
        "config": {
          "experiment": "cnc_predictive_maintenance",
          "training_data": "ml.sensor_features_daily",
          "model_type": "random_survival_forest",
          "hyperparameters": {
            "n_estimators": 500,
            "max_depth": 12,
            "min_samples_leaf": 20
          }
        }
      }
    }
  ]
}

Data Quality Gates

The pipeline includes sensor-specific quality gates that run before any model training:

Gate	Rule	Action on Failure
Completeness	> 95% of expected sensor readings received	Pause pipeline, alert data engineering
Freshness	Most recent reading < 15 minutes old	Retry after 5 minutes, then alert
Value range	All sensors within physical limits	Quarantine out-of-range readings
Consistency	No machine has > 5% duplicate timestamps	Deduplicate and log warning
Sufficient history	Each machine has > 30 days of data for training	Exclude new machines from training set

Stage 5: Analysis

Failure Mode Profiling

Lin Wei profiles the failure data to understand what the model needs to predict:

  Failure Mode Distribution (Apex Manufacturing, 2 years)
  ════════════════════════════════════════════════════════

  BEARING_WEAR        ███████████████████████████████████████████  43%
  SPINDLE_ALIGNMENT   ████████████████████████████               28%
  COOLANT_SYSTEM      ██████████████████                         18%
  ELECTRICAL          ███████████                                11%

  Total failures: 847 events across 200 machines
  Failure rate: 2.3% of operating 5-min windows
  Mean time to failure after first anomaly: 72 hours (bearing), 48h (spindle)

Class imbalance analysis:

Class	Count	Percentage	Strategy
Normal operation	97,200,000	97.7%	Subsample to 10% for training
Pre-failure (< 72h to failure)	1,890,000	1.9%	Use all samples
Active failure	410,000	0.4%	Use all samples, apply SMOTE on minority subclasses

Sensor-failure correlation analysis:

-- Top sensor features correlated with impending bearing failure
-- Run in ML Workbench notebook
SELECT
    sensor_feature,
    CORR(feature_value, CASE WHEN hours_to_failure < 72 THEN 1 ELSE 0 END)
        AS correlation_with_failure,
    AVG(CASE WHEN hours_to_failure < 72 THEN feature_value END)
        AS avg_pre_failure,
    AVG(CASE WHEN hours_to_failure > 168 THEN feature_value END)
        AS avg_normal
FROM sensor_features_pivoted
WHERE failure_category = 'BEARING_WEAR'
GROUP BY sensor_feature
ORDER BY ABS(correlation_with_failure) DESC
LIMIT 10

Sensor Feature	Correlation	Pre-Failure Avg	Normal Avg	Signal
`vibration_std_24h`	0.72	1.84 g	0.42 g	Variability increases 4.4x
`vibration_trend_ratio`	0.68	+0.31	-0.02	Upward trend before failure
`spindle_load_std_6h`	0.54	8.7%	3.1%	Load becomes erratic
`temp_range_1h`	0.51	12.4 C	4.2 C	Temperature oscillation
`power_consumption_mean_6h`	0.43	48.2 kW	38.7 kW	Draws more power

Stage 6: Productionization

Model Training in ML Workbench

Lin Wei trains a Random Survival Forest model -- appropriate for time-to-event prediction where the "event" is machine failure:

Experiment configuration:

{
  "experiment_name": "cnc_predictive_maintenance",
  "model_type": "random_survival_forest",
  "training_config": {
    "features": [
      "vibration_mean_1h", "vibration_std_1h", "vibration_max_1h",
      "vibration_mean_6h", "vibration_std_6h",
      "vibration_mean_24h", "vibration_std_24h", "vibration_trend_ratio",
      "spindle_temperature_mean_1h", "temp_range_1h",
      "coolant_flow_mean_1h", "spindle_load_std_6h",
      "power_consumption_mean_6h", "hours_since_maintenance",
      "machine_type", "machine_age_years"
    ],
    "target": "hours_to_failure",
    "event_column": "failure_occurred",
    "stratify_by": "failure_category",
    "train_test_split": 0.8,
    "cross_validation_folds": 5
  },
  "hyperparameters": {
    "n_estimators": 500,
    "max_depth": 12,
    "min_samples_leaf": 20,
    "max_features": "sqrt",
    "n_jobs": -1
  },
  "compute": {
    "type": "ray_train",
    "num_workers": 4,
    "resources_per_worker": {"cpu": 4, "memory_gb": 16}
  }
}

Model evaluation results:

Metric	Value	Threshold	Status
Concordance Index (C-index)	0.84	> 0.80	Pass
Time-dependent AUC (24h horizon)	0.91	> 0.85	Pass
Time-dependent AUC (72h horizon)	0.87	> 0.80	Pass
Brier Score (24h)	0.08	< 0.15	Pass
Calibration slope	1.03	0.9 - 1.1	Pass

Deploying to Ray Serve

The trained model is registered in the Model Registry and deployed for real-time inference:

{
  "deployment": {
    "model_name": "cnc_predictive_maintenance",
    "model_version": "v3",
    "serving_framework": "ray_serve",
    "endpoint": "/api/v1/ml/predict/machine-health",
    "config": {
      "num_replicas": 2,
      "max_concurrent_queries": 100,
      "ray_actor_options": {
        "num_cpus": 2,
        "memory": 4294967296
      }
    },
    "input_schema": {
      "machine_id": "string",
      "sensor_features": "object (16 feature values)"
    },
    "output_schema": {
      "machine_id": "string",
      "health_score": "float (0-1, lower = higher risk)",
      "predicted_rul_hours": "float",
      "failure_probability_24h": "float",
      "failure_probability_72h": "float",
      "top_risk_factors": "array of {feature, contribution}"
    }
  }
}

Sample inference response:

{
  "machine_id": "M-0073",
  "health_score": 0.28,
  "predicted_rul_hours": 34.5,
  "failure_probability_24h": 0.12,
  "failure_probability_72h": 0.67,
  "failure_mode_probabilities": {
    "BEARING_WEAR": 0.81,
    "SPINDLE_ALIGNMENT": 0.11,
    "COOLANT_SYSTEM": 0.05,
    "ELECTRICAL": 0.03
  },
  "top_risk_factors": [
    {"feature": "vibration_std_24h", "contribution": 0.34, "value": 2.1, "normal_range": "0.2-0.6"},
    {"feature": "vibration_trend_ratio", "contribution": 0.22, "value": 0.28, "normal_range": "-0.05-0.05"},
    {"feature": "temp_range_1h", "contribution": 0.15, "value": 11.3, "normal_range": "2-5"}
  ],
  "recommended_action": "Schedule bearing inspection within 24 hours",
  "timestamp": "2026-02-28T14:30:00Z"
}

CMMS Integration

When a machine's health score drops below 0.3, the system automatically generates a maintenance work order in the CMMS:

  Sensor Data ──▶ Ray Serve ──▶ Health Score < 0.3? ──▶ CMMS Work Order
       │              │               │                       │
       │              │          ┌────┴─────┐                 │
       │              │          │ Yes      │ No              │
       │              │          ▼          ▼                 │
       │              │    Generate WO   Continue             │
       │              │    with priority  monitoring           │
       │              │    and failure                         │
       │              │    mode context                        │

Stage 7: Feedback

Production Monitoring Dashboard

Lin Wei configures monitoring for the deployed model in the ML Workbench:

Metric	Frequency	Alert Threshold	Current Value
Precision @ Recall=80%	Weekly	< 0.70	0.78
False positive rate	Daily	> 15%	11.2%
Prediction latency P99	Continuous	> 500ms	127ms
Feature drift (PSI)	Daily	> 0.15	0.04
Prediction volume	Hourly	< 80% of expected	98.3%

Business impact tracking:

Metric	Before Model	After Model (Month 3)	Improvement
Unplanned downtime hours/month	142 hours	94 hours	-33.8%
Avg. maintenance response time	4.2 hours	1.8 hours	-57%
False maintenance dispatches	23/month	18/month	-22%
Maintenance cost per unit	$1.42	$1.18	-17%
Caught-before-failure rate	0% (reactive)	62%	N/A

Alert Configuration

{
  "alerts": [
    {
      "name": "model_accuracy_degradation",
      "metric": "precision_at_recall_80",
      "condition": "< 0.70 for 3 consecutive evaluations",
      "action": "trigger_retraining_pipeline",
      "notify": ["lin.wei@apex.com", "#ml-ops-alerts"]
    },
    {
      "name": "false_positive_spike",
      "metric": "false_positive_rate",
      "condition": "> 15% daily",
      "action": "notify_and_log",
      "notify": ["lin.wei@apex.com", "maintenance-supervisor@apex.com"]
    },
    {
      "name": "feature_drift_detected",
      "metric": "population_stability_index",
      "condition": "> 0.15 on any feature",
      "action": "trigger_drift_investigation_workflow",
      "notify": ["lin.wei@apex.com"]
    }
  ]
}

Stage 8: Experimentation

Model Comparison

Lin Wei runs a structured experiment comparing three modeling approaches:

Model	C-Index	AUC (24h)	AUC (72h)	Training Time	Inference P99
Random Survival Forest	0.84	0.91	0.87	45 min	127ms
Cox Proportional Hazards	0.79	0.85	0.81	12 min	23ms
LSTM Autoencoder (anomaly)	0.81	0.89	0.83	3.5 hours	89ms

Decision: Random Survival Forest provides the best predictive performance. The Cox model is kept as a fast fallback for edge cases. The LSTM approach is noted for future investigation with more training data.

Transfer Learning Across Machine Types

Lin Wei tests whether a model trained on CNC lathes can transfer to CNC mills:

-- Evaluate transfer learning: train on lathes, test on mills
SELECT
    e.machine_type AS test_machine_type,
    COUNT(*) AS test_samples,
    AVG(CASE WHEN predicted_failure AND actual_failure THEN 1.0 ELSE 0.0 END)
        AS precision,
    AVG(CASE WHEN predicted_failure AND actual_failure THEN 1.0
         WHEN NOT predicted_failure AND actual_failure THEN 0.0
         ELSE NULL END) AS recall
FROM model_predictions p
JOIN equipment_registry e ON p.machine_id = e.machine_id
WHERE p.experiment_id = 'transfer_learning_lathe_to_mill'
GROUP BY e.machine_type

Transfer Scenario	C-Index	AUC (24h)	Viable?
Lathe model on lathes (baseline)	0.84	0.91	Yes (baseline)
Lathe model on mills (zero-shot)	0.71	0.78	Marginal
Lathe model fine-tuned on 30 days of mill data	0.82	0.88	Yes
Separate mill model (full training)	0.83	0.89	Yes (more data needed)

Finding: Fine-tuning the lathe model with just 30 days of mill-specific data achieves near-parity with a full model, reducing the cold-start problem for new machine types.

ROI Summary

After 3 months in production:

Metric	Value
Unplanned downtime reduction	34% (142h to 94h per month)
Annual cost savings (downtime averted)	$780K
Maintenance labor efficiency gain	22% fewer false dispatches
Model development cost (6-week project)	~$45K (labor + compute)
Ongoing operating cost	~$2K/month (compute + monitoring)
ROI (Year 1)	17:1

Related Walkthroughs

ML Engineer Journey -- Tomas deploys visual inspection using techniques from Lin Wei's feature engineering
BI Lead Journey -- Carlos uses Lin Wei's health scores in the OEE dashboard
Executive Journey -- Karen reviews predictive maintenance ROI in strategic planning
Manufacturing Overview -- Full dataset and KPI reference

Industry Overview ML Engineer Journey