Data Scientist Journey: Predictive Maintenance for CNC Machines
Persona: Lin Wei, Senior Data Scientist at Apex Manufacturing Goal: Build a predictive maintenance model to reduce unplanned downtime across 200 CNC machines Primary Workbenches: ML Workbench, Data Workbench Timeline: 6-week project from data exploration to production deployment
Business Context
Apex Manufacturing operates 200 CNC machines across 4 plants, running 2 shifts per day. Unplanned downtime currently costs the company 15K-$50K in lost output depending on the machine type and order backlog. The current maintenance approach is calendar-based: every machine gets serviced every 500 operating hours regardless of condition. This leads to two problems: over-maintenance on healthy machines (wasted labor and parts) and under-maintenance on degrading machines (unexpected failures).
Lin Wei's objective is to replace this fixed schedule with condition-based predictions: score each machine's health in real time, predict remaining useful life (RUL), and generate maintenance work orders before failures occur.
Stage 1: Ingestion
Connecting IoT Sensor Streams
Lin Wei starts in the Data Workbench by examining what data sources are already connected. The IoT gateway pushes sensor data from all 200 machines through MQTT into Kafka, which the Ingestion Service consumes continuously.
Ingestion configuration for the sensor stream:
{
"source": {
"type": "kafka",
"config": {
"bootstrap_servers": "kafka.apex-iot.internal:9092",
"topic_pattern": "sensors.cnc.*",
"consumer_group": "matih-ingestion-sensors",
"format": "avro",
"schema_registry_url": "http://schema-registry:8081"
}
},
"destination": {
"type": "s3_parquet",
"config": {
"bucket": "apex-data-lake",
"prefix": "raw/sensors/",
"partition_by": ["plant_id", "date"],
"compression": "snappy"
}
},
"transform": {
"aggregation_window": "5m",
"aggregations": ["mean", "stddev", "min", "max", "count"],
"group_by": ["machine_id", "sensor_type"]
}
}Connecting SAP for Maintenance History
The SAP ERP PostgreSQL database contains equipment registry and maintenance logs. Lin Wei configures an Airbyte CDC connector to capture ongoing maintenance events:
| Source Table | Sync Mode | Frequency | Key Fields |
|---|---|---|---|
sap.equipment_master | Full refresh | Daily | machine_id, machine_type, install_date, plant_id |
sap.maintenance_orders | CDC (incremental) | Every 15 min | order_id, machine_id, maintenance_type, start_time, end_time |
sap.parts_consumed | CDC (incremental) | Every 15 min | order_id, part_number, quantity, cost |
sap.production_orders | CDC (incremental) | Every 15 min | order_id, machine_id, product_id, planned_qty, actual_qty |
Importing Equipment Specifications
Machine specification sheets from manufacturers arrive as CSV files. Lin Wei uses the File Import feature in Data Workbench to upload them:
File: cnc_machine_specs.csv (5,247 rows)
Columns: machine_id, manufacturer, model, spindle_max_rpm, max_feed_rate,
axis_count, coolant_type, rated_power_kw, weight_kg, install_dateAfter ingestion, Lin Wei verifies the data is flowing by checking row counts in the catalog:
| Dataset | Expected Volume | Actual Volume | Status |
|---|---|---|---|
sensor_readings (5-min agg) | ~57,600 rows/day (200 machines x 12 sensors x 24 readings) | 56,832 rows/day | 98.7% completeness |
maintenance_logs | ~15 new records/day | 14 records/day | Normal |
equipment_registry | 5,247 machines | 5,247 machines | Complete |
Stage 2: Discovery
Exploring the Sensor Data Catalog
In the Data Workbench catalog, Lin Wei browses the sensor data assets. The Catalog Service has automatically profiled the ingested data:
Catalog: apex_manufacturing
└── sensors/
├── sensor_readings_5min (100M rows, 47 sensor types)
│ ├── vibration_x (g-force, range: 0.01 - 8.4)
│ ├── vibration_y (g-force, range: 0.01 - 7.9)
│ ├── vibration_z (g-force, range: 0.01 - 9.1)
│ ├── spindle_temperature (celsius, range: 22 - 94)
│ ├── coolant_temperature (celsius, range: 15 - 38)
│ ├── coolant_flow_rate (L/min, range: 0 - 45)
│ ├── spindle_speed (rpm, range: 0 - 12000)
│ ├── feed_rate (mm/min, range: 0 - 8000)
│ ├── spindle_load (%, range: 0 - 100)
│ ├── power_consumption (kW, range: 0 - 75)
│ └── ... (37 more sensor types)
└── sensor_readings_raw (archived, not for direct query)Data Quality Profiling
The Data Quality Service flags several issues that Lin Wei investigates:
| Issue | Affected Records | Root Cause | Resolution |
|---|---|---|---|
| Vibration readings = 0.00 for 3 machines | 12,400 records | Miscalibrated accelerometers (machines M-0047, M-0112, M-0198) | Exclude from training; notify maintenance to recalibrate |
| Temperature spikes > 150C | 847 records | Sensor noise during machine startup | Apply startup exclusion window (first 10 min after power-on) |
Missing coolant_flow_rate on 12 machines | 8.2% of readings | Older machines lack flow sensors | Use coolant_temperature as proxy feature |
| Duplicate timestamps | 2,100 records | IoT gateway retry on network timeout | Deduplicate by (machine_id, sensor_type, timestamp) |
Lineage Tracing
Lin Wei traces the data lineage from raw sensor signal to aggregated feature:
MQTT Broker Kafka Topic 5-min Aggregation S3 Parquet
┌──────────┐ ┌──────────┐ ┌──────────────────┐ ┌──────────┐
│ Raw │───────▶│ sensors. │───────▶│ Pipeline: │────▶│ sensors/ │
│ signal │ │ cnc. │ │ sensor_agg_5min │ │ 5min_agg │
│ (100 Hz) │ │ plant_1 │ │ │ │ .parquet │
│ │ │ │ │ mean, stddev, │ │ │
│ │ │ │ │ min, max, count │ │ │
└──────────┘ └──────────┘ └──────────────────┘ └──────────┘Stage 3: Query
Feature Engineering with SQL
Lin Wei moves to the Query Engine to build features for the predictive model. The key insight: machine degradation shows up as gradual changes in sensor distributions over time. Rolling statistics capture these trends.
Rolling window features for vibration analysis:
-- Rolling statistics over multiple time horizons
-- Uses DuckDB for efficient window functions on S3 parquet
SELECT
machine_id,
timestamp,
-- 1-hour rolling statistics
AVG(vibration_rms) OVER w1h AS vibration_mean_1h,
STDDEV(vibration_rms) OVER w1h AS vibration_std_1h,
MAX(vibration_rms) OVER w1h AS vibration_max_1h,
-- 6-hour rolling statistics
AVG(vibration_rms) OVER w6h AS vibration_mean_6h,
STDDEV(vibration_rms) OVER w6h AS vibration_std_6h,
-- 24-hour rolling statistics
AVG(vibration_rms) OVER w24h AS vibration_mean_24h,
STDDEV(vibration_rms) OVER w24h AS vibration_std_24h,
-- Rate of change (trend detection)
(AVG(vibration_rms) OVER w1h - AVG(vibration_rms) OVER w24h)
/ NULLIF(AVG(vibration_rms) OVER w24h, 0) AS vibration_trend_ratio,
-- Temperature rate of change (degrees per hour)
(MAX(spindle_temperature) OVER w1h - MIN(spindle_temperature) OVER w1h)
AS temp_range_1h,
-- Operating context
AVG(spindle_speed) OVER w1h AS avg_spindle_speed_1h,
AVG(spindle_load) OVER w1h AS avg_spindle_load_1h,
-- Hours since last maintenance
EXTRACT(EPOCH FROM (timestamp - last_maintenance_time)) / 3600.0
AS hours_since_maintenance
FROM sensor_readings_5min s
LEFT JOIN (
SELECT machine_id,
MAX(end_time) AS last_maintenance_time
FROM maintenance_logs
WHERE maintenance_type IN ('PREVENTIVE', 'CORRECTIVE')
GROUP BY machine_id
) m ON s.machine_id = m.machine_id
WINDOW
w1h AS (PARTITION BY machine_id ORDER BY timestamp
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW),
w6h AS (PARTITION BY machine_id ORDER BY timestamp
RANGE BETWEEN INTERVAL '6' HOUR PRECEDING AND CURRENT ROW),
w24h AS (PARTITION BY machine_id ORDER BY timestamp
RANGE BETWEEN INTERVAL '24' HOUR PRECEDING AND CURRENT ROW)Failure label construction -- joining sensor data with maintenance events:
-- Create target variable: time-to-failure for each sensor reading
-- A reading is labeled with the hours remaining until the next unplanned failure
WITH failure_events AS (
SELECT
machine_id,
start_time AS failure_time,
description AS failure_mode,
CASE
WHEN description ILIKE '%bearing%' THEN 'BEARING_WEAR'
WHEN description ILIKE '%spindle%alignment%' THEN 'SPINDLE_ALIGNMENT'
WHEN description ILIKE '%coolant%' THEN 'COOLANT_SYSTEM'
WHEN description ILIKE '%electrical%' OR description ILIKE '%drive%'
THEN 'ELECTRICAL'
ELSE 'OTHER'
END AS failure_category
FROM maintenance_logs
WHERE maintenance_type = 'CORRECTIVE'
AND unplanned = true
)
SELECT
s.*,
f.failure_time,
f.failure_category,
EXTRACT(EPOCH FROM (f.failure_time - s.timestamp)) / 3600.0
AS hours_to_failure
FROM sensor_features s
LEFT JOIN LATERAL (
SELECT failure_time, failure_category
FROM failure_events fe
WHERE fe.machine_id = s.machine_id
AND fe.failure_time > s.timestamp
ORDER BY fe.failure_time ASC
LIMIT 1
) f ON trueStage 4: Orchestration
Streaming + Batch Pipeline Design
Lin Wei designs a two-tier pipeline architecture using the Pipeline Service (Temporal workflows):
Apex Predictive Maintenance Pipeline
STREAMING (continuous) BATCH (nightly at 2 AM)
┌─────────────────────────┐ ┌─────────────────────────┐
│ Kafka sensor topic │ │ S3 sensor parquet │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ 5-min aggregation │ │ Feature computation │
│ │ │ │ (rolling 1h/6h/24h) │
│ ▼ │ │ │ │
│ Anomaly detection │ │ ▼ │
│ (Z-score > 3.5) │ │ Model retraining check │
│ │ │ │ (drift detected?) │
│ ▼ │ │ │ │
│ Real-time health score │ │ ┌────┴─────┐ │
│ (Ray Serve inference) │ │ │ Yes │ No │
│ │ │ │ ▼ ▼ │
│ ▼ │ │ Retrain Log metrics │
│ Alert if score < 0.3 │ │ model and exit │
└─────────────────────────┘ └─────────────────────────┘Temporal workflow definition for the batch pipeline:
{
"workflow": "predictive_maintenance_batch",
"schedule": "0 2 * * *",
"activities": [
{
"name": "compute_features",
"type": "sql_transform",
"config": {
"query_file": "features/rolling_sensor_features.sql",
"output_table": "ml.sensor_features_daily",
"partition_by": "date"
},
"timeout": "45m"
},
{
"name": "validate_features",
"type": "data_quality",
"config": {
"suite": "sensor_feature_quality",
"expectations": [
{"type": "expect_column_values_to_not_be_null", "column": "vibration_mean_1h"},
{"type": "expect_column_values_to_be_between", "column": "vibration_std_1h", "min": 0, "max": 10},
{"type": "expect_table_row_count_to_be_between", "min": 50000}
],
"fail_pipeline_on_violation": true
}
},
{
"name": "check_model_drift",
"type": "model_monitoring",
"config": {
"model_name": "cnc_predictive_maintenance",
"metrics": ["psi", "csi"],
"threshold": 0.15,
"reference_window": "30d"
}
},
{
"name": "retrain_if_drifted",
"type": "conditional",
"condition": "check_model_drift.drift_detected == true",
"activity": {
"name": "retrain_model",
"type": "ml_training",
"config": {
"experiment": "cnc_predictive_maintenance",
"training_data": "ml.sensor_features_daily",
"model_type": "random_survival_forest",
"hyperparameters": {
"n_estimators": 500,
"max_depth": 12,
"min_samples_leaf": 20
}
}
}
}
]
}Data Quality Gates
The pipeline includes sensor-specific quality gates that run before any model training:
| Gate | Rule | Action on Failure |
|---|---|---|
| Completeness | > 95% of expected sensor readings received | Pause pipeline, alert data engineering |
| Freshness | Most recent reading < 15 minutes old | Retry after 5 minutes, then alert |
| Value range | All sensors within physical limits | Quarantine out-of-range readings |
| Consistency | No machine has > 5% duplicate timestamps | Deduplicate and log warning |
| Sufficient history | Each machine has > 30 days of data for training | Exclude new machines from training set |
Stage 5: Analysis
Failure Mode Profiling
Lin Wei profiles the failure data to understand what the model needs to predict:
Failure Mode Distribution (Apex Manufacturing, 2 years)
════════════════════════════════════════════════════════
BEARING_WEAR ███████████████████████████████████████████ 43%
SPINDLE_ALIGNMENT ████████████████████████████ 28%
COOLANT_SYSTEM ██████████████████ 18%
ELECTRICAL ███████████ 11%
Total failures: 847 events across 200 machines
Failure rate: 2.3% of operating 5-min windows
Mean time to failure after first anomaly: 72 hours (bearing), 48h (spindle)Class imbalance analysis:
| Class | Count | Percentage | Strategy |
|---|---|---|---|
| Normal operation | 97,200,000 | 97.7% | Subsample to 10% for training |
| Pre-failure (< 72h to failure) | 1,890,000 | 1.9% | Use all samples |
| Active failure | 410,000 | 0.4% | Use all samples, apply SMOTE on minority subclasses |
Sensor-failure correlation analysis:
-- Top sensor features correlated with impending bearing failure
-- Run in ML Workbench notebook
SELECT
sensor_feature,
CORR(feature_value, CASE WHEN hours_to_failure < 72 THEN 1 ELSE 0 END)
AS correlation_with_failure,
AVG(CASE WHEN hours_to_failure < 72 THEN feature_value END)
AS avg_pre_failure,
AVG(CASE WHEN hours_to_failure > 168 THEN feature_value END)
AS avg_normal
FROM sensor_features_pivoted
WHERE failure_category = 'BEARING_WEAR'
GROUP BY sensor_feature
ORDER BY ABS(correlation_with_failure) DESC
LIMIT 10| Sensor Feature | Correlation | Pre-Failure Avg | Normal Avg | Signal |
|---|---|---|---|---|
vibration_std_24h | 0.72 | 1.84 g | 0.42 g | Variability increases 4.4x |
vibration_trend_ratio | 0.68 | +0.31 | -0.02 | Upward trend before failure |
spindle_load_std_6h | 0.54 | 8.7% | 3.1% | Load becomes erratic |
temp_range_1h | 0.51 | 12.4 C | 4.2 C | Temperature oscillation |
power_consumption_mean_6h | 0.43 | 48.2 kW | 38.7 kW | Draws more power |
Stage 6: Productionization
Model Training in ML Workbench
Lin Wei trains a Random Survival Forest model -- appropriate for time-to-event prediction where the "event" is machine failure:
Experiment configuration:
{
"experiment_name": "cnc_predictive_maintenance",
"model_type": "random_survival_forest",
"training_config": {
"features": [
"vibration_mean_1h", "vibration_std_1h", "vibration_max_1h",
"vibration_mean_6h", "vibration_std_6h",
"vibration_mean_24h", "vibration_std_24h", "vibration_trend_ratio",
"spindle_temperature_mean_1h", "temp_range_1h",
"coolant_flow_mean_1h", "spindle_load_std_6h",
"power_consumption_mean_6h", "hours_since_maintenance",
"machine_type", "machine_age_years"
],
"target": "hours_to_failure",
"event_column": "failure_occurred",
"stratify_by": "failure_category",
"train_test_split": 0.8,
"cross_validation_folds": 5
},
"hyperparameters": {
"n_estimators": 500,
"max_depth": 12,
"min_samples_leaf": 20,
"max_features": "sqrt",
"n_jobs": -1
},
"compute": {
"type": "ray_train",
"num_workers": 4,
"resources_per_worker": {"cpu": 4, "memory_gb": 16}
}
}Model evaluation results:
| Metric | Value | Threshold | Status |
|---|---|---|---|
| Concordance Index (C-index) | 0.84 | > 0.80 | Pass |
| Time-dependent AUC (24h horizon) | 0.91 | > 0.85 | Pass |
| Time-dependent AUC (72h horizon) | 0.87 | > 0.80 | Pass |
| Brier Score (24h) | 0.08 | < 0.15 | Pass |
| Calibration slope | 1.03 | 0.9 - 1.1 | Pass |
Deploying to Ray Serve
The trained model is registered in the Model Registry and deployed for real-time inference:
{
"deployment": {
"model_name": "cnc_predictive_maintenance",
"model_version": "v3",
"serving_framework": "ray_serve",
"endpoint": "/api/v1/ml/predict/machine-health",
"config": {
"num_replicas": 2,
"max_concurrent_queries": 100,
"ray_actor_options": {
"num_cpus": 2,
"memory": 4294967296
}
},
"input_schema": {
"machine_id": "string",
"sensor_features": "object (16 feature values)"
},
"output_schema": {
"machine_id": "string",
"health_score": "float (0-1, lower = higher risk)",
"predicted_rul_hours": "float",
"failure_probability_24h": "float",
"failure_probability_72h": "float",
"top_risk_factors": "array of {feature, contribution}"
}
}
}Sample inference response:
{
"machine_id": "M-0073",
"health_score": 0.28,
"predicted_rul_hours": 34.5,
"failure_probability_24h": 0.12,
"failure_probability_72h": 0.67,
"failure_mode_probabilities": {
"BEARING_WEAR": 0.81,
"SPINDLE_ALIGNMENT": 0.11,
"COOLANT_SYSTEM": 0.05,
"ELECTRICAL": 0.03
},
"top_risk_factors": [
{"feature": "vibration_std_24h", "contribution": 0.34, "value": 2.1, "normal_range": "0.2-0.6"},
{"feature": "vibration_trend_ratio", "contribution": 0.22, "value": 0.28, "normal_range": "-0.05-0.05"},
{"feature": "temp_range_1h", "contribution": 0.15, "value": 11.3, "normal_range": "2-5"}
],
"recommended_action": "Schedule bearing inspection within 24 hours",
"timestamp": "2026-02-28T14:30:00Z"
}CMMS Integration
When a machine's health score drops below 0.3, the system automatically generates a maintenance work order in the CMMS:
Sensor Data ──▶ Ray Serve ──▶ Health Score < 0.3? ──▶ CMMS Work Order
│ │ │ │
│ │ ┌────┴─────┐ │
│ │ │ Yes │ No │
│ │ ▼ ▼ │
│ │ Generate WO Continue │
│ │ with priority monitoring │
│ │ and failure │
│ │ mode context │Stage 7: Feedback
Production Monitoring Dashboard
Lin Wei configures monitoring for the deployed model in the ML Workbench:
| Metric | Frequency | Alert Threshold | Current Value |
|---|---|---|---|
| Precision @ Recall=80% | Weekly | < 0.70 | 0.78 |
| False positive rate | Daily | > 15% | 11.2% |
| Prediction latency P99 | Continuous | > 500ms | 127ms |
| Feature drift (PSI) | Daily | > 0.15 | 0.04 |
| Prediction volume | Hourly | < 80% of expected | 98.3% |
Business impact tracking:
| Metric | Before Model | After Model (Month 3) | Improvement |
|---|---|---|---|
| Unplanned downtime hours/month | 142 hours | 94 hours | -33.8% |
| Avg. maintenance response time | 4.2 hours | 1.8 hours | -57% |
| False maintenance dispatches | 23/month | 18/month | -22% |
| Maintenance cost per unit | $1.42 | $1.18 | -17% |
| Caught-before-failure rate | 0% (reactive) | 62% | N/A |
Alert Configuration
{
"alerts": [
{
"name": "model_accuracy_degradation",
"metric": "precision_at_recall_80",
"condition": "< 0.70 for 3 consecutive evaluations",
"action": "trigger_retraining_pipeline",
"notify": ["lin.wei@apex.com", "#ml-ops-alerts"]
},
{
"name": "false_positive_spike",
"metric": "false_positive_rate",
"condition": "> 15% daily",
"action": "notify_and_log",
"notify": ["lin.wei@apex.com", "maintenance-supervisor@apex.com"]
},
{
"name": "feature_drift_detected",
"metric": "population_stability_index",
"condition": "> 0.15 on any feature",
"action": "trigger_drift_investigation_workflow",
"notify": ["lin.wei@apex.com"]
}
]
}Stage 8: Experimentation
Model Comparison
Lin Wei runs a structured experiment comparing three modeling approaches:
| Model | C-Index | AUC (24h) | AUC (72h) | Training Time | Inference P99 |
|---|---|---|---|---|---|
| Random Survival Forest | 0.84 | 0.91 | 0.87 | 45 min | 127ms |
| Cox Proportional Hazards | 0.79 | 0.85 | 0.81 | 12 min | 23ms |
| LSTM Autoencoder (anomaly) | 0.81 | 0.89 | 0.83 | 3.5 hours | 89ms |
Decision: Random Survival Forest provides the best predictive performance. The Cox model is kept as a fast fallback for edge cases. The LSTM approach is noted for future investigation with more training data.
Transfer Learning Across Machine Types
Lin Wei tests whether a model trained on CNC lathes can transfer to CNC mills:
-- Evaluate transfer learning: train on lathes, test on mills
SELECT
e.machine_type AS test_machine_type,
COUNT(*) AS test_samples,
AVG(CASE WHEN predicted_failure AND actual_failure THEN 1.0 ELSE 0.0 END)
AS precision,
AVG(CASE WHEN predicted_failure AND actual_failure THEN 1.0
WHEN NOT predicted_failure AND actual_failure THEN 0.0
ELSE NULL END) AS recall
FROM model_predictions p
JOIN equipment_registry e ON p.machine_id = e.machine_id
WHERE p.experiment_id = 'transfer_learning_lathe_to_mill'
GROUP BY e.machine_type| Transfer Scenario | C-Index | AUC (24h) | Viable? |
|---|---|---|---|
| Lathe model on lathes (baseline) | 0.84 | 0.91 | Yes (baseline) |
| Lathe model on mills (zero-shot) | 0.71 | 0.78 | Marginal |
| Lathe model fine-tuned on 30 days of mill data | 0.82 | 0.88 | Yes |
| Separate mill model (full training) | 0.83 | 0.89 | Yes (more data needed) |
Finding: Fine-tuning the lathe model with just 30 days of mill-specific data achieves near-parity with a full model, reducing the cold-start problem for new machine types.
ROI Summary
After 3 months in production:
| Metric | Value |
|---|---|
| Unplanned downtime reduction | 34% (142h to 94h per month) |
| Annual cost savings (downtime averted) | $780K |
| Maintenance labor efficiency gain | 22% fewer false dispatches |
| Model development cost (6-week project) | ~$45K (labor + compute) |
| Ongoing operating cost | ~$2K/month (compute + monitoring) |
| ROI (Year 1) | 17:1 |
Related Walkthroughs
- ML Engineer Journey -- Tomas deploys visual inspection using techniques from Lin Wei's feature engineering
- BI Lead Journey -- Carlos uses Lin Wei's health scores in the OEE dashboard
- Executive Journey -- Karen reviews predictive maintenance ROI in strategic planning
- Manufacturing Overview -- Full dataset and KPI reference