ML Engineer Journey: Visual Quality Inspection System
Persona: Tomas Rivera, ML Engineer at Apex Manufacturing Goal: Deploy an automated visual inspection system for real-time defect detection on the production line Primary Workbenches: ML Workbench, Data Workbench, Pipeline Service Timeline: 8-week project from data preparation to edge deployment
Business Context
Apex Manufacturing currently relies on human inspectors at the end of each production line to identify surface defects on machined parts. Each inspector examines roughly 200 parts per shift using magnification and gauging tools. The process has three problems: it is slow (45 seconds per part), inconsistent (inspector agreement rate is only 82%), and expensive (2M in scrap from missed defects that reach customers).
Tomas Rivera's objective is to build and deploy a CNN-based visual inspection system that captures images of every part at line speed, classifies defects in real time, and routes rejected parts to rework stations automatically. The system must achieve > 97% accuracy with P99 inference latency under 200ms to keep pace with the production line.
Stage 1: Ingestion
Connecting Image Metadata
The camera system at each production line captures high-resolution images (4096x3072, 12MP) and stores them on a local NAS. Tomas does not ingest raw images into the platform -- instead, he ingests the metadata and labels, using the platform to manage the ML lifecycle while images remain on fast local storage accessible to training infrastructure.
Image metadata ingestion configuration:
{
"source": {
"type": "postgresql",
"config": {
"host": "qms-db.apex.internal",
"database": "quality_vision",
"schema": "inspection",
"tables": [
"image_metadata",
"defect_annotations",
"inspection_results",
"camera_calibration"
]
}
},
"sync": {
"mode": "cdc_incremental",
"frequency": "every_15_minutes",
"cursor_field": "updated_at"
}
}Connected Data Sources
| Source | Type | Connector | Key Tables | Frequency |
|---|---|---|---|---|
| Quality Vision DB | PostgreSQL | Airbyte CDC | image_metadata, defect_annotations | Every 15 min |
| Quality Management System | PostgreSQL | Airbyte CDC | inspection_results, defect_taxonomy | Every 15 min |
| SAP ERP | PostgreSQL | Airbyte CDC | production_orders, product_specs | Every 15 min |
| Camera calibration logs | CSV Import | File Import | camera_config, lens_distortion_params | Monthly |
Data Volume Assessment
| Dataset | Volume | Growth Rate |
|---|---|---|
image_metadata | 250K records (6 months of history) | ~1,400 images/day (2 lines, 2 shifts) |
defect_annotations | 38K labeled defects across 22K images | ~200 new annotations/day |
inspection_results | 1M historical results (human inspector decisions) | ~1,400/day |
production_orders | 500K orders | ~300/day |
| Raw images (on NAS, not ingested) | 4.2 TB | ~18 GB/day |
Stage 2: Discovery
Mapping Quality-Related Assets
Tomas explores the data catalog in the Data Workbench to understand the quality data landscape:
Catalog: apex_manufacturing
└── quality/
├── image_metadata (250K rows)
│ ├── image_id (UUID, unique)
│ ├── production_order_id (FK to production_orders)
│ ├── machine_id (FK to equipment_registry)
│ ├── camera_id (1 of 8 cameras across 2 lines)
│ ├── capture_timestamp (millisecond precision)
│ ├── image_path (NAS path, not in platform)
│ ├── part_type (product SKU)
│ └── exposure_settings (JSON: ISO, aperture, shutter)
├── defect_annotations (38K rows)
│ ├── annotation_id (UUID)
│ ├── image_id (FK to image_metadata)
│ ├── defect_type (string -- INCONSISTENT across plants)
│ ├── bounding_box (JSON: x, y, width, height)
│ ├── annotator_id (human labeler)
│ └── confidence (annotator self-assessed, 1-5)
└── inspection_results (1M rows)
├── result_id (UUID)
├── image_id (FK)
├── inspector_id (human inspector)
├── decision (PASS / FAIL / REWORK)
└── defect_codes (array of codes)Discovering Labeling Inconsistencies
Data profiling reveals a critical issue: the same defect type is named differently across plants and even between annotators at the same plant.
-- Discover labeling inconsistencies across plants
SELECT
e.plant_id,
d.defect_type,
COUNT(*) AS annotation_count
FROM defect_annotations d
JOIN image_metadata im ON d.image_id = im.image_id
JOIN equipment_registry e ON im.machine_id = e.machine_id
GROUP BY e.plant_id, d.defect_type
ORDER BY e.plant_id, annotation_count DESC| Plant | Defect Label (as entered) | Count | Standardized Label |
|---|---|---|---|
| Plant 1 | scratch | 4,200 | SURFACE_SCRATCH |
| Plant 2 | surface scratch | 3,100 | SURFACE_SCRATCH |
| Plant 3 | SCRATCH_SURFACE | 2,800 | SURFACE_SCRATCH |
| Plant 1 | burr | 3,600 | BURR |
| Plant 2 | edge burr | 2,900 | BURR |
| Plant 1 | pit | 2,100 | SURFACE_PIT |
| Plant 2 | pitting | 1,800 | SURFACE_PIT |
| Plant 3 | corrosion_pit | 1,400 | SURFACE_PIT |
| All | crack / fracture / hairline_crack | 2,400 | CRACK |
Standardizing via Ontology Service
Tomas creates a defect taxonomy in the Ontology Service to standardize labels:
{
"ontology": "apex_defect_taxonomy",
"version": "1.0",
"classes": [
{
"id": "SURFACE_SCRATCH",
"label": "Surface Scratch",
"description": "Linear mark on machined surface caused by tool contact or handling",
"aliases": ["scratch", "surface scratch", "SCRATCH_SURFACE", "scr"],
"severity_levels": ["minor", "major", "critical"],
"parent": "SURFACE_DEFECT"
},
{
"id": "BURR",
"label": "Burr",
"description": "Raised edge or small piece of material remaining after machining",
"aliases": ["burr", "edge burr", "deburr_needed", "rough_edge"],
"severity_levels": ["minor", "major"],
"parent": "EDGE_DEFECT"
},
{
"id": "SURFACE_PIT",
"label": "Surface Pit",
"description": "Small cavity or depression in the machined surface",
"aliases": ["pit", "pitting", "corrosion_pit", "surface_void"],
"severity_levels": ["minor", "major", "critical"],
"parent": "SURFACE_DEFECT"
},
{
"id": "CRACK",
"label": "Crack",
"description": "Fracture line in the material, potentially structural",
"aliases": ["crack", "fracture", "hairline_crack", "stress_crack"],
"severity_levels": ["major", "critical"],
"parent": "STRUCTURAL_DEFECT"
},
{
"id": "DIMENSIONAL_OOS",
"label": "Dimensional Out-of-Spec",
"description": "Part dimension outside specified tolerance",
"aliases": ["oos", "out_of_spec", "tolerance_fail"],
"severity_levels": ["major", "critical"],
"parent": "DIMENSIONAL_DEFECT"
}
]
}Stage 3: Query
Building Training Dataset
Tomas constructs the training dataset by joining image metadata with standardized inspection outcomes:
-- Build labeled training dataset with balanced sampling
WITH standardized_labels AS (
SELECT
d.image_id,
CASE
WHEN d.defect_type IN ('scratch', 'surface scratch', 'SCRATCH_SURFACE')
THEN 'SURFACE_SCRATCH'
WHEN d.defect_type IN ('burr', 'edge burr', 'rough_edge')
THEN 'BURR'
WHEN d.defect_type IN ('pit', 'pitting', 'corrosion_pit')
THEN 'SURFACE_PIT'
WHEN d.defect_type IN ('crack', 'fracture', 'hairline_crack')
THEN 'CRACK'
ELSE 'OTHER'
END AS defect_class,
d.bounding_box,
d.confidence AS annotator_confidence
FROM defect_annotations d
WHERE d.confidence >= 3 -- filter low-confidence annotations
),
image_labels AS (
SELECT
im.image_id,
im.image_path,
im.part_type,
im.camera_id,
im.capture_timestamp,
COALESCE(sl.defect_class, 'NO_DEFECT') AS label,
sl.bounding_box,
ir.decision AS inspector_decision
FROM image_metadata im
LEFT JOIN standardized_labels sl ON im.image_id = sl.image_id
LEFT JOIN inspection_results ir ON im.image_id = ir.image_id
)
SELECT
image_id,
image_path,
part_type,
camera_id,
label,
bounding_box,
inspector_decision,
-- Stratified split: same part_type distribution in train/test
NTILE(5) OVER (PARTITION BY label, part_type ORDER BY capture_timestamp)
AS fold_id
FROM image_labels
WHERE capture_timestamp >= CURRENT_DATE - INTERVAL '180 days'Dataset composition after balancing:
| Class | Raw Count | After Balancing | Train | Validation | Test |
|---|---|---|---|---|---|
| NO_DEFECT | 212,000 | 15,000 (subsampled) | 10,500 | 2,250 | 2,250 |
| SURFACE_SCRATCH | 10,100 | 10,100 | 7,070 | 1,515 | 1,515 |
| BURR | 6,500 | 6,500 | 4,550 | 975 | 975 |
| SURFACE_PIT | 5,300 | 5,300 | 3,710 | 795 | 795 |
| CRACK | 2,400 | 2,400 (+ augmentation) | 1,680 | 360 | 360 |
| DIMENSIONAL_OOS | 1,700 | 1,700 (+ augmentation) | 1,190 | 255 | 255 |
| Total | 238,000 | 41,000 | 28,700 | 6,150 | 6,150 |
Feature Vector Preparation
For models that use pre-extracted feature vectors (transfer learning from ImageNet), Tomas queries the precomputed embeddings stored as parquet on S3:
-- Query precomputed image embeddings for model training
SELECT
e.image_id,
e.embedding_vector, -- 2048-dim float array (ResNet-50 penultimate layer)
il.label,
il.part_type
FROM s3.image_embeddings e
JOIN image_labels il ON e.image_id = il.image_id
WHERE il.fold_id <= 4 -- train + validation foldsStage 4: Orchestration
Model Training Pipeline
Tomas builds the end-to-end training pipeline using the Pipeline Service:
Visual Inspection Training Pipeline (Weekly, Saturday 2 AM)
┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌───────────┐
│ 1. Validate │───▶│ 2. Augment │───▶│ 3. Train │───▶│ 4. Eval │
│ Labels │ │ Data │ │ (Ray Train │ │ Model │
│ │ │ │ │ 4 GPUs) │ │ │
│ Agreement │ │ Rotation, │ │ │ │ Precision │
│ rate > 90% │ │ flip, noise,│ │ EfficientNet│ │ > 0.95? │
│ │ │ brightness │ │ -B4 fine- │ │ │
└─────────────┘ └─────────────┘ │ tune │ └─────┬─────┘
└──────────────┘ │
┌─────┴─────┐
│ Yes │ No
▼ ▼
┌───────────┐ ┌──────────┐
│ 5. Deploy │ │ Alert + │
│ (shadow │ │ manual │
│ first) │ │ review │
└───────────┘ └──────────┘Temporal workflow definition:
{
"workflow": "visual_inspection_training",
"schedule": "0 2 * * 6",
"activities": [
{
"name": "validate_labels",
"type": "data_quality",
"config": {
"suite": "defect_label_quality",
"expectations": [
{
"type": "expect_column_pair_values_a_to_be_greater_than_b",
"column_A": "inter_annotator_agreement",
"column_B": "0.90",
"description": "Label agreement rate must exceed 90%"
},
{
"type": "expect_column_values_to_be_in_set",
"column": "defect_class",
"value_set": ["NO_DEFECT", "SURFACE_SCRATCH", "BURR", "SURFACE_PIT", "CRACK", "DIMENSIONAL_OOS"]
},
{
"type": "expect_table_row_count_to_be_between",
"min": 30000,
"description": "Minimum training samples"
}
]
}
},
{
"name": "augment_data",
"type": "python_activity",
"config": {
"script": "augmentation/defect_augmentation.py",
"augmentations": [
{"type": "random_rotation", "degrees": 15},
{"type": "horizontal_flip", "probability": 0.5},
{"type": "gaussian_noise", "std": 0.02},
{"type": "brightness_contrast", "brightness": 0.2, "contrast": 0.2},
{"type": "random_crop_resize", "scale_range": [0.8, 1.0]}
],
"oversample_classes": ["CRACK", "DIMENSIONAL_OOS"],
"target_min_per_class": 3000
}
},
{
"name": "train_model",
"type": "ml_training",
"config": {
"experiment": "visual_quality_inspection",
"framework": "pytorch",
"model_architecture": "efficientnet_b4",
"pretrained_weights": "imagenet",
"training": {
"epochs": 50,
"batch_size": 32,
"learning_rate": 0.001,
"lr_scheduler": "cosine_annealing",
"warmup_epochs": 5,
"early_stopping_patience": 10,
"optimizer": "adamw",
"weight_decay": 0.01
},
"compute": {
"type": "ray_train",
"num_workers": 4,
"resources_per_worker": {"cpu": 4, "gpu": 1, "memory_gb": 32}
}
},
"timeout": "4h"
},
{
"name": "evaluate_model",
"type": "ml_evaluation",
"config": {
"metrics": ["accuracy", "precision", "recall", "f1", "confusion_matrix"],
"per_class": true,
"deployment_gate": {
"overall_precision": {"min": 0.95},
"overall_recall": {"min": 0.93},
"crack_recall": {"min": 0.99, "description": "Safety-critical defect must have near-perfect recall"},
"false_reject_rate": {"max": 0.05}
}
}
},
{
"name": "deploy_shadow",
"type": "conditional",
"condition": "evaluate_model.gate_passed == true",
"activity": {
"name": "shadow_deployment",
"type": "ml_deployment",
"config": {
"model_name": "visual_quality_inspector",
"deployment_mode": "shadow",
"shadow_duration_hours": 48,
"compare_with": "current_production_model"
}
}
}
]
}Stage 5: Analysis
Training Data Quality Validation
Before training, Tomas validates the label quality rigorously:
Inter-annotator agreement analysis:
-- Measure label agreement rate: images labeled by multiple annotators
SELECT
a1.image_id,
a1.defect_class AS annotator_1_label,
a2.defect_class AS annotator_2_label,
CASE WHEN a1.defect_class = a2.defect_class THEN 1 ELSE 0 END AS agreement
FROM standardized_labels a1
JOIN standardized_labels a2
ON a1.image_id = a2.image_id
AND a1.annotator_id < a2.annotator_id| Metric | Value | Threshold | Status |
|---|---|---|---|
| Overall agreement rate | 91.3% | > 90% | Pass |
| Agreement on CRACK class | 96.7% | > 95% | Pass |
| Agreement on NO_DEFECT | 94.2% | > 90% | Pass |
| Agreement on SURFACE_SCRATCH vs SURFACE_PIT | 83.1% | > 80% | Pass (marginal) |
Data leakage check:
| Check | Method | Result |
|---|---|---|
| Train/test image overlap | Hash comparison of image_ids across splits | 0 overlaps (clean) |
| Same-part leakage | Verify no production_order appears in both train and test | Clean |
| Temporal leakage | Test set is strictly after training set chronologically | Clean |
| Camera bias | Uniform camera distribution across train/test | Balanced (+/- 3%) |
Defect distribution across production lines:
Defect Distribution by Production Line
═══════════════════════════════════════
Line A (Plant 1): SCRATCH ████████ BURR ██████ PIT ████ CRACK ██
Line B (Plant 1): SCRATCH ██████ BURR ████ PIT ██████ CRACK █
Line C (Plant 2): SCRATCH ██████████ BURR ████ PIT ███ CRACK ██
Line D (Plant 2): SCRATCH █████ BURR ████████ PIT ████ CRACK ███
Note: Line D has elevated burr rate -- check tool wear patternsStage 6: Productionization
Edge Deployment Architecture
The visual inspection system must run close to the production line for latency requirements. Tomas deploys via Ray Serve with a dedicated endpoint:
Production Line Camera ──▶ Edge Server ──▶ Ray Serve ──▶ Decision
│ │ │ │
│ 4K image capture │ Preprocess │ CNN │ PASS/FAIL
│ every 2.5 seconds │ + resize to │ inference │ + defect
│ per part │ 380x380 │ < 200ms │ class
│ │ │ │
│ └── GPU: NVIDIA │ │
│ T4 (16GB) │ │
│ │ │
│ Fallback: if confidence < 0.8 │
│ route to human inspector queue │Ray Serve deployment configuration:
{
"deployment": {
"model_name": "visual_quality_inspector",
"model_version": "v2.1",
"serving_framework": "ray_serve",
"endpoint": "/api/v1/ml/predict/visual-inspection",
"config": {
"num_replicas": 4,
"max_concurrent_queries": 50,
"ray_actor_options": {
"num_cpus": 2,
"num_gpus": 0.5,
"memory": 8589934592
},
"autoscaling_config": {
"min_replicas": 2,
"max_replicas": 8,
"target_num_ongoing_requests_per_replica": 10,
"upscale_delay_s": 30,
"downscale_delay_s": 300
}
},
"preprocessing": {
"resize": [380, 380],
"normalize": {"mean": [0.485, 0.456, 0.406], "std": [0.229, 0.224, 0.225]},
"format": "RGB"
},
"postprocessing": {
"confidence_threshold": 0.8,
"low_confidence_action": "route_to_human_inspector",
"multi_defect_handling": "report_all_above_threshold"
}
}
}Sample inference response:
{
"image_id": "img-2026-02-28-143022-cam3",
"prediction": "FAIL",
"defects": [
{
"class": "SURFACE_SCRATCH",
"confidence": 0.94,
"severity": "major",
"bounding_box": {"x": 120, "y": 340, "width": 85, "height": 12},
"area_percentage": 0.7
}
],
"overall_confidence": 0.94,
"inference_time_ms": 67,
"recommended_action": "REWORK",
"model_version": "v2.1",
"camera_id": "cam-line-a-03",
"timestamp": "2026-02-28T14:30:22.451Z"
}Latency performance under production load:
| Metric | Target | Actual | Status |
|---|---|---|---|
| P50 inference latency | < 100ms | 52ms | Pass |
| P95 inference latency | < 150ms | 89ms | Pass |
| P99 inference latency | < 200ms | 134ms | Pass |
| Throughput | > 25 images/sec per replica | 31 images/sec | Pass |
| Availability | > 99.9% | 99.97% (in first month) | Pass |
Stage 7: Feedback
Real-Time Monitoring Dashboard
Tomas configures continuous monitoring in the ML Workbench:
| Metric | Frequency | Alert Threshold | Current Value |
|---|---|---|---|
| False reject rate (FRR) | Hourly | > 5% | 3.2% |
| False accept rate (FAR) | Hourly | > 1% | 0.4% |
| Throughput (images/min) | Continuous | < 20/min per line | 24/min |
| Inference latency P99 | Continuous | > 200ms | 134ms |
| Human override rate | Daily | > 10% | 6.8% |
| Model confidence distribution | Daily | Mean < 0.85 | Mean 0.92 |
Inspector override tracking:
The system tracks when human inspectors disagree with the model's decision. These overrides serve as a continuous feedback signal:
-- Track inspector overrides as model feedback
SELECT
DATE_TRUNC('week', override_timestamp) AS week,
model_prediction,
inspector_decision,
COUNT(*) AS override_count,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (
PARTITION BY DATE_TRUNC('week', override_timestamp)
) AS override_pct
FROM inspection_overrides
WHERE override_timestamp >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY 1, 2, 3
ORDER BY 1 DESC, override_count DESC| Override Pattern | Count/Week | Trend | Action |
|---|---|---|---|
| Model: FAIL, Inspector: PASS | 34 | Decreasing | Acceptable -- conservative model |
| Model: PASS, Inspector: FAIL | 8 | Stable | Critical -- investigate each case |
| Model: SCRATCH, Inspector: PIT | 12 | Stable | Add to active learning queue |
Alert Configuration
{
"alerts": [
{
"name": "false_accept_spike",
"metric": "false_accept_rate_hourly",
"condition": "> 1.0%",
"severity": "critical",
"action": "pause_auto_pass_and_notify",
"notify": ["tomas.rivera@apex.com", "#quality-ops", "shift-supervisor"]
},
{
"name": "latency_degradation",
"metric": "inference_latency_p99",
"condition": "> 200ms for 5 consecutive minutes",
"severity": "high",
"action": "scale_up_replicas",
"notify": ["tomas.rivera@apex.com"]
},
{
"name": "model_confidence_drift",
"metric": "mean_prediction_confidence",
"condition": "< 0.85 daily average",
"severity": "warning",
"action": "trigger_investigation_workflow",
"notify": ["tomas.rivera@apex.com", "lin.wei@apex.com"]
}
]
}Stage 8: Experimentation
Architecture Comparison
Tomas runs a structured experiment comparing three CNN architectures:
| Model | Accuracy | Precision | Recall | F1 | Inference P99 | Model Size |
|---|---|---|---|---|---|---|
| ResNet-50 (fine-tuned) | 95.8% | 0.94 | 0.93 | 0.935 | 98ms | 98 MB |
| EfficientNet-B4 (fine-tuned) | 97.2% | 0.96 | 0.95 | 0.955 | 134ms | 75 MB |
| YOLOv8-m (defect localization) | 96.1% | 0.95 | 0.94 | 0.945 | 78ms | 52 MB |
Per-class performance (EfficientNet-B4, production model):
| Defect Class | Precision | Recall | F1 | Notes |
|---|---|---|---|---|
| NO_DEFECT | 0.98 | 0.97 | 0.975 | Low false reject rate |
| SURFACE_SCRATCH | 0.95 | 0.94 | 0.945 | Most common defect |
| BURR | 0.96 | 0.95 | 0.955 | Clear visual signature |
| SURFACE_PIT | 0.93 | 0.92 | 0.925 | Sometimes confused with scratch |
| CRACK | 0.99 | 0.99 | 0.990 | Safety-critical, tuned for recall |
| DIMENSIONAL_OOS | 0.91 | 0.89 | 0.900 | Hardest class -- subtle visual cues |
Decision: EfficientNet-B4 selected for production. YOLOv8 earmarked for a future iteration where defect localization (bounding box prediction) is needed for automated rework guidance.
Active Learning Experiment
Tomas tests whether active learning -- using the model's own low-confidence predictions to select the most informative images for human labeling -- can reduce labeling costs:
Active Learning Loop
════════════════════
Production ──▶ Model Inference ──▶ Confidence < 0.8? ──▶ Human Label Queue
│ │ │
│ ┌────┴─────┐ │
│ │ Yes │ No │
│ ▼ ▼ │
│ Queue for Log prediction │
│ labeling as ground truth │
│ │ │
│ ▼ │
│ Add to next │
│ training batch ◀──────────────────┘| Training Strategy | Labeled Images Used | Accuracy | Label Cost |
|---|---|---|---|
| Random sampling (baseline) | 41,000 | 97.2% | 2/label) |
| Active learning (uncertainty) | 24,600 | 97.1% | $49K |
| Active learning (diversity + uncertainty) | 22,100 | 96.9% | $44K |
Finding: Active learning achieves comparable accuracy with 40% fewer labeled images, saving approximately $38K per training cycle in labeling costs.
Production Impact Summary
After 2 months in production across 2 production lines:
| Metric | Before (Manual) | After (AI + Human Fallback) | Change |
|---|---|---|---|
| Inspection throughput | 200 parts/shift/inspector | 1,400 parts/shift/line | +600% |
| Defect detection rate | 85% | 97.2% | +12.2pp |
| False reject rate | 8% | 3.2% | -4.8pp |
| Inspector labor cost | $1.4M/year | $420K/year (oversight role) | -70% |
| Customer escapes (defects shipped) | 340/year | 41/year | -88% |
| Scrap cost reduction | -- | $860K/year | Significant |
Related Walkthroughs
- Data Scientist Journey -- Lin Wei's sensor feature engineering techniques inform Tomas's data preparation approach
- BI Lead Journey -- Carlos tracks quality metrics from Tomas's inspection system in the OEE dashboard
- Executive Journey -- Karen reviews the quality inspection ROI and expansion plan
- Manufacturing Overview -- Full dataset and KPI reference