ML Engineer Journey: Visual Quality Inspection System

Persona: Tomas Rivera, ML Engineer at Apex Manufacturing Goal: Deploy an automated visual inspection system for real-time defect detection on the production line Primary Workbenches: ML Workbench, Data Workbench, Pipeline Service Timeline: 8-week project from data preparation to edge deployment

Business Context

Apex Manufacturing currently relies on human inspectors at the end of each production line to identify surface defects on machined parts. Each inspector examines roughly 200 parts per shift using magnification and gauging tools. The process has three problems: it is slow (45 seconds per part), inconsistent (inspector agreement rate is only 82%), and expensive ( $1.4M annually in labor plus$ 2M in scrap from missed defects that reach customers).

Tomas Rivera's objective is to build and deploy a CNN-based visual inspection system that captures images of every part at line speed, classifies defects in real time, and routes rejected parts to rework stations automatically. The system must achieve > 97% accuracy with P99 inference latency under 200ms to keep pace with the production line.

Stage 1: Ingestion

Connecting Image Metadata

The camera system at each production line captures high-resolution images (4096x3072, 12MP) and stores them on a local NAS. Tomas does not ingest raw images into the platform -- instead, he ingests the metadata and labels, using the platform to manage the ML lifecycle while images remain on fast local storage accessible to training infrastructure.

Image metadata ingestion configuration:

{
  "source": {
    "type": "postgresql",
    "config": {
      "host": "qms-db.apex.internal",
      "database": "quality_vision",
      "schema": "inspection",
      "tables": [
        "image_metadata",
        "defect_annotations",
        "inspection_results",
        "camera_calibration"
      ]
    }
  },
  "sync": {
    "mode": "cdc_incremental",
    "frequency": "every_15_minutes",
    "cursor_field": "updated_at"
  }
}

Connected Data Sources

Source	Type	Connector	Key Tables	Frequency
Quality Vision DB	PostgreSQL	Airbyte CDC	`image_metadata`, `defect_annotations`	Every 15 min
Quality Management System	PostgreSQL	Airbyte CDC	`inspection_results`, `defect_taxonomy`	Every 15 min
SAP ERP	PostgreSQL	Airbyte CDC	`production_orders`, `product_specs`	Every 15 min
Camera calibration logs	CSV Import	File Import	`camera_config`, `lens_distortion_params`	Monthly

Data Volume Assessment

Dataset	Volume	Growth Rate
`image_metadata`	250K records (6 months of history)	~1,400 images/day (2 lines, 2 shifts)
`defect_annotations`	38K labeled defects across 22K images	~200 new annotations/day
`inspection_results`	1M historical results (human inspector decisions)	~1,400/day
`production_orders`	500K orders	~300/day
Raw images (on NAS, not ingested)	4.2 TB	~18 GB/day

Stage 2: Discovery

Mapping Quality-Related Assets

Tomas explores the data catalog in the Data Workbench to understand the quality data landscape:

Catalog: apex_manufacturing
  └── quality/
      ├── image_metadata            (250K rows)
      │   ├── image_id              (UUID, unique)
      │   ├── production_order_id   (FK to production_orders)
      │   ├── machine_id            (FK to equipment_registry)
      │   ├── camera_id             (1 of 8 cameras across 2 lines)
      │   ├── capture_timestamp     (millisecond precision)
      │   ├── image_path            (NAS path, not in platform)
      │   ├── part_type             (product SKU)
      │   └── exposure_settings     (JSON: ISO, aperture, shutter)
      ├── defect_annotations        (38K rows)
      │   ├── annotation_id         (UUID)
      │   ├── image_id              (FK to image_metadata)
      │   ├── defect_type           (string -- INCONSISTENT across plants)
      │   ├── bounding_box          (JSON: x, y, width, height)
      │   ├── annotator_id          (human labeler)
      │   └── confidence            (annotator self-assessed, 1-5)
      └── inspection_results        (1M rows)
          ├── result_id             (UUID)
          ├── image_id              (FK)
          ├── inspector_id          (human inspector)
          ├── decision              (PASS / FAIL / REWORK)
          └── defect_codes          (array of codes)

Discovering Labeling Inconsistencies

Data profiling reveals a critical issue: the same defect type is named differently across plants and even between annotators at the same plant.

-- Discover labeling inconsistencies across plants
SELECT
    e.plant_id,
    d.defect_type,
    COUNT(*) AS annotation_count
FROM defect_annotations d
JOIN image_metadata im ON d.image_id = im.image_id
JOIN equipment_registry e ON im.machine_id = e.machine_id
GROUP BY e.plant_id, d.defect_type
ORDER BY e.plant_id, annotation_count DESC

Plant	Defect Label (as entered)	Count	Standardized Label
Plant 1	`scratch`	4,200	`SURFACE_SCRATCH`
Plant 2	`surface scratch`	3,100	`SURFACE_SCRATCH`
Plant 3	`SCRATCH_SURFACE`	2,800	`SURFACE_SCRATCH`
Plant 1	`burr`	3,600	`BURR`
Plant 2	`edge burr`	2,900	`BURR`
Plant 1	`pit`	2,100	`SURFACE_PIT`
Plant 2	`pitting`	1,800	`SURFACE_PIT`
Plant 3	`corrosion_pit`	1,400	`SURFACE_PIT`
All	`crack` / `fracture` / `hairline_crack`	2,400	`CRACK`

Standardizing via Ontology Service

Tomas creates a defect taxonomy in the Ontology Service to standardize labels:

{
  "ontology": "apex_defect_taxonomy",
  "version": "1.0",
  "classes": [
    {
      "id": "SURFACE_SCRATCH",
      "label": "Surface Scratch",
      "description": "Linear mark on machined surface caused by tool contact or handling",
      "aliases": ["scratch", "surface scratch", "SCRATCH_SURFACE", "scr"],
      "severity_levels": ["minor", "major", "critical"],
      "parent": "SURFACE_DEFECT"
    },
    {
      "id": "BURR",
      "label": "Burr",
      "description": "Raised edge or small piece of material remaining after machining",
      "aliases": ["burr", "edge burr", "deburr_needed", "rough_edge"],
      "severity_levels": ["minor", "major"],
      "parent": "EDGE_DEFECT"
    },
    {
      "id": "SURFACE_PIT",
      "label": "Surface Pit",
      "description": "Small cavity or depression in the machined surface",
      "aliases": ["pit", "pitting", "corrosion_pit", "surface_void"],
      "severity_levels": ["minor", "major", "critical"],
      "parent": "SURFACE_DEFECT"
    },
    {
      "id": "CRACK",
      "label": "Crack",
      "description": "Fracture line in the material, potentially structural",
      "aliases": ["crack", "fracture", "hairline_crack", "stress_crack"],
      "severity_levels": ["major", "critical"],
      "parent": "STRUCTURAL_DEFECT"
    },
    {
      "id": "DIMENSIONAL_OOS",
      "label": "Dimensional Out-of-Spec",
      "description": "Part dimension outside specified tolerance",
      "aliases": ["oos", "out_of_spec", "tolerance_fail"],
      "severity_levels": ["major", "critical"],
      "parent": "DIMENSIONAL_DEFECT"
    }
  ]
}

Stage 3: Query

Building Training Dataset

Tomas constructs the training dataset by joining image metadata with standardized inspection outcomes:

-- Build labeled training dataset with balanced sampling
WITH standardized_labels AS (
    SELECT
        d.image_id,
        CASE
            WHEN d.defect_type IN ('scratch', 'surface scratch', 'SCRATCH_SURFACE')
                THEN 'SURFACE_SCRATCH'
            WHEN d.defect_type IN ('burr', 'edge burr', 'rough_edge')
                THEN 'BURR'
            WHEN d.defect_type IN ('pit', 'pitting', 'corrosion_pit')
                THEN 'SURFACE_PIT'
            WHEN d.defect_type IN ('crack', 'fracture', 'hairline_crack')
                THEN 'CRACK'
            ELSE 'OTHER'
        END AS defect_class,
        d.bounding_box,
        d.confidence AS annotator_confidence
    FROM defect_annotations d
    WHERE d.confidence >= 3  -- filter low-confidence annotations
),
image_labels AS (
    SELECT
        im.image_id,
        im.image_path,
        im.part_type,
        im.camera_id,
        im.capture_timestamp,
        COALESCE(sl.defect_class, 'NO_DEFECT') AS label,
        sl.bounding_box,
        ir.decision AS inspector_decision
    FROM image_metadata im
    LEFT JOIN standardized_labels sl ON im.image_id = sl.image_id
    LEFT JOIN inspection_results ir ON im.image_id = ir.image_id
)
SELECT
    image_id,
    image_path,
    part_type,
    camera_id,
    label,
    bounding_box,
    inspector_decision,
    -- Stratified split: same part_type distribution in train/test
    NTILE(5) OVER (PARTITION BY label, part_type ORDER BY capture_timestamp)
        AS fold_id
FROM image_labels
WHERE capture_timestamp >= CURRENT_DATE - INTERVAL '180 days'

Dataset composition after balancing:

Class	Raw Count	After Balancing	Train	Validation	Test
NO_DEFECT	212,000	15,000 (subsampled)	10,500	2,250	2,250
SURFACE_SCRATCH	10,100	10,100	7,070	1,515	1,515
BURR	6,500	6,500	4,550	975	975
SURFACE_PIT	5,300	5,300	3,710	795	795
CRACK	2,400	2,400 (+ augmentation)	1,680	360	360
DIMENSIONAL_OOS	1,700	1,700 (+ augmentation)	1,190	255	255
Total	238,000	41,000	28,700	6,150	6,150

Feature Vector Preparation

For models that use pre-extracted feature vectors (transfer learning from ImageNet), Tomas queries the precomputed embeddings stored as parquet on S3:

-- Query precomputed image embeddings for model training
SELECT
    e.image_id,
    e.embedding_vector,  -- 2048-dim float array (ResNet-50 penultimate layer)
    il.label,
    il.part_type
FROM s3.image_embeddings e
JOIN image_labels il ON e.image_id = il.image_id
WHERE il.fold_id <= 4  -- train + validation folds

Stage 4: Orchestration

Model Training Pipeline

Tomas builds the end-to-end training pipeline using the Pipeline Service:

  Visual Inspection Training Pipeline (Weekly, Saturday 2 AM)

  ┌─────────────┐    ┌─────────────┐    ┌──────────────┐    ┌───────────┐
  │ 1. Validate │───▶│ 2. Augment  │───▶│ 3. Train     │───▶│ 4. Eval   │
  │    Labels   │    │    Data     │    │   (Ray Train │    │   Model   │
  │             │    │             │    │    4 GPUs)   │    │           │
  │ Agreement   │    │ Rotation,   │    │             │    │ Precision │
  │ rate > 90%  │    │ flip, noise,│    │ EfficientNet│    │ > 0.95?   │
  │             │    │ brightness  │    │ -B4 fine-   │    │           │
  └─────────────┘    └─────────────┘    │ tune        │    └─────┬─────┘
                                        └──────────────┘          │
                                                            ┌─────┴─────┐
                                                            │ Yes       │ No
                                                            ▼           ▼
                                                     ┌───────────┐ ┌──────────┐
                                                     │ 5. Deploy │ │ Alert +  │
                                                     │   (shadow │ │ manual   │
                                                     │    first) │ │ review   │
                                                     └───────────┘ └──────────┘

Temporal workflow definition:

{
  "workflow": "visual_inspection_training",
  "schedule": "0 2 * * 6",
  "activities": [
    {
      "name": "validate_labels",
      "type": "data_quality",
      "config": {
        "suite": "defect_label_quality",
        "expectations": [
          {
            "type": "expect_column_pair_values_a_to_be_greater_than_b",
            "column_A": "inter_annotator_agreement",
            "column_B": "0.90",
            "description": "Label agreement rate must exceed 90%"
          },
          {
            "type": "expect_column_values_to_be_in_set",
            "column": "defect_class",
            "value_set": ["NO_DEFECT", "SURFACE_SCRATCH", "BURR", "SURFACE_PIT", "CRACK", "DIMENSIONAL_OOS"]
          },
          {
            "type": "expect_table_row_count_to_be_between",
            "min": 30000,
            "description": "Minimum training samples"
          }
        ]
      }
    },
    {
      "name": "augment_data",
      "type": "python_activity",
      "config": {
        "script": "augmentation/defect_augmentation.py",
        "augmentations": [
          {"type": "random_rotation", "degrees": 15},
          {"type": "horizontal_flip", "probability": 0.5},
          {"type": "gaussian_noise", "std": 0.02},
          {"type": "brightness_contrast", "brightness": 0.2, "contrast": 0.2},
          {"type": "random_crop_resize", "scale_range": [0.8, 1.0]}
        ],
        "oversample_classes": ["CRACK", "DIMENSIONAL_OOS"],
        "target_min_per_class": 3000
      }
    },
    {
      "name": "train_model",
      "type": "ml_training",
      "config": {
        "experiment": "visual_quality_inspection",
        "framework": "pytorch",
        "model_architecture": "efficientnet_b4",
        "pretrained_weights": "imagenet",
        "training": {
          "epochs": 50,
          "batch_size": 32,
          "learning_rate": 0.001,
          "lr_scheduler": "cosine_annealing",
          "warmup_epochs": 5,
          "early_stopping_patience": 10,
          "optimizer": "adamw",
          "weight_decay": 0.01
        },
        "compute": {
          "type": "ray_train",
          "num_workers": 4,
          "resources_per_worker": {"cpu": 4, "gpu": 1, "memory_gb": 32}
        }
      },
      "timeout": "4h"
    },
    {
      "name": "evaluate_model",
      "type": "ml_evaluation",
      "config": {
        "metrics": ["accuracy", "precision", "recall", "f1", "confusion_matrix"],
        "per_class": true,
        "deployment_gate": {
          "overall_precision": {"min": 0.95},
          "overall_recall": {"min": 0.93},
          "crack_recall": {"min": 0.99, "description": "Safety-critical defect must have near-perfect recall"},
          "false_reject_rate": {"max": 0.05}
        }
      }
    },
    {
      "name": "deploy_shadow",
      "type": "conditional",
      "condition": "evaluate_model.gate_passed == true",
      "activity": {
        "name": "shadow_deployment",
        "type": "ml_deployment",
        "config": {
          "model_name": "visual_quality_inspector",
          "deployment_mode": "shadow",
          "shadow_duration_hours": 48,
          "compare_with": "current_production_model"
        }
      }
    }
  ]
}

Stage 5: Analysis

Training Data Quality Validation

Before training, Tomas validates the label quality rigorously:

Inter-annotator agreement analysis:

-- Measure label agreement rate: images labeled by multiple annotators
SELECT
    a1.image_id,
    a1.defect_class AS annotator_1_label,
    a2.defect_class AS annotator_2_label,
    CASE WHEN a1.defect_class = a2.defect_class THEN 1 ELSE 0 END AS agreement
FROM standardized_labels a1
JOIN standardized_labels a2
    ON a1.image_id = a2.image_id
    AND a1.annotator_id < a2.annotator_id

Metric	Value	Threshold	Status
Overall agreement rate	91.3%	> 90%	Pass
Agreement on CRACK class	96.7%	> 95%	Pass
Agreement on NO_DEFECT	94.2%	> 90%	Pass
Agreement on SURFACE_SCRATCH vs SURFACE_PIT	83.1%	> 80%	Pass (marginal)

Data leakage check:

Check	Method	Result
Train/test image overlap	Hash comparison of image_ids across splits	0 overlaps (clean)
Same-part leakage	Verify no production_order appears in both train and test	Clean
Temporal leakage	Test set is strictly after training set chronologically	Clean
Camera bias	Uniform camera distribution across train/test	Balanced (+/- 3%)

Defect distribution across production lines:

  Defect Distribution by Production Line
  ═══════════════════════════════════════

  Line A (Plant 1):  SCRATCH ████████  BURR ██████  PIT ████  CRACK ██
  Line B (Plant 1):  SCRATCH ██████    BURR ████    PIT ██████  CRACK █
  Line C (Plant 2):  SCRATCH ██████████ BURR ████   PIT ███   CRACK ██
  Line D (Plant 2):  SCRATCH █████     BURR ████████ PIT ████  CRACK ███

  Note: Line D has elevated burr rate -- check tool wear patterns

Stage 6: Productionization

Edge Deployment Architecture

The visual inspection system must run close to the production line for latency requirements. Tomas deploys via Ray Serve with a dedicated endpoint:

  Production Line Camera ──▶ Edge Server ──▶ Ray Serve ──▶ Decision
       │                       │                │              │
       │  4K image capture     │  Preprocess    │  CNN         │  PASS/FAIL
       │  every 2.5 seconds    │  + resize to   │  inference   │  + defect
       │  per part             │  380x380       │  < 200ms     │  class
       │                       │                │              │
       │                       └── GPU: NVIDIA  │              │
       │                           T4 (16GB)    │              │
       │                                        │              │
       │                       Fallback: if confidence < 0.8   │
       │                       route to human inspector queue   │

Ray Serve deployment configuration:

{
  "deployment": {
    "model_name": "visual_quality_inspector",
    "model_version": "v2.1",
    "serving_framework": "ray_serve",
    "endpoint": "/api/v1/ml/predict/visual-inspection",
    "config": {
      "num_replicas": 4,
      "max_concurrent_queries": 50,
      "ray_actor_options": {
        "num_cpus": 2,
        "num_gpus": 0.5,
        "memory": 8589934592
      },
      "autoscaling_config": {
        "min_replicas": 2,
        "max_replicas": 8,
        "target_num_ongoing_requests_per_replica": 10,
        "upscale_delay_s": 30,
        "downscale_delay_s": 300
      }
    },
    "preprocessing": {
      "resize": [380, 380],
      "normalize": {"mean": [0.485, 0.456, 0.406], "std": [0.229, 0.224, 0.225]},
      "format": "RGB"
    },
    "postprocessing": {
      "confidence_threshold": 0.8,
      "low_confidence_action": "route_to_human_inspector",
      "multi_defect_handling": "report_all_above_threshold"
    }
  }
}

Sample inference response:

{
  "image_id": "img-2026-02-28-143022-cam3",
  "prediction": "FAIL",
  "defects": [
    {
      "class": "SURFACE_SCRATCH",
      "confidence": 0.94,
      "severity": "major",
      "bounding_box": {"x": 120, "y": 340, "width": 85, "height": 12},
      "area_percentage": 0.7
    }
  ],
  "overall_confidence": 0.94,
  "inference_time_ms": 67,
  "recommended_action": "REWORK",
  "model_version": "v2.1",
  "camera_id": "cam-line-a-03",
  "timestamp": "2026-02-28T14:30:22.451Z"
}

Latency performance under production load:

Metric	Target	Actual	Status
P50 inference latency	< 100ms	52ms	Pass
P95 inference latency	< 150ms	89ms	Pass
P99 inference latency	< 200ms	134ms	Pass
Throughput	> 25 images/sec per replica	31 images/sec	Pass
Availability	> 99.9%	99.97% (in first month)	Pass

Stage 7: Feedback

Real-Time Monitoring Dashboard

Tomas configures continuous monitoring in the ML Workbench:

Metric	Frequency	Alert Threshold	Current Value
False reject rate (FRR)	Hourly	> 5%	3.2%
False accept rate (FAR)	Hourly	> 1%	0.4%
Throughput (images/min)	Continuous	< 20/min per line	24/min
Inference latency P99	Continuous	> 200ms	134ms
Human override rate	Daily	> 10%	6.8%
Model confidence distribution	Daily	Mean < 0.85	Mean 0.92

Inspector override tracking:

The system tracks when human inspectors disagree with the model's decision. These overrides serve as a continuous feedback signal:

-- Track inspector overrides as model feedback
SELECT
    DATE_TRUNC('week', override_timestamp) AS week,
    model_prediction,
    inspector_decision,
    COUNT(*) AS override_count,
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (
        PARTITION BY DATE_TRUNC('week', override_timestamp)
    ) AS override_pct
FROM inspection_overrides
WHERE override_timestamp >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY 1, 2, 3
ORDER BY 1 DESC, override_count DESC

Override Pattern	Count/Week	Trend	Action
Model: FAIL, Inspector: PASS	34	Decreasing	Acceptable -- conservative model
Model: PASS, Inspector: FAIL	8	Stable	Critical -- investigate each case
Model: SCRATCH, Inspector: PIT	12	Stable	Add to active learning queue

Alert Configuration

{
  "alerts": [
    {
      "name": "false_accept_spike",
      "metric": "false_accept_rate_hourly",
      "condition": "> 1.0%",
      "severity": "critical",
      "action": "pause_auto_pass_and_notify",
      "notify": ["tomas.rivera@apex.com", "#quality-ops", "shift-supervisor"]
    },
    {
      "name": "latency_degradation",
      "metric": "inference_latency_p99",
      "condition": "> 200ms for 5 consecutive minutes",
      "severity": "high",
      "action": "scale_up_replicas",
      "notify": ["tomas.rivera@apex.com"]
    },
    {
      "name": "model_confidence_drift",
      "metric": "mean_prediction_confidence",
      "condition": "< 0.85 daily average",
      "severity": "warning",
      "action": "trigger_investigation_workflow",
      "notify": ["tomas.rivera@apex.com", "lin.wei@apex.com"]
    }
  ]
}

Stage 8: Experimentation

Architecture Comparison

Tomas runs a structured experiment comparing three CNN architectures:

Model	Accuracy	Precision	Recall	F1	Inference P99	Model Size
ResNet-50 (fine-tuned)	95.8%	0.94	0.93	0.935	98ms	98 MB
EfficientNet-B4 (fine-tuned)	97.2%	0.96	0.95	0.955	134ms	75 MB
YOLOv8-m (defect localization)	96.1%	0.95	0.94	0.945	78ms	52 MB

Per-class performance (EfficientNet-B4, production model):

Defect Class	Precision	Recall	F1	Notes
NO_DEFECT	0.98	0.97	0.975	Low false reject rate
SURFACE_SCRATCH	0.95	0.94	0.945	Most common defect
BURR	0.96	0.95	0.955	Clear visual signature
SURFACE_PIT	0.93	0.92	0.925	Sometimes confused with scratch
CRACK	0.99	0.99	0.990	Safety-critical, tuned for recall
DIMENSIONAL_OOS	0.91	0.89	0.900	Hardest class -- subtle visual cues

Decision: EfficientNet-B4 selected for production. YOLOv8 earmarked for a future iteration where defect localization (bounding box prediction) is needed for automated rework guidance.

Active Learning Experiment

Tomas tests whether active learning -- using the model's own low-confidence predictions to select the most informative images for human labeling -- can reduce labeling costs:

  Active Learning Loop
  ════════════════════

  Production ──▶ Model Inference ──▶ Confidence < 0.8? ──▶ Human Label Queue
       │                                    │                       │
       │                               ┌────┴─────┐                │
       │                               │ Yes      │ No             │
       │                               ▼          ▼                │
       │                         Queue for    Log prediction       │
       │                         labeling     as ground truth      │
       │                               │                           │
       │                               ▼                           │
       │                         Add to next                       │
       │                         training batch ◀──────────────────┘

Training Strategy	Labeled Images Used	Accuracy	Label Cost
Random sampling (baseline)	41,000	97.2%	$82K (at$ 2/label)
Active learning (uncertainty)	24,600	97.1%	$49K
Active learning (diversity + uncertainty)	22,100	96.9%	$44K

Finding: Active learning achieves comparable accuracy with 40% fewer labeled images, saving approximately $38K per training cycle in labeling costs.

Production Impact Summary

After 2 months in production across 2 production lines:

Metric	Before (Manual)	After (AI + Human Fallback)	Change
Inspection throughput	200 parts/shift/inspector	1,400 parts/shift/line	+600%
Defect detection rate	85%	97.2%	+12.2pp
False reject rate	8%	3.2%	-4.8pp
Inspector labor cost	$1.4M/year	$420K/year (oversight role)	-70%
Customer escapes (defects shipped)	340/year	41/year	-88%
Scrap cost reduction	--	$860K/year	Significant

Related Walkthroughs

Data Scientist Journey -- Lin Wei's sensor feature engineering techniques inform Tomas's data preparation approach
BI Lead Journey -- Carlos tracks quality metrics from Tomas's inspection system in the OEE dashboard
Executive Journey -- Karen reviews the quality inspection ROI and expansion plan
Manufacturing Overview -- Full dataset and KPI reference

Data Scientist Journey BI Lead Journey