MATIH Platform is in active MVP development. Documentation reflects current implementation status.
21. Industry Examples & Walkthroughs
Manufacturing & Supply Chain
ML Engineer Journey

ML Engineer Journey: Visual Quality Inspection System

Persona: Tomas Rivera, ML Engineer at Apex Manufacturing Goal: Deploy an automated visual inspection system for real-time defect detection on the production line Primary Workbenches: ML Workbench, Data Workbench, Pipeline Service Timeline: 8-week project from data preparation to edge deployment


Business Context

Apex Manufacturing currently relies on human inspectors at the end of each production line to identify surface defects on machined parts. Each inspector examines roughly 200 parts per shift using magnification and gauging tools. The process has three problems: it is slow (45 seconds per part), inconsistent (inspector agreement rate is only 82%), and expensive (1.4Mannuallyinlaborplus1.4M annually in labor plus 2M in scrap from missed defects that reach customers).

Tomas Rivera's objective is to build and deploy a CNN-based visual inspection system that captures images of every part at line speed, classifies defects in real time, and routes rejected parts to rework stations automatically. The system must achieve > 97% accuracy with P99 inference latency under 200ms to keep pace with the production line.


Stage 1: Ingestion

Connecting Image Metadata

The camera system at each production line captures high-resolution images (4096x3072, 12MP) and stores them on a local NAS. Tomas does not ingest raw images into the platform -- instead, he ingests the metadata and labels, using the platform to manage the ML lifecycle while images remain on fast local storage accessible to training infrastructure.

Image metadata ingestion configuration:

{
  "source": {
    "type": "postgresql",
    "config": {
      "host": "qms-db.apex.internal",
      "database": "quality_vision",
      "schema": "inspection",
      "tables": [
        "image_metadata",
        "defect_annotations",
        "inspection_results",
        "camera_calibration"
      ]
    }
  },
  "sync": {
    "mode": "cdc_incremental",
    "frequency": "every_15_minutes",
    "cursor_field": "updated_at"
  }
}

Connected Data Sources

SourceTypeConnectorKey TablesFrequency
Quality Vision DBPostgreSQLAirbyte CDCimage_metadata, defect_annotationsEvery 15 min
Quality Management SystemPostgreSQLAirbyte CDCinspection_results, defect_taxonomyEvery 15 min
SAP ERPPostgreSQLAirbyte CDCproduction_orders, product_specsEvery 15 min
Camera calibration logsCSV ImportFile Importcamera_config, lens_distortion_paramsMonthly

Data Volume Assessment

DatasetVolumeGrowth Rate
image_metadata250K records (6 months of history)~1,400 images/day (2 lines, 2 shifts)
defect_annotations38K labeled defects across 22K images~200 new annotations/day
inspection_results1M historical results (human inspector decisions)~1,400/day
production_orders500K orders~300/day
Raw images (on NAS, not ingested)4.2 TB~18 GB/day

Stage 2: Discovery

Mapping Quality-Related Assets

Tomas explores the data catalog in the Data Workbench to understand the quality data landscape:

Catalog: apex_manufacturing
  └── quality/
      ├── image_metadata            (250K rows)
      │   ├── image_id              (UUID, unique)
      │   ├── production_order_id   (FK to production_orders)
      │   ├── machine_id            (FK to equipment_registry)
      │   ├── camera_id             (1 of 8 cameras across 2 lines)
      │   ├── capture_timestamp     (millisecond precision)
      │   ├── image_path            (NAS path, not in platform)
      │   ├── part_type             (product SKU)
      │   └── exposure_settings     (JSON: ISO, aperture, shutter)
      ├── defect_annotations        (38K rows)
      │   ├── annotation_id         (UUID)
      │   ├── image_id              (FK to image_metadata)
      │   ├── defect_type           (string -- INCONSISTENT across plants)
      │   ├── bounding_box          (JSON: x, y, width, height)
      │   ├── annotator_id          (human labeler)
      │   └── confidence            (annotator self-assessed, 1-5)
      └── inspection_results        (1M rows)
          ├── result_id             (UUID)
          ├── image_id              (FK)
          ├── inspector_id          (human inspector)
          ├── decision              (PASS / FAIL / REWORK)
          └── defect_codes          (array of codes)

Discovering Labeling Inconsistencies

Data profiling reveals a critical issue: the same defect type is named differently across plants and even between annotators at the same plant.

-- Discover labeling inconsistencies across plants
SELECT
    e.plant_id,
    d.defect_type,
    COUNT(*) AS annotation_count
FROM defect_annotations d
JOIN image_metadata im ON d.image_id = im.image_id
JOIN equipment_registry e ON im.machine_id = e.machine_id
GROUP BY e.plant_id, d.defect_type
ORDER BY e.plant_id, annotation_count DESC
PlantDefect Label (as entered)CountStandardized Label
Plant 1scratch4,200SURFACE_SCRATCH
Plant 2surface scratch3,100SURFACE_SCRATCH
Plant 3SCRATCH_SURFACE2,800SURFACE_SCRATCH
Plant 1burr3,600BURR
Plant 2edge burr2,900BURR
Plant 1pit2,100SURFACE_PIT
Plant 2pitting1,800SURFACE_PIT
Plant 3corrosion_pit1,400SURFACE_PIT
Allcrack / fracture / hairline_crack2,400CRACK

Standardizing via Ontology Service

Tomas creates a defect taxonomy in the Ontology Service to standardize labels:

{
  "ontology": "apex_defect_taxonomy",
  "version": "1.0",
  "classes": [
    {
      "id": "SURFACE_SCRATCH",
      "label": "Surface Scratch",
      "description": "Linear mark on machined surface caused by tool contact or handling",
      "aliases": ["scratch", "surface scratch", "SCRATCH_SURFACE", "scr"],
      "severity_levels": ["minor", "major", "critical"],
      "parent": "SURFACE_DEFECT"
    },
    {
      "id": "BURR",
      "label": "Burr",
      "description": "Raised edge or small piece of material remaining after machining",
      "aliases": ["burr", "edge burr", "deburr_needed", "rough_edge"],
      "severity_levels": ["minor", "major"],
      "parent": "EDGE_DEFECT"
    },
    {
      "id": "SURFACE_PIT",
      "label": "Surface Pit",
      "description": "Small cavity or depression in the machined surface",
      "aliases": ["pit", "pitting", "corrosion_pit", "surface_void"],
      "severity_levels": ["minor", "major", "critical"],
      "parent": "SURFACE_DEFECT"
    },
    {
      "id": "CRACK",
      "label": "Crack",
      "description": "Fracture line in the material, potentially structural",
      "aliases": ["crack", "fracture", "hairline_crack", "stress_crack"],
      "severity_levels": ["major", "critical"],
      "parent": "STRUCTURAL_DEFECT"
    },
    {
      "id": "DIMENSIONAL_OOS",
      "label": "Dimensional Out-of-Spec",
      "description": "Part dimension outside specified tolerance",
      "aliases": ["oos", "out_of_spec", "tolerance_fail"],
      "severity_levels": ["major", "critical"],
      "parent": "DIMENSIONAL_DEFECT"
    }
  ]
}

Stage 3: Query

Building Training Dataset

Tomas constructs the training dataset by joining image metadata with standardized inspection outcomes:

-- Build labeled training dataset with balanced sampling
WITH standardized_labels AS (
    SELECT
        d.image_id,
        CASE
            WHEN d.defect_type IN ('scratch', 'surface scratch', 'SCRATCH_SURFACE')
                THEN 'SURFACE_SCRATCH'
            WHEN d.defect_type IN ('burr', 'edge burr', 'rough_edge')
                THEN 'BURR'
            WHEN d.defect_type IN ('pit', 'pitting', 'corrosion_pit')
                THEN 'SURFACE_PIT'
            WHEN d.defect_type IN ('crack', 'fracture', 'hairline_crack')
                THEN 'CRACK'
            ELSE 'OTHER'
        END AS defect_class,
        d.bounding_box,
        d.confidence AS annotator_confidence
    FROM defect_annotations d
    WHERE d.confidence >= 3  -- filter low-confidence annotations
),
image_labels AS (
    SELECT
        im.image_id,
        im.image_path,
        im.part_type,
        im.camera_id,
        im.capture_timestamp,
        COALESCE(sl.defect_class, 'NO_DEFECT') AS label,
        sl.bounding_box,
        ir.decision AS inspector_decision
    FROM image_metadata im
    LEFT JOIN standardized_labels sl ON im.image_id = sl.image_id
    LEFT JOIN inspection_results ir ON im.image_id = ir.image_id
)
SELECT
    image_id,
    image_path,
    part_type,
    camera_id,
    label,
    bounding_box,
    inspector_decision,
    -- Stratified split: same part_type distribution in train/test
    NTILE(5) OVER (PARTITION BY label, part_type ORDER BY capture_timestamp)
        AS fold_id
FROM image_labels
WHERE capture_timestamp >= CURRENT_DATE - INTERVAL '180 days'

Dataset composition after balancing:

ClassRaw CountAfter BalancingTrainValidationTest
NO_DEFECT212,00015,000 (subsampled)10,5002,2502,250
SURFACE_SCRATCH10,10010,1007,0701,5151,515
BURR6,5006,5004,550975975
SURFACE_PIT5,3005,3003,710795795
CRACK2,4002,400 (+ augmentation)1,680360360
DIMENSIONAL_OOS1,7001,700 (+ augmentation)1,190255255
Total238,00041,00028,7006,1506,150

Feature Vector Preparation

For models that use pre-extracted feature vectors (transfer learning from ImageNet), Tomas queries the precomputed embeddings stored as parquet on S3:

-- Query precomputed image embeddings for model training
SELECT
    e.image_id,
    e.embedding_vector,  -- 2048-dim float array (ResNet-50 penultimate layer)
    il.label,
    il.part_type
FROM s3.image_embeddings e
JOIN image_labels il ON e.image_id = il.image_id
WHERE il.fold_id <= 4  -- train + validation folds

Stage 4: Orchestration

Model Training Pipeline

Tomas builds the end-to-end training pipeline using the Pipeline Service:

  Visual Inspection Training Pipeline (Weekly, Saturday 2 AM)

  ┌─────────────┐    ┌─────────────┐    ┌──────────────┐    ┌───────────┐
  │ 1. Validate │───▶│ 2. Augment  │───▶│ 3. Train     │───▶│ 4. Eval   │
  │    Labels   │    │    Data     │    │   (Ray Train │    │   Model   │
  │             │    │             │    │    4 GPUs)   │    │           │
  │ Agreement   │    │ Rotation,   │    │             │    │ Precision │
  │ rate > 90%  │    │ flip, noise,│    │ EfficientNet│    │ > 0.95?   │
  │             │    │ brightness  │    │ -B4 fine-   │    │           │
  └─────────────┘    └─────────────┘    │ tune        │    └─────┬─────┘
                                        └──────────────┘          │
                                                            ┌─────┴─────┐
                                                            │ Yes       │ No
                                                            ▼           ▼
                                                     ┌───────────┐ ┌──────────┐
                                                     │ 5. Deploy │ │ Alert +  │
                                                     │   (shadow │ │ manual   │
                                                     │    first) │ │ review   │
                                                     └───────────┘ └──────────┘

Temporal workflow definition:

{
  "workflow": "visual_inspection_training",
  "schedule": "0 2 * * 6",
  "activities": [
    {
      "name": "validate_labels",
      "type": "data_quality",
      "config": {
        "suite": "defect_label_quality",
        "expectations": [
          {
            "type": "expect_column_pair_values_a_to_be_greater_than_b",
            "column_A": "inter_annotator_agreement",
            "column_B": "0.90",
            "description": "Label agreement rate must exceed 90%"
          },
          {
            "type": "expect_column_values_to_be_in_set",
            "column": "defect_class",
            "value_set": ["NO_DEFECT", "SURFACE_SCRATCH", "BURR", "SURFACE_PIT", "CRACK", "DIMENSIONAL_OOS"]
          },
          {
            "type": "expect_table_row_count_to_be_between",
            "min": 30000,
            "description": "Minimum training samples"
          }
        ]
      }
    },
    {
      "name": "augment_data",
      "type": "python_activity",
      "config": {
        "script": "augmentation/defect_augmentation.py",
        "augmentations": [
          {"type": "random_rotation", "degrees": 15},
          {"type": "horizontal_flip", "probability": 0.5},
          {"type": "gaussian_noise", "std": 0.02},
          {"type": "brightness_contrast", "brightness": 0.2, "contrast": 0.2},
          {"type": "random_crop_resize", "scale_range": [0.8, 1.0]}
        ],
        "oversample_classes": ["CRACK", "DIMENSIONAL_OOS"],
        "target_min_per_class": 3000
      }
    },
    {
      "name": "train_model",
      "type": "ml_training",
      "config": {
        "experiment": "visual_quality_inspection",
        "framework": "pytorch",
        "model_architecture": "efficientnet_b4",
        "pretrained_weights": "imagenet",
        "training": {
          "epochs": 50,
          "batch_size": 32,
          "learning_rate": 0.001,
          "lr_scheduler": "cosine_annealing",
          "warmup_epochs": 5,
          "early_stopping_patience": 10,
          "optimizer": "adamw",
          "weight_decay": 0.01
        },
        "compute": {
          "type": "ray_train",
          "num_workers": 4,
          "resources_per_worker": {"cpu": 4, "gpu": 1, "memory_gb": 32}
        }
      },
      "timeout": "4h"
    },
    {
      "name": "evaluate_model",
      "type": "ml_evaluation",
      "config": {
        "metrics": ["accuracy", "precision", "recall", "f1", "confusion_matrix"],
        "per_class": true,
        "deployment_gate": {
          "overall_precision": {"min": 0.95},
          "overall_recall": {"min": 0.93},
          "crack_recall": {"min": 0.99, "description": "Safety-critical defect must have near-perfect recall"},
          "false_reject_rate": {"max": 0.05}
        }
      }
    },
    {
      "name": "deploy_shadow",
      "type": "conditional",
      "condition": "evaluate_model.gate_passed == true",
      "activity": {
        "name": "shadow_deployment",
        "type": "ml_deployment",
        "config": {
          "model_name": "visual_quality_inspector",
          "deployment_mode": "shadow",
          "shadow_duration_hours": 48,
          "compare_with": "current_production_model"
        }
      }
    }
  ]
}

Stage 5: Analysis

Training Data Quality Validation

Before training, Tomas validates the label quality rigorously:

Inter-annotator agreement analysis:

-- Measure label agreement rate: images labeled by multiple annotators
SELECT
    a1.image_id,
    a1.defect_class AS annotator_1_label,
    a2.defect_class AS annotator_2_label,
    CASE WHEN a1.defect_class = a2.defect_class THEN 1 ELSE 0 END AS agreement
FROM standardized_labels a1
JOIN standardized_labels a2
    ON a1.image_id = a2.image_id
    AND a1.annotator_id < a2.annotator_id
MetricValueThresholdStatus
Overall agreement rate91.3%> 90%Pass
Agreement on CRACK class96.7%> 95%Pass
Agreement on NO_DEFECT94.2%> 90%Pass
Agreement on SURFACE_SCRATCH vs SURFACE_PIT83.1%> 80%Pass (marginal)

Data leakage check:

CheckMethodResult
Train/test image overlapHash comparison of image_ids across splits0 overlaps (clean)
Same-part leakageVerify no production_order appears in both train and testClean
Temporal leakageTest set is strictly after training set chronologicallyClean
Camera biasUniform camera distribution across train/testBalanced (+/- 3%)

Defect distribution across production lines:

  Defect Distribution by Production Line
  ═══════════════════════════════════════

  Line A (Plant 1):  SCRATCH ████████  BURR ██████  PIT ████  CRACK ██
  Line B (Plant 1):  SCRATCH ██████    BURR ████    PIT ██████  CRACK █
  Line C (Plant 2):  SCRATCH ██████████ BURR ████   PIT ███   CRACK ██
  Line D (Plant 2):  SCRATCH █████     BURR ████████ PIT ████  CRACK ███

  Note: Line D has elevated burr rate -- check tool wear patterns

Stage 6: Productionization

Edge Deployment Architecture

The visual inspection system must run close to the production line for latency requirements. Tomas deploys via Ray Serve with a dedicated endpoint:

  Production Line Camera ──▶ Edge Server ──▶ Ray Serve ──▶ Decision
       │                       │                │              │
       │  4K image capture     │  Preprocess    │  CNN         │  PASS/FAIL
       │  every 2.5 seconds    │  + resize to   │  inference   │  + defect
       │  per part             │  380x380       │  < 200ms     │  class
       │                       │                │              │
       │                       └── GPU: NVIDIA  │              │
       │                           T4 (16GB)    │              │
       │                                        │              │
       │                       Fallback: if confidence < 0.8   │
       │                       route to human inspector queue   │

Ray Serve deployment configuration:

{
  "deployment": {
    "model_name": "visual_quality_inspector",
    "model_version": "v2.1",
    "serving_framework": "ray_serve",
    "endpoint": "/api/v1/ml/predict/visual-inspection",
    "config": {
      "num_replicas": 4,
      "max_concurrent_queries": 50,
      "ray_actor_options": {
        "num_cpus": 2,
        "num_gpus": 0.5,
        "memory": 8589934592
      },
      "autoscaling_config": {
        "min_replicas": 2,
        "max_replicas": 8,
        "target_num_ongoing_requests_per_replica": 10,
        "upscale_delay_s": 30,
        "downscale_delay_s": 300
      }
    },
    "preprocessing": {
      "resize": [380, 380],
      "normalize": {"mean": [0.485, 0.456, 0.406], "std": [0.229, 0.224, 0.225]},
      "format": "RGB"
    },
    "postprocessing": {
      "confidence_threshold": 0.8,
      "low_confidence_action": "route_to_human_inspector",
      "multi_defect_handling": "report_all_above_threshold"
    }
  }
}

Sample inference response:

{
  "image_id": "img-2026-02-28-143022-cam3",
  "prediction": "FAIL",
  "defects": [
    {
      "class": "SURFACE_SCRATCH",
      "confidence": 0.94,
      "severity": "major",
      "bounding_box": {"x": 120, "y": 340, "width": 85, "height": 12},
      "area_percentage": 0.7
    }
  ],
  "overall_confidence": 0.94,
  "inference_time_ms": 67,
  "recommended_action": "REWORK",
  "model_version": "v2.1",
  "camera_id": "cam-line-a-03",
  "timestamp": "2026-02-28T14:30:22.451Z"
}

Latency performance under production load:

MetricTargetActualStatus
P50 inference latency< 100ms52msPass
P95 inference latency< 150ms89msPass
P99 inference latency< 200ms134msPass
Throughput> 25 images/sec per replica31 images/secPass
Availability> 99.9%99.97% (in first month)Pass

Stage 7: Feedback

Real-Time Monitoring Dashboard

Tomas configures continuous monitoring in the ML Workbench:

MetricFrequencyAlert ThresholdCurrent Value
False reject rate (FRR)Hourly> 5%3.2%
False accept rate (FAR)Hourly> 1%0.4%
Throughput (images/min)Continuous< 20/min per line24/min
Inference latency P99Continuous> 200ms134ms
Human override rateDaily> 10%6.8%
Model confidence distributionDailyMean < 0.85Mean 0.92

Inspector override tracking:

The system tracks when human inspectors disagree with the model's decision. These overrides serve as a continuous feedback signal:

-- Track inspector overrides as model feedback
SELECT
    DATE_TRUNC('week', override_timestamp) AS week,
    model_prediction,
    inspector_decision,
    COUNT(*) AS override_count,
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (
        PARTITION BY DATE_TRUNC('week', override_timestamp)
    ) AS override_pct
FROM inspection_overrides
WHERE override_timestamp >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY 1, 2, 3
ORDER BY 1 DESC, override_count DESC
Override PatternCount/WeekTrendAction
Model: FAIL, Inspector: PASS34DecreasingAcceptable -- conservative model
Model: PASS, Inspector: FAIL8StableCritical -- investigate each case
Model: SCRATCH, Inspector: PIT12StableAdd to active learning queue

Alert Configuration

{
  "alerts": [
    {
      "name": "false_accept_spike",
      "metric": "false_accept_rate_hourly",
      "condition": "> 1.0%",
      "severity": "critical",
      "action": "pause_auto_pass_and_notify",
      "notify": ["tomas.rivera@apex.com", "#quality-ops", "shift-supervisor"]
    },
    {
      "name": "latency_degradation",
      "metric": "inference_latency_p99",
      "condition": "> 200ms for 5 consecutive minutes",
      "severity": "high",
      "action": "scale_up_replicas",
      "notify": ["tomas.rivera@apex.com"]
    },
    {
      "name": "model_confidence_drift",
      "metric": "mean_prediction_confidence",
      "condition": "< 0.85 daily average",
      "severity": "warning",
      "action": "trigger_investigation_workflow",
      "notify": ["tomas.rivera@apex.com", "lin.wei@apex.com"]
    }
  ]
}

Stage 8: Experimentation

Architecture Comparison

Tomas runs a structured experiment comparing three CNN architectures:

ModelAccuracyPrecisionRecallF1Inference P99Model Size
ResNet-50 (fine-tuned)95.8%0.940.930.93598ms98 MB
EfficientNet-B4 (fine-tuned)97.2%0.960.950.955134ms75 MB
YOLOv8-m (defect localization)96.1%0.950.940.94578ms52 MB

Per-class performance (EfficientNet-B4, production model):

Defect ClassPrecisionRecallF1Notes
NO_DEFECT0.980.970.975Low false reject rate
SURFACE_SCRATCH0.950.940.945Most common defect
BURR0.960.950.955Clear visual signature
SURFACE_PIT0.930.920.925Sometimes confused with scratch
CRACK0.990.990.990Safety-critical, tuned for recall
DIMENSIONAL_OOS0.910.890.900Hardest class -- subtle visual cues

Decision: EfficientNet-B4 selected for production. YOLOv8 earmarked for a future iteration where defect localization (bounding box prediction) is needed for automated rework guidance.

Active Learning Experiment

Tomas tests whether active learning -- using the model's own low-confidence predictions to select the most informative images for human labeling -- can reduce labeling costs:

  Active Learning Loop
  ════════════════════

  Production ──▶ Model Inference ──▶ Confidence < 0.8? ──▶ Human Label Queue
       │                                    │                       │
       │                               ┌────┴─────┐                │
       │                               │ Yes      │ No             │
       │                               ▼          ▼                │
       │                         Queue for    Log prediction       │
       │                         labeling     as ground truth      │
       │                               │                           │
       │                               ▼                           │
       │                         Add to next                       │
       │                         training batch ◀──────────────────┘
Training StrategyLabeled Images UsedAccuracyLabel Cost
Random sampling (baseline)41,00097.2%82K(at82K (at 2/label)
Active learning (uncertainty)24,60097.1%$49K
Active learning (diversity + uncertainty)22,10096.9%$44K

Finding: Active learning achieves comparable accuracy with 40% fewer labeled images, saving approximately $38K per training cycle in labeling costs.

Production Impact Summary

After 2 months in production across 2 production lines:

MetricBefore (Manual)After (AI + Human Fallback)Change
Inspection throughput200 parts/shift/inspector1,400 parts/shift/line+600%
Defect detection rate85%97.2%+12.2pp
False reject rate8%3.2%-4.8pp
Inspector labor cost$1.4M/year$420K/year (oversight role)-70%
Customer escapes (defects shipped)340/year41/year-88%
Scrap cost reduction--$860K/yearSignificant

Related Walkthroughs