MATIH Platform is in active MVP development. Documentation reflects current implementation status.
13. ML Service & MLOps
Inference & Serving
Inference Optimization

Inference Optimization

The Inference Optimization module provides tools for reducing model size, improving inference latency, and maximizing throughput. Techniques include quantization, pruning, knowledge distillation, ONNX conversion, and TensorRT compilation. The implementation is in src/inference/inference_optimizer.py.


Optimization Techniques

TechniqueLatency ReductionSize ReductionAccuracy Impact
ONNX Conversion20-40%MinimalNone
INT8 Quantization40-60%50-75%Low (less than 1%)
FP16 Mixed Precision30-50%50%Minimal
Pruning20-50%30-70%Low to moderate
Knowledge DistillationVariable50-90%Moderate
TensorRT Compilation50-70%VariableNone to low
Dynamic BatchingN/A (throughput)N/ANone

Optimize Model

POST /api/v1/inference/optimize
{
  "model_id": "model-xyz789",
  "optimizations": ["onnx_conversion", "quantization"],
  "quantization_config": {
    "method": "dynamic",
    "precision": "int8"
  },
  "validation_dataset": {
    "source": "sql",
    "query": "SELECT * FROM ml_features.customer_churn LIMIT 1000"
  },
  "accuracy_threshold": 0.99
}

Response

{
  "optimization_id": "opt-abc123",
  "status": "completed",
  "original": {
    "format": "pkl",
    "size_mb": 245,
    "latency_ms": 15.2,
    "accuracy": 0.912
  },
  "optimized": {
    "format": "onnx",
    "size_mb": 62,
    "latency_ms": 6.8,
    "accuracy": 0.910
  },
  "improvement": {
    "size_reduction": "74.7%",
    "latency_reduction": "55.3%",
    "accuracy_change": "-0.002"
  },
  "artifact_id": "model-xyz789-optimized"
}

ONNX Conversion

Converts models from framework-specific formats to ONNX for portable, optimized inference:

class InferenceOptimizer:
    def convert_to_onnx(self, model, sample_input):
        import onnx
        import onnxruntime as ort
 
        onnx_path = self._export_onnx(model, sample_input)
        onnx_model = onnx.load(onnx_path)
        onnx.checker.check_model(onnx_model)
        return onnx_path
Source FrameworkConversion Method
scikit-learnskl2onnx converter
XGBoostonnxmltools converter
LightGBMonnxmltools converter
PyTorchtorch.onnx.export
TensorFlowtf2onnx converter

Quantization

Reduces numerical precision of model weights and activations:

Quantization TypeDescriptionSupported
DynamicQuantize weights statically, activations dynamicallyAll ONNX models
StaticCalibrate with representative dataONNX with calibration set
Quantization-Aware TrainingTrain with simulated quantizationPyTorch models

Pruning

Removes low-importance weights or neurons to reduce model size:

Pruning MethodDescription
MagnitudeRemove weights below threshold
StructuredRemove entire filters or attention heads
Lottery TicketFind sparse subnetwork

Benchmarking

After optimization, the module runs benchmarks to validate performance:

{
  "benchmark": {
    "iterations": 1000,
    "warmup_iterations": 100,
    "batch_sizes": [1, 16, 32, 64],
    "results": {
      "batch_1": {"p50_ms": 5.2, "p95_ms": 8.1, "p99_ms": 12.3},
      "batch_16": {"p50_ms": 8.5, "p95_ms": 14.2, "p99_ms": 21.0},
      "batch_32": {"p50_ms": 12.1, "p95_ms": 19.8, "p99_ms": 28.5}
    }
  }
}

Configuration

Environment VariableDefaultDescription
OPTIMIZATION_MAX_TIME_MINUTES30Max optimization time
OPTIMIZATION_ACCURACY_THRESHOLD0.99Min accuracy ratio vs. original
TENSORRT_WORKSPACE_SIZE_GB4TensorRT workspace memory
ONNX_OPSET_VERSION17ONNX opset version