Inference Optimization
The Inference Optimization module provides tools for reducing model size, improving inference latency, and maximizing throughput. Techniques include quantization, pruning, knowledge distillation, ONNX conversion, and TensorRT compilation. The implementation is in src/inference/inference_optimizer.py.
Optimization Techniques
| Technique | Latency Reduction | Size Reduction | Accuracy Impact |
|---|---|---|---|
| ONNX Conversion | 20-40% | Minimal | None |
| INT8 Quantization | 40-60% | 50-75% | Low (less than 1%) |
| FP16 Mixed Precision | 30-50% | 50% | Minimal |
| Pruning | 20-50% | 30-70% | Low to moderate |
| Knowledge Distillation | Variable | 50-90% | Moderate |
| TensorRT Compilation | 50-70% | Variable | None to low |
| Dynamic Batching | N/A (throughput) | N/A | None |
Optimize Model
POST /api/v1/inference/optimize{
"model_id": "model-xyz789",
"optimizations": ["onnx_conversion", "quantization"],
"quantization_config": {
"method": "dynamic",
"precision": "int8"
},
"validation_dataset": {
"source": "sql",
"query": "SELECT * FROM ml_features.customer_churn LIMIT 1000"
},
"accuracy_threshold": 0.99
}Response
{
"optimization_id": "opt-abc123",
"status": "completed",
"original": {
"format": "pkl",
"size_mb": 245,
"latency_ms": 15.2,
"accuracy": 0.912
},
"optimized": {
"format": "onnx",
"size_mb": 62,
"latency_ms": 6.8,
"accuracy": 0.910
},
"improvement": {
"size_reduction": "74.7%",
"latency_reduction": "55.3%",
"accuracy_change": "-0.002"
},
"artifact_id": "model-xyz789-optimized"
}ONNX Conversion
Converts models from framework-specific formats to ONNX for portable, optimized inference:
class InferenceOptimizer:
def convert_to_onnx(self, model, sample_input):
import onnx
import onnxruntime as ort
onnx_path = self._export_onnx(model, sample_input)
onnx_model = onnx.load(onnx_path)
onnx.checker.check_model(onnx_model)
return onnx_path| Source Framework | Conversion Method |
|---|---|
| scikit-learn | skl2onnx converter |
| XGBoost | onnxmltools converter |
| LightGBM | onnxmltools converter |
| PyTorch | torch.onnx.export |
| TensorFlow | tf2onnx converter |
Quantization
Reduces numerical precision of model weights and activations:
| Quantization Type | Description | Supported |
|---|---|---|
| Dynamic | Quantize weights statically, activations dynamically | All ONNX models |
| Static | Calibrate with representative data | ONNX with calibration set |
| Quantization-Aware Training | Train with simulated quantization | PyTorch models |
Pruning
Removes low-importance weights or neurons to reduce model size:
| Pruning Method | Description |
|---|---|
| Magnitude | Remove weights below threshold |
| Structured | Remove entire filters or attention heads |
| Lottery Ticket | Find sparse subnetwork |
Benchmarking
After optimization, the module runs benchmarks to validate performance:
{
"benchmark": {
"iterations": 1000,
"warmup_iterations": 100,
"batch_sizes": [1, 16, 32, 64],
"results": {
"batch_1": {"p50_ms": 5.2, "p95_ms": 8.1, "p99_ms": 12.3},
"batch_16": {"p50_ms": 8.5, "p95_ms": 14.2, "p99_ms": 21.0},
"batch_32": {"p50_ms": 12.1, "p95_ms": 19.8, "p99_ms": 28.5}
}
}
}Configuration
| Environment Variable | Default | Description |
|---|---|---|
OPTIMIZATION_MAX_TIME_MINUTES | 30 | Max optimization time |
OPTIMIZATION_ACCURACY_THRESHOLD | 0.99 | Min accuracy ratio vs. original |
TENSORRT_WORKSPACE_SIZE_GB | 4 | TensorRT workspace memory |
ONNX_OPSET_VERSION | 17 | ONNX opset version |