Chapter 11: Pipelines and Data Engineering
The Pipelines and Data Engineering layer provides the data movement, transformation, and orchestration capabilities of the MATIH platform. It encompasses the Pipeline Service (a unified FastAPI application), integrations with Apache Airflow for DAG orchestration, Apache Spark v4.1.1 for distributed processing, Apache Flink for real-time streaming, and Temporal for workflow orchestration. Pre-built pipeline templates accelerate development for common data engineering patterns across industries.
What You Will Learn
By the end of this chapter, you will understand:
- The Pipeline Service architecture, its unified API for managing pipelines across multiple execution engines, and the operator framework for data transformations
- Airflow integration for DAG orchestration, custom operators for MATIH workloads, DAG synchronization, and Airflow API management
- Spark integration with Spark v4.1.1 and Spark Connect for distributed data processing, Iceberg table operations, and Spark Operator on Kubernetes
- Flink streaming for CDC pipelines, real-time analytics aggregation, and the four production Flink SQL jobs
- Temporal workflows for long-running data pipeline orchestration, activity definitions, retry policies, and workflow monitoring
- Pipeline templates organized by industry vertical, providing pre-built transformation patterns for rapid pipeline development
- The complete API surface for pipeline management, monitoring, scheduling, and execution
Chapter Structure
| Section | Description | Audience |
|---|---|---|
| Pipeline Service | Architecture, unified API, operator framework, DAG generation, connection management | Data engineers, platform engineers |
| Airflow | DAG orchestration, custom operators, DAG sync, API integration, monitoring | Data engineers |
| Spark | Spark v4.1.1, Spark Connect, Iceberg integration, Spark Operator, Polaris catalog | Data engineers, platform engineers |
| Flink | CDC streaming, state-transition archival, session analytics, agent performance, LLM operations | Data engineers, platform engineers |
| Temporal | Workflow orchestration, activities, retry policies, workflow monitoring | Data engineers, backend developers |
| Templates | Industry-specific pipeline templates, template library management, template SDK | Data engineers, solution architects |
| API Reference | Complete REST API documentation for all Pipeline Service endpoints | All developers |
Pipelines at a Glance
The pipeline layer supports four execution engines unified behind a single API:
+-------------------+
| Pipeline Service |
| (Python/FastAPI) |
| Port 8000 |
+--------+----------+
|
+----------------+----------------+----------------+
| | | |
+--------v------+ +------v-------+ +------v-------+ +-----v-------+
| Apache Airflow| | Apache Spark | | Apache Flink | | Temporal |
| (DAG Orch) | | v4.1.1 | | (Streaming) | | (Workflows) |
+---------------+ +--------------+ +--------------+ +-------------+
| | | |
+--------v------+ +------v-------+ +------v-------+ +-----v-------+
| Scheduled ETL | | Distributed | | CDC / Real- | | Long-running|
| Batch jobs | | Processing | | time Agg | | Orchestrtion|
+---------------+ +--------------+ +--------------+ +-------------+Key Numbers
| Metric | Value |
|---|---|
| Pipeline Service technology | Python 3.11, FastAPI |
| Pipeline Service port | 8000 |
| Spark version | 4.1.1 with Spark Connect |
| Flink jobs (production) | 4 (CDC archival, session analytics, agent performance, LLM operations) |
| Pipeline templates | 10+ industry verticals |
| Custom Airflow operators | 8 (database extract, API extract, cloud storage, Kafka, SQL transform, dbt, ClickHouse load, Delta load) |
| Supported connections | JDBC, API, SFTP, Kafka, S3/GCS/Azure Blob, Iceberg |
Key Source Files
| Component | Location |
|---|---|
| Pipeline API routes | data-plane/pipeline-service/src/matih_pipeline/api/routes/pipelines.py |
| Airflow routes | data-plane/pipeline-service/src/matih_pipeline/api/routes/airflow.py |
| Template routes | data-plane/pipeline-service/src/matih_pipeline/api/routes/templates.py |
| DAG generator | data-plane/pipeline-service/src/matih_pipeline/dags/pipeline_dag_generator.py |
| Airflow service | data-plane/pipeline-service/src/matih_pipeline/services/airflow_service.py |
| Spark service | data-plane/pipeline-service/src/matih_pipeline/services/spark_service.py |
| Spark operator service | data-plane/pipeline-service/src/matih_pipeline/services/spark_operator_service.py |
| Flink service | data-plane/pipeline-service/src/matih_pipeline/services/flink_service.py |
| Flink operator service | data-plane/pipeline-service/src/matih_pipeline/services/flink_operator_service.py |
| Template service | data-plane/pipeline-service/src/matih_pipeline/services/template_service.py |
| Flink CDC job | infrastructure/flink/jobs/state-transition-cdc.sql |
| Flink session analytics | infrastructure/flink/jobs/session-analytics.sql |
| Flink agent performance | infrastructure/flink/jobs/agent-performance-agg.sql |
| Flink LLM operations | infrastructure/flink/jobs/llm-operations-agg.sql |
| Spark Iceberg template | templates/spark/spark-iceberg-job.yaml |
| Pipeline templates | templates/data/ (10 industry verticals) |
Design Principles
-
Engine agnosticism. The Pipeline Service provides a unified interface for pipeline definition regardless of whether execution happens on Airflow, Spark, Flink, or Temporal.
-
Template-driven development. Pre-built templates for common patterns reduce pipeline development from days to hours.
-
Lineage by default. Every pipeline execution emits OpenLineage events for automatic lineage tracking.
-
Quality gates. Pipelines integrate with the Data Quality Service to validate data at critical transformation points.
-
Cost awareness. Pipeline execution includes cost estimation and tracking for resource optimization.
-
Observability everywhere. Every pipeline execution emits metrics, logs, and traces for complete visibility.
Execution Engine Selection
The Pipeline Service selects the appropriate execution engine based on pipeline characteristics:
| Engine | Best For | Trigger |
|---|---|---|
| Airflow | Scheduled batch ETL, DAG-based workflows | Cron schedules, API triggers |
| Spark | Large-scale distributed processing, Iceberg operations | Data volume > 10GB, complex transformations |
| Flink | Real-time streaming, CDC | Continuous streams, sub-minute latency requirements |
| Temporal | Long-running workflows, human-in-the-loop | Multi-step orchestration, approval workflows |
How This Chapter Connects
The Pipeline Service integrates with several platform services:
- The Data Catalog (Chapter 10) receives lineage events from pipeline executions and provides schema metadata for transformations
- The Data Quality Service (Chapter 10) validates data at quality gates within pipeline stages
- The Query Engine (Chapter 9) executes SQL transformations and provides data access for pipeline reads
- The AI Service (Chapter 12) consumes the analytics produced by Flink streaming jobs
Begin with the Pipeline Service section to understand the unified architecture.