Chapter 11: Pipelines and Data Engineering

The Pipelines and Data Engineering layer provides the data movement, transformation, and orchestration capabilities of the MATIH platform. It encompasses the Pipeline Service (a unified FastAPI application), integrations with Apache Airflow for DAG orchestration, Apache Spark v4.1.1 for distributed processing, Apache Flink for real-time streaming, and Temporal for workflow orchestration. Pre-built pipeline templates accelerate development for common data engineering patterns across industries.

What You Will Learn

By the end of this chapter, you will understand:

The Pipeline Service architecture, its unified API for managing pipelines across multiple execution engines, and the operator framework for data transformations
Airflow integration for DAG orchestration, custom operators for MATIH workloads, DAG synchronization, and Airflow API management
Spark integration with Spark v4.1.1 and Spark Connect for distributed data processing, Iceberg table operations, and Spark Operator on Kubernetes
Flink streaming for CDC pipelines, real-time analytics aggregation, and the four production Flink SQL jobs
Temporal workflows for long-running data pipeline orchestration, activity definitions, retry policies, and workflow monitoring
Pipeline templates organized by industry vertical, providing pre-built transformation patterns for rapid pipeline development
The complete API surface for pipeline management, monitoring, scheduling, and execution

Chapter Structure

Section	Description	Audience
Pipeline Service	Architecture, unified API, operator framework, DAG generation, connection management	Data engineers, platform engineers
Airflow	DAG orchestration, custom operators, DAG sync, API integration, monitoring	Data engineers
Spark	Spark v4.1.1, Spark Connect, Iceberg integration, Spark Operator, Polaris catalog	Data engineers, platform engineers
Flink	CDC streaming, state-transition archival, session analytics, agent performance, LLM operations	Data engineers, platform engineers
Temporal	Workflow orchestration, activities, retry policies, workflow monitoring	Data engineers, backend developers
Templates	Industry-specific pipeline templates, template library management, template SDK	Data engineers, solution architects
API Reference	Complete REST API documentation for all Pipeline Service endpoints	All developers

Pipelines at a Glance

The pipeline layer supports four execution engines unified behind a single API:

                      +-------------------+
                      |  Pipeline Service |
                      |  (Python/FastAPI) |
                      |  Port 8000        |
                      +--------+----------+
                               |
              +----------------+----------------+----------------+
              |                |                |                |
     +--------v------+ +------v-------+ +------v-------+ +-----v-------+
     | Apache Airflow| | Apache Spark | | Apache Flink | | Temporal    |
     | (DAG Orch)    | | v4.1.1       | | (Streaming)  | | (Workflows) |
     +---------------+ +--------------+ +--------------+ +-------------+
              |                |                |                |
     +--------v------+ +------v-------+ +------v-------+ +-----v-------+
     | Scheduled ETL | | Distributed  | | CDC / Real-  | | Long-running|
     | Batch jobs    | | Processing   | | time Agg     | | Orchestrtion|
     +---------------+ +--------------+ +--------------+ +-------------+

Key Numbers

Metric	Value
Pipeline Service technology	Python 3.11, FastAPI
Pipeline Service port	8000
Spark version	4.1.1 with Spark Connect
Flink jobs (production)	4 (CDC archival, session analytics, agent performance, LLM operations)
Pipeline templates	10+ industry verticals
Custom Airflow operators	8 (database extract, API extract, cloud storage, Kafka, SQL transform, dbt, ClickHouse load, Delta load)
Supported connections	JDBC, API, SFTP, Kafka, S3/GCS/Azure Blob, Iceberg

Key Source Files

Component	Location
Pipeline API routes	`data-plane/pipeline-service/src/matih_pipeline/api/routes/pipelines.py`
Airflow routes	`data-plane/pipeline-service/src/matih_pipeline/api/routes/airflow.py`
Template routes	`data-plane/pipeline-service/src/matih_pipeline/api/routes/templates.py`
DAG generator	`data-plane/pipeline-service/src/matih_pipeline/dags/pipeline_dag_generator.py`
Airflow service	`data-plane/pipeline-service/src/matih_pipeline/services/airflow_service.py`
Spark service	`data-plane/pipeline-service/src/matih_pipeline/services/spark_service.py`
Spark operator service	`data-plane/pipeline-service/src/matih_pipeline/services/spark_operator_service.py`
Flink service	`data-plane/pipeline-service/src/matih_pipeline/services/flink_service.py`
Flink operator service	`data-plane/pipeline-service/src/matih_pipeline/services/flink_operator_service.py`
Template service	`data-plane/pipeline-service/src/matih_pipeline/services/template_service.py`
Flink CDC job	`infrastructure/flink/jobs/state-transition-cdc.sql`
Flink session analytics	`infrastructure/flink/jobs/session-analytics.sql`
Flink agent performance	`infrastructure/flink/jobs/agent-performance-agg.sql`
Flink LLM operations	`infrastructure/flink/jobs/llm-operations-agg.sql`
Spark Iceberg template	`templates/spark/spark-iceberg-job.yaml`
Pipeline templates	`templates/data/` (10 industry verticals)

Design Principles

Engine agnosticism. The Pipeline Service provides a unified interface for pipeline definition regardless of whether execution happens on Airflow, Spark, Flink, or Temporal.
Template-driven development. Pre-built templates for common patterns reduce pipeline development from days to hours.
Lineage by default. Every pipeline execution emits OpenLineage events for automatic lineage tracking.
Quality gates. Pipelines integrate with the Data Quality Service to validate data at critical transformation points.
Cost awareness. Pipeline execution includes cost estimation and tracking for resource optimization.
Observability everywhere. Every pipeline execution emits metrics, logs, and traces for complete visibility.

Execution Engine Selection

The Pipeline Service selects the appropriate execution engine based on pipeline characteristics:

Engine	Best For	Trigger
Airflow	Scheduled batch ETL, DAG-based workflows	Cron schedules, API triggers
Spark	Large-scale distributed processing, Iceberg operations	Data volume > 10GB, complex transformations
Flink	Real-time streaming, CDC	Continuous streams, sub-minute latency requirements
Temporal	Long-running workflows, human-in-the-loop	Multi-step orchestration, approval workflows

How This Chapter Connects

The Pipeline Service integrates with several platform services:

The Data Catalog (Chapter 10) receives lineage events from pipeline executions and provides schema metadata for transformations
The Data Quality Service (Chapter 10) validates data at quality gates within pipeline stages
The Query Engine (Chapter 9) executes SQL transformations and provides data access for pipeline reads
The AI Service (Chapter 12) consumes the analytics produced by Flink streaming jobs

Begin with the Pipeline Service section to understand the unified architecture.

Sync Monitoring Pipeline Service