MATIH Platform is in active MVP development. Documentation reflects current implementation status.
11. Pipelines & Data Engineering
Overview

Chapter 11: Pipelines and Data Engineering

The Pipelines and Data Engineering layer provides the data movement, transformation, and orchestration capabilities of the MATIH platform. It encompasses the Pipeline Service (a unified FastAPI application), integrations with Apache Airflow for DAG orchestration, Apache Spark v4.1.1 for distributed processing, Apache Flink for real-time streaming, and Temporal for workflow orchestration. Pre-built pipeline templates accelerate development for common data engineering patterns across industries.


What You Will Learn

By the end of this chapter, you will understand:

  • The Pipeline Service architecture, its unified API for managing pipelines across multiple execution engines, and the operator framework for data transformations
  • Airflow integration for DAG orchestration, custom operators for MATIH workloads, DAG synchronization, and Airflow API management
  • Spark integration with Spark v4.1.1 and Spark Connect for distributed data processing, Iceberg table operations, and Spark Operator on Kubernetes
  • Flink streaming for CDC pipelines, real-time analytics aggregation, and the four production Flink SQL jobs
  • Temporal workflows for long-running data pipeline orchestration, activity definitions, retry policies, and workflow monitoring
  • Pipeline templates organized by industry vertical, providing pre-built transformation patterns for rapid pipeline development
  • The complete API surface for pipeline management, monitoring, scheduling, and execution

Chapter Structure

SectionDescriptionAudience
Pipeline ServiceArchitecture, unified API, operator framework, DAG generation, connection managementData engineers, platform engineers
AirflowDAG orchestration, custom operators, DAG sync, API integration, monitoringData engineers
SparkSpark v4.1.1, Spark Connect, Iceberg integration, Spark Operator, Polaris catalogData engineers, platform engineers
FlinkCDC streaming, state-transition archival, session analytics, agent performance, LLM operationsData engineers, platform engineers
TemporalWorkflow orchestration, activities, retry policies, workflow monitoringData engineers, backend developers
TemplatesIndustry-specific pipeline templates, template library management, template SDKData engineers, solution architects
API ReferenceComplete REST API documentation for all Pipeline Service endpointsAll developers

Pipelines at a Glance

The pipeline layer supports four execution engines unified behind a single API:

                      +-------------------+
                      |  Pipeline Service |
                      |  (Python/FastAPI) |
                      |  Port 8000        |
                      +--------+----------+
                               |
              +----------------+----------------+----------------+
              |                |                |                |
     +--------v------+ +------v-------+ +------v-------+ +-----v-------+
     | Apache Airflow| | Apache Spark | | Apache Flink | | Temporal    |
     | (DAG Orch)    | | v4.1.1       | | (Streaming)  | | (Workflows) |
     +---------------+ +--------------+ +--------------+ +-------------+
              |                |                |                |
     +--------v------+ +------v-------+ +------v-------+ +-----v-------+
     | Scheduled ETL | | Distributed  | | CDC / Real-  | | Long-running|
     | Batch jobs    | | Processing   | | time Agg     | | Orchestrtion|
     +---------------+ +--------------+ +--------------+ +-------------+

Key Numbers

MetricValue
Pipeline Service technologyPython 3.11, FastAPI
Pipeline Service port8000
Spark version4.1.1 with Spark Connect
Flink jobs (production)4 (CDC archival, session analytics, agent performance, LLM operations)
Pipeline templates10+ industry verticals
Custom Airflow operators8 (database extract, API extract, cloud storage, Kafka, SQL transform, dbt, ClickHouse load, Delta load)
Supported connectionsJDBC, API, SFTP, Kafka, S3/GCS/Azure Blob, Iceberg

Key Source Files

ComponentLocation
Pipeline API routesdata-plane/pipeline-service/src/matih_pipeline/api/routes/pipelines.py
Airflow routesdata-plane/pipeline-service/src/matih_pipeline/api/routes/airflow.py
Template routesdata-plane/pipeline-service/src/matih_pipeline/api/routes/templates.py
DAG generatordata-plane/pipeline-service/src/matih_pipeline/dags/pipeline_dag_generator.py
Airflow servicedata-plane/pipeline-service/src/matih_pipeline/services/airflow_service.py
Spark servicedata-plane/pipeline-service/src/matih_pipeline/services/spark_service.py
Spark operator servicedata-plane/pipeline-service/src/matih_pipeline/services/spark_operator_service.py
Flink servicedata-plane/pipeline-service/src/matih_pipeline/services/flink_service.py
Flink operator servicedata-plane/pipeline-service/src/matih_pipeline/services/flink_operator_service.py
Template servicedata-plane/pipeline-service/src/matih_pipeline/services/template_service.py
Flink CDC jobinfrastructure/flink/jobs/state-transition-cdc.sql
Flink session analyticsinfrastructure/flink/jobs/session-analytics.sql
Flink agent performanceinfrastructure/flink/jobs/agent-performance-agg.sql
Flink LLM operationsinfrastructure/flink/jobs/llm-operations-agg.sql
Spark Iceberg templatetemplates/spark/spark-iceberg-job.yaml
Pipeline templatestemplates/data/ (10 industry verticals)

Design Principles

  1. Engine agnosticism. The Pipeline Service provides a unified interface for pipeline definition regardless of whether execution happens on Airflow, Spark, Flink, or Temporal.

  2. Template-driven development. Pre-built templates for common patterns reduce pipeline development from days to hours.

  3. Lineage by default. Every pipeline execution emits OpenLineage events for automatic lineage tracking.

  4. Quality gates. Pipelines integrate with the Data Quality Service to validate data at critical transformation points.

  5. Cost awareness. Pipeline execution includes cost estimation and tracking for resource optimization.

  6. Observability everywhere. Every pipeline execution emits metrics, logs, and traces for complete visibility.


Execution Engine Selection

The Pipeline Service selects the appropriate execution engine based on pipeline characteristics:

EngineBest ForTrigger
AirflowScheduled batch ETL, DAG-based workflowsCron schedules, API triggers
SparkLarge-scale distributed processing, Iceberg operationsData volume > 10GB, complex transformations
FlinkReal-time streaming, CDCContinuous streams, sub-minute latency requirements
TemporalLong-running workflows, human-in-the-loopMulti-step orchestration, approval workflows

How This Chapter Connects

The Pipeline Service integrates with several platform services:

  • The Data Catalog (Chapter 10) receives lineage events from pipeline executions and provides schema metadata for transformations
  • The Data Quality Service (Chapter 10) validates data at quality gates within pipeline stages
  • The Query Engine (Chapter 9) executes SQL transformations and provides data access for pipeline reads
  • The AI Service (Chapter 12) consumes the analytics produced by Flink streaming jobs

Begin with the Pipeline Service section to understand the unified architecture.