Pipeline and Streaming Architecture

Production - Pipeline Service (Java:8092) + Flink + Spark

The Pipeline Service orchestrates data engineering workflows -- ETL/ELT pipelines, data transformation jobs, CDC streams, and scheduled data processing. It integrates with Temporal for workflow orchestration, Apache Flink for stream processing, and Apache Spark for batch processing.

2.4.E.1Pipeline Service (Port 8092)

Core responsibilities:

Pipeline definition and versioning (DAG-based workflows)
Job scheduling via Temporal workflow engine
Pipeline execution monitoring, alerting, and SLA tracking
Data transformation orchestration (SQL, Python, Spark, Flink)
Pipeline lineage and dependency tracking
Failure recovery with checkpoint-based retry

Pipeline Definition

Pipelines are defined as DAGs (Directed Acyclic Graphs) of tasks:

{
  "name": "daily_revenue_pipeline",
  "schedule": "0 6 * * *",
  "tasks": [
    {
      "id": "extract_orders",
      "type": "sql",
      "query": "SELECT * FROM raw.orders WHERE date = '${execution_date}'",
      "output": "staging.orders_daily"
    },
    {
      "id": "transform_revenue",
      "type": "sql",
      "depends_on": ["extract_orders"],
      "query": "SELECT region, SUM(amount) as revenue FROM staging.orders_daily GROUP BY region",
      "output": "analytics.revenue_by_region"
    },
    {
      "id": "quality_check",
      "type": "quality",
      "depends_on": ["transform_revenue"],
      "checks": ["row_count > 0", "revenue >= 0"]
    },
    {
      "id": "notify",
      "type": "notification",
      "depends_on": ["quality_check"],
      "channel": "slack",
      "message": "Daily revenue pipeline completed"
    }
  ]
}

2.4.E.2Temporal Integration

Pipeline execution is orchestrated through Temporal workflows:

Pipeline Service --> Temporal Server
                      |
                      +--> Workflow: daily_revenue_pipeline
                           |
                           +--> Activity: extract_orders
                           |    (executes SQL via query-engine)
                           |
                           +--> Activity: transform_revenue
                           |    (executes SQL via query-engine)
                           |
                           +--> Activity: quality_check
                           |    (calls data-quality-service)
                           |
                           +--> Activity: notify
                                (publishes to Kafka for notification-service)

Temporal provides:

Durable execution -- Workflows survive service restarts
Retry policies -- Configurable per-activity retry with backoff
Visibility -- Query workflow status, history, and pending activities
Versioning -- Deploy new workflow versions alongside existing ones

2.4.E.3Flink Stream Processing

Apache Flink handles real-time data transformations and CDC processing:

Flink Job	Source	Sink	Purpose
`agent-performance-agg`	Kafka: `ai.agent.events`	ClickHouse	Aggregate agent performance metrics
`llm-operations-agg`	Kafka: `ai.llm.events`	ClickHouse	Aggregate LLM token usage and latency
`session-analytics`	Kafka: `session.events`	ClickHouse	Real-time session analytics
`state-transition-cdc`	PostgreSQL CDC	Kafka	Capture state changes for event sourcing

Flink jobs are deployed as Kubernetes jobs using the Flink Kubernetes operator. Each job processes events from Kafka topics, applies transformations (windowed aggregations, joins, filters), and writes results to ClickHouse for analytical queries.

2.4.E.4Key APIs

Endpoint	Method	Description
`/api/v1/pipelines`	GET/POST	Pipeline management
`/api/v1/pipelines/{id}/run`	POST	Trigger pipeline execution
`/api/v1/pipelines/{id}/schedule`	PUT	Configure scheduling (cron)
`/api/v1/pipelines/{id}/runs`	GET	Execution history
`/api/v1/pipelines/{id}/runs/{runId}`	GET	Run details and task status
`/api/v1/pipelines/{id}/lineage`	GET	Pipeline data lineage

Related Sections

Pipeline Flow -- End-to-end pipeline execution
CDC Patterns -- Change data capture details
Kafka Data Store -- Kafka as pipeline backbone
Pipelines Chapter -- Complete pipeline documentation

Catalog Architecture Agent Architecture