Pipeline and Streaming Architecture
The Pipeline Service orchestrates data engineering workflows -- ETL/ELT pipelines, data transformation jobs, CDC streams, and scheduled data processing. It integrates with Temporal for workflow orchestration, Apache Flink for stream processing, and Apache Spark for batch processing.
2.4.E.1Pipeline Service (Port 8092)
Core responsibilities:
- Pipeline definition and versioning (DAG-based workflows)
- Job scheduling via Temporal workflow engine
- Pipeline execution monitoring, alerting, and SLA tracking
- Data transformation orchestration (SQL, Python, Spark, Flink)
- Pipeline lineage and dependency tracking
- Failure recovery with checkpoint-based retry
Pipeline Definition
Pipelines are defined as DAGs (Directed Acyclic Graphs) of tasks:
{
"name": "daily_revenue_pipeline",
"schedule": "0 6 * * *",
"tasks": [
{
"id": "extract_orders",
"type": "sql",
"query": "SELECT * FROM raw.orders WHERE date = '${execution_date}'",
"output": "staging.orders_daily"
},
{
"id": "transform_revenue",
"type": "sql",
"depends_on": ["extract_orders"],
"query": "SELECT region, SUM(amount) as revenue FROM staging.orders_daily GROUP BY region",
"output": "analytics.revenue_by_region"
},
{
"id": "quality_check",
"type": "quality",
"depends_on": ["transform_revenue"],
"checks": ["row_count > 0", "revenue >= 0"]
},
{
"id": "notify",
"type": "notification",
"depends_on": ["quality_check"],
"channel": "slack",
"message": "Daily revenue pipeline completed"
}
]
}2.4.E.2Temporal Integration
Pipeline execution is orchestrated through Temporal workflows:
Pipeline Service --> Temporal Server
|
+--> Workflow: daily_revenue_pipeline
|
+--> Activity: extract_orders
| (executes SQL via query-engine)
|
+--> Activity: transform_revenue
| (executes SQL via query-engine)
|
+--> Activity: quality_check
| (calls data-quality-service)
|
+--> Activity: notify
(publishes to Kafka for notification-service)Temporal provides:
- Durable execution -- Workflows survive service restarts
- Retry policies -- Configurable per-activity retry with backoff
- Visibility -- Query workflow status, history, and pending activities
- Versioning -- Deploy new workflow versions alongside existing ones
2.4.E.3Flink Stream Processing
Apache Flink handles real-time data transformations and CDC processing:
| Flink Job | Source | Sink | Purpose |
|---|---|---|---|
agent-performance-agg | Kafka: ai.agent.events | ClickHouse | Aggregate agent performance metrics |
llm-operations-agg | Kafka: ai.llm.events | ClickHouse | Aggregate LLM token usage and latency |
session-analytics | Kafka: session.events | ClickHouse | Real-time session analytics |
state-transition-cdc | PostgreSQL CDC | Kafka | Capture state changes for event sourcing |
Flink jobs are deployed as Kubernetes jobs using the Flink Kubernetes operator. Each job processes events from Kafka topics, applies transformations (windowed aggregations, joins, filters), and writes results to ClickHouse for analytical queries.
2.4.E.4Key APIs
| Endpoint | Method | Description |
|---|---|---|
/api/v1/pipelines | GET/POST | Pipeline management |
/api/v1/pipelines/{id}/run | POST | Trigger pipeline execution |
/api/v1/pipelines/{id}/schedule | PUT | Configure scheduling (cron) |
/api/v1/pipelines/{id}/runs | GET | Execution history |
/api/v1/pipelines/{id}/runs/{runId} | GET | Run details and task status |
/api/v1/pipelines/{id}/lineage | GET | Pipeline data lineage |
Related Sections
- Pipeline Flow -- End-to-end pipeline execution
- CDC Patterns -- Change data capture details
- Kafka Data Store -- Kafka as pipeline backbone
- Pipelines Chapter -- Complete pipeline documentation