Pipeline Flow
The pipeline flow traces the execution of data pipelines from definition through orchestration, execution, and monitoring. Pipelines are managed by the Pipeline Service and orchestrated through Temporal for durable, fault-tolerant workflow execution.
Pipeline Execution Path
Data Workbench / API
|
v
Pipeline Service (Port 8092)
| 1. Validate pipeline definition
| 2. Apply tenant context
| 3. Submit workflow to Temporal
|
v
Temporal (Workflow Orchestration)
| 4. Create workflow execution
| 5. Execute steps sequentially/in parallel
|
v
Step Execution
+-- SQL Transform --> Query Engine --> Trino
+-- Spark Job --> Spark on Kubernetes
+-- Flink Job --> Flink Operator
+-- Python Script --> Job container
+-- Data Quality Check --> Data Quality Service
|
v
Pipeline Service
| 6. Update pipeline status
| 7. Publish completion event (Kafka)
| 8. Trigger notifications
|
v
Result (pipeline output)Pipeline Lifecycle
| Phase | State | Description |
|---|---|---|
| Definition | DRAFT | Pipeline created in the editor |
| Validation | VALIDATED | Schema and dependency checks passed |
| Scheduling | SCHEDULED | Cron schedule configured |
| Execution | RUNNING | Temporal workflow executing steps |
| Step completion | STEP_COMPLETED | Individual step finished |
| Success | SUCCEEDED | All steps completed successfully |
| Failure | FAILED | A step failed, retry policy exhausted |
| Retry | RETRYING | Automatic retry of failed step |
Step Types
| Step Type | Execution Engine | Use Case |
|---|---|---|
| SQL transform | Query Engine / Trino | Data transformation via SQL |
| Spark job | Spark on Kubernetes | Large-scale batch processing |
| Flink job | Flink Operator | Real-time stream processing |
| Python script | Job container | Custom Python logic |
| Data quality check | Data Quality Service | Automated quality validation |
| Notification | Notification Service | Alert on step completion/failure |
Temporal Integration
Temporal provides durable workflow execution:
| Feature | Benefit |
|---|---|
| Durable execution | Workflow survives service restarts |
| Automatic retries | Configurable retry policies per step |
| Timeout handling | Per-step and per-workflow timeout limits |
| Parallel execution | Independent steps run concurrently |
| Saga pattern | Compensating actions on failure |
| Visibility | Query workflow state and history |
Event Publishing
Pipeline state changes are published as Kafka events:
| Event | Published When | Consumers |
|---|---|---|
PIPELINE_STARTED | Workflow begins | Audit, notification |
PIPELINE_STEP_COMPLETED | Each step finishes | Monitoring |
PIPELINE_SUCCEEDED | All steps complete | Audit, notification, billing |
PIPELINE_FAILED | A step fails permanently | Audit, notification, alerting |
Scheduling
Pipelines support cron-based scheduling:
| Feature | Description |
|---|---|
| Cron expressions | Standard cron syntax with timezone support |
| Dependencies | Trigger on completion of upstream pipelines |
| Manual trigger | On-demand execution via API or UI |
| Backfill | Re-run pipelines for historical date ranges |
Related Pages
- Query Flow -- SQL execution within pipeline steps
- ML Flow -- ML training pipelines
- Data Engineer Persona -- Pipeline building workflow
- Compute Engines -- Spark, Flink