Pipeline Flow

The pipeline flow traces the execution of data pipelines from definition through orchestration, execution, and monitoring. Pipelines are managed by the Pipeline Service and orchestrated through Temporal for durable, fault-tolerant workflow execution.

Pipeline Execution Path

Data Workbench / API
  |
  v
Pipeline Service (Port 8092)
  | 1. Validate pipeline definition
  | 2. Apply tenant context
  | 3. Submit workflow to Temporal
  |
  v
Temporal (Workflow Orchestration)
  | 4. Create workflow execution
  | 5. Execute steps sequentially/in parallel
  |
  v
Step Execution
  +-- SQL Transform --> Query Engine --> Trino
  +-- Spark Job --> Spark on Kubernetes
  +-- Flink Job --> Flink Operator
  +-- Python Script --> Job container
  +-- Data Quality Check --> Data Quality Service
  |
  v
Pipeline Service
  | 6. Update pipeline status
  | 7. Publish completion event (Kafka)
  | 8. Trigger notifications
  |
  v
Result (pipeline output)

Pipeline Lifecycle

Phase	State	Description
Definition	`DRAFT`	Pipeline created in the editor
Validation	`VALIDATED`	Schema and dependency checks passed
Scheduling	`SCHEDULED`	Cron schedule configured
Execution	`RUNNING`	Temporal workflow executing steps
Step completion	`STEP_COMPLETED`	Individual step finished
Success	`SUCCEEDED`	All steps completed successfully
Failure	`FAILED`	A step failed, retry policy exhausted
Retry	`RETRYING`	Automatic retry of failed step

Step Types

Step Type	Execution Engine	Use Case
SQL transform	Query Engine / Trino	Data transformation via SQL
Spark job	Spark on Kubernetes	Large-scale batch processing
Flink job	Flink Operator	Real-time stream processing
Python script	Job container	Custom Python logic
Data quality check	Data Quality Service	Automated quality validation
Notification	Notification Service	Alert on step completion/failure

Temporal Integration

Temporal provides durable workflow execution:

Feature	Benefit
Durable execution	Workflow survives service restarts
Automatic retries	Configurable retry policies per step
Timeout handling	Per-step and per-workflow timeout limits
Parallel execution	Independent steps run concurrently
Saga pattern	Compensating actions on failure
Visibility	Query workflow state and history

Event Publishing

Pipeline state changes are published as Kafka events:

Event	Published When	Consumers
`PIPELINE_STARTED`	Workflow begins	Audit, notification
`PIPELINE_STEP_COMPLETED`	Each step finishes	Monitoring
`PIPELINE_SUCCEEDED`	All steps complete	Audit, notification, billing
`PIPELINE_FAILED`	A step fails permanently	Audit, notification, alerting

Scheduling

Pipelines support cron-based scheduling:

Feature	Description
Cron expressions	Standard cron syntax with timezone support
Dependencies	Trigger on completion of upstream pipelines
Manual trigger	On-demand execution via API or UI
Backfill	Re-run pipelines for historical date ranges

Query Flow -- SQL execution within pipeline steps
ML Flow -- ML training pipelines
Data Engineer Persona -- Pipeline building workflow
Compute Engines -- Spark, Flink

Agent Flow ML Flow