MATIH Platform is in active MVP development. Documentation reflects current implementation status.
2. Architecture
Pipeline Architecture

Pipeline and Streaming Architecture

Production - Pipeline Service (Java:8092) + Flink + Spark

The Pipeline Service orchestrates data engineering workflows -- ETL/ELT pipelines, data transformation jobs, CDC streams, and scheduled data processing. It integrates with Temporal for workflow orchestration, Apache Flink for stream processing, and Apache Spark for batch processing.


2.4.E.1Pipeline Service (Port 8092)

Core responsibilities:

  • Pipeline definition and versioning (DAG-based workflows)
  • Job scheduling via Temporal workflow engine
  • Pipeline execution monitoring, alerting, and SLA tracking
  • Data transformation orchestration (SQL, Python, Spark, Flink)
  • Pipeline lineage and dependency tracking
  • Failure recovery with checkpoint-based retry

Pipeline Definition

Pipelines are defined as DAGs (Directed Acyclic Graphs) of tasks:

{
  "name": "daily_revenue_pipeline",
  "schedule": "0 6 * * *",
  "tasks": [
    {
      "id": "extract_orders",
      "type": "sql",
      "query": "SELECT * FROM raw.orders WHERE date = '${execution_date}'",
      "output": "staging.orders_daily"
    },
    {
      "id": "transform_revenue",
      "type": "sql",
      "depends_on": ["extract_orders"],
      "query": "SELECT region, SUM(amount) as revenue FROM staging.orders_daily GROUP BY region",
      "output": "analytics.revenue_by_region"
    },
    {
      "id": "quality_check",
      "type": "quality",
      "depends_on": ["transform_revenue"],
      "checks": ["row_count > 0", "revenue >= 0"]
    },
    {
      "id": "notify",
      "type": "notification",
      "depends_on": ["quality_check"],
      "channel": "slack",
      "message": "Daily revenue pipeline completed"
    }
  ]
}

2.4.E.2Temporal Integration

Pipeline execution is orchestrated through Temporal workflows:

Pipeline Service --> Temporal Server
                      |
                      +--> Workflow: daily_revenue_pipeline
                           |
                           +--> Activity: extract_orders
                           |    (executes SQL via query-engine)
                           |
                           +--> Activity: transform_revenue
                           |    (executes SQL via query-engine)
                           |
                           +--> Activity: quality_check
                           |    (calls data-quality-service)
                           |
                           +--> Activity: notify
                                (publishes to Kafka for notification-service)

Temporal provides:

  • Durable execution -- Workflows survive service restarts
  • Retry policies -- Configurable per-activity retry with backoff
  • Visibility -- Query workflow status, history, and pending activities
  • Versioning -- Deploy new workflow versions alongside existing ones

2.4.E.3Flink Stream Processing

Apache Flink handles real-time data transformations and CDC processing:

Flink JobSourceSinkPurpose
agent-performance-aggKafka: ai.agent.eventsClickHouseAggregate agent performance metrics
llm-operations-aggKafka: ai.llm.eventsClickHouseAggregate LLM token usage and latency
session-analyticsKafka: session.eventsClickHouseReal-time session analytics
state-transition-cdcPostgreSQL CDCKafkaCapture state changes for event sourcing

Flink jobs are deployed as Kubernetes jobs using the Flink Kubernetes operator. Each job processes events from Kafka topics, applies transformations (windowed aggregations, joins, filters), and writes results to ClickHouse for analytical queries.


2.4.E.4Key APIs

EndpointMethodDescription
/api/v1/pipelinesGET/POSTPipeline management
/api/v1/pipelines/{id}/runPOSTTrigger pipeline execution
/api/v1/pipelines/{id}/schedulePUTConfigure scheduling (cron)
/api/v1/pipelines/{id}/runsGETExecution history
/api/v1/pipelines/{id}/runs/{runId}GETRun details and task status
/api/v1/pipelines/{id}/lineageGETPipeline data lineage

Related Sections