MATIH Platform is in active MVP development. Documentation reflects current implementation status.
1. Introduction
Data Engineer

Data Engineer

The Data Engineer is responsible for building and maintaining the data infrastructure that powers analytics, ML, and BI across the organization. In the MATIH Platform, Data Engineers work primarily in the Data Workbench (port 3002) and interact with pipeline orchestration, data quality, catalog management, and query engine services.


Role Summary

AttributeDetails
Primary workbenchData Workbench (3002)
Key servicesPipeline Service, Data Quality Service, Catalog Service, Query Engine
Common tasksBuild pipelines, monitor data quality, manage schemas, optimize queries
Technical depthHigh -- SQL, Python, pipeline DAGs, infrastructure configuration

Day-in-the-Life Workflow

A typical workday for a Data Engineer on the MATIH Platform:

TimeActivityPlatform Feature
9:00 AMCheck overnight pipeline runsPipeline Service dashboard, failure alerts
9:30 AMInvestigate a failed pipeline stepConversational AI: "Why did the customer_etl pipeline fail?"
10:00 AMFix data quality issueData Quality Service profiling, rule editor
11:00 AMAdd new data sourceCatalog Service connector registration
1:00 PMBuild transformation pipelinePipeline Service visual editor, SQL transforms
3:00 PMOptimize slow queryQuery Engine explain plan, Trino query analysis
4:00 PMReview data quality metricsData Quality dashboards, anomaly alerts

Key Capabilities

Pipeline Orchestration

Data Engineers build, schedule, and monitor data pipelines through the Pipeline Service:

FeatureDescription
Visual pipeline editorDrag-and-drop pipeline builder with SQL and Python steps
Temporal orchestrationDurable workflow execution with automatic retries
Schedule managementCron-based scheduling with timezone support
Dependency trackingAutomatic detection and visualization of pipeline dependencies
Failure alertingReal-time notifications on pipeline failures

Data Quality

The Data Quality Service provides automated data profiling and quality monitoring:

FeatureDescription
Automated profilingStatistical analysis of every table column
Quality rulesConfigurable rules for null checks, range validation, uniqueness
Anomaly detectionStatistical anomaly detection on data freshness and volume
Quality dashboardsVisual quality scores per table, column, and pipeline

Catalog Management

The Catalog Service maintains metadata about all data assets:

FeatureDescription
Schema discoveryAutomatic schema detection from connected sources
Lineage trackingColumn-level data lineage across pipelines
Tag managementBusiness and technical metadata tagging
SearchFull-text search across table names, column names, and descriptions

Backend Services

The Data Engineer persona interacts with these backend services:

ServicePortInteraction
pipeline-service8092Pipeline CRUD, scheduling, monitoring
data-quality-service8000Quality rules, profiling, anomaly alerts
catalog-service8086Schema metadata, lineage, search
query-engine8080SQL execution via Trino
ai-service8000Natural language pipeline diagnostics

Related Chapters