Data Engineer

The Data Engineer is responsible for building and maintaining the data infrastructure that powers analytics, ML, and BI across the organization. In the MATIH Platform, Data Engineers work primarily in the Data Workbench (port 3002) and interact with pipeline orchestration, data quality, catalog management, and query engine services.

Role Summary

Attribute	Details
Primary workbench	Data Workbench (3002)
Key services	Pipeline Service, Data Quality Service, Catalog Service, Query Engine
Common tasks	Build pipelines, monitor data quality, manage schemas, optimize queries
Technical depth	High -- SQL, Python, pipeline DAGs, infrastructure configuration

Day-in-the-Life Workflow

A typical workday for a Data Engineer on the MATIH Platform:

Time	Activity	Platform Feature
9:00 AM	Check overnight pipeline runs	Pipeline Service dashboard, failure alerts
9:30 AM	Investigate a failed pipeline step	Conversational AI: "Why did the customer_etl pipeline fail?"
10:00 AM	Fix data quality issue	Data Quality Service profiling, rule editor
11:00 AM	Add new data source	Catalog Service connector registration
1:00 PM	Build transformation pipeline	Pipeline Service visual editor, SQL transforms
3:00 PM	Optimize slow query	Query Engine explain plan, Trino query analysis
4:00 PM	Review data quality metrics	Data Quality dashboards, anomaly alerts

Key Capabilities

Pipeline Orchestration

Data Engineers build, schedule, and monitor data pipelines through the Pipeline Service:

Feature	Description
Visual pipeline editor	Drag-and-drop pipeline builder with SQL and Python steps
Temporal orchestration	Durable workflow execution with automatic retries
Schedule management	Cron-based scheduling with timezone support
Dependency tracking	Automatic detection and visualization of pipeline dependencies
Failure alerting	Real-time notifications on pipeline failures

Data Quality

The Data Quality Service provides automated data profiling and quality monitoring:

Feature	Description
Automated profiling	Statistical analysis of every table column
Quality rules	Configurable rules for null checks, range validation, uniqueness
Anomaly detection	Statistical anomaly detection on data freshness and volume
Quality dashboards	Visual quality scores per table, column, and pipeline

Catalog Management

The Catalog Service maintains metadata about all data assets:

Feature	Description
Schema discovery	Automatic schema detection from connected sources
Lineage tracking	Column-level data lineage across pipelines
Tag management	Business and technical metadata tagging
Search	Full-text search across table names, column names, and descriptions

Backend Services

The Data Engineer persona interacts with these backend services:

Service	Port	Interaction
`pipeline-service`	8092	Pipeline CRUD, scheduling, monitoring
`data-quality-service`	8000	Quality rules, profiling, anomaly alerts
`catalog-service`	8086	Schema metadata, lineage, search
`query-engine`	8080	SQL execution via Trino
`ai-service`	8000	Natural language pipeline diagnostics

Related Chapters

Data Engineering Capabilities -- Full capability description
Data Stores -- PostgreSQL, Kafka, Trino architecture
Pipeline Flow -- Pipeline execution lifecycle
Technology Stack: Data Infrastructure -- Data technologies

Overview BI Developer