Data Engineer
The Data Engineer is responsible for building and maintaining the data infrastructure that powers analytics, ML, and BI across the organization. In the MATIH Platform, Data Engineers work primarily in the Data Workbench (port 3002) and interact with pipeline orchestration, data quality, catalog management, and query engine services.
Role Summary
| Attribute | Details |
|---|---|
| Primary workbench | Data Workbench (3002) |
| Key services | Pipeline Service, Data Quality Service, Catalog Service, Query Engine |
| Common tasks | Build pipelines, monitor data quality, manage schemas, optimize queries |
| Technical depth | High -- SQL, Python, pipeline DAGs, infrastructure configuration |
Day-in-the-Life Workflow
A typical workday for a Data Engineer on the MATIH Platform:
| Time | Activity | Platform Feature |
|---|---|---|
| 9:00 AM | Check overnight pipeline runs | Pipeline Service dashboard, failure alerts |
| 9:30 AM | Investigate a failed pipeline step | Conversational AI: "Why did the customer_etl pipeline fail?" |
| 10:00 AM | Fix data quality issue | Data Quality Service profiling, rule editor |
| 11:00 AM | Add new data source | Catalog Service connector registration |
| 1:00 PM | Build transformation pipeline | Pipeline Service visual editor, SQL transforms |
| 3:00 PM | Optimize slow query | Query Engine explain plan, Trino query analysis |
| 4:00 PM | Review data quality metrics | Data Quality dashboards, anomaly alerts |
Key Capabilities
Pipeline Orchestration
Data Engineers build, schedule, and monitor data pipelines through the Pipeline Service:
| Feature | Description |
|---|---|
| Visual pipeline editor | Drag-and-drop pipeline builder with SQL and Python steps |
| Temporal orchestration | Durable workflow execution with automatic retries |
| Schedule management | Cron-based scheduling with timezone support |
| Dependency tracking | Automatic detection and visualization of pipeline dependencies |
| Failure alerting | Real-time notifications on pipeline failures |
Data Quality
The Data Quality Service provides automated data profiling and quality monitoring:
| Feature | Description |
|---|---|
| Automated profiling | Statistical analysis of every table column |
| Quality rules | Configurable rules for null checks, range validation, uniqueness |
| Anomaly detection | Statistical anomaly detection on data freshness and volume |
| Quality dashboards | Visual quality scores per table, column, and pipeline |
Catalog Management
The Catalog Service maintains metadata about all data assets:
| Feature | Description |
|---|---|
| Schema discovery | Automatic schema detection from connected sources |
| Lineage tracking | Column-level data lineage across pipelines |
| Tag management | Business and technical metadata tagging |
| Search | Full-text search across table names, column names, and descriptions |
Backend Services
The Data Engineer persona interacts with these backend services:
| Service | Port | Interaction |
|---|---|---|
pipeline-service | 8092 | Pipeline CRUD, scheduling, monitoring |
data-quality-service | 8000 | Quality rules, profiling, anomaly alerts |
catalog-service | 8086 | Schema metadata, lineage, search |
query-engine | 8080 | SQL execution via Trino |
ai-service | 8000 | Natural language pipeline diagnostics |
Related Chapters
- Data Engineering Capabilities -- Full capability description
- Data Stores -- PostgreSQL, Kafka, Trino architecture
- Pipeline Flow -- Pipeline execution lifecycle
- Technology Stack: Data Infrastructure -- Data technologies