Data Ingestion
Data Ingestion is the entry point for bringing external data into the Matih platform. It provides a unified interface for connecting to external databases, SaaS applications, cloud storage, and flat files, and landing that data into Apache Iceberg tables where it becomes immediately queryable by the platform's query engines and consumable by AI, BI, and ML workbenches.
What You Will Learn
By the end of this chapter, you will understand:
- The Ingestion Service architecture, its Java/Spring Boot 3.2 API layer, and its integration with Airbyte for connector-based ingestion
- The connector catalog with 600+ Airbyte connectors spanning databases, SaaS applications, APIs, and cloud storage
- File import for CSV, Excel (XLSX), Parquet, JSON, and Avro files with automatic schema inference, preview, and column mapping
- The sync lifecycle from source configuration through schema discovery, stream selection, scheduling, execution, and monitoring
- Per-tenant isolation where each tenant receives an isolated Airbyte deployment with dedicated resources
- The complete API surface for source management, connection configuration, sync orchestration, and file import
Chapter Structure
| Section | Description | Audience |
|---|---|---|
| Architecture | System architecture, data flow diagrams, component integration map, per-tenant deployment model | Architects, platform engineers |
| Getting Started | Step-by-step guide to configure a source, run a sync, and query ingested data | All users |
| Connectors | Connector catalog overview, database connectors, SaaS connectors, cloud storage connectors | Data engineers, analysts |
| File Import | File upload and import for CSV, Excel, Parquet with schema inference and preview | Data engineers, analysts |
| API Reference | Complete REST API documentation for all Ingestion Service endpoints | Backend developers |
| Sync Monitoring | Sync status dashboard, common errors, Grafana dashboards, alerting | Data engineers, platform operators |
Data Ingestion at a Glance
The Ingestion Service is a Java/Spring Boot 3.2 service that orchestrates data movement from external sources into the platform's Iceberg lakehouse.
+---------------------+
| Data Workbench |
| (Frontend UI) |
+---------+-----------+
|
+---------v-----------+
| Ingestion Service |
| (Java/Spring Boot) |
| Port 8113 |
+---------+-----------+
|
+---------------+---------------+
| |
+--------v---------+ +----------v----------+
| Airbyte | | File Import Engine |
| (600+ connectors) | | (CSV/Excel/Parquet) |
+--------+----------+ +----------+----------+
| |
+---------------+---------------+
|
+---------v-----------+
| Apache Iceberg |
| (Polaris Catalog) |
+---------+-----------+
|
+-------------------+-------------------+
| | | |
+-----v----+ +---v------+ +-v---------+ +---v--------+
| Trino | |ClickHouse| | StarRocks | | Spark |
| (OLAP) | |(Real-time)| | (MPP) | | (Batch) |
+----------+ +----------+ +----------+ +------------+Key Numbers
| Metric | Value |
|---|---|
| Technology | Java 21, Spring Boot 3.2 |
| Service port | 8113 |
| Connector engine | Airbyte (600+ connectors) |
| File formats supported | CSV, Excel (XLSX), Parquet, JSON, Avro |
| Sync modes | Full Refresh, Incremental, CDC |
| Storage format | Apache Iceberg (via Polaris REST Catalog) |
| Multi-tenancy | Per-tenant isolated Airbyte deployment |
| Scheduling | Cron-based or manual trigger |
Key Capabilities
Connector-Based Ingestion (Airbyte)
The platform integrates with Airbyte to provide 600+ pre-built connectors for databases, SaaS applications, APIs, and cloud storage. Connectors handle authentication, pagination, rate limiting, schema evolution, and incremental extraction automatically. Users configure a source, select streams (tables or collections), choose a sync mode, and set a schedule. Airbyte handles the extraction and the Ingestion Service routes the data into tenant-specific Iceberg tables.
File Import
For ad-hoc data loading, users upload files directly through the Data Workbench. The Ingestion Service infers the schema automatically, presents a preview with sample rows and detected column types, and allows users to adjust column mappings and target table names before importing. Supported formats include CSV (with delimiter and encoding detection), Excel (with multi-sheet support), Parquet (with full schema preservation), JSON, and Avro.
Per-Tenant Isolation
Every tenant receives a dedicated Airbyte deployment. Source credentials are stored in tenant-scoped Kubernetes secrets. Sync jobs execute in tenant-isolated pods. Ingested data lands in tenant-specific Iceberg namespaces. This ensures complete data isolation, resource isolation, and credential isolation across tenants.
Key Source Files
| Component | Location |
|---|---|
| Application entry point | control-plane/ingestion-service/src/main/java/com/matih/ingestion/IngestionServiceApplication.java |
| Controllers | |
| Source Controller | control-plane/ingestion-service/src/main/java/com/matih/ingestion/controller/SourceController.java |
| Connection Controller | control-plane/ingestion-service/src/main/java/com/matih/ingestion/controller/ConnectionController.java |
| Sync Controller | control-plane/ingestion-service/src/main/java/com/matih/ingestion/controller/SyncController.java |
| File Import Controller | control-plane/ingestion-service/src/main/java/com/matih/ingestion/controller/FileImportController.java |
| Services | |
| Source Service | control-plane/ingestion-service/src/main/java/com/matih/ingestion/service/SourceService.java |
| Connection Service | control-plane/ingestion-service/src/main/java/com/matih/ingestion/service/ConnectionService.java |
| Sync Service | control-plane/ingestion-service/src/main/java/com/matih/ingestion/service/SyncService.java |
| File Import Service | control-plane/ingestion-service/src/main/java/com/matih/ingestion/service/FileImportService.java |
| Clients | |
| Airbyte Client | control-plane/ingestion-service/src/main/java/com/matih/ingestion/client/AirbyteClient.java |
| Pipeline Service Client | control-plane/ingestion-service/src/main/java/com/matih/ingestion/client/PipelineServiceClient.java |
| Entities | |
| Ingestion Source | control-plane/ingestion-service/src/main/java/com/matih/ingestion/entity/IngestionSource.java |
| Ingestion Connection | control-plane/ingestion-service/src/main/java/com/matih/ingestion/entity/IngestionConnection.java |
| Sync History | control-plane/ingestion-service/src/main/java/com/matih/ingestion/entity/SyncHistory.java |
| File Import Job | control-plane/ingestion-service/src/main/java/com/matih/ingestion/entity/FileImportJob.java |
| Configuration | |
| Service Config | control-plane/ingestion-service/src/main/java/com/matih/ingestion/config/IngestionServiceConfig.java |
| Helm Chart | infrastructure/helm/control-plane/ingestion-service/ |
Design Principles
-
Connector reuse over custom integration. The platform uses Airbyte's connector ecosystem rather than building custom integrations. This provides breadth (600+ sources), maintenance (Airbyte community maintains connectors), and reliability (battle-tested extraction logic).
-
Schema-first ingestion. Every ingestion flow begins with schema discovery. Users see exactly which streams (tables, collections, API resources) are available, their column types, and supported sync modes before any data moves.
-
Tenant isolation at every layer. Source credentials, Airbyte deployments, sync jobs, and destination tables are all scoped to a single tenant. No cross-tenant data leakage is possible.
-
Lakehouse-native landing. All ingested data lands in Apache Iceberg format via the Polaris REST Catalog. This provides ACID transactions, time travel, schema evolution, and partition pruning from the moment data arrives.
-
Observability by default. Every sync job emits structured metrics (records synced, bytes transferred, duration) and retains full history for debugging and audit.
How This Chapter Connects
The Ingestion Service integrates with several platform components:
- The Query Engine (Chapter 9) provides SQL access to ingested data through Trino, ClickHouse, and StarRocks
- The Data Catalog (Chapter 10) automatically registers ingested tables and tracks lineage from source to Iceberg
- The Pipeline Service (Chapter 11) transforms ingested raw data through ETL/ELT pipelines
- The AI Service (Chapter 12) queries ingested data for natural language analytics and insight generation
- The Tenant Service (Chapter 7) provisions per-tenant Airbyte deployments during tenant onboarding
Begin with the Architecture section to understand the system design, or jump directly to Getting Started to configure your first data source and run a sync.