Data Ingestion

Data Ingestion is the entry point for bringing external data into the Matih platform. It provides a unified interface for connecting to external databases, SaaS applications, cloud storage, and flat files, and landing that data into Apache Iceberg tables where it becomes immediately queryable by the platform's query engines and consumable by AI, BI, and ML workbenches.

What You Will Learn

By the end of this chapter, you will understand:

The Ingestion Service architecture, its Java/Spring Boot 3.2 API layer, and its integration with Airbyte for connector-based ingestion
The connector catalog with 600+ Airbyte connectors spanning databases, SaaS applications, APIs, and cloud storage
File import for CSV, Excel (XLSX), Parquet, JSON, and Avro files with automatic schema inference, preview, and column mapping
The sync lifecycle from source configuration through schema discovery, stream selection, scheduling, execution, and monitoring
Per-tenant isolation where each tenant receives an isolated Airbyte deployment with dedicated resources
The complete API surface for source management, connection configuration, sync orchestration, and file import

Chapter Structure

Section	Description	Audience
Architecture	System architecture, data flow diagrams, component integration map, per-tenant deployment model	Architects, platform engineers
Getting Started	Step-by-step guide to configure a source, run a sync, and query ingested data	All users
Connectors	Connector catalog overview, database connectors, SaaS connectors, cloud storage connectors	Data engineers, analysts
File Import	File upload and import for CSV, Excel, Parquet with schema inference and preview	Data engineers, analysts
API Reference	Complete REST API documentation for all Ingestion Service endpoints	Backend developers
Sync Monitoring	Sync status dashboard, common errors, Grafana dashboards, alerting	Data engineers, platform operators

Data Ingestion at a Glance

The Ingestion Service is a Java/Spring Boot 3.2 service that orchestrates data movement from external sources into the platform's Iceberg lakehouse.

                        +---------------------+
                        |   Data Workbench    |
                        |   (Frontend UI)     |
                        +---------+-----------+
                                  |
                        +---------v-----------+
                        |  Ingestion Service  |
                        |  (Java/Spring Boot) |
                        |  Port 8113          |
                        +---------+-----------+
                                  |
                  +---------------+---------------+
                  |                               |
         +--------v---------+          +----------v----------+
         | Airbyte           |          | File Import Engine  |
         | (600+ connectors) |          | (CSV/Excel/Parquet) |
         +--------+----------+          +----------+----------+
                  |                               |
                  +---------------+---------------+
                                  |
                        +---------v-----------+
                        |  Apache Iceberg     |
                        |  (Polaris Catalog)  |
                        +---------+-----------+
                                  |
              +-------------------+-------------------+
              |           |           |               |
        +-----v----+ +---v------+ +-v---------+ +---v--------+
        | Trino    | |ClickHouse| | StarRocks | | Spark      |
        | (OLAP)   | |(Real-time)| | (MPP)    | | (Batch)    |
        +----------+ +----------+ +----------+ +------------+

Key Numbers

Metric	Value
Technology	Java 21, Spring Boot 3.2
Service port	8113
Connector engine	Airbyte (600+ connectors)
File formats supported	CSV, Excel (XLSX), Parquet, JSON, Avro
Sync modes	Full Refresh, Incremental, CDC
Storage format	Apache Iceberg (via Polaris REST Catalog)
Multi-tenancy	Per-tenant isolated Airbyte deployment
Scheduling	Cron-based or manual trigger

Key Capabilities

Connector-Based Ingestion (Airbyte)

The platform integrates with Airbyte to provide 600+ pre-built connectors for databases, SaaS applications, APIs, and cloud storage. Connectors handle authentication, pagination, rate limiting, schema evolution, and incremental extraction automatically. Users configure a source, select streams (tables or collections), choose a sync mode, and set a schedule. Airbyte handles the extraction and the Ingestion Service routes the data into tenant-specific Iceberg tables.

File Import

For ad-hoc data loading, users upload files directly through the Data Workbench. The Ingestion Service infers the schema automatically, presents a preview with sample rows and detected column types, and allows users to adjust column mappings and target table names before importing. Supported formats include CSV (with delimiter and encoding detection), Excel (with multi-sheet support), Parquet (with full schema preservation), JSON, and Avro.

Per-Tenant Isolation

Every tenant receives a dedicated Airbyte deployment. Source credentials are stored in tenant-scoped Kubernetes secrets. Sync jobs execute in tenant-isolated pods. Ingested data lands in tenant-specific Iceberg namespaces. This ensures complete data isolation, resource isolation, and credential isolation across tenants.

Key Source Files

Component	Location
Application entry point	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/IngestionServiceApplication.java`
Controllers
Source Controller	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/controller/SourceController.java`
Connection Controller	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/controller/ConnectionController.java`
Sync Controller	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/controller/SyncController.java`
File Import Controller	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/controller/FileImportController.java`
Services
Source Service	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/service/SourceService.java`
Connection Service	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/service/ConnectionService.java`
Sync Service	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/service/SyncService.java`
File Import Service	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/service/FileImportService.java`
Clients
Airbyte Client	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/client/AirbyteClient.java`
Pipeline Service Client	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/client/PipelineServiceClient.java`
Entities
Ingestion Source	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/entity/IngestionSource.java`
Ingestion Connection	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/entity/IngestionConnection.java`
Sync History	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/entity/SyncHistory.java`
File Import Job	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/entity/FileImportJob.java`
Configuration
Service Config	`control-plane/ingestion-service/src/main/java/com/matih/ingestion/config/IngestionServiceConfig.java`
Helm Chart	`infrastructure/helm/control-plane/ingestion-service/`

Design Principles

Connector reuse over custom integration. The platform uses Airbyte's connector ecosystem rather than building custom integrations. This provides breadth (600+ sources), maintenance (Airbyte community maintains connectors), and reliability (battle-tested extraction logic).
Schema-first ingestion. Every ingestion flow begins with schema discovery. Users see exactly which streams (tables, collections, API resources) are available, their column types, and supported sync modes before any data moves.
Tenant isolation at every layer. Source credentials, Airbyte deployments, sync jobs, and destination tables are all scoped to a single tenant. No cross-tenant data leakage is possible.
Lakehouse-native landing. All ingested data lands in Apache Iceberg format via the Polaris REST Catalog. This provides ACID transactions, time travel, schema evolution, and partition pruning from the moment data arrives.
Observability by default. Every sync job emits structured metrics (records synced, bytes transferred, duration) and retains full history for debugging and audit.

How This Chapter Connects

The Ingestion Service integrates with several platform components:

The Query Engine (Chapter 9) provides SQL access to ingested data through Trino, ClickHouse, and StarRocks
The Data Catalog (Chapter 10) automatically registers ingested tables and tracks lineage from source to Iceberg
The Pipeline Service (Chapter 11) transforms ingested raw data through ETL/ELT pipelines
The AI Service (Chapter 12) queries ingested data for natural language analytics and insight generation
The Tenant Service (Chapter 7) provisions per-tenant Airbyte deployments during tenant onboarding

Begin with the Architecture section to understand the system design, or jump directly to Getting Started to configure your first data source and run a sync.

Metadata Management Architecture