MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10a. Data Ingestion
Getting Started

Getting Started

This guide walks through the complete workflow for ingesting external data into the Matih platform: configuring a source, discovering its schema, creating a sync connection, running the first sync, and querying the results.


Prerequisites

Before you begin, ensure the following are in place:

  • Active tenant. Your organization must have a provisioned tenant. The Tenant Service automatically deploys the per-tenant Airbyte instance and Iceberg namespace during provisioning.
  • User account with Data Engineer or Admin role. Source and connection management require the data:ingestion:manage permission.
  • Source system credentials. Have the hostname, port, database name, username, and password (or API key) for the external data source you want to connect.
  • Network connectivity. The platform must be able to reach the source system. For on-premises databases, ensure the appropriate firewall rules or VPN tunnels are configured.

Step 1: Navigate to Data Workbench -- Ingestion

  1. Log in to the Matih platform.
  2. Open the Data Workbench from the left navigation panel.
  3. Select the Ingestion tab. This opens the Ingestion dashboard showing your configured sources, active connections, and recent sync history.

Step 2: Browse and Select a Connector

  1. Click Add Source in the upper right corner.
  2. The connector catalog displays available connector types organized by category (Databases, SaaS, Cloud Storage, Files, APIs).
  3. Use the search bar or browse by category to find your connector. For this example, select PostgreSQL.
  4. Click Select to proceed to the configuration form.

Available Connector Categories

CategoryExamples
DatabasesPostgreSQL, MySQL, MongoDB, SQL Server, Oracle, MariaDB, CockroachDB
SaaSSalesforce, HubSpot, Stripe, Zendesk, Jira, GitHub, Google Analytics
Cloud StorageAmazon S3, Google Cloud Storage, Azure Blob Storage, SFTP
APIsREST API (generic), GraphQL, Webhook

Step 3: Configure Connection Credentials

Fill in the connection configuration form for your selected connector. Each connector has its own set of required and optional fields.

PostgreSQL Example

FieldValueRequired
Source Nameproduction-orders-dbYes
DescriptionProduction PostgreSQL database for order dataNo
Hostorders-db.example.comYes
Port5432Yes
DatabaseordersYes
Usernamereadonly_userYes
Password********Yes
SSL ModerequireNo
SchemapublicNo

Click Save to create the source. The source is created with status INACTIVE until the connection is tested.


Step 4: Test Connection

  1. On the source detail page, click Test Connection.
  2. The Ingestion Service sends the credentials to Airbyte, which attempts to connect to the source system.
  3. On success, the source status changes to ACTIVE and the lastTestedAt timestamp is recorded.
  4. On failure, the status changes to ERROR and the error message is displayed (e.g., "Connection refused", "Authentication failed", "SSL handshake error").

Troubleshooting Connection Failures

ErrorCauseResolution
Connection refusedHost unreachable or port blockedVerify hostname, port, and firewall rules
Authentication failedWrong username or passwordVerify credentials; check if the user has LOGIN privilege
SSL handshake errorSSL mode mismatchMatch the SSL mode to the server configuration (require, verify-ca, verify-full)
Unknown databaseDatabase name does not existVerify the database name on the source system
TimeoutNetwork latency or connectivity issueCheck VPN/tunnel status; increase connection timeout

Step 5: Select Streams and Tables

  1. Once the connection test succeeds, click Discover Schema.
  2. The Ingestion Service calls Airbyte's schema discovery endpoint, which introspects the source system and returns all available streams.
  3. A stream corresponds to a table (for databases), a collection (for MongoDB), or an API resource (for SaaS connectors).
  4. For each stream, you can see:
    • Stream name (e.g., public.orders, public.customers)
    • Columns with detected data types
    • Supported sync modes (Full Refresh, Incremental, CDC)
  5. Select the streams you want to sync by toggling the checkboxes.
  6. For incremental or CDC streams, select the cursor column (e.g., updated_at) if required.

Sync Mode Selection Guide

Sync ModeWhen to UseHow It Works
Full RefreshSmall tables, reference data, or when you need a complete snapshot every syncExtracts all rows from the source on every sync. Replaces the entire destination table.
IncrementalLarge tables with a reliable updated_at or created_at columnExtracts only rows where the cursor column value is greater than the last sync's maximum. Appends to the destination.
CDCTables where you need real-time change capture, including deletesUses the database's transaction log (WAL for PostgreSQL) to capture inserts, updates, and deletes. Requires specific database configuration.

Step 6: Set Schedule

  1. In the connection configuration, choose a sync schedule:
    • Manual -- sync runs only when you trigger it
    • Scheduled -- sync runs on a cron schedule
  2. For scheduled syncs, enter a cron expression or select a preset:
PresetCron ExpressionDescription
Every hour0 * * * *Runs at the top of every hour
Every 6 hours0 */6 * * *Runs at midnight, 6 AM, noon, 6 PM
Daily at midnight0 0 * * *Runs once daily at 00:00 UTC
Weekly on Sunday0 0 * * 0Runs once weekly at midnight Sunday UTC
  1. Click Create Connection to save the connection with your selected streams, sync mode, and schedule.

Step 7: Run First Sync

  1. On the connection detail page, click Sync Now to trigger the first sync manually.
  2. The sync status changes to RUNNING. You can monitor progress on the Sync Monitoring dashboard.
  3. During the sync, Airbyte:
    • Connects to the source using the configured credentials
    • Extracts data from the selected streams
    • Writes the data to Apache Iceberg tables in your tenant's namespace
  4. When the sync completes, the status changes to SUCCEEDED with a summary:
    • Records synced -- total number of rows extracted
    • Bytes synced -- total data volume transferred
    • Duration -- elapsed time in seconds
  5. If the sync fails, the status changes to FAILED with an error message. See Sync Monitoring for common errors and resolutions.

Step 8: Query Ingested Data via SQL Workbench

Once the sync completes, the ingested data is immediately available for querying.

  1. Navigate to the SQL Workbench from the left navigation panel.
  2. Select the Trino query engine (or ClickHouse/StarRocks depending on your workload).
  3. Browse the catalog to find your ingested tables. They appear under your tenant's Iceberg catalog, in the namespace corresponding to the connection name.
  4. Write and execute a SQL query:
-- View all tables in the ingested namespace
SHOW TABLES IN iceberg.raw_data;
 
-- Query the ingested orders table
SELECT
    order_id,
    customer_id,
    order_date,
    total_amount,
    status
FROM iceberg.raw_data.orders
ORDER BY order_date DESC
LIMIT 100;
 
-- Check row counts to verify sync completeness
SELECT COUNT(*) AS total_rows
FROM iceberg.raw_data.orders;
  1. Verify the row count matches what you expect from the source system.
  2. The data is now available to all platform consumers: the AI Service for natural language queries, the BI Service for dashboards, and the Pipeline Service for downstream transformations.

Next Steps

After completing your first ingestion:

  • Add more sources. Connect additional databases, SaaS applications, or cloud storage. See Connectors for the full catalog.
  • Import files. Upload CSV, Excel, or Parquet files for ad-hoc data loading. See File Import.
  • Set up monitoring. Configure alerts for sync failures and track data freshness. See Sync Monitoring.
  • Build pipelines. Use the Pipeline Service (Chapter 11) to transform ingested raw data into analytics-ready models.
  • Ask questions. Use the AI Service (Chapter 12) to query your ingested data using natural language.