Getting Started

This guide walks through the complete workflow for ingesting external data into the Matih platform: configuring a source, discovering its schema, creating a sync connection, running the first sync, and querying the results.

Prerequisites

Before you begin, ensure the following are in place:

Active tenant. Your organization must have a provisioned tenant. The Tenant Service automatically deploys the per-tenant Airbyte instance and Iceberg namespace during provisioning.
User account with Data Engineer or Admin role. Source and connection management require the data:ingestion:manage permission.
Source system credentials. Have the hostname, port, database name, username, and password (or API key) for the external data source you want to connect.
Network connectivity. The platform must be able to reach the source system. For on-premises databases, ensure the appropriate firewall rules or VPN tunnels are configured.

Step 1: Navigate to Data Workbench -- Ingestion

Log in to the Matih platform.
Open the Data Workbench from the left navigation panel.
Select the Ingestion tab. This opens the Ingestion dashboard showing your configured sources, active connections, and recent sync history.

Step 2: Browse and Select a Connector

Click Add Source in the upper right corner.
The connector catalog displays available connector types organized by category (Databases, SaaS, Cloud Storage, Files, APIs).
Use the search bar or browse by category to find your connector. For this example, select PostgreSQL.
Click Select to proceed to the configuration form.

Available Connector Categories

Category	Examples
Databases	PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, MariaDB, CockroachDB
SaaS	Salesforce, HubSpot, Stripe, Zendesk, Jira, GitHub, Google Analytics
Cloud Storage	Amazon S3, Google Cloud Storage, Azure Blob Storage, SFTP
APIs	REST API (generic), GraphQL, Webhook

Step 3: Configure Connection Credentials

Fill in the connection configuration form for your selected connector. Each connector has its own set of required and optional fields.

PostgreSQL Example

Field	Value	Required
Source Name	`production-orders-db`	Yes
Description	`Production PostgreSQL database for order data`	No
Host	`orders-db.example.com`	Yes
Port	`5432`	Yes
Database	`orders`	Yes
Username	`readonly_user`	Yes
Password	`********`	Yes
SSL Mode	`require`	No
Schema	`public`	No

Click Save to create the source. The source is created with status INACTIVE until the connection is tested.

Step 4: Test Connection

On the source detail page, click Test Connection.
The Ingestion Service sends the credentials to Airbyte, which attempts to connect to the source system.
On success, the source status changes to ACTIVE and the lastTestedAt timestamp is recorded.
On failure, the status changes to ERROR and the error message is displayed (e.g., "Connection refused", "Authentication failed", "SSL handshake error").

Troubleshooting Connection Failures

Error	Cause	Resolution
Connection refused	Host unreachable or port blocked	Verify hostname, port, and firewall rules
Authentication failed	Wrong username or password	Verify credentials; check if the user has LOGIN privilege
SSL handshake error	SSL mode mismatch	Match the SSL mode to the server configuration (`require`, `verify-ca`, `verify-full`)
Unknown database	Database name does not exist	Verify the database name on the source system
Timeout	Network latency or connectivity issue	Check VPN/tunnel status; increase connection timeout

Step 5: Select Streams and Tables

Once the connection test succeeds, click Discover Schema.
The Ingestion Service calls Airbyte's schema discovery endpoint, which introspects the source system and returns all available streams.
A stream corresponds to a table (for databases), a collection (for MongoDB), or an API resource (for SaaS connectors).
For each stream, you can see:
- Stream name (e.g., public.orders, public.customers)
- Columns with detected data types
- Supported sync modes (Full Refresh, Incremental, CDC)
Select the streams you want to sync by toggling the checkboxes.
For incremental or CDC streams, select the cursor column (e.g., updated_at) if required.

Sync Mode Selection Guide

Sync Mode	When to Use	How It Works
Full Refresh	Small tables, reference data, or when you need a complete snapshot every sync	Extracts all rows from the source on every sync. Replaces the entire destination table.
Incremental	Large tables with a reliable `updated_at` or `created_at` column	Extracts only rows where the cursor column value is greater than the last sync's maximum. Appends to the destination.
CDC	Tables where you need real-time change capture, including deletes	Uses the database's transaction log (WAL for PostgreSQL) to capture inserts, updates, and deletes. Requires specific database configuration.

Step 6: Set Schedule

In the connection configuration, choose a sync schedule:
- Manual -- sync runs only when you trigger it
- Scheduled -- sync runs on a cron schedule
For scheduled syncs, enter a cron expression or select a preset:

Preset	Cron Expression	Description
Every hour	`0 * * * *`	Runs at the top of every hour
Every 6 hours	`0 /6 * *`	Runs at midnight, 6 AM, noon, 6 PM
Daily at midnight	`0 0 * * *`	Runs once daily at 00:00 UTC
Weekly on Sunday	`0 0 * * 0`	Runs once weekly at midnight Sunday UTC

Click Create Connection to save the connection with your selected streams, sync mode, and schedule.

Step 7: Run First Sync

On the connection detail page, click Sync Now to trigger the first sync manually.
The sync status changes to RUNNING. You can monitor progress on the Sync Monitoring dashboard.
During the sync, Airbyte:
- Connects to the source using the configured credentials
- Extracts data from the selected streams
- Writes the data to Apache Iceberg tables in your tenant's namespace
When the sync completes, the status changes to SUCCEEDED with a summary:
- Records synced -- total number of rows extracted
- Bytes synced -- total data volume transferred
- Duration -- elapsed time in seconds
If the sync fails, the status changes to FAILED with an error message. See Sync Monitoring for common errors and resolutions.

Step 8: Query Ingested Data via SQL Workbench

Once the sync completes, the ingested data is immediately available for querying.

Navigate to the SQL Workbench from the left navigation panel.
Select the Trino query engine (or ClickHouse/StarRocks depending on your workload).
Browse the catalog to find your ingested tables. They appear under your tenant's Iceberg catalog, in the namespace corresponding to the connection name.
Write and execute a SQL query:

-- View all tables in the ingested namespace
SHOW TABLES IN iceberg.raw_data;
 
-- Query the ingested orders table
SELECT
    order_id,
    customer_id,
    order_date,
    total_amount,
    status
FROM iceberg.raw_data.orders
ORDER BY order_date DESC
LIMIT 100;
 
-- Check row counts to verify sync completeness
SELECT COUNT(*) AS total_rows
FROM iceberg.raw_data.orders;

Verify the row count matches what you expect from the source system.
The data is now available to all platform consumers: the AI Service for natural language queries, the BI Service for dashboards, and the Pipeline Service for downstream transformations.

Next Steps

After completing your first ingestion:

Add more sources. Connect additional databases, SaaS applications, or cloud storage. See Connectors for the full catalog.
Import files. Upload CSV, Excel, or Parquet files for ad-hoc data loading. See File Import.
Set up monitoring. Configure alerts for sync failures and track data freshness. See Sync Monitoring.
Build pipelines. Use the Pipeline Service (Chapter 11) to transform ingested raw data into analytics-ready models.
Ask questions. Use the AI Service (Chapter 12) to query your ingested data using natural language.

Architecture Connector Catalog