Getting Started
This guide walks through the complete workflow for ingesting external data into the Matih platform: configuring a source, discovering its schema, creating a sync connection, running the first sync, and querying the results.
Prerequisites
Before you begin, ensure the following are in place:
- Active tenant. Your organization must have a provisioned tenant. The Tenant Service automatically deploys the per-tenant Airbyte instance and Iceberg namespace during provisioning.
- User account with Data Engineer or Admin role. Source and connection management require the
data:ingestion:managepermission. - Source system credentials. Have the hostname, port, database name, username, and password (or API key) for the external data source you want to connect.
- Network connectivity. The platform must be able to reach the source system. For on-premises databases, ensure the appropriate firewall rules or VPN tunnels are configured.
Step 1: Navigate to Data Workbench -- Ingestion
- Log in to the Matih platform.
- Open the Data Workbench from the left navigation panel.
- Select the Ingestion tab. This opens the Ingestion dashboard showing your configured sources, active connections, and recent sync history.
Step 2: Browse and Select a Connector
- Click Add Source in the upper right corner.
- The connector catalog displays available connector types organized by category (Databases, SaaS, Cloud Storage, Files, APIs).
- Use the search bar or browse by category to find your connector. For this example, select PostgreSQL.
- Click Select to proceed to the configuration form.
Available Connector Categories
| Category | Examples |
|---|---|
| Databases | PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, MariaDB, CockroachDB |
| SaaS | Salesforce, HubSpot, Stripe, Zendesk, Jira, GitHub, Google Analytics |
| Cloud Storage | Amazon S3, Google Cloud Storage, Azure Blob Storage, SFTP |
| APIs | REST API (generic), GraphQL, Webhook |
Step 3: Configure Connection Credentials
Fill in the connection configuration form for your selected connector. Each connector has its own set of required and optional fields.
PostgreSQL Example
| Field | Value | Required |
|---|---|---|
| Source Name | production-orders-db | Yes |
| Description | Production PostgreSQL database for order data | No |
| Host | orders-db.example.com | Yes |
| Port | 5432 | Yes |
| Database | orders | Yes |
| Username | readonly_user | Yes |
| Password | ******** | Yes |
| SSL Mode | require | No |
| Schema | public | No |
Click Save to create the source. The source is created with status INACTIVE until the connection is tested.
Step 4: Test Connection
- On the source detail page, click Test Connection.
- The Ingestion Service sends the credentials to Airbyte, which attempts to connect to the source system.
- On success, the source status changes to
ACTIVEand thelastTestedAttimestamp is recorded. - On failure, the status changes to
ERRORand the error message is displayed (e.g., "Connection refused", "Authentication failed", "SSL handshake error").
Troubleshooting Connection Failures
| Error | Cause | Resolution |
|---|---|---|
| Connection refused | Host unreachable or port blocked | Verify hostname, port, and firewall rules |
| Authentication failed | Wrong username or password | Verify credentials; check if the user has LOGIN privilege |
| SSL handshake error | SSL mode mismatch | Match the SSL mode to the server configuration (require, verify-ca, verify-full) |
| Unknown database | Database name does not exist | Verify the database name on the source system |
| Timeout | Network latency or connectivity issue | Check VPN/tunnel status; increase connection timeout |
Step 5: Select Streams and Tables
- Once the connection test succeeds, click Discover Schema.
- The Ingestion Service calls Airbyte's schema discovery endpoint, which introspects the source system and returns all available streams.
- A stream corresponds to a table (for databases), a collection (for MongoDB), or an API resource (for SaaS connectors).
- For each stream, you can see:
- Stream name (e.g.,
public.orders,public.customers) - Columns with detected data types
- Supported sync modes (Full Refresh, Incremental, CDC)
- Stream name (e.g.,
- Select the streams you want to sync by toggling the checkboxes.
- For incremental or CDC streams, select the cursor column (e.g.,
updated_at) if required.
Sync Mode Selection Guide
| Sync Mode | When to Use | How It Works |
|---|---|---|
| Full Refresh | Small tables, reference data, or when you need a complete snapshot every sync | Extracts all rows from the source on every sync. Replaces the entire destination table. |
| Incremental | Large tables with a reliable updated_at or created_at column | Extracts only rows where the cursor column value is greater than the last sync's maximum. Appends to the destination. |
| CDC | Tables where you need real-time change capture, including deletes | Uses the database's transaction log (WAL for PostgreSQL) to capture inserts, updates, and deletes. Requires specific database configuration. |
Step 6: Set Schedule
- In the connection configuration, choose a sync schedule:
- Manual -- sync runs only when you trigger it
- Scheduled -- sync runs on a cron schedule
- For scheduled syncs, enter a cron expression or select a preset:
| Preset | Cron Expression | Description |
|---|---|---|
| Every hour | 0 * * * * | Runs at the top of every hour |
| Every 6 hours | 0 */6 * * * | Runs at midnight, 6 AM, noon, 6 PM |
| Daily at midnight | 0 0 * * * | Runs once daily at 00:00 UTC |
| Weekly on Sunday | 0 0 * * 0 | Runs once weekly at midnight Sunday UTC |
- Click Create Connection to save the connection with your selected streams, sync mode, and schedule.
Step 7: Run First Sync
- On the connection detail page, click Sync Now to trigger the first sync manually.
- The sync status changes to
RUNNING. You can monitor progress on the Sync Monitoring dashboard. - During the sync, Airbyte:
- Connects to the source using the configured credentials
- Extracts data from the selected streams
- Writes the data to Apache Iceberg tables in your tenant's namespace
- When the sync completes, the status changes to
SUCCEEDEDwith a summary:- Records synced -- total number of rows extracted
- Bytes synced -- total data volume transferred
- Duration -- elapsed time in seconds
- If the sync fails, the status changes to
FAILEDwith an error message. See Sync Monitoring for common errors and resolutions.
Step 8: Query Ingested Data via SQL Workbench
Once the sync completes, the ingested data is immediately available for querying.
- Navigate to the SQL Workbench from the left navigation panel.
- Select the Trino query engine (or ClickHouse/StarRocks depending on your workload).
- Browse the catalog to find your ingested tables. They appear under your tenant's Iceberg catalog, in the namespace corresponding to the connection name.
- Write and execute a SQL query:
-- View all tables in the ingested namespace
SHOW TABLES IN iceberg.raw_data;
-- Query the ingested orders table
SELECT
order_id,
customer_id,
order_date,
total_amount,
status
FROM iceberg.raw_data.orders
ORDER BY order_date DESC
LIMIT 100;
-- Check row counts to verify sync completeness
SELECT COUNT(*) AS total_rows
FROM iceberg.raw_data.orders;- Verify the row count matches what you expect from the source system.
- The data is now available to all platform consumers: the AI Service for natural language queries, the BI Service for dashboards, and the Pipeline Service for downstream transformations.
Next Steps
After completing your first ingestion:
- Add more sources. Connect additional databases, SaaS applications, or cloud storage. See Connectors for the full catalog.
- Import files. Upload CSV, Excel, or Parquet files for ad-hoc data loading. See File Import.
- Set up monitoring. Configure alerts for sync failures and track data freshness. See Sync Monitoring.
- Build pipelines. Use the Pipeline Service (Chapter 11) to transform ingested raw data into analytics-ready models.
- Ask questions. Use the AI Service (Chapter 12) to query your ingested data using natural language.