MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10a. Data Ingestion
File Import
Overview

File Import

File Import provides a direct way to upload and ingest structured data files into the Matih platform without configuring a connector or Airbyte source. Users upload a file through the Data Workbench, preview the inferred schema, adjust column mappings, and import the data into an Iceberg table.


Supported Formats

FormatExtensionsMax File SizeFeatures
CSV.csv, .tsv, .txt500 MBAuto-detect delimiter, encoding, headers; configurable quoting
Excel.xlsx, .xls100 MBMulti-sheet support, data type inference, header row detection
Parquet.parquet1 GBFull schema preservation, partition awareness, predicate pushdown
JSON.json, .jsonl500 MBNested object flattening, array expansion
Avro.avro1 GBSchema registry compatible, full type preservation

File Import Workflow

The file import process follows a four-step workflow.

+----------+     +-----------+     +--------------+     +----------+
| 1. Upload|---->| 2. Preview|---->| 3. Configure |---->| 4. Import|
| (file)   |     | (schema)  |     | (mappings)   |     | (execute)|
+----------+     +-----------+     +--------------+     +----------+
     |                |                   |                   |
  UPLOADED        PREVIEWING          PREVIEWING          IMPORTING
                                                              |
                                                         COMPLETED
                                                         or FAILED

Step 1: Upload

Upload a file via the Data Workbench UI or the REST API. The file is stored in tenant-scoped object storage and a FileImportJob record is created with status UPLOADED.

API:

POST /api/v1/files/upload
Content-Type: multipart/form-data

file: <binary file content>

Response:

{
  "id": "a1b2c3d4-...",
  "tenantId": "...",
  "fileName": "sales_2024.csv",
  "fileSize": 2456789,
  "fileFormat": "csv",
  "status": "UPLOADED",
  "createdAt": "2024-03-15T10:30:00Z"
}

Step 2: Preview

Request a preview of the uploaded file. The Ingestion Service reads the file, infers column names and data types, and returns sample rows.

API:

GET /api/v1/files/{fileId}/preview

Response:

{
  "fileId": "a1b2c3d4-...",
  "fileName": "sales_2024.csv",
  "columns": [
    { "name": "order_id", "type": "INTEGER", "nullable": false, "sampleValues": ["1001", "1002", "1003"] },
    { "name": "customer_name", "type": "STRING", "nullable": true, "sampleValues": ["Alice", "Bob", "Carol"] },
    { "name": "order_date", "type": "DATE", "nullable": false, "sampleValues": ["2024-01-15", "2024-01-16", "2024-01-17"] },
    { "name": "amount", "type": "DECIMAL", "nullable": false, "sampleValues": ["150.00", "299.50", "75.25"] }
  ],
  "previewRows": [
    { "order_id": 1001, "customer_name": "Alice", "order_date": "2024-01-15", "amount": 150.00 },
    { "order_id": 1002, "customer_name": "Bob", "order_date": "2024-01-16", "amount": 299.50 },
    { "order_id": 1003, "customer_name": "Carol", "order_date": "2024-01-17", "amount": 75.25 }
  ],
  "totalRows": 15423,
  "inferredTypes": {
    "order_id": "INTEGER",
    "customer_name": "STRING",
    "order_date": "DATE",
    "amount": "DECIMAL"
  }
}

Step 3: Configure

Optionally update the target table name, target schema, and column mappings. This step allows you to:

  • Rename the destination table
  • Place the table in a specific Iceberg schema/namespace
  • Rename columns
  • Change inferred data types
  • Exclude columns from the import

API:

PUT /api/v1/files/{fileId}/schema

{
  "targetTableName": "sales_orders_2024",
  "targetSchema": "raw_data",
  "columnMappings": {
    "order_id": "order_id",
    "customer_name": "customer",
    "order_date": "date",
    "amount": "total_amount"
  }
}

Step 4: Import

Trigger the import. The Ingestion Service reads the file from object storage, applies column mappings, and writes the data into an Iceberg table.

API:

POST /api/v1/files/{fileId}/import

Response:

{
  "id": "a1b2c3d4-...",
  "fileName": "sales_2024.csv",
  "targetTableName": "sales_orders_2024",
  "targetSchema": "raw_data",
  "status": "IMPORTING",
  "createdAt": "2024-03-15T10:30:00Z"
}

Poll the status endpoint until the job reaches COMPLETED or FAILED:

GET /api/v1/files/{fileId}/status
{
  "id": "a1b2c3d4-...",
  "status": "COMPLETED",
  "recordsImported": 15423,
  "completedAt": "2024-03-15T10:31:45Z"
}

Import Job Statuses

StatusDescription
UPLOADEDFile has been uploaded and stored. Preview has not been requested yet.
PREVIEWINGSchema inference is in progress or the schema has been updated.
IMPORTINGData is being written to the Iceberg table.
COMPLETEDImport finished successfully. recordsImported contains the row count.
FAILEDImport failed. errorMessage contains the failure reason.

Querying Imported Data

After a successful import, the data is immediately available via the Query Engine.

-- Query the imported table
SELECT *
FROM iceberg.raw_data.sales_orders_2024
LIMIT 10;
 
-- Check row count matches the import
SELECT COUNT(*) FROM iceberg.raw_data.sales_orders_2024;
-- Expected: 15423

Format-Specific Guides

  • CSV Import -- delimiter detection, encoding handling, header options, quoting rules
  • Excel Import -- multi-sheet support, data type handling, formula evaluation
  • Parquet Import -- schema preservation, partition handling, nested type support