File Import

File Import provides a direct way to upload and ingest structured data files into the Matih platform without configuring a connector or Airbyte source. Users upload a file through the Data Workbench, preview the inferred schema, adjust column mappings, and import the data into an Iceberg table.

Supported Formats

Format	Extensions	Max File Size	Features
CSV	`.csv`, `.tsv`, `.txt`	500 MB	Auto-detect delimiter, encoding, headers; configurable quoting
Excel	`.xlsx`, `.xls`	100 MB	Multi-sheet support, data type inference, header row detection
Parquet	`.parquet`	1 GB	Full schema preservation, partition awareness, predicate pushdown
JSON	`.json`, `.jsonl`	500 MB	Nested object flattening, array expansion
Avro	`.avro`	1 GB	Schema registry compatible, full type preservation

File Import Workflow

The file import process follows a four-step workflow.

+----------+     +-----------+     +--------------+     +----------+
| 1. Upload|---->| 2. Preview|---->| 3. Configure |---->| 4. Import|
| (file)   |     | (schema)  |     | (mappings)   |     | (execute)|
+----------+     +-----------+     +--------------+     +----------+
     |                |                   |                   |
  UPLOADED        PREVIEWING          PREVIEWING          IMPORTING
                                                              |
                                                         COMPLETED
                                                         or FAILED

Step 1: Upload

Upload a file via the Data Workbench UI or the REST API. The file is stored in tenant-scoped object storage and a FileImportJob record is created with status UPLOADED.

API:

POST /api/v1/files/upload
Content-Type: multipart/form-data

file: <binary file content>

Response:

{
  "id": "a1b2c3d4-...",
  "tenantId": "...",
  "fileName": "sales_2024.csv",
  "fileSize": 2456789,
  "fileFormat": "csv",
  "status": "UPLOADED",
  "createdAt": "2024-03-15T10:30:00Z"
}

Step 2: Preview

Request a preview of the uploaded file. The Ingestion Service reads the file, infers column names and data types, and returns sample rows.

API:

GET /api/v1/files/{fileId}/preview

Response:

{
  "fileId": "a1b2c3d4-...",
  "fileName": "sales_2024.csv",
  "columns": [
    { "name": "order_id", "type": "INTEGER", "nullable": false, "sampleValues": ["1001", "1002", "1003"] },
    { "name": "customer_name", "type": "STRING", "nullable": true, "sampleValues": ["Alice", "Bob", "Carol"] },
    { "name": "order_date", "type": "DATE", "nullable": false, "sampleValues": ["2024-01-15", "2024-01-16", "2024-01-17"] },
    { "name": "amount", "type": "DECIMAL", "nullable": false, "sampleValues": ["150.00", "299.50", "75.25"] }
  ],
  "previewRows": [
    { "order_id": 1001, "customer_name": "Alice", "order_date": "2024-01-15", "amount": 150.00 },
    { "order_id": 1002, "customer_name": "Bob", "order_date": "2024-01-16", "amount": 299.50 },
    { "order_id": 1003, "customer_name": "Carol", "order_date": "2024-01-17", "amount": 75.25 }
  ],
  "totalRows": 15423,
  "inferredTypes": {
    "order_id": "INTEGER",
    "customer_name": "STRING",
    "order_date": "DATE",
    "amount": "DECIMAL"
  }
}

Step 3: Configure

Optionally update the target table name, target schema, and column mappings. This step allows you to:

Rename the destination table
Place the table in a specific Iceberg schema/namespace
Rename columns
Change inferred data types
Exclude columns from the import

API:

PUT /api/v1/files/{fileId}/schema

{
  "targetTableName": "sales_orders_2024",
  "targetSchema": "raw_data",
  "columnMappings": {
    "order_id": "order_id",
    "customer_name": "customer",
    "order_date": "date",
    "amount": "total_amount"
  }
}

Step 4: Import

Trigger the import. The Ingestion Service reads the file from object storage, applies column mappings, and writes the data into an Iceberg table.

API:

POST /api/v1/files/{fileId}/import

Response:

{
  "id": "a1b2c3d4-...",
  "fileName": "sales_2024.csv",
  "targetTableName": "sales_orders_2024",
  "targetSchema": "raw_data",
  "status": "IMPORTING",
  "createdAt": "2024-03-15T10:30:00Z"
}

Poll the status endpoint until the job reaches COMPLETED or FAILED:

GET /api/v1/files/{fileId}/status

{
  "id": "a1b2c3d4-...",
  "status": "COMPLETED",
  "recordsImported": 15423,
  "completedAt": "2024-03-15T10:31:45Z"
}

Import Job Statuses

Status	Description
`UPLOADED`	File has been uploaded and stored. Preview has not been requested yet.
`PREVIEWING`	Schema inference is in progress or the schema has been updated.
`IMPORTING`	Data is being written to the Iceberg table.
`COMPLETED`	Import finished successfully. `recordsImported` contains the row count.
`FAILED`	Import failed. `errorMessage` contains the failure reason.

Querying Imported Data

After a successful import, the data is immediately available via the Query Engine.

-- Query the imported table
SELECT *
FROM iceberg.raw_data.sales_orders_2024
LIMIT 10;
 
-- Check row count matches the import
SELECT COUNT(*) FROM iceberg.raw_data.sales_orders_2024;
-- Expected: 15423

Format-Specific Guides

CSV Import -- delimiter detection, encoding handling, header options, quoting rules
Excel Import -- multi-sheet support, data type handling, formula evaluation
Parquet Import -- schema preservation, partition handling, nested type support

Cloud Storage Connectors CSV Import