File Import
File Import provides a direct way to upload and ingest structured data files into the Matih platform without configuring a connector or Airbyte source. Users upload a file through the Data Workbench, preview the inferred schema, adjust column mappings, and import the data into an Iceberg table.
Supported Formats
| Format | Extensions | Max File Size | Features |
|---|---|---|---|
| CSV | .csv, .tsv, .txt | 500 MB | Auto-detect delimiter, encoding, headers; configurable quoting |
| Excel | .xlsx, .xls | 100 MB | Multi-sheet support, data type inference, header row detection |
| Parquet | .parquet | 1 GB | Full schema preservation, partition awareness, predicate pushdown |
| JSON | .json, .jsonl | 500 MB | Nested object flattening, array expansion |
| Avro | .avro | 1 GB | Schema registry compatible, full type preservation |
File Import Workflow
The file import process follows a four-step workflow.
+----------+ +-----------+ +--------------+ +----------+
| 1. Upload|---->| 2. Preview|---->| 3. Configure |---->| 4. Import|
| (file) | | (schema) | | (mappings) | | (execute)|
+----------+ +-----------+ +--------------+ +----------+
| | | |
UPLOADED PREVIEWING PREVIEWING IMPORTING
|
COMPLETED
or FAILEDStep 1: Upload
Upload a file via the Data Workbench UI or the REST API. The file is stored in tenant-scoped object storage and a FileImportJob record is created with status UPLOADED.
API:
POST /api/v1/files/upload
Content-Type: multipart/form-data
file: <binary file content>Response:
{
"id": "a1b2c3d4-...",
"tenantId": "...",
"fileName": "sales_2024.csv",
"fileSize": 2456789,
"fileFormat": "csv",
"status": "UPLOADED",
"createdAt": "2024-03-15T10:30:00Z"
}Step 2: Preview
Request a preview of the uploaded file. The Ingestion Service reads the file, infers column names and data types, and returns sample rows.
API:
GET /api/v1/files/{fileId}/previewResponse:
{
"fileId": "a1b2c3d4-...",
"fileName": "sales_2024.csv",
"columns": [
{ "name": "order_id", "type": "INTEGER", "nullable": false, "sampleValues": ["1001", "1002", "1003"] },
{ "name": "customer_name", "type": "STRING", "nullable": true, "sampleValues": ["Alice", "Bob", "Carol"] },
{ "name": "order_date", "type": "DATE", "nullable": false, "sampleValues": ["2024-01-15", "2024-01-16", "2024-01-17"] },
{ "name": "amount", "type": "DECIMAL", "nullable": false, "sampleValues": ["150.00", "299.50", "75.25"] }
],
"previewRows": [
{ "order_id": 1001, "customer_name": "Alice", "order_date": "2024-01-15", "amount": 150.00 },
{ "order_id": 1002, "customer_name": "Bob", "order_date": "2024-01-16", "amount": 299.50 },
{ "order_id": 1003, "customer_name": "Carol", "order_date": "2024-01-17", "amount": 75.25 }
],
"totalRows": 15423,
"inferredTypes": {
"order_id": "INTEGER",
"customer_name": "STRING",
"order_date": "DATE",
"amount": "DECIMAL"
}
}Step 3: Configure
Optionally update the target table name, target schema, and column mappings. This step allows you to:
- Rename the destination table
- Place the table in a specific Iceberg schema/namespace
- Rename columns
- Change inferred data types
- Exclude columns from the import
API:
PUT /api/v1/files/{fileId}/schema
{
"targetTableName": "sales_orders_2024",
"targetSchema": "raw_data",
"columnMappings": {
"order_id": "order_id",
"customer_name": "customer",
"order_date": "date",
"amount": "total_amount"
}
}Step 4: Import
Trigger the import. The Ingestion Service reads the file from object storage, applies column mappings, and writes the data into an Iceberg table.
API:
POST /api/v1/files/{fileId}/importResponse:
{
"id": "a1b2c3d4-...",
"fileName": "sales_2024.csv",
"targetTableName": "sales_orders_2024",
"targetSchema": "raw_data",
"status": "IMPORTING",
"createdAt": "2024-03-15T10:30:00Z"
}Poll the status endpoint until the job reaches COMPLETED or FAILED:
GET /api/v1/files/{fileId}/status{
"id": "a1b2c3d4-...",
"status": "COMPLETED",
"recordsImported": 15423,
"completedAt": "2024-03-15T10:31:45Z"
}Import Job Statuses
| Status | Description |
|---|---|
UPLOADED | File has been uploaded and stored. Preview has not been requested yet. |
PREVIEWING | Schema inference is in progress or the schema has been updated. |
IMPORTING | Data is being written to the Iceberg table. |
COMPLETED | Import finished successfully. recordsImported contains the row count. |
FAILED | Import failed. errorMessage contains the failure reason. |
Querying Imported Data
After a successful import, the data is immediately available via the Query Engine.
-- Query the imported table
SELECT *
FROM iceberg.raw_data.sales_orders_2024
LIMIT 10;
-- Check row count matches the import
SELECT COUNT(*) FROM iceberg.raw_data.sales_orders_2024;
-- Expected: 15423Format-Specific Guides
- CSV Import -- delimiter detection, encoding handling, header options, quoting rules
- Excel Import -- multi-sheet support, data type handling, formula evaluation
- Parquet Import -- schema preservation, partition handling, nested type support