CSV Import
CSV (Comma-Separated Values) is the most common format for file import. The Matih platform auto-detects delimiters, encoding, headers, and data types, allowing most CSV files to be imported without manual configuration.
Automatic Detection
When a CSV file is uploaded, the Ingestion Service performs the following automatic detection steps.
| Detection | Method | Fallback |
|---|---|---|
| Delimiter | Statistical analysis of first 100 lines for ,, \t, ` | , ;` |
| Encoding | BOM detection, then charset detection (UTF-8, Latin-1, Windows-1252) | Defaults to UTF-8 |
| Header row | Checks if first row contains non-numeric values distinct from data rows | Assumes first row is header |
| Data types | Samples first 1,000 rows to infer INTEGER, DECIMAL, DATE, TIMESTAMP, BOOLEAN, STRING | Defaults to STRING |
| Quote character | Detects " or ' based on presence in first 100 lines | Defaults to " |
Configuration Options
If automatic detection produces incorrect results, you can override detection via the schema update endpoint.
Delimiter
| Delimiter | Character | Common File Extensions |
|---|---|---|
| Comma | , | .csv |
| Tab | \t | .tsv, .txt |
| Pipe | | | .psv, .txt |
| Semicolon | ; | .csv (European locales) |
Encoding
| Encoding | Use Case |
|---|---|
UTF-8 | Default for modern systems, supports all Unicode characters |
UTF-8-BOM | UTF-8 with Byte Order Mark (common in Excel-generated CSVs) |
Latin-1 (ISO-8859-1) | Western European characters |
Windows-1252 | Windows default encoding, superset of Latin-1 |
UTF-16 | Used by some legacy systems |
Header Options
| Option | Behavior |
|---|---|
| Auto-detect (default) | First row is used as header if it appears to contain column names |
| First row is header | Force the first row to be treated as column names |
| No header | Columns are named column_1, column_2, etc. |
Type Inference Rules
The type inference engine samples the first 1,000 rows and applies the following rules in order.
| Priority | Type | Detection Rule |
|---|---|---|
| 1 | BOOLEAN | All values are true/false, yes/no, 1/0, t/f (case-insensitive) |
| 2 | INTEGER | All values match ^-?\d+$ and fit within 64-bit range |
| 3 | DECIMAL | All values match ^-?\d+\.\d+$ |
| 4 | DATE | All values match ISO 8601 date (YYYY-MM-DD) or common date formats (MM/DD/YYYY, DD-MM-YYYY) |
| 5 | TIMESTAMP | All values match ISO 8601 datetime or common timestamp formats |
| 6 | STRING | Default fallback for all other values |
If a column contains mixed types (e.g., some integers and some strings), the type is widened to STRING. Null values (empty cells) are allowed in all types.
Overriding Inferred Types
Use the column mappings in the schema update step to change inferred types.
{
"targetTableName": "sales_data",
"columnMappings": {
"zip_code": "zip_code",
"phone_number": "phone"
}
}If a column like zip_code is inferred as INTEGER (because values like 01234 parse as numbers), the preview will show the issue. You can exclude the column and re-add it with the correct type, or accept the inferred type and apply a transformation after import.
Handling Large CSV Files
| File Size | Behavior |
|---|---|
| Under 50 MB | Full preview with all rows counted |
| 50 MB -- 200 MB | Preview uses first 10,000 rows; total row count is estimated |
| 200 MB -- 500 MB | Preview uses first 5,000 rows; import runs as async background job |
| Over 500 MB | Use the Cloud Storage Connector instead of file import |
Common Issues
| Issue | Cause | Resolution |
|---|---|---|
| Garbled characters in preview | Wrong encoding detected | Check source system encoding; manually specify encoding in column mapping step |
| Wrong number of columns | Delimiter mismatch | Verify the file uses the expected delimiter; check for unquoted delimiters in data values |
| All columns detected as STRING | Mixed types in data | Check for header-like values in data rows, or non-standard number formats (e.g., 1,234.56 with comma thousands separator) |
| Empty rows at end of file | Trailing newlines | These are automatically filtered during import |
| Date columns detected as STRING | Non-standard date format | Common for formats like Jan 15, 2024 or 15/01/2024. Import as STRING and transform post-ingestion. |
Example: Full CSV Import Workflow
1. Upload: POST /api/v1/files/upload
File: quarterly_revenue.csv (2.3 MB, UTF-8, comma-delimited)
2. Preview: GET /api/v1/files/{fileId}/preview
Result: 4 columns detected
- region (STRING)
- quarter (STRING)
- revenue (DECIMAL)
- headcount (INTEGER)
Total rows: 1,200
3. Configure: PUT /api/v1/files/{fileId}/schema
{
"targetTableName": "quarterly_revenue",
"targetSchema": "finance",
"columnMappings": {
"region": "region",
"quarter": "fiscal_quarter",
"revenue": "revenue_usd",
"headcount": "employee_count"
}
}
4. Import: POST /api/v1/files/{fileId}/import
Result: 1,200 records imported into iceberg.finance.quarterly_revenue
5. Query:
SELECT region, fiscal_quarter, revenue_usd
FROM iceberg.finance.quarterly_revenue
WHERE region = 'EMEA'
ORDER BY fiscal_quarter;