CSV Import

CSV (Comma-Separated Values) is the most common format for file import. The Matih platform auto-detects delimiters, encoding, headers, and data types, allowing most CSV files to be imported without manual configuration.

Automatic Detection

When a CSV file is uploaded, the Ingestion Service performs the following automatic detection steps.

Detection	Method	Fallback
Delimiter	Statistical analysis of first 100 lines for `,`, `\t`, `	`,` ;`
Encoding	BOM detection, then charset detection (UTF-8, Latin-1, Windows-1252)	Defaults to UTF-8
Header row	Checks if first row contains non-numeric values distinct from data rows	Assumes first row is header
Data types	Samples first 1,000 rows to infer INTEGER, DECIMAL, DATE, TIMESTAMP, BOOLEAN, STRING	Defaults to STRING
Quote character	Detects `"` or `'` based on presence in first 100 lines	Defaults to `"`

Configuration Options

If automatic detection produces incorrect results, you can override detection via the schema update endpoint.

Delimiter

Delimiter	Character	Common File Extensions
Comma	`,`	`.csv`
Tab	`\t`	`.tsv`, `.txt`
Pipe	`\|`	`.psv`, `.txt`
Semicolon	`;`	`.csv` (European locales)

Encoding

Encoding	Use Case
`UTF-8`	Default for modern systems, supports all Unicode characters
`UTF-8-BOM`	UTF-8 with Byte Order Mark (common in Excel-generated CSVs)
`Latin-1` (ISO-8859-1)	Western European characters
`Windows-1252`	Windows default encoding, superset of Latin-1
`UTF-16`	Used by some legacy systems

Header Options

Option	Behavior
Auto-detect (default)	First row is used as header if it appears to contain column names
First row is header	Force the first row to be treated as column names
No header	Columns are named `column_1`, `column_2`, etc.

Type Inference Rules

The type inference engine samples the first 1,000 rows and applies the following rules in order.

Priority	Type	Detection Rule
1	`BOOLEAN`	All values are `true`/`false`, `yes`/`no`, `1`/`0`, `t`/`f` (case-insensitive)
2	`INTEGER`	All values match `^-?\d+$` and fit within 64-bit range
3	`DECIMAL`	All values match `^-?\d+\.\d+$`
4	`DATE`	All values match ISO 8601 date (`YYYY-MM-DD`) or common date formats (`MM/DD/YYYY`, `DD-MM-YYYY`)
5	`TIMESTAMP`	All values match ISO 8601 datetime or common timestamp formats
6	`STRING`	Default fallback for all other values

If a column contains mixed types (e.g., some integers and some strings), the type is widened to STRING. Null values (empty cells) are allowed in all types.

Overriding Inferred Types

Use the column mappings in the schema update step to change inferred types.

{
  "targetTableName": "sales_data",
  "columnMappings": {
    "zip_code": "zip_code",
    "phone_number": "phone"
  }
}

If a column like zip_code is inferred as INTEGER (because values like 01234 parse as numbers), the preview will show the issue. You can exclude the column and re-add it with the correct type, or accept the inferred type and apply a transformation after import.

Handling Large CSV Files

File Size	Behavior
Under 50 MB	Full preview with all rows counted
50 MB -- 200 MB	Preview uses first 10,000 rows; total row count is estimated
200 MB -- 500 MB	Preview uses first 5,000 rows; import runs as async background job
Over 500 MB	Use the Cloud Storage Connector instead of file import

Common Issues

Issue	Cause	Resolution
Garbled characters in preview	Wrong encoding detected	Check source system encoding; manually specify encoding in column mapping step
Wrong number of columns	Delimiter mismatch	Verify the file uses the expected delimiter; check for unquoted delimiters in data values
All columns detected as STRING	Mixed types in data	Check for header-like values in data rows, or non-standard number formats (e.g., `1,234.56` with comma thousands separator)
Empty rows at end of file	Trailing newlines	These are automatically filtered during import
Date columns detected as STRING	Non-standard date format	Common for formats like `Jan 15, 2024` or `15/01/2024`. Import as STRING and transform post-ingestion.

Example: Full CSV Import Workflow

1. Upload: POST /api/v1/files/upload
   File: quarterly_revenue.csv (2.3 MB, UTF-8, comma-delimited)

2. Preview: GET /api/v1/files/{fileId}/preview
   Result: 4 columns detected
   - region (STRING)
   - quarter (STRING)
   - revenue (DECIMAL)
   - headcount (INTEGER)
   Total rows: 1,200

3. Configure: PUT /api/v1/files/{fileId}/schema
   {
     "targetTableName": "quarterly_revenue",
     "targetSchema": "finance",
     "columnMappings": {
       "region": "region",
       "quarter": "fiscal_quarter",
       "revenue": "revenue_usd",
       "headcount": "employee_count"
     }
   }

4. Import: POST /api/v1/files/{fileId}/import
   Result: 1,200 records imported into iceberg.finance.quarterly_revenue

5. Query:
   SELECT region, fiscal_quarter, revenue_usd
   FROM iceberg.finance.quarterly_revenue
   WHERE region = 'EMEA'
   ORDER BY fiscal_quarter;

Overview Excel Import