MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10a. Data Ingestion
File Import
CSV Import

CSV Import

CSV (Comma-Separated Values) is the most common format for file import. The Matih platform auto-detects delimiters, encoding, headers, and data types, allowing most CSV files to be imported without manual configuration.


Automatic Detection

When a CSV file is uploaded, the Ingestion Service performs the following automatic detection steps.

DetectionMethodFallback
DelimiterStatistical analysis of first 100 lines for ,, \t, `, ;`
EncodingBOM detection, then charset detection (UTF-8, Latin-1, Windows-1252)Defaults to UTF-8
Header rowChecks if first row contains non-numeric values distinct from data rowsAssumes first row is header
Data typesSamples first 1,000 rows to infer INTEGER, DECIMAL, DATE, TIMESTAMP, BOOLEAN, STRINGDefaults to STRING
Quote characterDetects " or ' based on presence in first 100 linesDefaults to "

Configuration Options

If automatic detection produces incorrect results, you can override detection via the schema update endpoint.

Delimiter

DelimiterCharacterCommon File Extensions
Comma,.csv
Tab\t.tsv, .txt
Pipe|.psv, .txt
Semicolon;.csv (European locales)

Encoding

EncodingUse Case
UTF-8Default for modern systems, supports all Unicode characters
UTF-8-BOMUTF-8 with Byte Order Mark (common in Excel-generated CSVs)
Latin-1 (ISO-8859-1)Western European characters
Windows-1252Windows default encoding, superset of Latin-1
UTF-16Used by some legacy systems

Header Options

OptionBehavior
Auto-detect (default)First row is used as header if it appears to contain column names
First row is headerForce the first row to be treated as column names
No headerColumns are named column_1, column_2, etc.

Type Inference Rules

The type inference engine samples the first 1,000 rows and applies the following rules in order.

PriorityTypeDetection Rule
1BOOLEANAll values are true/false, yes/no, 1/0, t/f (case-insensitive)
2INTEGERAll values match ^-?\d+$ and fit within 64-bit range
3DECIMALAll values match ^-?\d+\.\d+$
4DATEAll values match ISO 8601 date (YYYY-MM-DD) or common date formats (MM/DD/YYYY, DD-MM-YYYY)
5TIMESTAMPAll values match ISO 8601 datetime or common timestamp formats
6STRINGDefault fallback for all other values

If a column contains mixed types (e.g., some integers and some strings), the type is widened to STRING. Null values (empty cells) are allowed in all types.

Overriding Inferred Types

Use the column mappings in the schema update step to change inferred types.

{
  "targetTableName": "sales_data",
  "columnMappings": {
    "zip_code": "zip_code",
    "phone_number": "phone"
  }
}

If a column like zip_code is inferred as INTEGER (because values like 01234 parse as numbers), the preview will show the issue. You can exclude the column and re-add it with the correct type, or accept the inferred type and apply a transformation after import.


Handling Large CSV Files

File SizeBehavior
Under 50 MBFull preview with all rows counted
50 MB -- 200 MBPreview uses first 10,000 rows; total row count is estimated
200 MB -- 500 MBPreview uses first 5,000 rows; import runs as async background job
Over 500 MBUse the Cloud Storage Connector instead of file import

Common Issues

IssueCauseResolution
Garbled characters in previewWrong encoding detectedCheck source system encoding; manually specify encoding in column mapping step
Wrong number of columnsDelimiter mismatchVerify the file uses the expected delimiter; check for unquoted delimiters in data values
All columns detected as STRINGMixed types in dataCheck for header-like values in data rows, or non-standard number formats (e.g., 1,234.56 with comma thousands separator)
Empty rows at end of fileTrailing newlinesThese are automatically filtered during import
Date columns detected as STRINGNon-standard date formatCommon for formats like Jan 15, 2024 or 15/01/2024. Import as STRING and transform post-ingestion.

Example: Full CSV Import Workflow

1. Upload: POST /api/v1/files/upload
   File: quarterly_revenue.csv (2.3 MB, UTF-8, comma-delimited)

2. Preview: GET /api/v1/files/{fileId}/preview
   Result: 4 columns detected
   - region (STRING)
   - quarter (STRING)
   - revenue (DECIMAL)
   - headcount (INTEGER)
   Total rows: 1,200

3. Configure: PUT /api/v1/files/{fileId}/schema
   {
     "targetTableName": "quarterly_revenue",
     "targetSchema": "finance",
     "columnMappings": {
       "region": "region",
       "quarter": "fiscal_quarter",
       "revenue": "revenue_usd",
       "headcount": "employee_count"
     }
   }

4. Import: POST /api/v1/files/{fileId}/import
   Result: 1,200 records imported into iceberg.finance.quarterly_revenue

5. Query:
   SELECT region, fiscal_quarter, revenue_usd
   FROM iceberg.finance.quarterly_revenue
   WHERE region = 'EMEA'
   ORDER BY fiscal_quarter;