MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10a. Data Ingestion
File Import
Excel Import

Excel Import

The Matih platform supports importing data from Microsoft Excel files (.xlsx and .xls). Excel import handles multi-sheet workbooks, mixed data types within columns, merged cells, and formula evaluation.


Supported Formats

FormatExtensionMax File SizeNotes
Excel 2007+.xlsx100 MBOpenXML format. Recommended.
Excel 97-2003.xls50 MBLegacy binary format.

Multi-Sheet Support

When an Excel file contains multiple sheets, each sheet is treated as a separate stream during preview. You can select which sheets to import.

Preview Response

{
  "fileId": "a1b2c3d4-...",
  "fileName": "financial_report.xlsx",
  "columns": [
    { "name": "region", "type": "STRING", "nullable": false, "sampleValues": ["North", "South", "East"] },
    { "name": "q1_revenue", "type": "DECIMAL", "nullable": false, "sampleValues": ["1250000.00", "980000.00"] },
    { "name": "q2_revenue", "type": "DECIMAL", "nullable": true, "sampleValues": ["1310000.00", "1020000.00"] }
  ],
  "previewRows": [
    { "region": "North", "q1_revenue": 1250000.00, "q2_revenue": 1310000.00 },
    { "region": "South", "q1_revenue": 980000.00, "q2_revenue": 1020000.00 }
  ],
  "totalRows": 48,
  "inferredTypes": {
    "region": "STRING",
    "q1_revenue": "DECIMAL",
    "q2_revenue": "DECIMAL"
  }
}

For multi-sheet workbooks, each sheet can be imported as a separate table by uploading once and requesting the preview for each sheet.


Data Type Handling

Excel cells carry explicit type information. The import engine maps Excel types to Iceberg types.

Excel Cell TypeIceberg TypeNotes
TextSTRINGDirect mapping
Number (integer)LONGWhole numbers without decimal places
Number (decimal)DOUBLENumbers with decimal places
DateDATEExcel serial dates are converted to ISO 8601
DateTimeTIMESTAMPTime-zone naive timestamps
BooleanBOOLEANTRUE/FALSE cell values
FormulaEvaluated typeFormulas are evaluated; the result type is used
Error (#N/A, #REF!)NULLError cells are imported as null values
EmptyNULLEmpty cells are imported as null values

Date Handling

Excel stores dates as serial numbers (days since January 1, 1900). The import engine detects date-formatted cells and converts them to ISO 8601 date strings. This handles:

  • Date-only values: 2024-03-15
  • DateTime values: 2024-03-15T14:30:00
  • Custom date formats: The underlying serial number is converted regardless of the display format

Mixed-Type Columns

Excel columns can contain mixed types (e.g., some cells are numbers, others are text). The import engine handles this by:

  1. Sampling all non-empty cells in the column
  2. If more than 90% of values are the same type, using that type (with exceptions converted to null or string)
  3. If types are mixed, widening to STRING

Header Row Detection

The import engine detects the header row using these heuristics:

HeuristicWeightDescription
First row is all textHighAll cells in row 1 contain text values while subsequent rows contain mixed types
First row values are uniqueMediumHeader values are typically distinct column names
First row has no empty cellsLowHeaders usually do not contain blanks
Formatting differsMediumBold, different font size, or background color on first row

If the header is not in row 1 (e.g., the file has title rows or metadata rows above the data), the preview will show incorrect results. In this case, you can specify the header row offset during the schema configuration step.


Merged Cells

Merged cells in Excel are handled as follows:

Merge DirectionBehavior
Horizontal merge (across columns)The value is placed in the leftmost column; other columns in the merge receive null
Vertical merge (across rows)The value is placed in the topmost row; other rows in the merge receive the same value (forward fill)

Formulas

Excel formulas are evaluated during import. The result value is used, not the formula text.

Formula TypeHandling
Simple formulas (=A1+B1)Evaluated, result imported
Cross-sheet references (=Sheet2!A1)Evaluated if the referenced sheet is present
External references (=[other.xlsx]Sheet1!A1)Evaluate to #REF! error, imported as null
Array formulasEvaluated, results imported
Volatile functions (=NOW(), =RAND())Evaluated at import time

Common Issues

IssueCauseResolution
Dates appear as numbers (e.g., 45371)Date column not detected as date-formattedThe column was stored as a plain number in Excel without date formatting. Import as INTEGER and convert post-ingestion using DATE_ADD('1899-12-30', INTERVAL value DAY).
Empty columns appear in previewHidden columns or trailing empty cellsExclude empty columns in the column mapping step
First row data imported as headerHeader detection false positiveSpecify that the file has no header row in the schema configuration
#N/A values in source appear as nullExpected behaviorError cells are intentionally mapped to null. If you need to preserve error indicators, export the Excel file as CSV first.
Large file upload timeoutFile exceeds 100 MBSplit the workbook into smaller files, or export individual sheets as CSV/Parquet
Number precision lossExcel stores all numbers as IEEE 754 doublesVery large integers (>2^53) may lose precision. For financial data, verify decimal precision in the preview.

Example: Multi-Sheet Excel Import

1. Upload: POST /api/v1/files/upload
   File: annual_report_2024.xlsx (8.5 MB, 3 sheets)

2. Preview Sheet 1: GET /api/v1/files/{fileId}/preview
   Sheet: "Revenue"
   Columns: region (STRING), month (DATE), amount (DECIMAL)
   Total rows: 600

3. Configure: PUT /api/v1/files/{fileId}/schema
   {
     "targetTableName": "revenue_2024",
     "targetSchema": "finance"
   }

4. Import: POST /api/v1/files/{fileId}/import
   Result: 600 records imported into iceberg.finance.revenue_2024

5. Repeat steps 2-4 for remaining sheets (Expenses, Headcount)