MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10a. Data Ingestion
File Import
Parquet Import

Parquet Import

Apache Parquet is a columnar storage format optimized for analytical workloads. Importing Parquet files into the Matih platform preserves the full schema, including nested types, nullability constraints, and column metadata. Parquet is the recommended format for large datasets because it provides efficient compression, predicate pushdown, and column pruning.


Advantages of Parquet Import

AdvantageDescription
Schema preservationParquet files carry an embedded schema with exact column types, eliminating the need for type inference
No data lossTypes map directly from Parquet to Iceberg without ambiguity (unlike CSV where 123 could be integer or string)
CompressionParquet files are typically 3-10x smaller than equivalent CSV files, reducing upload time and storage
Large file supportUp to 1 GB per file upload, compared to 500 MB for CSV
Fast importColumnar format enables direct column-to-column mapping without row-by-row parsing

Type Mapping

Parquet types map directly to Iceberg types with full fidelity.

Parquet TypeIceberg TypeNotes
BOOLEANBOOLEANDirect mapping
INT32INT32-bit integer
INT64LONG64-bit integer
FLOATFLOAT32-bit IEEE 754
DOUBLEDOUBLE64-bit IEEE 754
BINARY (UTF8)STRINGUTF-8 encoded strings
BINARY (raw)BINARYRaw byte arrays
FIXED_LEN_BYTE_ARRAYFIXED(n)Fixed-length binary
INT32 (DATE)DATEDays since epoch
INT64 (TIMESTAMP_MILLIS)TIMESTAMPMillisecond precision
INT64 (TIMESTAMP_MICROS)TIMESTAMPMicrosecond precision
BYTE_ARRAY (DECIMAL)DECIMAL(p,s)Arbitrary precision decimal
INT32/INT64 (DECIMAL)DECIMAL(p,s)Integer-encoded decimal

Nested Types

Parquet Logical TypeIceberg TypeDescription
LISTLIST<element_type>Ordered collection of elements
MAPMAP<key_type, value_type>Key-value pairs
GROUP (struct)STRUCT<field: type, ...>Named fields with types

Nested types are preserved during import. You can query nested fields using Trino's nested type syntax:

-- Access struct fields
SELECT address.city, address.zip_code
FROM iceberg.raw_data.customers;
 
-- Unnest arrays
SELECT c.customer_id, item.product_name
FROM iceberg.raw_data.orders AS o
CROSS JOIN UNNEST(o.line_items) AS t(item);
 
-- Access map values
SELECT properties['color'] AS color
FROM iceberg.raw_data.products;

Schema Inference

Unlike CSV and Excel, Parquet files include an embedded schema. The preview step reads this schema directly rather than inferring it from data values.

{
  "fileId": "...",
  "fileName": "transactions.parquet",
  "columns": [
    { "name": "transaction_id", "type": "LONG", "nullable": false },
    { "name": "customer_id", "type": "LONG", "nullable": false },
    { "name": "amount", "type": "DECIMAL(10,2)", "nullable": false },
    { "name": "currency", "type": "STRING", "nullable": false },
    { "name": "timestamp", "type": "TIMESTAMP", "nullable": false },
    { "name": "metadata", "type": "MAP<STRING,STRING>", "nullable": true }
  ],
  "previewRows": [
    {
      "transaction_id": 100001,
      "customer_id": 5001,
      "amount": 150.00,
      "currency": "USD",
      "timestamp": "2024-03-15T14:30:00Z",
      "metadata": { "source": "web", "device": "mobile" }
    }
  ],
  "totalRows": 1250000,
  "inferredTypes": {
    "transaction_id": "LONG",
    "customer_id": "LONG",
    "amount": "DECIMAL(10,2)",
    "currency": "STRING",
    "timestamp": "TIMESTAMP",
    "metadata": "MAP<STRING,STRING>"
  }
}

Partitioned Parquet Files

If a Parquet file was written as part of a partitioned dataset (e.g., Hive-style partitioning with year=2024/month=03/data.parquet), the single file import treats it as a standalone file. Partition columns embedded in the file path are not automatically extracted.

To import partitioned datasets:

  • Single file: Upload individual Parquet files. Each file is imported as-is without partition column extraction.
  • Full partitioned dataset: Use the S3 or GCS cloud storage connector, which handles Hive-style partitioned directories and extracts partition columns automatically.

Compression

Parquet files may use internal compression at the column chunk level. The import engine supports all standard Parquet compression codecs.

CodecSupportedNotes
UNCOMPRESSEDYesNo compression
SNAPPYYesFast decompression, moderate compression ratio
GZIPYesHigh compression ratio, slower decompression
LZOYesFast compression and decompression
BROTLIYesHigh compression ratio
LZ4YesVery fast decompression
ZSTDYesGood balance of speed and compression ratio

No configuration is needed. The codec is detected from the file metadata automatically.


Row Group Handling

Parquet files are organized into row groups (typically 128 MB each). The import engine processes row groups sequentially, which means:

  • Preview reads only the first row group to generate sample rows, making preview fast even for very large files
  • Import processes all row groups, writing each batch to the Iceberg table transactionally
  • Memory usage is bounded by a single row group rather than the entire file

Common Issues

IssueCauseResolution
Unsupported logical typeParquet file uses a non-standard logical type annotationExport the file using standard Parquet logical types (INT96 timestamps should be converted to INT64 TIMESTAMP_MICROS)
Nested types appear as JSON stringsQuery engine does not support the nesting depthFlatten nested structures during import using column mappings, or query with UNNEST/struct access syntax
Row count mismatchFile contains row groups with different schemas (schema evolution within file)Export the file with a consistent schema. The import engine uses the schema from the file footer.
Slow preview for large filesFile has very large row groups (>256 MB)Expected behavior for files with non-standard row group sizes. Preview will still complete.

Example: Parquet Import Workflow

1. Upload: POST /api/v1/files/upload
   File: web_events_2024q1.parquet (450 MB, Snappy compressed, 1.2M rows)

2. Preview: GET /api/v1/files/{fileId}/preview
   Schema read from Parquet metadata (instant, no inference needed)
   Columns: event_id (LONG), user_id (LONG), event_type (STRING),
            timestamp (TIMESTAMP), properties (MAP<STRING,STRING>)
   Total rows: 1,250,000

3. Configure: PUT /api/v1/files/{fileId}/schema
   {
     "targetTableName": "web_events_2024_q1",
     "targetSchema": "analytics"
   }

4. Import: POST /api/v1/files/{fileId}/import
   Result: 1,250,000 records imported into iceberg.analytics.web_events_2024_q1

5. Query:
   SELECT event_type, COUNT(*) as event_count
   FROM iceberg.analytics.web_events_2024_q1
   GROUP BY event_type
   ORDER BY event_count DESC;