Parquet Import
Apache Parquet is a columnar storage format optimized for analytical workloads. Importing Parquet files into the Matih platform preserves the full schema, including nested types, nullability constraints, and column metadata. Parquet is the recommended format for large datasets because it provides efficient compression, predicate pushdown, and column pruning.
Advantages of Parquet Import
| Advantage | Description |
|---|---|
| Schema preservation | Parquet files carry an embedded schema with exact column types, eliminating the need for type inference |
| No data loss | Types map directly from Parquet to Iceberg without ambiguity (unlike CSV where 123 could be integer or string) |
| Compression | Parquet files are typically 3-10x smaller than equivalent CSV files, reducing upload time and storage |
| Large file support | Up to 1 GB per file upload, compared to 500 MB for CSV |
| Fast import | Columnar format enables direct column-to-column mapping without row-by-row parsing |
Type Mapping
Parquet types map directly to Iceberg types with full fidelity.
| Parquet Type | Iceberg Type | Notes |
|---|---|---|
BOOLEAN | BOOLEAN | Direct mapping |
INT32 | INT | 32-bit integer |
INT64 | LONG | 64-bit integer |
FLOAT | FLOAT | 32-bit IEEE 754 |
DOUBLE | DOUBLE | 64-bit IEEE 754 |
BINARY (UTF8) | STRING | UTF-8 encoded strings |
BINARY (raw) | BINARY | Raw byte arrays |
FIXED_LEN_BYTE_ARRAY | FIXED(n) | Fixed-length binary |
INT32 (DATE) | DATE | Days since epoch |
INT64 (TIMESTAMP_MILLIS) | TIMESTAMP | Millisecond precision |
INT64 (TIMESTAMP_MICROS) | TIMESTAMP | Microsecond precision |
BYTE_ARRAY (DECIMAL) | DECIMAL(p,s) | Arbitrary precision decimal |
INT32/INT64 (DECIMAL) | DECIMAL(p,s) | Integer-encoded decimal |
Nested Types
| Parquet Logical Type | Iceberg Type | Description |
|---|---|---|
LIST | LIST<element_type> | Ordered collection of elements |
MAP | MAP<key_type, value_type> | Key-value pairs |
GROUP (struct) | STRUCT<field: type, ...> | Named fields with types |
Nested types are preserved during import. You can query nested fields using Trino's nested type syntax:
-- Access struct fields
SELECT address.city, address.zip_code
FROM iceberg.raw_data.customers;
-- Unnest arrays
SELECT c.customer_id, item.product_name
FROM iceberg.raw_data.orders AS o
CROSS JOIN UNNEST(o.line_items) AS t(item);
-- Access map values
SELECT properties['color'] AS color
FROM iceberg.raw_data.products;Schema Inference
Unlike CSV and Excel, Parquet files include an embedded schema. The preview step reads this schema directly rather than inferring it from data values.
{
"fileId": "...",
"fileName": "transactions.parquet",
"columns": [
{ "name": "transaction_id", "type": "LONG", "nullable": false },
{ "name": "customer_id", "type": "LONG", "nullable": false },
{ "name": "amount", "type": "DECIMAL(10,2)", "nullable": false },
{ "name": "currency", "type": "STRING", "nullable": false },
{ "name": "timestamp", "type": "TIMESTAMP", "nullable": false },
{ "name": "metadata", "type": "MAP<STRING,STRING>", "nullable": true }
],
"previewRows": [
{
"transaction_id": 100001,
"customer_id": 5001,
"amount": 150.00,
"currency": "USD",
"timestamp": "2024-03-15T14:30:00Z",
"metadata": { "source": "web", "device": "mobile" }
}
],
"totalRows": 1250000,
"inferredTypes": {
"transaction_id": "LONG",
"customer_id": "LONG",
"amount": "DECIMAL(10,2)",
"currency": "STRING",
"timestamp": "TIMESTAMP",
"metadata": "MAP<STRING,STRING>"
}
}Partitioned Parquet Files
If a Parquet file was written as part of a partitioned dataset (e.g., Hive-style partitioning with year=2024/month=03/data.parquet), the single file import treats it as a standalone file. Partition columns embedded in the file path are not automatically extracted.
To import partitioned datasets:
- Single file: Upload individual Parquet files. Each file is imported as-is without partition column extraction.
- Full partitioned dataset: Use the S3 or GCS cloud storage connector, which handles Hive-style partitioned directories and extracts partition columns automatically.
Compression
Parquet files may use internal compression at the column chunk level. The import engine supports all standard Parquet compression codecs.
| Codec | Supported | Notes |
|---|---|---|
UNCOMPRESSED | Yes | No compression |
SNAPPY | Yes | Fast decompression, moderate compression ratio |
GZIP | Yes | High compression ratio, slower decompression |
LZO | Yes | Fast compression and decompression |
BROTLI | Yes | High compression ratio |
LZ4 | Yes | Very fast decompression |
ZSTD | Yes | Good balance of speed and compression ratio |
No configuration is needed. The codec is detected from the file metadata automatically.
Row Group Handling
Parquet files are organized into row groups (typically 128 MB each). The import engine processes row groups sequentially, which means:
- Preview reads only the first row group to generate sample rows, making preview fast even for very large files
- Import processes all row groups, writing each batch to the Iceberg table transactionally
- Memory usage is bounded by a single row group rather than the entire file
Common Issues
| Issue | Cause | Resolution |
|---|---|---|
| Unsupported logical type | Parquet file uses a non-standard logical type annotation | Export the file using standard Parquet logical types (INT96 timestamps should be converted to INT64 TIMESTAMP_MICROS) |
| Nested types appear as JSON strings | Query engine does not support the nesting depth | Flatten nested structures during import using column mappings, or query with UNNEST/struct access syntax |
| Row count mismatch | File contains row groups with different schemas (schema evolution within file) | Export the file with a consistent schema. The import engine uses the schema from the file footer. |
| Slow preview for large files | File has very large row groups (>256 MB) | Expected behavior for files with non-standard row group sizes. Preview will still complete. |
Example: Parquet Import Workflow
1. Upload: POST /api/v1/files/upload
File: web_events_2024q1.parquet (450 MB, Snappy compressed, 1.2M rows)
2. Preview: GET /api/v1/files/{fileId}/preview
Schema read from Parquet metadata (instant, no inference needed)
Columns: event_id (LONG), user_id (LONG), event_type (STRING),
timestamp (TIMESTAMP), properties (MAP<STRING,STRING>)
Total rows: 1,250,000
3. Configure: PUT /api/v1/files/{fileId}/schema
{
"targetTableName": "web_events_2024_q1",
"targetSchema": "analytics"
}
4. Import: POST /api/v1/files/{fileId}/import
Result: 1,250,000 records imported into iceberg.analytics.web_events_2024_q1
5. Query:
SELECT event_type, COUNT(*) as event_count
FROM iceberg.analytics.web_events_2024_q1
GROUP BY event_type
ORDER BY event_count DESC;