Parquet Import

Apache Parquet is a columnar storage format optimized for analytical workloads. Importing Parquet files into the Matih platform preserves the full schema, including nested types, nullability constraints, and column metadata. Parquet is the recommended format for large datasets because it provides efficient compression, predicate pushdown, and column pruning.

Advantages of Parquet Import

Advantage	Description
Schema preservation	Parquet files carry an embedded schema with exact column types, eliminating the need for type inference
No data loss	Types map directly from Parquet to Iceberg without ambiguity (unlike CSV where `123` could be integer or string)
Compression	Parquet files are typically 3-10x smaller than equivalent CSV files, reducing upload time and storage
Large file support	Up to 1 GB per file upload, compared to 500 MB for CSV
Fast import	Columnar format enables direct column-to-column mapping without row-by-row parsing

Type Mapping

Parquet types map directly to Iceberg types with full fidelity.

Parquet Type	Iceberg Type	Notes
`BOOLEAN`	`BOOLEAN`	Direct mapping
`INT32`	`INT`	32-bit integer
`INT64`	`LONG`	64-bit integer
`FLOAT`	`FLOAT`	32-bit IEEE 754
`DOUBLE`	`DOUBLE`	64-bit IEEE 754
`BINARY` (UTF8)	`STRING`	UTF-8 encoded strings
`BINARY` (raw)	`BINARY`	Raw byte arrays
`FIXED_LEN_BYTE_ARRAY`	`FIXED(n)`	Fixed-length binary
`INT32` (DATE)	`DATE`	Days since epoch
`INT64` (TIMESTAMP_MILLIS)	`TIMESTAMP`	Millisecond precision
`INT64` (TIMESTAMP_MICROS)	`TIMESTAMP`	Microsecond precision
`BYTE_ARRAY` (DECIMAL)	`DECIMAL(p,s)`	Arbitrary precision decimal
`INT32`/`INT64` (DECIMAL)	`DECIMAL(p,s)`	Integer-encoded decimal

Nested Types

Parquet Logical Type	Iceberg Type	Description
`LIST`	`LIST<element_type>`	Ordered collection of elements
`MAP`	`MAP<key_type, value_type>`	Key-value pairs
`GROUP` (struct)	`STRUCT<field: type, ...>`	Named fields with types

Nested types are preserved during import. You can query nested fields using Trino's nested type syntax:

-- Access struct fields
SELECT address.city, address.zip_code
FROM iceberg.raw_data.customers;
 
-- Unnest arrays
SELECT c.customer_id, item.product_name
FROM iceberg.raw_data.orders AS o
CROSS JOIN UNNEST(o.line_items) AS t(item);
 
-- Access map values
SELECT properties['color'] AS color
FROM iceberg.raw_data.products;

Schema Inference

Unlike CSV and Excel, Parquet files include an embedded schema. The preview step reads this schema directly rather than inferring it from data values.

{
  "fileId": "...",
  "fileName": "transactions.parquet",
  "columns": [
    { "name": "transaction_id", "type": "LONG", "nullable": false },
    { "name": "customer_id", "type": "LONG", "nullable": false },
    { "name": "amount", "type": "DECIMAL(10,2)", "nullable": false },
    { "name": "currency", "type": "STRING", "nullable": false },
    { "name": "timestamp", "type": "TIMESTAMP", "nullable": false },
    { "name": "metadata", "type": "MAP<STRING,STRING>", "nullable": true }
  ],
  "previewRows": [
    {
      "transaction_id": 100001,
      "customer_id": 5001,
      "amount": 150.00,
      "currency": "USD",
      "timestamp": "2024-03-15T14:30:00Z",
      "metadata": { "source": "web", "device": "mobile" }
    }
  ],
  "totalRows": 1250000,
  "inferredTypes": {
    "transaction_id": "LONG",
    "customer_id": "LONG",
    "amount": "DECIMAL(10,2)",
    "currency": "STRING",
    "timestamp": "TIMESTAMP",
    "metadata": "MAP<STRING,STRING>"
  }
}

Partitioned Parquet Files

If a Parquet file was written as part of a partitioned dataset (e.g., Hive-style partitioning with year=2024/month=03/data.parquet), the single file import treats it as a standalone file. Partition columns embedded in the file path are not automatically extracted.

To import partitioned datasets:

Single file: Upload individual Parquet files. Each file is imported as-is without partition column extraction.
Full partitioned dataset: Use the S3 or GCS cloud storage connector, which handles Hive-style partitioned directories and extracts partition columns automatically.

Compression

Parquet files may use internal compression at the column chunk level. The import engine supports all standard Parquet compression codecs.

Codec	Supported	Notes
`UNCOMPRESSED`	Yes	No compression
`SNAPPY`	Yes	Fast decompression, moderate compression ratio
`GZIP`	Yes	High compression ratio, slower decompression
`LZO`	Yes	Fast compression and decompression
`BROTLI`	Yes	High compression ratio
`LZ4`	Yes	Very fast decompression
`ZSTD`	Yes	Good balance of speed and compression ratio

No configuration is needed. The codec is detected from the file metadata automatically.

Row Group Handling

Parquet files are organized into row groups (typically 128 MB each). The import engine processes row groups sequentially, which means:

Preview reads only the first row group to generate sample rows, making preview fast even for very large files
Import processes all row groups, writing each batch to the Iceberg table transactionally
Memory usage is bounded by a single row group rather than the entire file

Common Issues

Issue	Cause	Resolution
Unsupported logical type	Parquet file uses a non-standard logical type annotation	Export the file using standard Parquet logical types (INT96 timestamps should be converted to INT64 TIMESTAMP_MICROS)
Nested types appear as JSON strings	Query engine does not support the nesting depth	Flatten nested structures during import using column mappings, or query with `UNNEST`/struct access syntax
Row count mismatch	File contains row groups with different schemas (schema evolution within file)	Export the file with a consistent schema. The import engine uses the schema from the file footer.
Slow preview for large files	File has very large row groups (>256 MB)	Expected behavior for files with non-standard row group sizes. Preview will still complete.

Example: Parquet Import Workflow

1. Upload: POST /api/v1/files/upload
   File: web_events_2024q1.parquet (450 MB, Snappy compressed, 1.2M rows)

2. Preview: GET /api/v1/files/{fileId}/preview
   Schema read from Parquet metadata (instant, no inference needed)
   Columns: event_id (LONG), user_id (LONG), event_type (STRING),
            timestamp (TIMESTAMP), properties (MAP<STRING,STRING>)
   Total rows: 1,250,000

3. Configure: PUT /api/v1/files/{fileId}/schema
   {
     "targetTableName": "web_events_2024_q1",
     "targetSchema": "analytics"
   }

4. Import: POST /api/v1/files/{fileId}/import
   Result: 1,250,000 records imported into iceberg.analytics.web_events_2024_q1

5. Query:
   SELECT event_type, COUNT(*) as event_count
   FROM iceberg.analytics.web_events_2024_q1
   GROUP BY event_type
   ORDER BY event_count DESC;

Excel Import API Reference