File Format on Parquet

Binary Protocol Extensions

Mon, 01 Jan 0001 00:00:00 +0000

Binary Protocol Extensions

The extension mechanism of the binary Thrift field-id 32767 has some desirable properties:

Existing readers will ignore these extensions without any modifications
Existing readers will ignore the extension bytes with little processing overhead
The content of the extension is freeform and can be encoded in any format. This format is not restricted to Thrift.
Extensions can be appended to existing Thrift serialized structs without requiring Thrift libraries for manipulation (or changes to the thrift IDL).

Because only one field-id is reserved the extension bytes themselves require disambiguation; otherwise readers will not be able to decode extensions safely. This is left to implementers who MUST put enough unique state in their extension bytes for disambiguation. This can be relatively easily achieved by adding a UUID at the start or end of the extension bytes. The extension does not specify a disambiguation mechanism to allow more flexibility to implementers.

Configurations

Mon, 01 Jan 0001 00:00:00 +0000

Row Group Size

Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write). We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file.

Extensibility

Mon, 01 Jan 0001 00:00:00 +0000

There are many places in the format for compatible extensions:

File Version: The file metadata contains a version.
Encodings: Encodings are specified by enum and more can be added in the future.
Page types: Additional page types can be added and safely skipped.

Metadata

Mon, 01 Jan 0001 00:00:00 +0000

There are two types of metadata: file metadata, and page header metadata.

All thrift structures are serialized using the TCompactProtocol. The full definition of these structures is given in the Parquet Thrift definition.

File metadata

In the diagram below, file metadata is described by the FileMetaData structure. This file metadata provides offset and size information useful when navigating the Parquet file.

Page header metadata (PageHeader and children in the diagram) is stored in-line with the page data, and is used in the reading and decoding of data.

Nested Encoding

Mon, 01 Jan 0001 00:00:00 +0000

To encode nested columns, Parquet uses the Dremel encoding with definition and repetition levels. Definition levels specify how many optional fields in the path for the column are defined. Repetition levels specify at what repeated field in the path has the value repeated. The max definition and repetition levels can be computed from the schema (i.e. how much nesting there is). This defines the maximum number of bits required to store the levels (levels are defined for all values in the column).

Bloom Filter

Mon, 01 Jan 0001 00:00:00 +0000

Parquet Bloom Filter

Problem statement

In their current format, column statistics and dictionaries can be used for predicate pushdown. Statistics include minimum and maximum value, which can be used to filter out values not in the range. Dictionaries are more specific, and readers can filter out values that are between min and max but not in the dictionary. However, when there are too many distinct values, writers sometimes choose not to add dictionaries because of the extra space they occupy. This leaves columns with large cardinalities and widely separated min and max without support for predicate pushdown.

Nulls

Mon, 01 Jan 0001 00:00:00 +0000

Nullity is encoded in the definition levels (which is run-length encoded). NULL values are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs would be encoded with run-length encoding (0, 1000 times) for the definition levels and nothing else.

Page Index

Mon, 01 Jan 0001 00:00:00 +0000

Parquet Page Index: Layout to Support Page Skipping

In Parquet, a page index is optional metadata for a ColumnChunk, containing statistics for DataPages that can be used to skip those pages when scanning in ordered and unordered columns. The page index is stored using the OffsetIndex and ColumnIndex structures, defined in parquet.thrift.

Problem Statement

In previous versions of the format, Statistics are stored for ColumnChunks in ColumnMetaData and for individual pages inside DataPageHeader structs. When reading pages, a reader had to process the page header to determine whether the page could be skipped based on the statistics. This means the reader had to access all pages in a column, thus likely reading most of the column data from disk.

Implementation status

Mon, 01 Jan 0001 00:00:00 +0000

This page summarizes the features supported by different Parquet implementations.

Note: If you find out of date information, please help us improve the accuracy of this page by opening an issue or submitting a pull request.

Legend

The value in each box means:

✅: supported. Footnote added when support is partial. When data is available, links to release notes are provided on the implementing version.
❌: not supported
(R): only read support
(W): only write support
(blank): no data

Implementations

arrow (C++)
parquet-java (Java)
arrow-go (Go)
arrow-rs (Rust)
cudf (cuDF C++)
hyparquet (JavaScript)
duckdb (C++)
polars (Rust)

Physical types

Physical types are defined by the enum Type in parquet.thrift

Parquet format versions

Mon, 01 Jan 0001 00:00:00 +0000

This page describes how features are added to the Parquet format specification and how they affect reader and writer compatibility. See the Implementation status page for which implementations (arrow, parquet-java, arrow-rs, etc.) support each feature.

Note: If you find out-of-date information, please open an issue or pull request.

Feature compatibility

The Parquet format spec classifies changes by their effect on reader and writer compatibility. Changes differ in their forward compatibility — whether an older reader can read files that use a newer feature.