Data Pages on Parquet

Compression

Mon, 01 Jan 0001 00:00:00 +0000

Parquet compression definitions

This document contains the specification of all supported compression codecs.

Overview

Parquet allows the data block inside dictionary pages and data pages to be compressed for better space efficiency. The Parquet format supports several compression codecs covering different areas in the compression ratio / processing cost spectrum.

The detailed specifications of compression codecs are maintained externally by their respective authors or maintainers, which we reference hereafter.

For all compression codecs except the deprecated LZ4 codec, the raw data of a (data or dictionary) page is fed as-is to the underlying compression library, without any additional framing or padding. The information required for precise allocation of compressed and decompressed buffers is written in the PageHeader struct.

Encodings

Mon, 01 Jan 0001 00:00:00 +0000

Parquet encoding definitions

This file contains the specification of all supported encodings.

Unless otherwise stated in page or encoding documentation, any encoding can be used with any page type.

Supported Encodings

For details on current implementation status, see the Implementation Status page.

Encoding type	Encoding enum	Supported Types
Plain	PLAIN = 0	All Physical Types
Dictionary Encoding	PLAIN_DICTIONARY = 2 (Deprecated) RLE_DICTIONARY = 8	All Physical Types
Run Length Encoding / Bit-Packing Hybrid	RLE = 3	BOOLEAN, Dictionary Indices
Delta Encoding	DELTA_BINARY_PACKED = 5	INT32, INT64
Delta-length byte array	DELTA_LENGTH_BYTE_ARRAY = 6	BYTE_ARRAY
Delta Strings	DELTA_BYTE_ARRAY = 7	BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
Byte Stream Split	BYTE_STREAM_SPLIT = 9	INT32, INT64, FLOAT, DOUBLE, FIXED_LEN_BYTE_ARRAY

Deprecated Encodings

Encoding type	Encoding enum
Bit-packed (Deprecated)	BIT_PACKED = 4

Encryption

Mon, 01 Jan 0001 00:00:00 +0000

Parquet Modular Encryption

Parquet files containing sensitive information can be protected by the modular encryption mechanism that encrypts and authenticates the file data and metadata - while allowing for a regular Parquet functionality (columnar projection, predicate pushdown, encoding and compression).

1 Problem Statement

Existing data protection solutions (such as flat encryption of files, in-storage encryption, or use of an encrypting storage client) can be applied to Parquet files, but have various security or performance issues. An encryption mechanism, integrated in the Parquet format, allows for an optimal combination of data security, processing speed and encryption granularity.

Checksumming

Mon, 01 Jan 0001 00:00:00 +0000

Pages of all kinds can be individually checksummed. This allows disabling of checksums at the HDFS file level, to better support single row lookups. Checksums are calculated using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary representation of a page (not including the page header itself).

Column Chunks

Mon, 01 Jan 0001 00:00:00 +0000

Column chunks are composed of pages written back to back. The pages share a common header and readers can skip over pages they are not interested in. The data for the page follows the header and can be compressed and/or encoded. The compression and encoding is specified in the page metadata.

A column chunk might be partly or completely dictionary encoded. It means that dictionary indexes are saved in the data pages instead of the actual values. The actual values are stored in the dictionary page. See details in Encodings.md. The dictionary page must be placed at the first position of the column chunk. At most one dictionary page can be placed in a column chunk.

Error Recovery

Mon, 01 Jan 0001 00:00:00 +0000

If the file metadata is corrupt, the file is lost. If the column metadata is corrupt, that column chunk is lost (but column chunks for this column in other row groups are okay). If a page header is corrupt, the remaining pages in that chunk are lost. If the data within a page is corrupt, that page is lost. The file will be more resilient to corruption with smaller row groups.