<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Data Pages on Parquet</title><link>https://alamb.github.io/parquet-site/docs/file-format/data-pages/</link><description>Recent content in Data Pages on Parquet</description><generator>Hugo</generator><language>en</language><atom:link href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/index.xml" rel="self" type="application/rss+xml"/><item><title>Compression</title><link>https://alamb.github.io/parquet-site/docs/file-format/data-pages/compression/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/data-pages/compression/</guid><description>&lt;h1 id="parquet-compression-definitions"&gt;Parquet compression definitions&lt;/h1&gt;
&lt;p&gt;This document contains the specification of all supported compression codecs.&lt;/p&gt;
&lt;h2 id="overview"&gt;Overview&lt;/h2&gt;
&lt;p&gt;Parquet allows the data block inside dictionary pages and data pages to
be compressed for better space efficiency. The Parquet format supports
several compression codecs covering different areas in the compression
ratio / processing cost spectrum.&lt;/p&gt;
&lt;p&gt;The detailed specifications of compression codecs are maintained externally
by their respective authors or maintainers, which we reference hereafter.&lt;/p&gt;
&lt;p&gt;For all compression codecs except the deprecated &lt;code&gt;LZ4&lt;/code&gt; codec, the raw data
of a (data or dictionary) page is fed &lt;em&gt;as-is&lt;/em&gt; to the underlying compression
library, without any additional framing or padding. The information required
for precise allocation of compressed and decompressed buffers is written
in the &lt;code&gt;PageHeader&lt;/code&gt; struct.&lt;/p&gt;</description></item><item><title>Encodings</title><link>https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/</guid><description>&lt;h1 id="parquet-encoding-definitions"&gt;Parquet encoding definitions&lt;/h1&gt;
&lt;p&gt;This file contains the specification of all supported encodings.&lt;/p&gt;
&lt;p&gt;Unless otherwise stated in page or encoding documentation, any encoding can be
used with any page type.&lt;/p&gt;
&lt;h3 id="supported-encodings"&gt;Supported Encodings&lt;/h3&gt;
&lt;p&gt;For details on current implementation status, see the &lt;a href="https://parquet.apache.org/docs/file-format/implementationstatus/#encodings"&gt;Implementation Status&lt;/a&gt; page.&lt;/p&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Encoding type&lt;/th&gt;
 &lt;th&gt;Encoding enum&lt;/th&gt;
 &lt;th&gt;Supported Types&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/#PLAIN"&gt;Plain&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;PLAIN = 0&lt;/td&gt;
 &lt;td&gt;All Physical Types&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/#DICTIONARY"&gt;Dictionary Encoding&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;PLAIN_DICTIONARY = 2 (Deprecated) &lt;br&gt; RLE_DICTIONARY = 8&lt;/td&gt;
 &lt;td&gt;All Physical Types&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/#RLE"&gt;Run Length Encoding / Bit-Packing Hybrid&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;RLE = 3&lt;/td&gt;
 &lt;td&gt;BOOLEAN, Dictionary Indices&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/#DELTAENC"&gt;Delta Encoding&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;DELTA_BINARY_PACKED = 5&lt;/td&gt;
 &lt;td&gt;INT32, INT64&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/#DELTALENGTH"&gt;Delta-length byte array&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;DELTA_LENGTH_BYTE_ARRAY = 6&lt;/td&gt;
 &lt;td&gt;BYTE_ARRAY&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/#DELTASTRING"&gt;Delta Strings&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;DELTA_BYTE_ARRAY = 7&lt;/td&gt;
 &lt;td&gt;BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY&lt;/td&gt;
 &lt;/tr&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/#BYTESTREAMSPLIT"&gt;Byte Stream Split&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;BYTE_STREAM_SPLIT = 9&lt;/td&gt;
 &lt;td&gt;INT32, INT64, FLOAT, DOUBLE, FIXED_LEN_BYTE_ARRAY&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id="deprecated-encodings"&gt;Deprecated Encodings&lt;/h3&gt;
&lt;table&gt;
 &lt;thead&gt;
 &lt;tr&gt;
 &lt;th&gt;Encoding type&lt;/th&gt;
 &lt;th&gt;Encoding enum&lt;/th&gt;
 &lt;/tr&gt;
 &lt;/thead&gt;
 &lt;tbody&gt;
 &lt;tr&gt;
 &lt;td&gt;&lt;a href="https://alamb.github.io/parquet-site/docs/file-format/data-pages/encodings/#BITPACKED"&gt;Bit-packed (Deprecated)&lt;/a&gt;&lt;/td&gt;
 &lt;td&gt;BIT_PACKED = 4&lt;/td&gt;
 &lt;/tr&gt;
 &lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a name="PLAIN"&gt;&lt;/a&gt;&lt;/p&gt;</description></item><item><title>Encryption</title><link>https://alamb.github.io/parquet-site/docs/file-format/data-pages/encryption/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/data-pages/encryption/</guid><description>&lt;h1 id="parquet-modular-encryption"&gt;Parquet Modular Encryption&lt;/h1&gt;
&lt;p&gt;Parquet files containing sensitive information can be protected by the modular encryption
mechanism that encrypts and authenticates the file data and metadata - while allowing
for a regular Parquet functionality (columnar projection, predicate pushdown, encoding
and compression).&lt;/p&gt;
&lt;h2 id="1-problem-statement"&gt;1 Problem Statement&lt;/h2&gt;
&lt;p&gt;Existing data protection solutions (such as flat encryption of files, in-storage encryption,
or use of an encrypting storage client) can be applied to Parquet files, but have various
security or performance issues. An encryption mechanism, integrated in the Parquet format,
allows for an optimal combination of data security, processing speed and encryption granularity.&lt;/p&gt;</description></item><item><title>Checksumming</title><link>https://alamb.github.io/parquet-site/docs/file-format/data-pages/checksumming/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/data-pages/checksumming/</guid><description>&lt;p&gt;Pages of all kinds can be individually checksummed. This allows disabling of checksums
at the HDFS file level, to better support single row lookups. Checksums are calculated
using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary
representation of a page (not including the page header itself).&lt;/p&gt;</description></item><item><title>Column Chunks</title><link>https://alamb.github.io/parquet-site/docs/file-format/data-pages/columnchunks/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/data-pages/columnchunks/</guid><description>&lt;p&gt;Column chunks are composed of pages written back to back. The pages share a common
header and readers can skip over pages they are not interested in. The data for the
page follows the header and can be compressed and/or encoded. The compression and
encoding is specified in the page metadata.&lt;/p&gt;
&lt;p&gt;A column chunk might be partly or completely dictionary encoded. It means that
dictionary indexes are saved in the data pages instead of the actual values. The
actual values are stored in the dictionary page. See details in Encodings.md.
The dictionary page must be placed at the first position of the column chunk. At
most one dictionary page can be placed in a column chunk.&lt;/p&gt;</description></item><item><title>Error Recovery</title><link>https://alamb.github.io/parquet-site/docs/file-format/data-pages/errorrecovery/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/data-pages/errorrecovery/</guid><description>&lt;p&gt;If the file metadata is corrupt, the file is lost. If the column metadata is corrupt,
that column chunk is lost (but column chunks for this column in other row groups are
okay). If a page header is corrupt, the remaining pages in that chunk are lost. If
the data within a page is corrupt, that page is lost. The file will be more
resilient to corruption with smaller row groups.&lt;/p&gt;</description></item></channel></rss>