<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>File Format on Parquet</title><link>https://alamb.github.io/parquet-site/docs/file-format/</link><description>Recent content in File Format on Parquet</description><generator>Hugo</generator><language>en</language><atom:link href="https://alamb.github.io/parquet-site/docs/file-format/index.xml" rel="self" type="application/rss+xml"/><item><title>Binary Protocol Extensions</title><link>https://alamb.github.io/parquet-site/docs/file-format/binaryprotocolextensions/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/binaryprotocolextensions/</guid><description>&lt;h1 id="binary-protocol-extensions"&gt;Binary Protocol Extensions&lt;/h1&gt;
&lt;p&gt;The extension mechanism of the &lt;code&gt;binary&lt;/code&gt; Thrift field-id &lt;code&gt;32767&lt;/code&gt; has some desirable properties:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Existing readers will ignore these extensions without any modifications&lt;/li&gt;
&lt;li&gt;Existing readers will ignore the extension bytes with little processing overhead&lt;/li&gt;
&lt;li&gt;The content of the extension is freeform and can be encoded in any format. This format is not restricted to Thrift.&lt;/li&gt;
&lt;li&gt;Extensions can be appended to existing Thrift serialized structs &lt;a href="https://alamb.github.io/parquet-site/docs/file-format/binaryprotocolextensions/#appending-extensions-to-thrift"&gt;without requiring Thrift libraries&lt;/a&gt; for manipulation (or changes to the thrift IDL).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Because only one field-id is reserved the extension bytes themselves require
disambiguation; otherwise readers will not be able to decode extensions safely.
This is left to implementers who MUST put enough unique state in their extension
bytes for disambiguation. This can be relatively easily achieved by adding a
&lt;a href="https://en.wikipedia.org/wiki/Universally%5C_unique%5C_identifier"&gt;UUID&lt;/a&gt; at the
start or end of the extension bytes. The extension does not specify a
disambiguation mechanism to allow more flexibility to implementers.&lt;/p&gt;</description></item><item><title>Configurations</title><link>https://alamb.github.io/parquet-site/docs/file-format/configurations/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/configurations/</guid><description>&lt;h3 id="row-group-size"&gt;Row Group Size&lt;/h3&gt;
&lt;p&gt;Larger row groups allow for larger column chunks which makes it
possible to do larger sequential IO. Larger groups also require more buffering in
the write path (or a two pass write). We recommend large row groups (512MB - 1GB).
Since an entire row group might need to be read, we want it to completely fit on
one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An
optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block
per HDFS file.&lt;/p&gt;</description></item><item><title>Extensibility</title><link>https://alamb.github.io/parquet-site/docs/file-format/extensibility/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/extensibility/</guid><description>&lt;p&gt;There are many places in the format for compatible extensions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;File Version: The file metadata contains a version.&lt;/li&gt;
&lt;li&gt;Encodings: Encodings are specified by enum and more can be added in the future.&lt;/li&gt;
&lt;li&gt;Page types: Additional page types can be added and safely skipped.&lt;/li&gt;
&lt;/ul&gt;</description></item><item><title>Metadata</title><link>https://alamb.github.io/parquet-site/docs/file-format/metadata/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/metadata/</guid><description>&lt;p&gt;There are two types of metadata: file metadata, and page header metadata.&lt;/p&gt;
&lt;p&gt;All thrift structures are serialized using the TCompactProtocol. The full
definition of these structures is given in the Parquet
&lt;a href="https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift"&gt;Thrift definition&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="file-metadata"&gt;File metadata&lt;/h2&gt;
&lt;p&gt;In the diagram below, file metadata is described by the &lt;code&gt;FileMetaData&lt;/code&gt;
structure. This file metadata provides offset and size information useful
when navigating the Parquet file.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://alamb.github.io/parquet-site/images/FileMetaData.svg" alt="Parquet Metadata format"&gt;&lt;/p&gt;
&lt;h2 id="page-header"&gt;Page header&lt;/h2&gt;
&lt;p&gt;Page header metadata (&lt;code&gt;PageHeader&lt;/code&gt; and children in the diagram) is stored
in-line with the page data, and is used in the reading and decoding of data.&lt;/p&gt;</description></item><item><title>Nested Encoding</title><link>https://alamb.github.io/parquet-site/docs/file-format/nestedencoding/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/nestedencoding/</guid><description>&lt;p&gt;To encode nested columns, Parquet uses the Dremel encoding with definition and
repetition levels. Definition levels specify how many optional fields in the
path for the column are defined. Repetition levels specify at what repeated field
in the path has the value repeated. The max definition and repetition levels can
be computed from the schema (i.e. how much nesting there is). This defines the
maximum number of bits required to store the levels (levels are defined for all
values in the column).&lt;/p&gt;</description></item><item><title>Bloom Filter</title><link>https://alamb.github.io/parquet-site/docs/file-format/bloomfilter/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/bloomfilter/</guid><description>&lt;h1 id="parquet-bloom-filter"&gt;Parquet Bloom Filter&lt;/h1&gt;
&lt;h3 id="problem-statement"&gt;Problem statement&lt;/h3&gt;
&lt;p&gt;In their current format, column statistics and dictionaries can be used for predicate
pushdown. Statistics include minimum and maximum value, which can be used to filter out
values not in the range. Dictionaries are more specific, and readers can filter out values
that are between min and max but not in the dictionary. However, when there are too many
distinct values, writers sometimes choose not to add dictionaries because of the extra
space they occupy. This leaves columns with large cardinalities and widely separated min
and max without support for predicate pushdown.&lt;/p&gt;</description></item><item><title>Nulls</title><link>https://alamb.github.io/parquet-site/docs/file-format/nulls/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/nulls/</guid><description>&lt;p&gt;Nullity is encoded in the definition levels (which is run-length encoded). NULL values
are not encoded in the data. For example, in a non-nested schema, a column with 1000 NULLs
would be encoded with run-length encoding (0, 1000 times) for the definition levels and
nothing else.&lt;/p&gt;</description></item><item><title>Page Index</title><link>https://alamb.github.io/parquet-site/docs/file-format/pageindex/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/pageindex/</guid><description>&lt;h1 id="parquet-page-index-layout-to-support-page-skipping"&gt;Parquet Page Index: Layout to Support Page Skipping&lt;/h1&gt;
&lt;p&gt;In Parquet, a &lt;em&gt;page index&lt;/em&gt; is optional metadata for a
ColumnChunk, containing statistics for DataPages that can be used
to skip those pages when scanning in ordered and unordered columns.
The page index is stored using the OffsetIndex and ColumnIndex structures,
defined in &lt;a href="src/main/thrift/parquet.thrift"&gt;&lt;code&gt;parquet.thrift&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="problem-statement"&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;In previous versions of the format, Statistics are stored for ColumnChunks in
ColumnMetaData and for individual pages inside DataPageHeader structs. When
reading pages, a reader had to process the page header to determine
whether the page could be skipped based on the statistics. This means the reader
had to access all pages in a column, thus likely reading most of the column
data from disk.&lt;/p&gt;</description></item><item><title>Implementation status</title><link>https://alamb.github.io/parquet-site/docs/file-format/implementationstatus/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/implementationstatus/</guid><description>&lt;p&gt;This page summarizes the features supported by different Parquet
implementations.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: If you find out of date information, please help us improve the accuracy
of this page by opening an issue or submitting a pull request.&lt;/p&gt;
&lt;h3 id="legend"&gt;Legend&lt;/h3&gt;
&lt;p&gt;The value in each box means:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;✅: supported. Footnote added when support is partial. When data is available, links to release notes are provided on the implementing version.&lt;/li&gt;
&lt;li&gt;❌: not supported&lt;/li&gt;
&lt;li&gt;(R): only read support&lt;/li&gt;
&lt;li&gt;(W): only write support&lt;/li&gt;
&lt;li&gt;(blank): no data&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="implementations"&gt;Implementations&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/arrow/tree/main/cpp/src/parquet"&gt;arrow&lt;/a&gt; (C++)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/parquet-java"&gt;parquet-java&lt;/a&gt; (Java)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/arrow-go/tree/main/parquet"&gt;arrow-go&lt;/a&gt; (Go)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/apache/arrow-rs/blob/main/parquet/README.md"&gt;arrow-rs&lt;/a&gt; (Rust)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rapidsai/cudf"&gt;cudf&lt;/a&gt; (cuDF C++)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/hyparam/hyparquet"&gt;hyparquet&lt;/a&gt; (JavaScript)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/duckdb/duckdb"&gt;duckdb&lt;/a&gt; (C++)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/pola-rs/polars"&gt;polars&lt;/a&gt; (Rust)&lt;/li&gt;
&lt;/ul&gt;
&lt;!-- Status source in data/implementations --&gt;
&lt;h3 id="physical-types"&gt;&lt;a href="#physical-types"&gt;Physical types&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;Physical types are defined by the &lt;a href="https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32"&gt;&lt;code&gt;enum Type&lt;/code&gt; in parquet.thrift&lt;/a&gt;&lt;/p&gt;</description></item><item><title>Parquet format versions</title><link>https://alamb.github.io/parquet-site/docs/file-format/versions/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://alamb.github.io/parquet-site/docs/file-format/versions/</guid><description>&lt;p&gt;This page describes how features are added to the &lt;a href="https://github.com/apache/parquet-format"&gt;Parquet format
specification&lt;/a&gt; and how they affect
reader and writer compatibility. See the
&lt;a href="../implementationstatus/"&gt;Implementation status&lt;/a&gt; page for which implementations
(arrow, parquet-java, arrow-rs, etc.) support each feature.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note&lt;/em&gt;: If you find out-of-date information, please open an issue or pull request.&lt;/p&gt;
&lt;h2 id="feature-compatibility"&gt;Feature compatibility&lt;/h2&gt;
&lt;p&gt;The Parquet format spec &lt;a href="https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#compatibility-and-feature-enablement"&gt;classifies changes&lt;/a&gt; by their effect on reader and
writer compatibility. Changes differ in their &lt;em&gt;forward&lt;/em&gt; compatibility — whether
an older reader can read files that use a newer feature.&lt;/p&gt;</description></item></channel></rss>