首页 > 其他分享 >Apache Arrow User Guide —— Reading and writing Parquet files

Apache Arrow User Guide —— Reading and writing Parquet files

时间:2023-01-14 11:08:08浏览次数:56  
标签:files Arrow C++ writing User arrow Parquet type types


Reading Parquet files

arrow::FileReader类将整个文件或行组的数据读取到::arrow::Table中。StreamReader和StreamWriter类允许使用C++输入/输出流方法逐列逐行读取/写入字段数据。提供这种方法是为了便于使用和类型安全。当数据必须以增量方式读写文件时,它当然也很有用。请注意,由于类型检查以及一次处理一个列值的事实,StreamReader和StreamWriter类的性能将不太好。The arrow::FileReader class reads data for an entire file or row group into an ::arrow::Table. The StreamReader and StreamWriter classes allow for data to be written using a C++ input/output streams approach to read/write fields column by column and row by row. This approach is offered for ease of use and type-safety. It is of course also useful when data must be streamed as files are read and written incrementally. Please note that the performance of the StreamReader and StreamWriter classes will not be as good due to the type checking and the fact that column values are processed one at a time.

The Parquet arrow::FileReader requires a ::arrow::io::RandomAccessFile instance representing the input file. Parquet arrow::FileReader需要一个表示输入文件的::arrow::io::RandomAccessFile实例。Finer-grained options are available through the arrow::FileReaderBuilder helper class. 更细粒度的选项可通过arrow::FileReaderBuilder帮助类获得。

#include "arrow/parquet/arrow/reader.h"

{
// ...
arrow::Status st;
arrow::MemoryPool* pool = default_memory_pool();
std::shared_ptr<arrow::io::RandomAccessFile> input = ...;

// Open Parquet file reader
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
st = parquet::arrow::OpenFile(input, pool, &arrow_reader);
if (!st.ok()) {
// Handle error instantiating file reader...
}

// Read entire file as a single Arrow table
std::shared_ptr<arrow::Table> table;
st = arrow_reader->ReadTable(&table);
if (!st.ok()) {
// Handle error reading Parquet data...
}
}

The StreamReader allows for Parquet files to be read using standard C++ input operators which ensures type-safety. StreamReader允许使用标准C++输入运算符读取Parquet文件,从而确保类型安全。Please note that types must match the schema exactly i.e. if the schema field is an unsigned 16-bit integer then you must supply a uint16_t type. 请注意,类型必须与模式完全匹配,即如果模式字段是无符号16位整数,则必须提供uint16_t类型。Exceptions are used to signal errors. A ParquetException is thrown in the following circumstances: Attempt to read field by supplying the incorrect type\Attempt to read beyond end of row\Attempt to read beyond end of file. 异常用于发出错误信号。在以下情况下会引发ParquetException:通过提供错误类型尝试读取字段\尝试读取超出行结尾\尝试读取超过文件结尾。

#include "arrow/io/file.h"
#include "parquet/stream_reader.h"

{
std::shared_ptr<arrow::io::ReadableFile> infile;

PARQUET_ASSIGN_OR_THROW( infile, arrow::io::ReadableFile::Open("test.parquet"));

parquet::StreamReader os{parquet::ParquetFileReader::Open(infile)};

std::string article; float price; uint32_t quantity;

while ( !os.eof() ) {
os >> article >> price >> quantity >> parquet::EndRow;
// ...
}
}

Writing Parquet files

The arrow::WriteTable() function writes an entire ::arrow::Table to an output file. arrow::WriteTable()函数的作用是将整个::arrow::Table写入输出文件。

#include "parquet/arrow/writer.h"

{
std::shared_ptr<arrow::io::FileOutputStream> outfile;
PARQUET_ASSIGN_OR_THROW( outfile, arrow::io::FileOutputStream::Open("test.parquet"));

PARQUET_THROW_NOT_OK(
parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}

The StreamWriter allows for Parquet files to be written using standard C++ output operators. This type-safe approach also ensures that rows are written without omitting fields and allows for new row groups to be created automatically (after certain volume of data) or explicitly by using the EndRowGroup stream modifier. Exceptions are used to signal errors. A ParquetException is thrown in the following circumstances: Attempt to write a field using an incorrect type\Attempt to write too many fields in a row\Attempt to skip a required field. StreamWriter允许使用标准C++输出运算符写Parquet文件。这种类型安全的方法还可以确保在不省略字段的情况下写入行,并允许自动(在一定数量的数据之后)或使用EndRowGroup流修饰符显式地创建新的行组。异常用于发出错误信号。在以下情况下会引发ParquetException:尝试使用错误类型写入字段\尝试在行中写入过多字段\尝试跳过所需字段。

#include "arrow/io/file.h"
#include "parquet/stream_writer.h"

{
std::shared_ptr<arrow::io::FileOutputStream> outfile;
PARQUET_ASSIGN_OR_THROW( outfile, arrow::io::FileOutputStream::Open("test.parquet"));

parquet::WriterProperties::Builder builder;
std::shared_ptr<parquet::schema::GroupNode> schema;

// Set up builder with required compression type etc.
// Define schema.
// ...

parquet::StreamWriter os{ parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

// Loop over some data structure which provides the required
// fields to be written and write each row.
for (const auto& a : getArticles()){
os << a.name() << a.price() << a.quantity() << parquet::EndRow;
}
}

Apache Arrow User Guide —— Reading and writing Parquet files_ide


Parquet格式是一种用于复杂数据的节省空间的柱状存储格式。Parquet C++实现是Apache Arrow项目的一部分,并得益于与Arrow C++类和工具的紧密集成。The Parquet format is a space-efficient columnar storage format for complex data. The Parquet C++ implementation is part of the Apache Arrow project and benefits from tight integration with the Arrow C++ classes and facilities.

Supported Parquet features

Parquet格式有许多功能,Parquet C++支持其中的一个子集。The Parquet format has many features, and Parquet C++ supports a subset of them.

Page types

Unsupported page type: INDEX_PAGE. When reading a Parquet file, pages of this type are ignored. 不支持的页面类型:INDEX_page。读取Parquet文件时,将忽略此类型的页面。

Apache Arrow User Guide —— Reading and writing Parquet files_apache_02


Compression

Unsupported compression codec: LZO.

Apache Arrow User Guide —— Reading and writing Parquet files_ide_03


(1) On the read side, Parquet C++ is able to decompress both the regular LZ4 block format and the ad-hoc Hadoop LZ4 format used by the reference Parquet implementation. On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.Encodings

Apache Arrow User Guide —— Reading and writing Parquet files_apache_04


(1) Only supported for encoding definition and repetition levels, not values.

(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version 2.4 or greater is selected in WriterProperties::version().Types

Physical types

Apache Arrow User Guide —— Reading and writing Parquet files_java_05


(1) Can be mapped to other Arrow types, depending on the logical type (see below). 根据逻辑类型,可以映射到其他arrow类型(见下文)。

(2) On the write side, ArrowWriterProperties::support_deprecated_int96_timestamps() must be enabled.

(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY. 在写端,Arrow LargeBinary也可以映射到BYTE_ARRAY。Logical types

Specific logical types can override the default Arrow type mapping for a given physical type. 特定逻辑类型可以覆盖给定物理类型的默认箭头类型映射。

Apache Arrow User Guide —— Reading and writing Parquet files_c++_06


(1) On the write side, the Parquet physical type INT32 is generated.

(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted.

(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.

(4) On the write side, an Arrow LargeUtf8 is also mapped to a Parquet STRING.

(5) On the write side, an Arrow LargeList or FixedSizedList is also mapped to a Parquet LIST.

(6) On the read side, a key with multiple values does not get deduplicated, in contradiction with the Parquet specification.

Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary). 不支持的逻辑类型:JSON、BSON、UUID。如果在读取Parquet文件时遇到这种类型,则使用默认的物理类型映射(例如,Parquet JSON列可以读取为Arrow Binary或FixedSizeBinary)。

Converted types
While converted types are deprecated in the Parquet format (they are superceded by logical types), they are recognized and emitted by the Parquet C++ implementation so as to maximize compatibility with other Parquet implementations. 虽然转换后的类型在Parquet格式中被弃用(它们被逻辑类型取代),但它们由Parquet C++实现识别并发出,以便最大限度地与其他Parquet实现兼容。

Special cases
An Arrow Extension type is written out as its storage type. It can still be recreated at read time using Parquet metadata (see “Roundtripping Arrow types” below). 箭头扩展类型被写出来作为其存储类型。它仍然可以在读取时使用Parquet元数据重新创建(请参阅下面的“往返箭头类型”)。
An Arrow Dictionary type is written out as its value type. It can still be recreated at read time using Parquet metadata (see “Roundtripping Arrow types” below). 箭头字典类型作为其值类型写出。它仍然可以在读取时使用Parquet元数据重新创建(请参阅下面的“往返箭头类型”)。

Roundtripping Arrow types
While there is no bijection between Arrow types and Parquet types, it is possible to serialize the Arrow schema as part of the Parquet file metadata. This is enabled using ArrowWriterProperties::store_schema(). 虽然Arrow类型和Parquet类型之间没有双射,但可以将Arrow模式序列化为Parquet文件元数据的一部分。这是使用ArrowWriterProperties::store_schema()启用的。
On the read path, the serialized schema will be automatically recognized and will recreate the original Arrow data, converting the Parquet data as required (for example, a LargeList will be recreated from the Parquet LIST type). 在读取路径上,将自动识别序列化模式,并重新创建原始Arrow数据,根据需要转换Parquet数据(例如,将从Parquet LIST类型重新创建LargeList)。
As an example, when serializing an Arrow LargeList to Parquet:例如,当将Arrow LargeList序列化为Parquet时:
The data is written out as a Parquet LIST. When read back, the Parquet LIST data is decoded as an Arrow LargeList if ArrowWriterProperties::store_schema() was enabled when writing the file; otherwise, it is decoded as an Arrow List.数据以Parquet LIST的形式写出。回读时,如果在写入文件时启用了ArrowWriterProperties::store_schema(),则Parquet LIST数据将被解码为ArrowLargeList;否则,将其解码为箭头列表。

Serialization details
The Arrow schema is serialized as a Arrow IPC schema message, then base64-encoded and stored under the ARROW:schema metadata key in the Parquet file metadata. Arrow模式被序列化为Arrow IPC模式消息,然后base64编码并存储在Parquet文件元数据中的Arrow:schema元数据键下。

Limitations
Writing or reading back FixedSizedList data with null entries is not supported. 不支持使用空条目写入或读回FixedSizedList数据。

Encryption
Parquet C++ implements all features specified in the encryption specification, except for encryption of column index and bloom filter modules. More specifically, Parquet C++ supports: Parquet C++实现了加密规范中指定的所有特性,除了列索引和bloom过滤器模块的加密。更具体地说,Parquet C++支持:

  • AES_GCM_V1 and AES_GCM_CTR_V1 encryption algorithms.
  • AAD suffix for Footer, ColumnMetaData, Data Page, Dictionary Page, Data PageHeader, Dictionary PageHeader module types. Other module types (ColumnIndex, OffsetIndex, BloomFilter Header, BloomFilter Bitset) are not supported.
  • EncryptionWithFooterKey and EncryptionWithColumnKey modes.
  • Encrypted Footer and Plaintext Footer modes.

​https://arrow.apache.org/docs/cpp/parquet.html​


标签:files,Arrow,C++,writing,User,arrow,Parquet,type,types
From: https://blog.51cto.com/feishujun/6007519

相关文章