PostgreSQL数据库FDW——Parquet S3 Foreign Data Wrapper

标签：files PostgreSQL jsonb S3 column Wrapper schemaless s3 parquet

This PostgreSQL extension is a Foreign Data Wrapper (FDW) for accessing Parquet file on local file system and Amazon S3. This version of parquet_s3_fdw can work for PostgreSQL 13. Read-only Apache Parquet foreign data wrapper supporting S3 access for PostgreSQL.

Installation

Install dependent libraries
parquet_s3_fdw requires libarrow and libparquet installed in your system (requires version 0.15+, for previous versions use branch arrow-0.14). Please refer to building guide. AWS SDK for C++ (libaws-cpp-sdk-core libaws-cpp-sdk-s3) is also required (Confirmed version is 1.8.14). Attention! We reccomend to build libarrow, libparquet and AWS SDK for C++ from the source code. We failed to link if using pre-compiled binaries because gcc version is different between arrow and AWS SDK.
Build and install parquet_s3_fdw
make install or in case when PostgreSQL is installed in a custom location: make install PG_CONFIG=/path/to/pg_config. It is possible to pass additional compilation flags through either custom CCFLAGS or standard PG_CFLAGS, PG_CXXFLAGS, PG_CPPFLAGS variables.

Usage

Load extension CREATE EXTENSION parquet_s3_fdw; Create server CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw;

If using MinIO instead of AWS S3, please use use_minio option for create server CREATE SERVER parquet_s3_srv FOREIGN DATA WRAPPER parquet_s3_fdw OPTIONS (use_minio 'true');

Create user mapping CREATE USER MAPPING FOR public SERVER parquet_s3_srv OPTIONS (user 's3user', password 's3password'); You have to specify user name and password if accessing Amazon S3.

Create foreign table Now you should be able to create foreign table from Parquet files. Currently parquet_s3_fdw supports the following column types (to be extended shortly): Currently parquet_s3_fdw doesn’t support structs and nested lists.

PostgreSQL数据库FDW——Parquet S3 Foreign Data Wrapper_外部表

Example:

CREATE FOREIGN TABLE userdata (
    id           int,
    first_name   text,
    last_name    text
)
SERVER parquet_s3_srv
OPTIONS (
    filename 's3://bucket/dir/userdata1.parquet'
);

Following options are supported:

filename - space separated list of paths to Parquet files to read. You can specify the path on AWS S3 by starting with s3://. The mix of local path and S3 path is not supported; 要读取的Parquet文件路径的空格分隔列表。您可以通过以S3://开头来指定AWS S3上的路径。不支持本地路径和S3路径的混合；
dirname - path to directory having Parquet files to read;
sorted - space separated list of columns that Parquet files are presorted by; that would help postgres to avoid redundant sorting when running query with ORDER BY clause or in other cases when having a presorted set is beneficial (Group Aggregate, Merge Join); 以空格分隔的Parquet文件预排序列列表；这将有助于postgres在使用ORDER BY子句运行查询时避免冗余排序，或者在其他情况下使用预排序集是有益的（Group Aggregate、Merge Join）；
files_in_order - specifies that files specified by filename or returned by files_func are ordered according to sorted option and have no intersection rangewise; this allows to use Gather Merge node on top of parallel Multifile scan (default false); 指定由文件名指定的文件或由files_func返回的文件是根据排序选项排序的，并且没有交叉范围；这允许在并行多文件扫描的顶部使用“聚集合并”节点（默认为false）；
use_mmap - whether memory map operations will be used instead of file read operations (default false);
use_threads - enables Apache Arrow’s parallel columns decoding/decompression (default false);
files_func - user defined function that is used by parquet_s3_fdw to retrieve the list of parquet files on each query; function must take one JSONB argument and return text array of full paths to parquet files;
files_func_arg - argument for the function, specified by files_func.
max_open_files - the limit for the number of Parquet files open simultaneously.

Foreign table may be created for a single Parquet file and for a set of files. It is also possible to specify a user defined function, which would return a list of file paths. Depending on the number of files and table options parquet_s3_fdw may use one of the following execution strategies:

PostgreSQL数据库FDW——Parquet S3 Foreign Data Wrapper_外部表_02

GUC variables:

parquet_fdw.use_threads - global switch that allow user to enable or disable threads (default true);

parquet_fdw.enable_multifile - enable Multifile reader (default true).

parquet_fdw.enable_multifile_merge - enable Multifile Merge reader (default true).

Parallel queries

parquet_s3_fdw also supports parallel query execution (not to confuse with multi-threaded decoding feature of Apache Arrow).

Import

parquet_s3_fdw also supports IMPORT FOREIGN SCHEMA command to discover parquet files in the specified directory on filesystem and create foreign tables according to those files. It can be used as follows:

IMPORT FOREIGN SCHEMA "/path/to/directory"
FROM SERVER parquet_s3_srv
INTO public;

It is important that remote_schema here is a path to a local filesystem directory and is double quoted.
Another way to import parquet files into foreign tables is to use import_parquet_s3 or import_parquet_s3_explicit:

CREATE FUNCTION import_parquet_s3(
    tablename   text,
    schemaname  text,
    servername  text,
    userfunc    regproc,
    args        jsonb,
    options     jsonb)

CREATE FUNCTION import_parquet_s3_explicit(
    tablename   text,
    schemaname  text,
    servername  text,
    attnames    text[],
    atttypes    regtype[],
    userfunc    regproc,
    args        jsonb,
    options     jsonb)

The only difference between import_parquet_s3 and import_parquet_s3_explicit is that the latter allows to specify a set of attributes (columns) to import. attnames and atttypes here are the attributes names and attributes types arrays respectively (see the example below).

userfunc is a user-defined function. It must take a jsonb argument and return a text array of filesystem paths to parquet files to be imported. args is user-specified jsonb object that is passed to userfunc as its argument. A simple implementation of such function and its usage may look like this:

CREATE FUNCTION list_parquet_s3_files(args jsonb)
RETURNS text[] AS
$$
BEGIN
    RETURN array_agg(args->>'dir' || '/' || filename)
           FROM pg_ls_dir(args->>'dir') AS files(filename)
           WHERE filename ~~ '%.parquet';
END
$$
LANGUAGE plpgsql;

SELECT import_parquet_s3_explicit(
    'abc',
    'public',
    'parquet_srv',
    array['one', 'three', 'six'],
    array['int8', 'text', 'bool']::regtype[],
    'list_parquet_files',
    '{"dir": "/path/to/directory"}',
    '{"sorted": "one"}'
);

Features

Support SELECT of parquet file on local file system or Amazon S3.
Support MinIO access instead of Amazon S3.
Allow control over whether foreign servers keep connections open after transaction completion. This is controlled by keep_connections and defaults to on.
Support parquet_s3_fdw function parquet_s3_fdw_get_connections() to report open foreign server connections.

Schemaless mode

The feature will enable user to use schemaless feature: 该功能将允许用户使用无模式功能

No specific foreign foreign schema (column difinition) for each parquet file. 每个parquet文件没有特定的外来模式（列定义）。
The schemaless foreign table has only one jsonb column to represent the data from the parquet file by following rule: Jsonb Key: parquet column name. Jsonb Value: parquet column data. schemaless外部表只有一个jsonb列，通过以下规则表示parquet文件中的数据：JsonbKey: parquet列名。Jsonb值：parquet列数据。

By use schemaless mode, there are several benefits: 使用无模式模式有几个好处：

Flexibility over data structure of parquet file: By merging all column data into one jsonb column, a schemaless foreign table can query any parquet file that has all column can be mapped with the postgres type. parquet文件数据结构的灵活性：通过将所有列数据合并到一个jsonb列中，无模式的外部表可以查询任何具有所有列都可以用postgres类型映射的parquet文件。
No pre-defined foreign table schemas (column difinition). The lack of schema means that foreign table will query all column from parquet file — including those that user do not yet use. 没有预定义的外部表架构（列定义）。缺少模式意味着外部表将查询parquet文件中的所有列，包括用户尚未使用的列。

Schemaless mode usage

Schemaless mode is enabled by schemaless option:schemaless option is true: enable schemaless mode. schemaless option is false: disable schemaless mode (We call it non-schemaless mode). If schemaless option is not configured, default value is false. schemaless option is supported in CREATE FOREIGN TABLE, IMPORT FOREIGN SCHEMA, import_parquet_s3() and import_parquet_s3_explicit(). 无模式模式由无模式选项启用：无模式选项为true：启用无模式模式。schemales选项为false：禁用无模式模式（我们称之为非无模式模式）。如果未配置schemales选项，则默认值为false。CREATE FOREIGN TABLE、IMPORT FOREIGN SCHEMA、IMPORT_parquet_s3（）和IMPORT_parque _s3_explicit（）中支持schemales选项。

Schemaless foreign table needs at least one jsonb column to represent data: If there is more than 1 jsonb column, only one column is populated, all other columns are treated with NULL value. If there is no jsonb column, all column are treated with NULL value. 无模式外部表至少需要一个jsonb列来表示数据：如果有多个jsonb列，则只填充一个列，所有其他列都用NULL值处理。如果没有jsonb列，则所有列都将用NULL值处理。

CREATE FOREIGN TABLE example_schemaless (
  id int,
  v jsonb
) OPTIONS (filename '/path/to/parquet_file', schemaless 'true');
SELECT * FROM example_schemaless;
id |                                                                v                                                                
----+---------------------------------------------------------------------------------------------------------------------------------
    | {"one": 1, "six": "t", "two": [1, 2, 3], "five": "2018-01-01", "four": "2018-01-01 00:00:00", "seven": 0.5, "three": "foo"}
    | {"one": 2, "six": "f", "two": [null, 5, 6], "five": "2018-01-02", "four": "2018-01-02 00:00:00", "seven": null, "three": "bar"}
(2 rows)

Create foreign table: With IMPORT FOREIGN SCHEMA, import_parquet_s3() and import_parquet_s3_explicit(), foreign table will create with fixed column difinition like below: 创建外部表：使用IMPORT foreign SCHEMA、IMPORT_parquet_s3（）和IMPORT_parque _s3_explicit（），外部表将使用如下固定列定义创建：

CREATE FOREIGN TABLE example (
  v jsonb
) OPTIONS (filename '/path/to/parquet_file', schemaless 'true');

Query data:

-- non-schemaless mode
SELECT * FROM example;
 one |    two     | three |        four         |    five    | six | seven 
-----+------------+-------+---------------------+------------+-----+-------
   1 | {1,2,3}    | foo   | 2018-01-01 00:00:00 | 2018-01-01 | t   |   0.5
   2 | {NULL,5,6} | bar   | 2018-01-02 00:00:00 | 2018-01-02 | f   |      
(2 rows)
-- schemaless mode
SELECT * FROM example_schemaless;
                                                                  v
---------------------------------------------------------------------------------------------------------------------------------
 {"one": 1, "six": "t", "two": [1, 2, 3], "five": "2018-01-01", "four": "2018-01-01 00:00:00", "seven": 0.5, "three": "foo"}
 {"one": 2, "six": "f", "two": [null, 5, 6], "five": "2018-01-02", "four": "2018-01-02 00:00:00", "seven": null, "three": "bar"}
(2 rows)

Fetch values in jsonb expression: Use ->> jsonb arrow operator which return text type. User may cast type the jsonb expression to get corresponding data representation. For example, v->>‘col’ expression of fetch value col will be column name col in parquet file and we call it schemaless variable or slvar. 获取jsonb表达式中的值：使用返回文本类型的->>jsonb箭头运算符。用户可以强制键入jsonb表达式以获得相应的数据表示。例如，提取值col的v->>'col’表达式将是parquet文件中的列名col，我们称其为schemales变量或slvar。

SELECT v->>'two', sqrt((v->>'one')::int) FROM example_schemaless;
  ?column?   |        sqrt        
--------------+--------------------
[1, 2, 3]    |                  1
[null, 5, 6] | 1.4142135623730951
(2 rows)

Some feature is different with non-schemaless mode 某些功能与非模式模式不同

Rowgroup filter support: in schemaless mode, parquet_s3_fdw can support execute row group filter with some WHERE condition below:行组筛选器支持：在无模式模式下，parquet_s3_fdw可以支持使用以下WHERE条件执行行组筛选器： 1.slvar::type {operator} const. For example: (v->>‘int64_col’)::int8 = 100 2. const {operator} slvar ::type. For example: 100 = (v->>‘int64_col’)::int8 3. slvar::boolean is true/false. For example: (v->>‘bool_col’)::boolean is false 4. !(slvar::boolean). For example: !(v->>‘bool_col’)::boolean 5. Jsonb exist operator: ((v->>‘col’)::jsonb) ? element, (v->‘col’) ? element and v ? ‘col’ 6. The cast function must be mapped with the parquet column type, otherwise, the filter will be skipped. 强制转换函数必须与拼花列类型映射，否则将跳过筛选器。
To use presort column of parquet file, user must be: define column name in sorted option same as non-schemaless mode.
Use slvar instead of column name in the ORDER BY clause. If the sorted parquet column is not a text column, please add the explicit cast to the mapped type of this column. 要使用parquet文件的预排序列，用户必须：在排序选项中定义列名，与非模式模式相同。在ORDER BY子句中使用slvar而不是列名。如果排序的parquet列不是文本列，请将显式转换添加到此列的映射类型。

CREATE FOREIGN TABLE example_sorted (v jsonb)
SERVER parquet_s3_srv
OPTIONS (filename '/path/to/example1.parquet /path/to/example2.parquet', sorted 'int64_col', schemaless 'true');
EXPLAIN (COSTS OFF) SELECT * FROM example_sorted ORDER BY (v->>'int64_col')::int8;
          QUERY PLAN           
--------------------------------
Foreign Scan on example_sorted
  Reader: Multifile Merge
  Row groups: 
    example1.parquet: 1, 2
    example2.parquet: 1
(5 rows)

Support for arrow Nested List and Map: these type will be treated as nested jsonb value which can access by -> operator. 支持箭头嵌套列表和映射：这些类型将被视为嵌套的jsonb值，可以通过->运算符访问。

SELECT * FROM example_schemaless;
                                  v
----------------------------------------------------------------------------
{"array_col": [19, 20], "jsonb_col": {"1": "foo", "2": "bar", "3": "baz"}}
{"array_col": [21, 22], "jsonb_col": {"4": "test1", "5": "test2"}}
(2 rows)

SELECT v->'array_col'->1, v->'jsonb_col'->'1' FROM example3;
?column? | ?column? 
----------+----------
20       | "foo"
22       | 
(2 rows)

Postgres cost for caculate (jsonb->>‘col’)::type is much larger than fetch column directly in non-schemaless mode, The query plan of schemaless mode can be different with non-schemaless mode in some complex query. Postgres计算开销（jsonb->‘col’）：：在非无模式模式下，type比直接获取列大得多，在一些复杂的查询中，无模式模式的查询计划可能与非无模式的不同。

For other feature, schemaless mode works same as non-schemaless mode. 对于其他功能，无模式模式的工作方式与非无模式模式相同。

Limitations

Modification (INSERT, UPDATE and DELETE) is not supported. 不支持修改（INSERT、UPDATE和DELETE）。
Transaction is not supported. 不支持事务。
Cannot create a single foreign table using parquet files on both file system and Amazon S3. 无法在文件系统和AmazonS3上使用parquet文件创建单个外部表。

AWS region is hard-coded as “ap-northeast-1”. If you want to use another region, you need to modify the source code by changing “AP_NORTHEAST_1” in parquet_s3_fdw_connection.cpp. AWS区域硬编码为“ap-northeast-1”。如果要使用其他区域，则需要通过更改parquet_s3_fdw_connection.cpp中的“AP_NORTHEAST_1”来修改源代码。

For the query that return record type, parquet s3 fdw only fills data for columns which are refered in target list or clause. For other columns, they are filled as NULL. 对于返回记录类型的查询，parquets3fdw只填充目标列表或子句中引用的列的数据。对于其他列，它们将填充为NULL。
Example:

-- column c1 and c3 are refered in ORDER BY clause, so it will be filled with values. For other columns: c2,c4,c5,c6 filled as NULL.
SELECT t1 FROM tbl t1 ORDER BY tbl.c3, tbl.c1;     
        t1              
------------------      
 (101,,00101,,,,)       
 (102,,00102,,,,)       
(2 rows)

The 4th and 5th arguments of import_parquet_s3_explicit() function are meaningless in schemaless mode. These arguments should be defined as NULL value. If these arguments is not NULL value the WARNING below will occur: import_parquet_s3_explicit（）函数的第4和第5个参数在无模式模式下没有意义。这些参数应定义为NULL值。如果这些参数不是NULL值，将出现以下警告：

WARNING: parquet_s3_fdw: attnames and atttypes are expected to be NULL. They are meaningless for schemaless table.
HINT: Schemaless table imported always contain "v" column with "jsonb" type.

schemaless mode does not support create partition table by CREATE TABLE parent_tbl (v jsonb) PARTITION BY RANGE((v->>‘a’)::int). schemales模式不支持通过create table parent_tbl（v jsonb）partition by RANGE（（v->>‘a’）：：int）创建分区表。

https://github.com/pgspider/parquet_s3_fdw https://www.postgresql.org/docs/current/parallel-query.html

标签：files,PostgreSQL,jsonb,S3,column,Wrapper,schemaless,s3,parquet
From： https://blog.51cto.com/feishujun/6007523