首页 > 其他分享 >parquet & DuckDB

parquet & DuckDB

时间:2024-12-29 11:52:47浏览次数:5  
标签:df self parquet DuckDB Parquet data id

Ingest and split data in a flow

https://docs.outerbounds.com/recsys-tutorial-L2/

Given our dataset is in a parquet file, in this lesson you will learn how to leverage an open-source, a hyper-performant database for analytics workloads called DuckDB. You can follow along with the code in this flow. DuckDB has become popular as a fast way to keep Pandas DataFrame interfaces while processing data faster and consuming less memory, as demonstrated in these public benchmarks and described in this post.

In this flow, you will see how to prepare the dataset using DuckDB queries. The data is then split into train, validation, and test splits. In general, it is good practice to have a validation set for choosing the best hyperparameters and a held out test set to give an estimate of performance on unseen data. Later, you will extend this flow to evaluate, tune, and deploy a model to make real-time predictions.

from metaflow import FlowSpec, step, S3, Parameter, current

class DataFlow(FlowSpec):

    IS_DEV = Parameter(
        name='is_dev',
        help='Flag for dev development, with a smaller dataset',
        default='1'
    )
    
    @step
    def start(self):
        self.next(self.prepare_dataset)

    @step
    def prepare_dataset(self):
        """
        Get the data in the right shape by reading the parquet dataset
        and using DuckDB SQL-based wrangling to quickly prepare the datasets for
        training our Recommender System.
        """
        import duckdb
        import numpy as np
        con = duckdb.connect(database=':memory:')
        con.execute("""
            CREATE TABLE playlists AS 
            SELECT *, 
            CONCAT (user_id, '-', playlist) as playlist_id,
            CONCAT (artist, '|||', track) as track_id,
            FROM 'cleaned_spotify_dataset.parquet'
            ;
        """)
        con.execute("SELECT * FROM playlists LIMIT 1;")
        print(con.fetchone())
        tables = ['row_id', 'user_id', 'track_id', 'playlist_id', 'artist']
        for t in tables:
            con.execute("SELECT COUNT(DISTINCT({})) FROM playlists;".format(t))
            print("# of {}".format(t), con.fetchone()[0])
        sampling_cmd = ''
        if self.IS_DEV == '1':
            print("Subsampling data, since this is DEV")
            sampling_cmd = ' USING SAMPLE 10 PERCENT (bernoulli)'
        dataset_query = """
            SELECT * FROM
            (   
                SELECT 
                    playlist_id,
                    LIST(artist ORDER BY row_id ASC) as artist_sequence,
                    LIST(track_id ORDER BY row_id ASC) as track_sequence,
                    array_pop_back(LIST(track_id ORDER BY row_id ASC)) as track_test_x,
                    LIST(track_id ORDER BY row_id ASC)[-1] as track_test_y
                FROM 
                    playlists
                GROUP BY playlist_id 
                HAVING len(track_sequence) > 2
            ) 
            {}
            ;
            """.format(sampling_cmd)
        con.execute(dataset_query)
        df = con.fetch_df()
        print("# rows: {}".format(len(df)))
        print(df.iloc[0].tolist())
        con.close()
        train, validate, test = np.split(
            df.sample(frac=1, random_state=42), 
            [int(.7 * len(df)), int(.9 * len(df))])
        self.df_dataset = df
        self.df_train = train
        self.df_validate = validate
        self.df_test = test
        print("# testing rows: {}".format(len(self.df_test)))
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == '__main__':
    DataFlow()

 

parquet

https://www.databricks.com/glossary/what-is-parquet

What is Parquet?

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.

Characteristics of Parquet

  • Free and open source file format.
  • Language agnostic.
  • Column-based format - files are organized by column, rather than by row, which saves storage space and speeds up analytics queries.
  • Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.
  • Highly efficient data compression and decompression.
  • Supports complex data types and advanced nested data structures.

Benefits of Parquet

  • Good for storing big data of any kind (structured data tables, images, videos, documents).
  • Saves on cloud storage space by using highly efficient column-wise compression, and flexible encoding schemes for columns with different data types.
  • Increased data throughput and performance using techniques like data skipping, whereby queries that fetch specific column values need not read the entire row of data.

Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store the data. Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.

Advantages of Storing Data in a Columnar Format:

  • Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV. When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time-consuming compared to row-oriented databases. This way of storage has translated into hardware savings and minimized latency for accessing data.
  • Apache Parquet is built from the ground up. Hence it is able to support advanced nested data structures. The layout of Parquet data files is optimized for queries that process large volumes of data, in the gigabyte range for each individual file.
  • Parquet is built to support flexible compression options and efficient encoding schemes. As the data type for each column is quite similar, the compression of each column is straightforward (which makes queries even faster). Data can be compressed by using one of the several codecs available; as a result, different data files can be compressed differently.
  • Apache Parquet works best with interactive and serverless technologies like AWS Athena, Amazon Redshift Spectrum, Google BigQuery and Google Dataproc.

Difference Between Parquet and CSV

CSV is a simple and common format that is used by many tools such as Excel, Google Sheets, and numerous others. Even though the CSV files are the default format for data processing pipelines it has some disadvantages:

  • Amazon Athena and Spectrum will charge based on the amount of data scanned per query.
  • Google and Amazon will charge you according to the amount of data stored on GS/S3.
  • Google Dataproc charges are time-based.

Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition, it greatly improved scan and deserialization time, hence the overall costs. The following table compares the savings as well as the speedup obtained by converting data into Parquet from CSV.

Dataset

Size on Amazon S3

Query Run Time

Data Scanned

Cost

Data stored as CSV files

1 TB

236 seconds

1.15 TB

$5.75

Data stored in Apache Parquet Format

130 GB

6.78 seconds

2.51 GB

$0.01

Savings

DuckDB

https://duckdb.org/

DuckDB is a fast

portable|


database system

Query and transform your data anywhere
using DuckDB's feature-rich SQL dialect

Installation Documentation  
  • SQL
  • Python
  • R
  • Java
  • Node.js
1
2
3
4
5
6
7
8
-- Get the top-3 busiest train stations
SELECT
    station_name,
    count(*) AS num_services
FROM train_services
GROUP BY ALL
ORDER BY num_services DESC
LIMIT 3;
Aggregation query Live demo

DuckDB at a glance

 

Simple

DuckDB is easy to install and deploy. It has zero external dependencies and runs in-process in its host application or as a single binary.

Read more  

Portable

DuckDB runs on Linux, macOS, Windows, and all popular hardware architectures. It has idiomatic client APIs for major programming languages.

Read more  

Feature-rich

DuckDB offers a rich SQL dialect. It can read and write file formats such as CSV, Parquet, and JSON, to and from the local file system and remote endpoints such as S3 buckets.

Read more  

Fast

DuckDB runs analytical queries at blazing speed thanks to its columnar engine, which supports parallel execution and can process larger-than-memory workloads.

Read more  

Extensible

DuckDB is extensible by third-party features such as new data types, functions, file formats and new SQL syntax.

Read more  

Free

DuckDB and its core extensions are open-source under the permissive MIT License.

Read more

 

DuckDB:适用于非大数据的进程内Python分析

https://juejin.cn/post/7375152719482437666

大数据已死?

总之,DuckDB 是一个具有革命性意图的快速数据库,即使对于非常大的数据集,它也可以实现单计算机分析。它质疑 基于大数据的解决方案 的必要性。

在 2023 年 MotherDuck 博客的一篇广为流传的帖子中,挑衅地题为“ 大数据已死”,Jordan Tigani 指出“大多数应用程序不需要处理海量数据”。

他写道:“用于分析工作负载处理的数据量几乎肯定比你想象的要小。”因此,在投入更昂贵的数据仓库或分布式分析系统之前,先考虑一个简单的基于单计算机的分析软件是有意义的。


 

为什么量化人应该使用duckdb?

http://www.jieyu.ai/blog/2024/02/01/why-should-you-use-duckdb/

为什么量化人应该使用duckdb?

上一篇笔记介绍了通过duckdb,使用SQL进行DataFrame的操作。我们还特别介绍了它独有的 Asof Join 功能,由于量化人常常处理跨周期行情对齐,这一功能因此格外实用。但是duckdb的好手段,不止如此。

  • 完全替代sqlite,但命令集甚至超过了Postgres
  • 易用性极佳
  • 性能怪兽

作为又一款来自人烟稀少的荷兰的软件,北境这一苦寒之地,再一次让人惊喜。科技的事儿,真是堆人没用。


sqlite非常轻量,基于内存或者文件作为存储,无须像其它数据库一样进行复杂的服务器安装和设置。它是市占率最高的数据库 -- 但未来可能会有所不同:仅就功能而言,duckdb不仅仅完全实现了sqlite的所有功能,它甚至可能比postgres的语法还要丰富。而在性能上,sqlite就更是难以望其项背 -- 实际上,在大数据集的join和group by操作上,H2O.ai的测试表明,在与clickhouse, porlars一众明星选手打排位赛,duckdb都能排在第一位!

 

为什么DuckDB变得越来越受欢迎?

https://zhuanlan.zhihu.com/p/666164886

现在有许多原因使公司开始在DuckDB上搭建产品。该数据库专为快速分析查询而设计,因此它针对大型数据集上的聚合、连接和复杂查询进行了优化,这些类型的查询通常在分析和报告中使用。此外:

  • 它易于安装、部署和使用。没有需要配置的服务器——DuckDB在应用程序内部嵌入运行。这使得它易于集成到不同的编程语言和环境中。
  • 尽管它很简单,但DuckDB具有丰富的功能集。它支持完整的SQL标准、事务、二级索引,并且与流行的数据分析编程语言如Python和R集成良好。
  • DuckDB是免费的,任何人都可以使用和修改它,这降低了开发人员和数据分析师采用它的门槛。
  • DuckDB经过充分的测试和稳定性验证。它有一个广泛的测试套件,并在各种平台上进行持续集成和测试,以确保稳定性。
  • DuckDB提供与专门的OLAP数据库相当的性能,同时更易于部署。这使得它既适用于中小型数据集的分析查询,也适用于大型企业数据集。

简而言之,DuckDB将SQLite的简单易用性与专业列存储数据库的分析性能相结合。所有这些因素——性能、简单性、功能和开源许可——促使DuckDB在开发人员和数据分析师中越来越受欢迎。

 

Architecture Overview

https://zhuanlan.zhihu.com/p/626311150

DuckDB 各个组件之间的架构非常的 "textbook",也就是说会把整个数据库分为:Parser, logical planner, optimizer, physical planner

, execution engine,transaction and storage managers

作为嵌入式数据库,DuckDB没有客户端协议接口或服务器进程,而是使用C/C++ API进行访问。此外,DuckDB提供了SQLite兼容层,允许以前使用SQLite的应用程序通过重新链接或库 overload 来使用DuckDB。

  • Parser
    • SQL Parser 源自 Postgres SQL Parser

 

  • Logical Planner
    • binder, 解析所有引用的 schema
  • 中的对象(如 table 或 view)的表达式,将其与列名和类型匹配。
  • plan generator
,将 binder 生成的 AST 转换为由基本 logical query 查询运算符组成的树,就得到了一颗 type-resolved logical query plan。 DuckDB 还保留存储数据的统计信息,并将其作为规划过程中不同表达式树
    • 的一部分进行传播。这些统计信息用于优化程序本身,并且也用于防止整数溢出

 

  • Optimizer
    • 使用动态规划进行 join order 的优化,针对复杂的 join graph 会 fallback 到贪心算法
  • 会消除所有的 subquery
  • 有一组 rewrite rules 来简化 expression tree,例如执行公共子表达式消除
常量折叠
    • Cardinality estimation 是使用采样和HyperLogLog 的组合完成的,这个过程将优化 logical plan。
    • physical planner 将 logical plan 转换为 physical plan,在适用时选择合适的实现方式,例如 sort-merge join or hash join。

 

    • DuckDB 最开始采用了基于 Pull-based 的 Vector Volcano 的执行引擎,后来切换到了 Push-based 的 pipelines 执行方法
    • DuckDB 采用了向量化计算来来加速计算,具有内部实现的多种类型的 vector 以及向量化的 operator
    • 另外出于可移植性原因,没有采用 JIT,因为 JIT引擎依赖于大型编译器库(例如LLVM),具有额外的传递依赖。

 

    • 中,以供并发事务和 abort 使用

 

  • Persistent Storage
    • 单文件存储
    • DuckDB 使用面向读取优化的 DataBlocks 存储布局(单个文件)。逻辑表被水平分区为 chunks of columns,并使用轻量级压缩方法压缩成 physical block
    • 。每个块都带有每列的min/max 索引,以便快速确定它们是否与查询相关。此外,每个块还带有每列的轻量级索引,可以进一步限制扫描的值数量。

 

vector

, execution, storage 多做一些介绍,其他部分暂时略过…

Vectors

DuckDB 内部是一个 vectorized push-based model,在执行过程中,vector 会在各个操作符之间流转,而不是一个个 tuple

DuckDB 具有非常多自定义的 vector format,非常类似 Arrow,但是对于执行更加友好,是和 volox team 一起设计的,所以很 volex team 的 vector 有很多相似之处。

每个 vector 只保存单个类型的数据,这里有 logical array

和实际物理实现的区别,对外的操作符看来就是一致的 logical array。

可以看到下面举例了四种类型的 vector,各自都有自己的逻辑和物理表达形式。

 

标签:df,self,parquet,DuckDB,Parquet,data,id
From: https://www.cnblogs.com/lightsong/p/18638573

相关文章

  • [Java/压缩] Java读取Parquet文件
    序:契机生产环境有设备出重大事故,又因一关键功能无法使用,亟需将生产环境的原始MQTT报文(以parquet文件格式+zstd压缩格式落盘)DOWN到本地,读取并解析。本文聚焦在本地电脑,用java读取parquet文件相当多网络文档的读取代码无法正常运行,有必要记录一二,后续还需进一步......
  • Parquet存储概述及读写时效对比
    1.Parquet简介Parquet是一种开放的列式存储格式,旨在提高大数据处理和分析的效率。Parquet由Twitter和Cloudera合作开发,并于2015年5月从Apache的孵化器项目毕业,成为Apache的顶级项目。Parquet旨在高效存储和处理大规模数据集,广泛应用于Hadoop、Spark等大数据生态系统。2.Parq......
  • DuckDB:JSON数据探索性分析实战教程
    简单来说,EDA通常指的是数据集的任何初始处理。通常,这些是较小的数据集,是较大数据集的子集,但你也可以使用大数据执行EDA。在本文中,你将扮演SecOps分析师的角色,对EDR(EndpointDetectionandResponse:端点检测和响应)数据的快照执行EDA,这些数据可能来自主流工具。测试数据......
  • Parquet文件格式
    随着大模型AI的进一步发展,我们需要存储和处理的数据量呈指数级增长,寻找存储各种数据风格的最佳方式依然是最大的挑战之一。相信现在几乎已经没有人还会认为关系数据库是依然是唯一数据存储处理方式。比如说抖音的视频和直播等信息,其原始数据通常是无法实现以传统(关系)数据库方式......
  • 用于大数据分析的数据存储格式:Parquet、Avro 和 ORC 的性能和成本影响
    高效的数据处理对于依赖大数据分析做出明智决策的企业和组织至关重要。显著影响数据处理性能的一个关键因素是数据的存储格式。本文探讨了不同存储格式(特别是Parquet、Avro和ORC)对GoogleCloudPlatform(GCP)上大数据环境中查询性能和成本的影响。本文提供了基准测试,讨论了成本......
  • huggingface上数据集常用格式Parquet及调用
    Parquet格式解释及词解Parquet,中文通常翻译为帕奎或帕凯,是一种高效的列式存储格式。为什么叫Parquet?Parquet这个词源自法语,指的是一种镶木地板。这种地板是由许多小块木块拼凑而成的,每个木块代表一个数据列。这种比喻形象地说明了Parquet格式的存储方式。Parquet的特点和优......
  • DuckDB简单使用及Python操作
    DuckDB简介DockDB官网DuckDB是一款开源免费类似Sqlite的嵌入式数据库,支持直接使用内存或单个文件作为数据库。DuckDB着重于数据处理和分析,是一个款OLAP(联机分析处理)类型的数据库,主要特点如下:开源免费,MIT协议功能完善,支持标准SQL、事务、二级索引等高性能,低消耗(内存/文件占用小)灵......
  • DuckDB简单使用及Python操作
    DuckDB简介DockUB官网DuckDB是一款开源免费类似Sqlite的嵌入式数据库,支持直接使用内存或单个文件作为数据库。DuckDB着重于数据处理和分析,是一个款OLAP(联机分析处理)类型的数据库,主要特点如下:开源免费,MIT协议功能完善,支持标准SQL、事务、二级索引等高性能,低消耗(内存/文件......
  • 数据库_DuckDB-extension-读取clickhouse数据
    duckdb练习学习sql的新选择-用duckdb代替mysql个人最佳分析数据库01.duckdb和依赖的parquet、httpfs插件02.测试duckdb是否能正常使用OSS,03.在postgresql中使用pg_duckdb访问oss内的parquet文件duckdb扩展duckdb_mysql.DuckDBtodirectlyreadandwri......
  • 使用duckdb加载mysql
    安装duckdbhttps://duckdb.org/docs/installation/index加载mysql扩展https://duckdb.org/docs/extensions/mysql.html离线安装的话,可以等INSTALLmysql;超时后,根据它提示的URL到有网环境下载,再传到服务器挂载Nginx,修改/etc/hosts指向,再重新执行INSTALL。Demo代码如下:import......