首页 > 数据库 >SQLite vs Pandas

SQLite vs Pandas

时间:2023-04-30 23:46:21浏览次数:130  
标签:SQLite bonus self df vs ._ employee Pandas def

Analysis details

For the analysis, we ran the six tasks 10 times each, for 5 different sample sizes, for each of 3 programs: pandas, sqlite, and memory-sqlite (where database is in memory instead of on disk). See below for the definitions of each task.

Our sample data was randomly generated. Here’s what it looks like:

 

sql_vs_pandas$ head -n 5 data/sample.100.csv

 

qqFjQHQc,c,1981,82405.59262172286

vILuhVGz,a,1908,27712.27152250119

mwCjpoOF,f,1992,58974.38538762843

kGbriYAK,d,1927,42258.24179716961

MeoxuJng,c,1955,96907.56416314292

 

This consists of a random string of 8 characters, a random single character (for the filtering operation), a random integer simulating a year (1900-2000), and a uniform random float value between 10000 and 100000.

Results

sqlite or memory-sqlite is faster for the following tasks:

  • select two columns from data (<.1 millisecond for any data size for sqlite. pandas scales with the data, up to just under 0.5 seconds for 10 million records)
  • filter data (>10x-50x faster with sqlite. The difference is more pronounced as data grows in size)
  • sort by single column: pandas is always a bit slower, but this was the closest

pandas is faster for the following tasks:

  • groupby computation of a mean and sum (significantly better for large data, only 2x faster for <10k records)
  • load data from disk (5x faster for >10k records, even better for smaller data)
  • join data (2-5x faster, but slower for smallest dataset of 1000 rows)

Comparing memory-sqlite vs. sqlite, there was no meaningful difference, especially as data size increased.

There is no significant speedup from loading sqlite in its own shell vs. via pandas.

Overall, joining and loading data is the slowest whereas select and filter are generally the fastest. Further, pandas seems to be optimized for group-by operations, where it performs really well (group-by is pandas‘ second-fastest operation for larger data).

Note that this analysis assumes you are equally proficient in writing code with both! But these results could encourage you to learn the tool that you are less familiar with, if the performance gains are significant.

All code is on our GitHub page.

 

Below are the definitions of our six tasks: sort, select, load, join, filter, and group by (see driver/sqlite_driver.py or driver/pandas_driver.py).

sqlite is first, followed by pandas:

sort

def sort(self):

	self._cursor.execute(‘SELECT * FROM employee ORDER BY name ASC;’)

	self._conn.commit()

	 

	def sort(self):

	self.df_employee.sort_values(by=’name’)

  

 

select

def select(self):

    self._cursor.execute(‘SELECT name, dept FROM employee;’)

    self._conn.commit()

    def select(self):

    self.df_employee[[“name”, “dept”]]

 

 

load

def load(self):

    self._cursor.execute(‘CREATE TABLE employee (name varchar(255), dept char(1), birth int, salary double);’)

    df = pd.read_csv(self.employee_file)

    df.columns = employee_columns

    df.to_sql(’employee’, self._conn, if_exists=’replace’)

     

    self._cursor.execute(‘CREATE TABLE bonus (name varchar(255), bonus double);’)

    df_bonus = pd.read_csv(self.bonus_file)

    df_bonus.columns = bonus_columns

    df_bonus.to_sql(‘bonus’, self._conn, if_exists=’replace’)

 

def load(self):

    self.df_employee = pd.read_csv(self.employee_file)

    self.df_employee.columns = employee_columns

     

    self.df_bonus = pd.read_csv(self.bonus_file)

    self.df_bonus.columns = bonus_columns

 

 

join

def join(self):

    self._cursor.execute(‘SELECT employee.name, employee.salary + bonus.bonus ‘

    ‘FROM employee INNER JOIN bonus ON employee.name = bonus.name’)

    self._conn.commit()

 

def join(self):

    joined = self.df_employee.merge(self.df_bonus, on=’name’)

    joined[‘total’] = joined[‘bonus’] + joined[‘salary’]

 

 

filter

def filter(self):

	self._cursor.execute(‘SELECT * FROM employee WHERE dept = “a”;’)

	self._conn.commit()

	 

	def filter(self):

	self.df_employee[self.df_employee[‘dept’] == ‘a’]

  

 

group by

def groupby(self):

    self._cursor.execute(‘SELECT avg(birth), sum(salary) FROM employee GROUP BY dept;’)

    self._conn.commit()

     

    def groupby(self):

    self.df_employee.groupby(“dept”).agg({‘birth’: np.mean, ‘salary’: np.sum})

 

 

标签:SQLite,bonus,self,df,vs,._,employee,Pandas,def
From: https://www.cnblogs.com/a00ium/p/17365988.html

相关文章

  • 文心一言 VS chatgpt (13)-- 算法导论3.1 8题 3.2 1题
    八、可以扩展我们的记号到有两个参数n和m的情形,其中的n和m可以按不同速率独立地趋于无穷。对于给定的函数g(n,m),用O(g(n,m))来表示以下函数集:O(g(n,m))={f(n,m):存在正常量c、和,使得对所有n>=n0或m>=m0,有0<=f(n,m)<=cg(n,m)}对Ω(g(n,m))和θ(g(n,m))给出相应的定义。文......
  • 文心一言 VS chatgpt (13)-- 算法导论3.1 8题 3.2 1题
    八、可以扩展我们的记号到有两个参数n和m的情形,其中的n和m可以按不同速率独立地趋于无穷。对于给定的函数g(n,m),用O(g(n,m))来表示以下函数集:O(g(n,m))=对Ω(g(n,m))和θ(g(n,m))给出相应的定义。文心一言:chatgpt:类比于单个参数的情形,我们可以定义类似的记号:O(g(n,......
  • 搭配vscode的博客园
    在完成上篇文章:上传本地笔记到博客园的配置后新建工作区,文件夹名称为Cnblogs(自定义)点击右侧导航栏的CnblogsConsole点击扩展设置,可以设置存放博文的文件夹为上方新建的文件夹在本地文件夹新建文件后,可以选择右键保存到博客园,然后选择关联已有的博客或者新建博客,如果新......
  • VS Code 配置 C/C++ 环境(编译/调试)
    1,VSCode安装点击此处跳转到官网下载安装VSCode2,环境准备这里我们安装VSCode官方推荐的方法安装即可。官方文档。2.1下载MSYS2点击进入MSYS2官网,找到Installation的Downloadtheinstaller:msys2-x86_64-20230318.exe,点击msys2-x86_64-20230318.exe下载。然后根......
  • python 读写sqlite3 读写内存中的数据库
    Python中,可以使用标准库sqlite3来读写SQLite数据库。下面是一个示例代码,展示如何连接到SQLite数据库,创建表格,插入数据,查询数据和关闭数据库:importsqlite3#连接到数据库conn=sqlite3.connect('example.db')#创建一个表格conn.execute('''CREATETABLEIFNOTE......
  • 使用VSCode取代Keil实现STM32和51单片机的开发
    使用VisualStudioCode开发STM32和51单片机,VSCode作为编辑器来开发嵌入式程序。视频教程:https://www.bilibili.com/video/BV18e4y1H7xX/VSCode简介VisualStudioCode是是由微软研发的一个轻量级但功能强大的源代码编辑器,这个软件是免费开源的,可在您的桌面上运行,并且可用于Windo......
  • 西门子成都工厂的DevSecOps实践
    大家好,我是Edison。4月15日,成都.NET线下技术沙龙活动中,我分享了一个主题《西门子成都工厂的DevSecOps实践》,向大家介绍了我们为什么要做DevSecOps以及我们目前是怎么做DevSecOps的。整个分享从Why-What-How的顺序讲解,基于我们目前正在使用的.NET6技术栈,我们也给出了一些参考......
  • Windows下配置vscode和gcc开发环境
    vscode下载安装下载地址vscode官网,直接点击下载可能会比较慢,可以在开始下载之后复制下载地址,然后用vscode.cdn.azure.cn替换下载地址开头,用国内镜像加速下载下载安装完成后打开,安装简体中文插件和C++插件,重启MinGW-w64下载安装由于MinGW-w64的官方二进制版本已经很久没更新了,......
  • 四月学习之LVS https配置
    1、需要增加443端口调度ipvsadm-A-t172.16.1.100:443-srripvsadm-a-t172.16.1.100:443-r172.16.1.5:443ipvsadm-a-t172.16.1.100:443-r172.16.1.6:443ipvsadm-L-nIPVirtualServerversion1.2.1(size=4096)ProtLocalAddress:PortSchedulerFlags->......
  • python 读写sqlite3
    importsqlite3#连接到SQLite3数据库conn=sqlite3.connect('example.db')#创建一个表conn.execute('''CREATETABLEIFNOTEXISTSusers(idINTEGERPRIMARYKEYAUTOINCREMENT,nameTEXTNOTNULL,ageI......