Analysis details
For the analysis, we ran the six tasks 10 times each, for 5 different sample sizes, for each of 3 programs: pandas, sqlite, and memory-sqlite (where database is in memory instead of on disk). See below for the definitions of each task.
Our sample data was randomly generated. Here’s what it looks like:
sql_vs_pandas$ head -n 5 data/sample.100.csv
qqFjQHQc,c,1981,82405.59262172286
vILuhVGz,a,1908,27712.27152250119
mwCjpoOF,f,1992,58974.38538762843
kGbriYAK,d,1927,42258.24179716961
MeoxuJng,c,1955,96907.56416314292
This consists of a random string of 8 characters, a random single character (for the filtering operation), a random integer simulating a year (1900-2000), and a uniform random float value between 10000 and 100000.
Results
sqlite or memory-sqlite is faster for the following tasks:
- select two columns from data (<.1 millisecond for any data size for sqlite. pandas scales with the data, up to just under 0.5 seconds for 10 million records)
- filter data (>10x-50x faster with sqlite. The difference is more pronounced as data grows in size)
- sort by single column: pandas is always a bit slower, but this was the closest
pandas is faster for the following tasks:
- groupby computation of a mean and sum (significantly better for large data, only 2x faster for <10k records)
- load data from disk (5x faster for >10k records, even better for smaller data)
- join data (2-5x faster, but slower for smallest dataset of 1000 rows)
Comparing memory-sqlite vs. sqlite, there was no meaningful difference, especially as data size increased.
There is no significant speedup from loading sqlite in its own shell vs. via pandas.
Overall, joining and loading data is the slowest whereas select and filter are generally the fastest. Further, pandas seems to be optimized for group-by operations, where it performs really well (group-by is pandas‘ second-fastest operation for larger data).
Note that this analysis assumes you are equally proficient in writing code with both! But these results could encourage you to learn the tool that you are less familiar with, if the performance gains are significant.
All code is on our GitHub page.
Below are the definitions of our six tasks: sort, select, load, join, filter, and group by (see driver/sqlite_driver.py or driver/pandas_driver.py).
sqlite is first, followed by pandas:
sort
def sort(self): self._cursor.execute(‘SELECT * FROM employee ORDER BY name ASC;’) self._conn.commit() def sort(self): self.df_employee.sort_values(by=’name’)
select
def select(self): self._cursor.execute(‘SELECT name, dept FROM employee;’) self._conn.commit() def select(self): self.df_employee[[“name”, “dept”]]
load
def load(self): self._cursor.execute(‘CREATE TABLE employee (name varchar(255), dept char(1), birth int, salary double);’) df = pd.read_csv(self.employee_file) df.columns = employee_columns df.to_sql(’employee’, self._conn, if_exists=’replace’) self._cursor.execute(‘CREATE TABLE bonus (name varchar(255), bonus double);’) df_bonus = pd.read_csv(self.bonus_file) df_bonus.columns = bonus_columns df_bonus.to_sql(‘bonus’, self._conn, if_exists=’replace’) def load(self): self.df_employee = pd.read_csv(self.employee_file) self.df_employee.columns = employee_columns self.df_bonus = pd.read_csv(self.bonus_file) self.df_bonus.columns = bonus_columns
join
def join(self): self._cursor.execute(‘SELECT employee.name, employee.salary + bonus.bonus ‘ ‘FROM employee INNER JOIN bonus ON employee.name = bonus.name’) self._conn.commit() def join(self): joined = self.df_employee.merge(self.df_bonus, on=’name’) joined[‘total’] = joined[‘bonus’] + joined[‘salary’]
filter
def filter(self): self._cursor.execute(‘SELECT * FROM employee WHERE dept = “a”;’) self._conn.commit() def filter(self): self.df_employee[self.df_employee[‘dept’] == ‘a’]
group by
def groupby(self): self._cursor.execute(‘SELECT avg(birth), sum(salary) FROM employee GROUP BY dept;’) self._conn.commit() def groupby(self): self.df_employee.groupby(“dept”).agg({‘birth’: np.mean, ‘salary’: np.sum})
标签:SQLite,bonus,self,df,vs,._,employee,Pandas,def From: https://www.cnblogs.com/a00ium/p/17365988.html