标签:df Country Delhi Capital Belgium pandas Population
Table of Contents
Set-up
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # 如果不需要,则修改为none即可。删除该段代码不起作用。
Intro
The Pandas library is built on NumPy
provides data structures and data analysis tools.
import pandas as pd
Pandas切片左闭右开
Pandas Data Structures - 2 Types
Notice:
当同时对s和df进行操作的时候:
- s会直接在原始的Series对象上进行更改,而不会创建新的对象 (Series是可变对象)
- 而df并不会修改原始数据,除非进行赋值操作,或者设置了参数
inplace=True
(非破坏操作)
Series
- (my understanding: ordinary/normal data) )
- A one-dimensional labeled array
- capable of holding any data type
s = pd.Series([1, 2, 3, 4],
index = ['a', 'b', 'c', 'd'])
print(s)
a 1
b 2
c 3
d 4
dtype: int64
DataFrame
- two-dimensional labeled data structure
- columns of different types
- 大多的DataFrame操作都会自动返回一个新的 DataFrame 对象
# Series
data = {
'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [11190846, 1303171035, 207847528]
}
print(data)
# series -> DataFrame
# index默认从0开始递增
# 修改默认行索引:index=['foo', 'bar', 'baz']
df = pd.DataFrame(data,
columns = ['Country', 'Capital', 'Population'])
df # 原始的 DataFrame 对象
# **注意**
# 对df.__任何操作都不会修改它本身数据,除非进行了赋值
# eg. df = df.drop(0)
{'Country': ['Belgium', 'India', 'Brazil'], 'Capital': ['Brussels', 'New Delhi', 'Brasília'], 'Population': [11190846, 1303171035, 207847528]}
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
Series VS DataFrame
Dropping
# drop rows from series: index = 'a' and 'c'
# 删除指定行
s.drop( ['a', 'c'] )
s
df
# drop columns
# 删除指定列 axis=1删除列,axis=0删除行
df.drop('Country', axis=1)
df
# 删除第一行数据, index=0 (不是标题行)
df.drop(0)
df
# Notice
# 看以下输出,数据本身并没有改变
# drop方法会自动返回一个 经过删除操作的 Series/DataFrame
# 如果想改变原有数据,参数中添加:inplace=True
b 2
d 4
dtype: int64
a 1
b 2
c 3
d 4
dtype: int64
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
|
Capital |
Population |
0 |
Brussels |
11190846 |
1 |
New Delhi |
1303171035 |
2 |
Brasília |
207847528 |
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
|
Country |
Capital |
Population |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
Sort & Rank
df
df.sort_index() # 按照index排序,index默认从0开始递增,所以这里看不出变化
df.sort_values( by='Country' ) # sory by 'values'
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
2 |
Brazil |
Brasília |
207847528 |
1 |
India |
New Delhi |
1303171035 |
df
# 给每个元素在该列进行排序
df.rank() # assign ranks to entires
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
|
Country |
Capital |
Population |
0 |
1.0 |
2.0 |
1.0 |
1 |
3.0 |
3.0 |
3.0 |
2 |
2.0 |
1.0 |
2.0 |
I/O
Notice:
- 对于CSV和Excel来说:使用pandas读取到的文件为DataFrame
- 对于SQL来说:在Pandas的pd.read_sql()函数中,结果可以是不同的数据结构,取决于查询和它的执行方式. eg. 一个单一的值(比如计数),一组值(比如名字的列表),一个类似表的结构(比如SELECT语句的结果集)。
CSV
Aim:
读取1.csv文件,经过DataFrame处理后,以2.csv的形式输出
# Read
# nrows = number of rows
# 用pandas读取csv -> 存入DataFrame中
# 因为pd.read_csv()方法产生的数据结构是一个DataFrame
# header=0或者None: 文件第一行作为标题行
df = pd.read_csv('1.csv', header = None, nrows = 3)
df
# Write
# 将DataFrame写入CSV文件中
df.to_csv('2.csv')
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
... |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
0 |
respondent_id |
h1n1_concern |
h1n1_knowledge |
behavioral_antiviral_meds |
behavioral_avoidance |
behavioral_face_mask |
behavioral_wash_hands |
behavioral_large_gatherings |
behavioral_outside_home |
behavioral_touch_face |
... |
income_poverty |
marital_status |
rent_or_own |
employment_status |
hhs_geo_region |
census_msa |
household_adults |
household_children |
employment_industry |
employment_occupation |
1 |
26707 |
2 |
2 |
0 |
1 |
0 |
1 |
1 |
0 |
1 |
... |
> $75,000 |
Not Married |
Rent |
Employed |
mlyzmhmf |
MSA, Not Principle City |
1 |
0 |
atmlpfrs |
hfxkjkmi |
2 |
26708 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
... |
Below Poverty |
Not Married |
Rent |
Employed |
bhuqouqj |
Non-MSA |
3 |
0 |
atmlpfrs |
xqwwgdyp |
3 rows × 36 columns
Excel
- 如果一个xlsx文件 - 只有一个sheet
# Read
df = pd.read_excel('3.xlsx')
# Write
df.to_excel('4.xlsx', sheet_name = 'new')
- 如果一个xlsx文件 - 有多个sheet
# 从一个excel中 - 读取所有sheet
sheets = pd.ExcelFile('5.xls')
# Read
df1 = pd.read_excel(sheets, 'sheet_1')
df2 = pd.read_excel(sheets, 'sheet_2')
# Write
merged_df = pd.concat([df1, df2], axis=0) # Merge the dataframes
merged_df.to_excel('merged_file.xlsx', index=False) # Write to new xlsx file
SQL Query/Database Table
存在问题,需要不断安装sqlalchemy,可能是虚拟环境的问题
jupyter kernelspec list
Available kernels:
pycharm-616d70e3 /Users/huang/Library/Jupyter/kernels/pycharm-616d70e3
python3 /Users/huang/opt/anaconda3/envs/WebAppFlask3_9_MachineLearning/share/jupyter/kernels/python3
Databricks也无法运行
官方文档:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
# !pip install sqlalchemy
!conda install sqlalchemy
# >>> from sqlalchemy import create_engine
# >>> engine = create_engine( )
# >>> pd.read_sql( , engine)
# >>> pd.read_sql_table( , engine)
# >>> pd.read_sql_query( , engine)
# read_sql() is a convenience wrapper around read_sql_table() andread_sql_query()
# >>> df.to_sql( , en 'myDf' gine)
Collecting package metadata (current_repodata.json): done
Solving environment: done
## Package Plan ##
environment location: /Users/huang/opt/anaconda3
added / updated specs:
- sqlalchemy
The following packages will be downloaded:
package | build
---------------------------|-----------------
boltons-23.0.0 | py39hecd8cb5_0 423 KB
conda-23.3.1 | py39hecd8cb5_0 962 KB
jsonpatch-1.32 | pyhd3eb1b0_0 15 KB
jsonpointer-2.1 | pyhd3eb1b0_0 9 KB
------------------------------------------------------------
Total: 1.4 MB
The following NEW packages will be INSTALLED:
boltons pkgs/main/osx-64::boltons-23.0.0-py39hecd8cb5_0
jsonpatch pkgs/main/noarch::jsonpatch-1.32-pyhd3eb1b0_0
jsonpointer pkgs/main/noarch::jsonpointer-2.1-pyhd3eb1b0_0
The following packages will be UPDATED:
conda 23.1.0-py39hecd8cb5_0 --> 23.3.1-py39hecd8cb5_0
Proceed ([y]/n)?
Selection
Getting
print(s)
s['b']
a 1
b 2
c 3
d 4
dtype: int64
2
df
df[1:] # get subset
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
|
Country |
Capital |
Population |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
Selecting, Boolean Indexing & Setting
# by position
# 选择(0,0)的元素
df
df.iloc[0,0]
df.iat[0,0]
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
'Belgium'
'Belgium'
# By Label
df.loc[0, 'Country'] #选择第0行Country列
df.at[0, 'Country']
# Notice:
# 在访问单个标量值时,.at比.loc快,但它只适用于一次访问一个值。
'Belgium'
'Belgium'
# Boolean Indexing
s
# 选择Series对象s中所有<=1的元素
# ~ 取反操作
s[~(s>1)]
s[ (s<-1) | (s>2) ]
df['Population']>1200000000 # -> 返回的是一个bool类型的Series对象
df[ df['Population']>1200000000 ]
a 1
b 2
c 3
d 4
dtype: int64
a 1
dtype: int64
c 3
d 4
dtype: int64
0 False
1 True
2 False
Name: Population, dtype: bool
|
Country |
Capital |
Population |
1 |
India |
New Delhi |
1303171035 |
# 修改索引a的值为6
s
s['a'] = 6 # 结果会被修改 - 会直接在原始的Series对象上进行更改,而不会创建新的对象
s
a 1
b 2
c 3
d 4
dtype: int64
a 6
b 2
c 3
d 4
dtype: int64
s[0:2] # 左闭右开
a 6
b 2
dtype: int64
loc和iloc区别
- loc: location -
df.loc[0, 'Country']
- iloc: integer location -
df.iloc[0,0]
## Retrieving Series/DataFrame Information
df
df.shape
df.index
df.columns
df.info()
df.count()
|
Country |
Capital |
Population |
0 |
Belgium |
Brussels |
11190846 |
1 |
India |
New Delhi |
1303171035 |
2 |
Brazil |
Brasília |
207847528 |
(3, 3)
RangeIndex(start=0, stop=3, step=1)
Index(['Country', 'Capital', 'Population'], dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 3 non-null object
1 Capital 3 non-null object
2 Population 3 non-null int64
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
Country 3
Capital 3
Population 3
dtype: int64
### Basic Information
### Summary
## Applying Functions
## Data Alignment
### Internal Data Alignment
### Arithmetic Operations with Fill Methods
Learning Link
https://www.datacamp.com/search?q=pandas
标签:df,
Country,
Delhi,
Capital,
Belgium,
pandas,
Population
From: https://www.cnblogs.com/huangkenicole/p/17375323.html