1.DataFrame介绍
一个Datarame表示一个表格,类似电子表格的数据结构,包含一个经过排序的列表集,它的每一列都可以有不同的类型值(数字,字符串,布尔等等)。Datarame有行和列的索引;它可以被看作是一个Series的字典(Series们共享一个索引)。与其它你以前使用过的(如 R 的 data.frame )类似Datarame的结构相比,在DataFrame里的面向行和面向列的操作大致是对称的。在底层,数据是作为一个或多个二维数组存储的,而不是列表,字典,或其它一维的数组集合。
DataFrame([data, index, columns, dtype, copy])
# Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
1
2
2 DataFrame创建
import pandas as pd
import numpy as np
1
2
使用字典创建
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, np.nan], # np.nan表示NA
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
DataFrame(data,
# index=['a','b','c','d','e']
# index = range(5)
) # 默认生成整数索引, 字典的键作列,值作行
1
2
3
4
5
6
7
输出结果为:
state year pop
0 Ohio 2000.0 1.5
1 Ohio 2001.0 1.7
2 Ohio 2002.0 3.6
3 Nevada 2001.0 2.4
4 Nevada NaN 2.9
1
2
3
4
5
6
7
pd.DataFrame.from_dict 方法生成DataFrame
# 两层嵌套
d = {'a': {'tp': 26, 'fp': 112},
'b': {'tp': 26, 'fp': 91},
'c': {'tp': 23, 'fp': 74}}
df_index = pd.DataFrame.from_dict(d, orient='index')
df_index
1
2
3
4
5
6
输出结果为:
tp fp
a 26 112
b 26 91
c 23 74
1
2
3
4
df_columns = pd.DataFrame.from_dict(d,orient='columns')
df_columns
1
2
输出结果为:
a b c
fp 112 91 74
tp 26 26 23
1
2
3
通过传递一个numpy array,时间索引以及列标签来创建一个DataFrame
data = DataFrame(np.arange(10,26).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
1
2
3
4
输出结果为:
one two three four
Ohio 10 11 12 13
Colorado 14 15 16 17
Utah 18 19 20 21
New York 22 23 24 25
1
2
3
4
5
生成一个df
np.random.seed(10)
dates = pd.date_range('20190101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
1
2
3
4
A B C D
2019-01-01 1.331587 0.715279 -1.545400 -0.008384
2019-01-02 0.621336 -0.720086 0.265512 0.108549
2019-01-03 0.004291 -0.174600 0.433026 1.203037
2019-01-04 -0.965066 1.028274 0.228630 0.445138
2019-01-05 -1.136602 0.135137 1.484537 -1.079805
2019-01-06 -1.977728 -1.743372 0.266070 2.384967
1
2
3
4
5
6
7
8
3 DataFrame基本属性
DataFrame.index: The index (row labels) of the DataFrame.
df.index
1
输出结果为:
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
1
2
3
设置索引名
df.index.name = ‘time’
df.index.name = 'time'
1
DataFrame.columns :The column labels of the DataFrame.
df.columns
1
输出结果为:
Index(['A', 'B', 'C', 'D'], dtype='object')
1
设置列列名
df.columns.name = 'alphabet'
1
DataFrame.values Return a Numpy representation of the DataFrame.
查看底层的Numpy数据
df.values
1
array([[-0.96506567, 1.02827408, 0.22863013, 0.44513761],
[-1.13660221, 0.13513688, 1.484537 , -1.07980489],
[-1.97772828, -1.7433723 , 0.26607016, 2.38496733]])
1
2
3
4 DataFrame索引
DataFrame.head(self[, n]) Return the first n rows.
df.head(3) # 显示前三行
1
A B C D
2019-01-01 1.331587 0.715279 -1.545400 -0.008384
2019-01-02 0.621336 -0.720086 0.265512 0.108549
2019-01-03 0.004291 -0.174600 0.433026 1.203037
1
2
3
4
5
DataFrame.tail(self[, n]) Return the last n rows.
df.tail(3) # 显示后三行
1
A B C D
2019-01-04 -0.965066 1.028274 0.228630 0.445138
2019-01-05 -1.136602 0.135137 1.484537 -1.079805
2019-01-06 -1.977728 -1.743372 0.266070 2.384967
1
2
3
4
DataFrame.set_index(self, keys[, drop, …])
df = DataFrame({'a': range(7), 'b': range(7, 0, -1),
'c': ['one', 'one', 'one', 'two', 'two',
'two', 'two'],
'd': [0, 1, 2, 0, 1, 2, 3]})
df
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# set_index方法将DataFrame的一个或者多个列转化为行索引
df2 = df.set_index(['c', 'd'])
df2
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
1
2
3
4
5
6
7
8
9
10
11
12
13
默认drop = True,当drop=False 不删除原始数据
df.set_index(['c', 'd'], drop=False)
1
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
1
2
3
4
5
6
7
8
9
-reset_index的功能和set_index的刚好相反,层次化索引的级别会被转移到列里面
df2.reset_index()
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
1
2
3
4
5
6
7
8
9
5 DataFrame计算、描述性统计
DataFrame.round(self[, decimals]) Round a DataFrame to a variable number of decimal places.
显示数字保留两位小数
df.round(2)
1
A B C D
2019-01-01 1.33 0.72 -1.55 -0.01
2019-01-02 0.62 -0.72 0.27 0.11
2019-01-03 0.00 -0.17 0.43 1.20
2019-01-04 -0.97 1.03 0.23 0.45
2019-01-05 -1.14 0.14 1.48 -1.08
2019-01-06 -1.98 -1.74 0.27 2.38
1
2
3
4
5
6
7
不同的列制定不同的小数位数
df.round({'A': 1, 'C': 2})
1
A B C D
2019-01-01 1.3 0.715279 -1.55 -0.008384
2019-01-02 0.6 -0.720086 0.27 0.108549
2019-01-03 0.0 -0.174600 0.43 1.203037
2019-01-04 -1.0 1.028274 0.23 0.445138
2019-01-05 -1.1 0.135137 1.48 -1.079805
2019-01-06 -2.0 -1.743372 0.27 2.384967
1
2
3
4
5
6
7
8
DataFrame.describe(self[, percentiles, …]) Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
# 数值型数据的快速统计汇总
df.describe()
1
2
alphabet A B C D
count 3.000000 3.000000 3.000000 3.000000
mean -1.359799 -0.193320 0.659746 0.583433
std 0.541972 1.414715 0.714535 1.736521
min -1.977728 -1.743372 0.228630 -1.079805
25% -1.557165 -0.804118 0.247350 -0.317334
50% -1.136602 0.135137 0.266070 0.445138
75% -1.050834 0.581705 0.875304 1.415052
max -0.965066 1.028274 1.484537 2.384967
1
2
3
4
5
6
7
8
9
DataFrame.apply(self, func[, axis, …]) Apply a function along an axis of the DataFrame.¶
df
A B C D F
2019-01-01 0.000000 0.000000 -1.545400 5 NaN
2019-01-02 0.621336 -0.720086 0.265512 5 1.0
2019-01-03 0.004291 -0.174600 0.433026 5 2.0
2019-01-04 -0.965066 1.028274 0.228630 5 3.0
2019-01-05 -1.136602 0.135137 1.484537 5 4.0
2019-01-06 -1.977728 -1.743372 0.266070 5 5.0
1
2
3
4
5
6
7
8
df.apply(np.cumsum, axis=0, result_type=None )
1
A B C D F
2019-01-01 0.000000 0.000000 -1.545400 5 NaN
2019-01-02 0.621336 -0.720086 -1.279889 10 1.0
2019-01-03 0.625627 -0.894686 -0.846863 15 3.0
2019-01-04 -0.339438 0.133588 -0.618232 20 6.0
2019-01-05 -1.476040 0.268725 0.866305 25 10.0
2019-01-06 -3.453769 -1.474647 1.132375 30 15.0
1
2
3
4
5
6
7
df.apply(lambda x: x.max() - x.min()) # 每一列的极差
1
6 重新索引、选择、标签操作
DataFrame.rename(self[, mapper, index, …]) Alter axes labels.
修改列名
df.rename(columns = {'A':'key2'},inplace=False)
1
7 排序
DataFrame.sort_index(self[, axis, level, …]) Sort object by labels (along an axis).
# 默认axis=0,按行索引对行进行排序;ascending=True,升序排序
df.sort_index(axis=0, ascending=False)
# df.sort_index(axis=0, ascending=True)
1
2
3
A B C D
2019-01-06 -1.977728 -1.743372 0.266070 2.384967
2019-01-05 -1.136602 0.135137 1.484537 -1.079805
2019-01-04 -0.965066 1.028274 0.228630 0.445138
2019-01-03 0.004291 -0.174600 0.433026 1.203037
2019-01-02 0.621336 -0.720086 0.265512 0.108549
2019-01-01 1.331587 0.715279 -1.545400 -0.008384