Pandas 数据分析实战
第 3 章 Series 方法
-
read_csv() 导入数据集
pd.read_csv(filepath_or_buffer="./file/chapter_03/pokemon.csv") # 或者 pd.read_csv("./file/chapter_03/pokemon.csv")
Pokemon Type 0 Bulbasaur Grass / Poison 1 Ivysaur Grass / Poison 2 Venusaur Grass / Poison 3 Charmander Fire 4 Charmeleon Fire .. ... ... 804 Stakataka Rock / Steel 805 Blacephalon Fire / Ghost 806 Zeraora Electric 807 Meltan Steel 808 Melmetal Steel [809 rows x 2 columns]
-
read_csv() 设置索引列
通过参数 index_col 设置索引列,将“Pokemon”作为参数传递给 index_col
pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon")
Type Pokemon Bulbasaur Grass / Poison Ivysaur Grass / Poison Venusaur Grass / Poison Charmander Fire Charmeleon Fire ... Stakataka Rock / Steel Blacephalon Fire / Ghost Zeraora Electric Meltan Steel Melmetal Steel [809 rows x 1 columns]
-
read_csv() 将 DataFrame 转换为 Series
一列数据,Pandas 默认是将数据导入 DataFrame ,为了得到 Series ,需要使用函数 squeeze()
在 pandas 1.5 版本以前,使用时在 read_csv() 函数中传参 squeeze=True 就行,在 1.5 版本后废弃掉了。链接
pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon").squeeze("columns")
Pokemon Bulbasaur Grass / Poison Ivysaur Grass / Poison Venusaur Grass / Poison Charmander Fire Charmeleon Fire ... Stakataka Rock / Steel Blacephalon Fire / Ghost Zeraora Electric Meltan Steel Melmetal Steel Name: Type, Length: 809, dtype: object
成功获得了一个 Series ,索引标签是 Pokemon 的名称,值是 Pokemon 的类型。
- Pandas 已为 Series 分配了名称为 Type 的列,即 CSV 文件的列名为 Type
- 该 Series 有 809 个值
- dtype : object 表示字符串类型的 Series 。
-
read_csv() 导入的参数转换为日期
导入数据时,Pandas 会为每一列推断最合适的数据类型,但是出于稳定程序的目的,Pandas 会避免对数据做出假设。因此在导入 google_stocks.csv 中,包含有一个 Date 列,格式为 YYYY-MM-DD 格式的,除非明确告诉 Pandas 把该值设置为日期,否则都是按照字符串导入的。可以通过 read_csv() 函数中的 parse_dates 参数指定需要转换为日期的列,parse_dates 参数接收一个字符串列表
pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"]).head()
Date Close 0 2004-08-19 49.98 1 2004-08-20 53.95 2 2004-08-23 54.50 3 2004-08-24 52.24 4 2004-08-25 52.80
-
read_csv() 导入的列转换为日期格式,并设置该列为索引并转换为 Series
pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"],index_col="Date").squeeze(True)
Date 2004-08-19 49.98 2004-08-20 53.95 2004-08-23 54.50 2004-08-24 52.24 2004-08-25 52.80 ... 2019-10-21 1246.15 2019-10-22 1242.80 2019-10-23 1259.13 2019-10-24 1260.99 2019-10-25 1265.13 Name: Close, Length: 3824, dtype: float64
-
read_csv() 存在多列时,squeeze() 函数无效
pd.read_csv("./file/chapter_03/revolutionary_war.csv", parse_dates=["Start Date"], index_col="Start Date").squeeze(True)
Battle State Start Date 1774-09-01 Powder Alarm Massachusetts 1774-12-14 Storming of Fort William and Mary New Hampshire 1775-04-19 Battles of Lexington and Concord Massachusetts 1775-04-19 Siege of Boston Massachusetts 1775-04-20 Gunpowder Incident Virginia ... ... 1782-09-11 Siege of Fort Henry Virginia 1782-09-13 Grand Assault on Gibraltar NaN 1782-10-18 Action of 18 October 1782 NaN 1782-12-06 Action of 6 December 1782 NaN 1783-01-22 Action of 22 January 1783 Virginia [232 rows x 2 columns]
-
read_csv() 多列中选择只导入索引列和值,转换为 Series
read_csv 函数的 usecols 参数接受 Pandas 应该导入的字段列表,选择 Start Date 和 State ,Start Date 作为索引,State 作为值,在 DataFrame 只存在 2 列时,可以转换为 Series
pd.read_csv("./file/chapter_03/revolutionary_war.csv", parse_dates=["Start Date"], index_col="Start Date", usecols=["Start Date", "State"]).squeeze(True)
Start Date 1774-09-01 Massachusetts 1774-12-14 New Hampshire 1775-04-19 Massachusetts 1775-04-19 Massachusetts 1775-04-20 Virginia ... 1782-09-11 Virginia 1782-09-13 NaN 1782-10-18 NaN 1782-12-06 NaN 1783-01-22 Virginia Name: State, Length: 232, dtype: object
-
sort_values() 按值排序
sort_values() 返回一个新的 Series,其中的值按照升序排序
google = pd.read_csv("./file/chapter_03/google_stocks.csv", parse_dates=["Date"], index_col="Date").squeeze(True) google.sort_values()
Date 2004-09-03 49.82 2004-09-01 49.94 2004-08-19 49.98 2004-09-02 50.57 2004-09-07 50.60 ... 2019-04-23 1264.55 2019-10-25 1265.13 2018-07-26 1268.33 2019-04-26 1272.18 2019-04-29 1287.58 Name: Close, Length: 3824, dtype: float64
按字母顺序对 Series 中的字符串进行排序
pokemon = pd.read_csv("./file/chapter_03/pokemon.csv", index_col="Pokemon").squeeze("columns") pokemon.sort_values()
Pokemon Illumise Bug Silcoon Bug Pinsir Bug Burmy Bug Wurmple Bug ... Tirtouga Water / Rock Relicanth Water / Rock Corsola Water / Rock Carracosta Water / Rock Empoleon Water / Steel Name: Type, Length: 809, dtype: object
Pandas 将大写字母排在小写字母前
pd.Series(data= ['Adam','adam','Ben']).sort_values()
0 Adam 2 Ben 1 adam dtype: object
-
sort_values() 通过 ascending=False 进行降序处理, 默认值为 True
google.sort_values(ascending=False)
Date 2019-04-29 1287.58 2019-04-26 1272.18 2018-07-26 1268.33 2019-10-25 1265.13 2019-04-23 1264.55 ... 2004-09-07 50.60 2004-09-02 50.57 2004-08-19 49.98 2004-09-01 49.94 2004-09-03 49.82 Name: Close, Length: 3824, dtype: float64
字符串降序排序是指按字母表的倒序对 Series 中的字符串进行排序
pokemon.sort_values(ascending=False)
Pokemon Empoleon Water / Steel Corsola Water / Rock Relicanth Water / Rock Carracosta Water / Rock Tirtouga Water / Rock ... Kricketune Bug Cascoon Bug Scatterbug Bug Kricketot Bug Grubbin Bug Name: Type, Length: 809, dtype: object
-
sort_values() 参数 na_position 用来设置 NaN 值时,将该记录放置在排序结果中的位置,该参数默认为 last,即默认将缺失值放在已排序 Series 的末尾
battles.sort_values(na_position="last")
Start Date 1781-09-06 Connecticut 1779-07-05 Connecticut 1777-04-27 Connecticut 1777-09-03 Delaware 1777-05-17 Florida ... 1782-08-08 NaN 1782-08-25 NaN 1782-09-13 NaN 1782-10-18 NaN 1782-12-06 NaN Name: State, Length: 232, dtype: object
需要先显示缺失值,na_position 参数设置为 first
battles.sort_values(na_position="first")
Start Date 1775-09-17 NaN 1775-12-31 NaN 1776-03-03 NaN 1776-03-25 NaN 1776-05-18 NaN ... 1781-07-06 Virginia 1781-07-01 Virginia 1781-06-26 Virginia 1781-04-25 Virginia 1783-01-22 Virginia Name: State, Length: 232, dtype: object
-
dropna() 删除了所有缺失值的 Series,该方法仅针对 Series 值中的 NaN,而不是索引
battles.dropna().sort_values()
Start Date 1781-09-06 Connecticut 1779-07-05 Connecticut 1777-04-27 Connecticut 1777-09-03 Delaware 1777-05-17 Florida ... 1781-07-06 Virginia 1781-07-01 Virginia 1781-06-26 Virginia 1781-04-25 Virginia 1783-01-22 Virginia Name: State, Length: 162, dtype: object
新的 Series 比之前的 Series 要短,因为 Pandas 从 battles 中删除了 70 个 NaN 值
-
sort_index() 按索引排序,ascending 参数默认为为 True
sort_index() 按索引对 Series 排序,这些值将与他们的索引一起移动
pokemon.sort_index() # 或者 pokemon.sort_index(ascending=True)
Pokemon Abomasnow Grass / Ice Abra Psychic Absol Dark Accelgor Bug Aegislash Steel / Ghost ... Zoroark Dark Zorua Dark Zubat Poison / Flying Zweilous Dark / Dragon Zygarde Dragon / Ground Name: Type, Length: 809, dtype: object
索引日期排序,按照从最早日期到最晚日期顺序进行排序
battles.sort_index()
Start Date 1774-09-01 Massachusetts 1774-12-14 New Hampshire 1775-04-19 Massachusetts 1775-04-19 Massachusetts 1775-04-20 Virginia ... 1783-01-22 Virginia NaT New Jersey NaT Virginia NaT NaN NaT NaN Name: State, Length: 232, dtype: object
NaT(not a time) 表示没有日期值
-
sort_index() 先显示 NaT,使用参数 na_position
battles.sort_index(na_position="first")
Start Date NaT New Jersey NaT Virginia NaT NaN NaT NaN 1774-09-01 Massachusetts ... 1782-09-11 Virginia 1782-09-13 NaN 1782-10-18 NaN 1782-12-06 NaN 1783-01-22 Virginia Name: State, Length: 232, dtype: object
-
sort_index() ,按照日期由近到远排序
battles.sort_index(ascending=False)
Start Date 1783-01-22 Virginia 1782-12-06 NaN 1782-10-18 NaN 1782-09-13 NaN 1782-09-11 Virginia ... 1774-09-01 Massachusetts NaT New Jersey NaT Virginia NaT NaN NaT NaN Name: State, Length: 232, dtype: object
-
nsmallest() 返回的 Series 中按升序进行排序,默认值为 5。不适合 Series 字符串
google.nsmallest()
Date 2004-09-03 49.82 2004-09-01 49.94 2004-08-19 49.98 2004-09-02 50.57 2004-09-07 50.60 Name: Close, dtype: float64
-
nlargest() 返回的 Series 中按降序对值进行排序,默认值为 5。不适合 Series 字符串
google.nlargest()
Date 2019-04-29 1287.58 2019-04-26 1272.18 2018-07-26 1268.33 2019-10-25 1265.13 2019-04-23 1264.55 Name: Close, dtype: float64
-
参数 inplace 替换原有的 Series
battles.sort_values(inplace=True)
inplace 参数,将修改或改变现有对象,而不是创建一个副本。
-
value_counts() 计算值的个数
默认按照降序对值进行排序
pokemon.value_counts()
Type Normal 65 Water 61 Grass 38 Psychic 35 Fire 30 .. Fire / Psychic 1 Normal / Ground 1 Psychic / Fighting 1 Dark / Ghost 1 Fire / Ghost 1 Name: count, Length: 159, dtype: int64
value_counts() 返回一个新的 Series 对象,新对象的索引标签是 pokemon Series 的值,新对象的值是它们各自的计数。
-
nunique() 唯一值的数量
pokemon.nunique()
159
-
value_counts() 参数 ascending 进行排序。
默认为 False ,即按照降序进行排序。要按升序值进行排序,ascending 设置为 Truepokemon.value_counts(ascending=True)
Type Fire / Ghost 1 Fighting / Dark 1 Fighting / Steel 1 Normal / Ground 1 Fire / Psychic 1 .. Fire 30 Psychic 35 Grass 38 Water 61 Normal 65 Name: count, Length: 159, dtype: int64
-
value_counts() 参数 normalize ,返回每个唯一值的频率
pokemon.value_counts(normalize=True)
Type Normal 0.080346 Water 0.075402 Grass 0.046972 Psychic 0.043263 Fire 0.037083 ... Fire / Psychic 0.001236 Normal / Ground 0.001236 Psychic / Fighting 0.001236 Dark / Ghost 0.001236 Fire / Ghost 0.001236 Name: proportion, Length: 159, dtype: float64
可以将 Series 中的值乘以 100 ,算出来百分比
pokemon.value_counts(normalize=True) * 100
Type Normal 8.034611 Water 7.540173 Grass 4.697157 Psychic 4.326329 Fire 3.708282 ... Fire / Psychic 0.123609 Normal / Ground 0.123609 Psychic / Fighting 0.123609 Dark / Ghost 0.123609 Fire / Ghost 0.123609 Name: proportion, Length: 159, dtype: float64
-
round() 设置百分比的精度
(pokemon.value_counts(normalize=True) * 100).round(2)
Type Normal 8.03 Water 7.54 Grass 4.70 Psychic 4.33 Fire 3.71 ... Fire / Psychic 0.12 Normal / Ground 0.12 Psychic / Fighting 0.12 Dark / Ghost 0.12 Fire / Ghost 0.12 Name: proportion, Length: 159, dtype: float64
-
max() python 函数最大值
google.max()
1287.58
-
min() python 函数最小值
google.min()
49.82
-
value_values() 参数 bins 分组区间
buckets = [0, 200, 400, 600, 800, 1000, 1200, 1400] google.value_counts(bins=buckets)
(200.0, 400.0] 1568 (-0.001, 200.0] 595 (400.0, 600.0] 575 (1000.0, 1200.0] 406 (600.0, 800.0] 380 (800.0, 1000.0] 207 (1200.0, 1400.0] 93 Name: count, dtype: int64
- 圆括号表示该值不包含在区间当中
- 方括号表示该值包含在区间当中
- 闭区间包括两个端点,[5,10]
- 开区间不包括两个端点,(5,10)
- 带有 bins 参数的 value_counts() 方法返回半开区间,将包含一个端点并排除另一个端点
- bins 也接受一个整数参数,Pandas 会自动计算 Series 中最大值和最小值之间的差值,并将范围划分为指定数量的 bins。
返回的 Series 按照值进行降序排序
可以继续对索引进行升序排序
google.value_counts(bins=buckets).sort_index() # 或者 google.value_counts(bins=buckets, sort=False)
(-0.001, 200.0] 595 (200.0, 400.0] 1568 (400.0, 600.0] 575 (600.0, 800.0] 380 (800.0, 1000.0] 207 (1000.0, 1200.0] 406 (1200.0, 1400.0] 93 Name: count, dtype: int64
-
value_counts() 默认排除 NaN 值,要对 NaN 值进行计算,参数 drnpna = False
battles.value_counts(dropna=False)
State NaN 70 South Carolina 31 New York 28 New Jersey 24 Virginia 21 Massachusetts 11 Pennsylvania 10 North Carolina 9 Florida 8 Georgia 6 Rhode Island 3 Connecticut 3 Vermont 3 New Hampshire 1 Delaware 1 Indiana 1 Louisiana 1 Ohio 1 Name: count, dtype: int64
-
Series 索引使用 value_counts 方法
battles.index.value_counts()
Start Date 1781-04-25 2 1781-05-22 2 1780-08-18 2 1781-09-13 2 1782-03-16 2 .. 1778-06-30 1 1778-07-03 1 1778-07-27 1 1778-08-21 1 1783-01-22 1 Name: count, Length: 217, dtype: int64
-
apply() 对每个 Series 值调用一个函数
函数是 Python 中的第一类对象(first-class object)。
任何可以用数字完成的事情,都可以用函数来完成
- 将函数存储在列表中
- 将函数作为字典键的值
- 将一个函数作为参数传递给另一个函数
- 从一个函数返回另一个函数
函数是产生出书的指令序列,函数调用是指令的实际执行
round() 函数,将高于 0.5 的值向上取整,低于 0.5 的值向下取整
google.apply(round)
Date 2004-08-19 50 2004-08-20 54 2004-08-23 54 2004-08-24 52 2004-08-25 53 ... 2019-10-21 1246 2019-10-22 1243 2019-10-23 1259 2019-10-24 1261 2019-10-25 1265 Name: Close, Length: 3824, dtype: int64
定义函数 single_or_multi ,含有 / 返回 multi ,否则是 Single
def single_or_multi(pokemon_type): if '/' in pokemon_type: return "Multi" return "Single"
pokemon.apply(single_or_multi)
Pokemon Bulbasaur Multi Ivysaur Multi Venusaur Multi Charmander Single Charmeleon Single ... Stakataka Multi Blacephalon Multi Zeraora Single Meltan Single Melmetal Single Name: Type, Length: 809, dtype: object
Pandas 为每个 Series 值调用 single_or_multi 函数
-
代码挑战
需要确定美国独立战争期间星期几发生的战斗最多。
最终输出应该是一个以星期几作为索引标签,每天战斗计数作为值的 Series
原始数据
pd.read_csv("./file/chapter_03/revolutionary_war.csv")
Battle Start Date State 0 Powder Alarm 9/1/1774 Massachusetts 1 Storming of Fort William and Mary 12/14/1774 New Hampshire 2 Battles of Lexington and Concord 4/19/1775 Massachusetts 3 Siege of Boston 4/19/1775 Massachusetts 4 Gunpowder Incident 4/20/1775 Virginia .. ... ... ... 227 Siege of Fort Henry 9/11/1782 Virginia 228 Grand Assault on Gibraltar 9/13/1782 NaN 229 Action of 18 October 1782 10/18/1782 NaN 230 Action of 6 December 1782 12/6/1782 NaN 231 Action of 22 January 1783 1/22/1783 Virginia [232 rows x 3 columns]
把 Start Date 作为导入的列,由于只有一列,调用 squeeze() 转换为 Series ,Start Date 指定为日期类型。
war = pd.read_csv("./file/chapter_03/revolutionary_war.csv", usecols=["Start Date"], parse_dates=["Start Date"]).squeeze(True)
0 1774-09-01 1 1774-12-14 2 1775-04-19 3 1775-04-19 4 1775-04-20 ... 227 1782-09-11 228 1782-09-13 229 1782-10-18 230 1782-12-06 231 1783-01-22 Name: Start Date, Length: 232, dtype: datetime64[ns]
定义日期转换为星期的函数
def day_for_week(date): return date.strftime("%A")
删除 NaT 值,使用 apply() 函数对 Series 每个值调用 day_for_week() 函数,然后进行唯一值出现次数的统计
war.dropna().apply(day_for_week).value_counts()
Start Date Saturday 39 Friday 39 Wednesday 32 Thursday 31 Sunday 31 Tuesday 29 Monday 27 Name: count, dtype: int64