Pandas 数据分析实战
第一章 Pandas 概述
-
read_csv() 没有设置索引列
read_csv 函数导入 movies.csv 文件,由于没有设置索引,Pandas 会生产一个从 0 开始的数字索引
movies = pd.read_csv('./file/chapter_01/movies.csv') print(movies)
Rank Title Studio Gross Year 0 1 Avengers: Endgame Buena Vista $2,796.30 2019 1 2 Avatar Fox $2,789.70 2009 2 3 Titanic Paramount $2,187.50 1997 3 4 Star Wars: The Force Awakens Buena Vista $2,068.20 2015 4 5 Avengers: Infinity War Buena Vista $2,048.40 2018 .. ... ... ... ... ... 777 778 Yogi Bear Warner Brothers $201.60 2010 778 779 Garfield: The Movie Fox $200.80 2004 779 780 Cats & Dogs Warner Brothers $200.70 2001 780 781 The Hunt for Red October Paramount $200.50 1990 781 782 Valkyrie MGM $200.30 2008 [782 rows x 5 columns]
为避免显示结果过多导致页面过长,Pandas 仅显示 DataFrame 的前五行和后五行,省略的数据用 ... 代替
-
read_csv() 指定索引列(index_col)
使用 Title 作为索引
movies = pd.read_csv("./file/chapter_01/movies.csv", index_col="Title") print(movies)
Rank Studio Gross Year Title Avengers: Endgame 1 Buena Vista $2,796.30 2019 Avatar 2 Fox $2,789.70 2009 Titanic 3 Paramount $2,187.50 1997 Star Wars: The Force Awakens 4 Buena Vista $2,068.20 2015 Avengers: Infinity War 5 Buena Vista $2,048.40 2018 ... ... ... ... ... Yogi Bear 778 Warner Brothers $201.60 2010 Garfield: The Movie 779 Fox $200.80 2004 Cats & Dogs 780 Warner Brothers $200.70 2001 The Hunt for Red October 781 Paramount $200.50 1990 Valkyrie 782 MGM $200.30 2008 [782 rows x 4 columns]
-
操作 DataFrame
-
head():
从开头提取几行,默认是 5 行 ,可以选择传参,比如 head(4),获取前 4 行数据
movies.head()
Rank Studio Gross Year Title Avengers: Endgame 1 Buena Vista $2,796.30 2019 Avatar 2 Fox $2,789.70 2009 Titanic 3 Paramount $2,187.50 1997 Star Wars: The Force Awakens 4 Buena Vista $2,068.20 2015 Avengers: Infinity War 5 Buena Vista $2,048.40 2018
-
tail()
查看数据集的尾部数据,默认是 5 行,可以选择传参,比如 tail(6),获取尾部 6 行数据
movies.tail()
Rank Studio Gross Year Title Yogi Bear 778 Warner Brothers $201.60 2010 Garfield: The Movie 779 Fox $200.80 2004 Cats & Dogs 780 Warner Brothers $200.70 2001 The Hunt for Red October 781 Paramount $200.50 1990 Valkyrie 782 MGM $200.30 2008
-
len()
查询 DataFrame 有多少行
len(movies)
782
-
属性:shape
查询 DataFrame 的行数和列数
movies.shape
(782, 5)
-
属性:size
查询单元格总数
movies.size
3910
-
属性:dtypes
查询列的数据类型,int64 表示整数列,object 表示文本列
movies.dtypes
Rank int64 Title object Studio object Gross object Year int64 dtype: object
-
iloc[]
根据索引位置,提取信息。索引从 0 开始计算。返回一个 Series 对象。
movies.iloc[499]
Rank 500 Title Maze Runner: The Death Cure Studio Fox Gross $288.30 Year 2018 Name: 499, dtype: object
-
loc[]
使用索引标签提取数据,在导入数据的时候,设置了 Title 作为索引
movies.loc["Forrest Gump"]
Rank 119 Studio Paramount Gross $677.90 Year 1994 Name: Forrest Gump, dtype: object
索引标签可以包含重复项。尽可能保持索引标签的唯一性,唯一的标签集合可以加快 Pandas 定位和提取特定行的速度
movies.loc["101 Dalmatians"]
Rank Studio Gross Year Title 101 Dalmatians 425 Buena Vista $320.70 1996 101 Dalmatians 708 Buena Vista $215.90 1961
-
sort_values()
按 Year 列中的值进行排序,取最新上映的 5 部电影 。by 指定列名或者索引值, ascending 是否按指定列的数组升序排序,默认为 True,即升序排列
movies.sort_values(by='Year', ascending= False).head()
Rank ... Year Title ... Avengers: Endgame 1 ... 2019 John Wick: Chapter 3 - Parabellum 458 ... 2019 The Wandering Earth 114 ... 2019 Toy Story 4 198 ... 2019 How to Train Your Dragon: The Hidden World 199 ... 2019 [5 rows x 4 columns]
还可以根据多列的值对 DateFrame 进行排序,先按 Studio 列的值排序,如果出现重复值,再按 Year 列的值对电影进行排序
movies.sort_values(by=['Studio','Year']).head()
Rank Studio Gross Year Title The Blair Witch Project 588 Artisan $248.60 1999 101 Dalmatians 708 Buena Vista $215.90 1961 The Jungle Book 755 Buena Vista $205.80 1967 Who Framed Roger Rabbit 410 Buena Vista $329.80 1988 Dead Poets Society 636 Buena Vista $235.90 1989
-
sort_index()
可以按照字母顺序查看电影信息(设置了 Title 为索引)
movies.sort_index().head()
Rank Studio Gross Year Title 10,000 B.C. 536 Warner Brothers $269.80 2008 101 Dalmatians 708 Buena Vista $215.90 1961 101 Dalmatians 425 Buena Vista $320.70 1996 2 Fast 2 Furious 632 Universal $236.40 2003 2012 93 Sony $769.70 2009
-
-
计算 Series 中的值
从 DataFrame 中提取一列数据作为 Series。Pandas 在 Series 中保存了 DataFrame 的索引,即设置的 Title
movies['Studio']
Title Avengers: Endgame Buena Vista Avatar Fox Titanic Paramount Star Wars: The Force Awakens Buena Vista Avengers: Infinity War Buena Vista ... Yogi Bear Warner Brothers Garfield: The Movie Fox Cats & Dogs Warner Brothers The Hunt for Red October Paramount Valkyrie MGM Name: Studio, Length: 782, dtype: object
Series 中有多行数据,Pandas 会截断数据集,只显示前五行和后五行
计算 Series 列中唯一值出现的次数,将结果限制在排名前10的电源公司:
movies['Studio'].value_counts().head(10)
Studio Warner Brothers 132 Buena Vista 125 Fox 117 Universal 109 Sony 86 Paramount 76 Dreamworks 27 Lionsgate 21 New Line 16 MGM 11 Name: count, dtype: int64
返回值是另一个 Series 对象。此时,Pandas 使用 Studio 列中的公司名称作为索引标签,他们的计算作为Series 值
-
根据一个或多个条件筛选列
movies[movies["Studio"] == "Universal"]
Rank Studio Gross Year Title Jurassic World 6 Universal $1,671.70 2015 Furious 7 8 Universal $1,516.00 2015 Jurassic World: Fallen Kingdom 13 Universal $1,309.50 2018 The Fate of the Furious 17 Universal $1,236.00 2017 Minions 19 Universal $1,159.40 2015 ... ... ... ... ... The Break-Up 763 Universal $205.00 2006 Everest 766 Universal $203.40 2015 Patch Adams 772 Universal $202.30 1998 Kindergarten Cop 775 Universal $202.00 1990 Straight Outta Compton 776 Universal $201.60 2015 [109 rows x 4 columns]
可以将过滤条件赋给一个变量
filter_by_universal = movies["Studio"] == "Universal" movies[filter_by_universal]
Rank Studio Gross Year Title Jurassic World 6 Universal $1,671.70 2015 Furious 7 8 Universal $1,516.00 2015 Jurassic World: Fallen Kingdom 13 Universal $1,309.50 2018 The Fate of the Furious 17 Universal $1,236.00 2017 Minions 19 Universal $1,159.40 2015
还可以根据多个条件过滤DataFrame行,获取2015年由 Universal 发行的所有电影
filter_by_universal = movies["Studio"] == "Universal" filter_by_year = movies['Year'] == 2015 movies[filter_by_universal & filter_by_year]
Rank Studio Gross Year Title Jurassic World 6 Universal $1,671.70 2015 Furious 7 8 Universal $1,516.00 2015 Minions 19 Universal $1,159.40 2015 Fifty Shades of Grey 165 Universal $571.00 2015 Pitch Perfect 2 504 Universal $287.50 2015 Ted 2 702 Universal $216.70 2015 Everest 766 Universal $203.40 2015 Straight Outta Compton 776 Universal $201.60 2015
若要筛选符合以下两个条件中任意一个的电源:Universal 发行的电源或2015年发行的电影
filter_by_universal = movies["Studio"] == "Universal" filter_by_year = movies['Year'] == 2015 movies[filter_by_universal | filter_by_year]
Rank Studio Gross Year Title Star Wars: The Force Awakens 4 Buena Vista $2,068.20 2015 Jurassic World 6 Universal $1,671.70 2015 Furious 7 8 Universal $1,516.00 2015 Avengers: Age of Ultron 9 Buena Vista $1,405.40 2015 Jurassic World: Fallen Kingdom 13 Universal $1,309.50 2018 ... ... ... ... ... The Break-Up 763 Universal $205.00 2006 Everest 766 Universal $203.40 2015 Patch Adams 772 Universal $202.30 1998 Kindergarten Cop 775 Universal $202.00 1990 Straight Outta Compton 776 Universal $201.60 2015 [140 rows x 4 columns]
可以将小于或大于特定值的列值作为筛选条件。以1975年之前发行的电影为筛选条件
filter_by_1975 = movies['Year'] < 1975 movies[filter_by_1975]
Rank Studio Gross Year Title The Exorcist 252 Warner Brothers $441.30 1973 Gone with the Wind 288 MGM $402.40 1939 Bambi 540 RKO $267.40 1942 The Godfather 604 Paramount $245.10 1972 101 Dalmatians 708 Buena Vista $215.90 1961 The Jungle Book 755 Buena Vista $205.80 1967
指定一个范围,所有值必须在该范围之内。筛选1983-1986年发行的电影
year__between = movies["Year"].between(1983, 1986) print(movies[year__between])
Rank Studio Gross Year Title Return of the Jedi 222 Fox $475.10 1983 Back to the Future 311 Universal $381.10 1985 Top Gun 357 Paramount $356.80 1986 Indiana Jones and the Temple of Doom 403 Paramount $333.10 1984 Crocodile Dundee 413 Paramount $328.20 1986 Beverly Hills Cop 432 Paramount $316.40 1984 Rocky IV 467 MGM $300.50 1985 Rambo: First Blood Part II 469 TriStar $300.40 1985 Ghostbusters 485 Columbia $295.20 1984 Out of Africa 662 Universal $227.50 1985
使用 DataFrame 索引来过滤。查找标题中带有“dark"一词的所有电影。先将索引的电影标题中的大写字母转换为小写字母。
str_lower__str_contains = movies.index.str.lower().str.contains("dark") print(movies[str_lower__str_contains])
Rank Studio Gross Year Title Transformers: Dark of the Moon 23 Paramount $1,123.80 2011 The Dark Knight Rises 27 Warner Brothers $1,084.90 2012 The Dark Knight 39 Warner Brothers $1,004.90 2008 Thor: The Dark World 132 Buena Vista $644.60 2013 Star Trek Into Darkness 232 Paramount $467.40 2013 Fifty Shades Darker 309 Universal $381.50 2017 Dark Shadows 600 Warner Brothers $245.50 2012 Dark Phoenix 603 Fox $245.10 2019
-
对数据分组
筛选出总票房最高的电影公司,首先按电影公司名称汇总 Gross 列的值。注意 Gross 列的值存储为文本而不是数字。Pandas 将列的值作为文本导入,以保留原始 CSV 中的美元符号和逗号符合,需要删除美元符号和逗号符合,然后将列的值转换为数字
movies['Gross'].str.replace("$", "", regex=False).str.replace(",", "", regex=False)
Title Avengers: Endgame 2796.30 Avatar 2789.70 Titanic 2187.50 Star Wars: The Force Awakens 2068.20 Avengers: Infinity War 2048.40 ... Yogi Bear 201.60 Garfield: The Movie 200.80 Cats & Dogs 200.70 The Hunt for Red October 200.50 Valkyrie 200.30 Name: Gross, Length: 782, dtype: object
可以看到 dtype 类型为 object ,可以转换为浮点数
movies['Gross'].str.replace("$", "", regex=False).str.replace(",", "", regex=False).astype(float)
Title Avengers: Endgame 2796.3 Avatar 2789.7 Titanic 2187.5 Star Wars: The Force Awakens 2068.2 Avengers: Infinity War 2048.4 ... Yogi Bear 201.6 Garfield: The Movie 200.8 Cats & Dogs 200.7 The Hunt for Red October 200.5 Valkyrie 200.3 Name: Gross, Length: 782, dtype: float64
以上操作都是在原始数据结构的副本上进行的操作,需要修改原始数据,需要进行覆盖操作
gross_series = movies['Gross'].str.replace("$", "", regex=False).str.replace(",", "", regex=False).astype(float) movies['Gross'] = gross_series print(movies.dtypes)
Rank int64 Studio object Gross float64 Year int64 dtype: object
计算电影的平均票房收入
movies['Gross'].mean()
439.0308184143222
计算每个电影公司的总票房收入
先对电影公司进行分组
studios = movies.groupby("Studio")
对电影出票数量由多到少的顺序排序
studios['Gross'].count().sort_values(ascending=False).head()
Studio Warner Brothers 132 Buena Vista 125 Fox 117 Universal 109 Sony 86 Name: Gross, dtype: int64
计算每个电影公司的总票房,找出票房最高的5家公司
studios['Gross'].sum().sort_values(ascending=False).head()
Studio Buena Vista 73585.0 Warner Brothers 58643.8 Fox 50420.8 Universal 44302.3 Sony 32822.5 Name: Gross, dtype: float64