标签：index 索引 python Series 笔记 df DataFrame pd pandas

北京理工大学嵩天Pandas课程学习笔记。部分内容补充自菜鸟教程。

Pandas 库提供了共性能易用数据类型和分析工具的第三方python库。

Pandas 库基于 Numpy 库实现。Pandas更关注数据的应用表达、数据与索引之间的关系。

数据类型

Series = 索引 + 一维数据

DataFrame = 行列索引 + 二维数据

Series

Series 类似一维数组的对象，由一组数据及与之相关的数据索引组成。

import numpy as np
import pandas as pd

pd.Series(data, index, dtype, name, copy=False)

# 从列表创建。如果没有指定索引 index，索引值就从 0 开始。 
pd.Series([6, 5, 4, 3])
pd.Series([6, 5, 4, 3], index=['a', 'b', 'c', 'd'])

# 从标量值创建
pd.Series(25, index=[3, 2, 1])

# 从字典创建
pd.Series({'a':6, 'b':5, 'c':4, 'd':3})
pd.Series({'a':6, 'b':5, 'c':4, 'd':3}, index=['d', 'a', 'c', 'b', 'e'])

# 从 ndarray 创建
pd.Series(np.arange(5), index=np.arange(9, 4, -1))

Series 类型的基本操作，主要是对 data 和 index 进行操作：

s = pd.Series([6, 5, 4, 3], index=['a', 'b', 'c', 'd'])

# 获取所有索引
s.index

# 获取所有值
s.values

# 根据索引获取值（自动索引和自定义索引并存）
s['b'] # 值为5
s[1] # 值为5
# 两套索引并存，但不能套用
#s[[0, 'b']] # ！！！报错

# 切片。操作类似 ndarray 的切片
s[:3]

# 用比较关系进行索引
s[s > s.median()]

# Numpy中运算和操作可用于 Series 类型
np.exp(s)

# 保留字 in 操作。当存在自定义索引时判断自定义索引。
'a' in s   # True
0 in s     # False

# .get() 操作，解决[]取值时因索引不存在从而报错的问题
s.get('z', 100) # 当索引不存在，返回值100

Series 类型对齐操作：

# Series + Series。在运算中会自动对齐不同索引的数据。
a = pd.Series([1, 2, 3], ['c', 'd', 'e'])
b = pd.Series([9, 8 ,7, 6], ['a', 'b', 'c', 'd'])
print(a + b)
'''
输出：
a    NaN
b    NaN
c    8.0
d    8.0
e    NaN
dtype: float64
'''

Series 对象和索引都可以有一个名字：

a.name = 'Series对象'
a.index.name = '索引列'

DataFrame

DataFrame 是一个表格型的数据结构，由共用相同索引的一组列构成，每列可以是不同的数据类型。简而言之，DataFrame 是一个二维的带标签的数组。

'''
index: 行标签
columns: 列标签
'''
pd.DataFrame(data, index, columns, dtype, copy)

# 从二维 ndarray 对象创建
df = pd.DataFrame(np.arange[10].reshape(2,5))

# 从字典创建
dt = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(dt)

# 从 Series 类型的字典创建
dt = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two': pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(dt, index=['b', 'c', 'd'], columns=['one', 'two', 'three'])

# 从列表类型的字典创建
dl = {'one': [1, 2, 3, 4], 'two': [9, 8, 7, 6]}
df = pd.DataFrame(dl, index=['a', 'b', 'c', 'd'])

DataFrame 类型的操作:

# 获取所有列索引,返回的是 Index 类型
df.columns

# 获取所有行索引,返回的是 Index 类型
df.index

# 获取所有表格值
df.values

# ['列索引'] 返回一个 Series 类型的列
df['one']
df.one

# loc属性返回指定行的数据，类型是 Series，其索引为 df 的列索引，值为指定行的值
df.loc['a']
df.loc['a', 'one'] # 返回a行one列的值

# ['列索引']['行索引'] 返回指定列和行交汇的值
df['two']['b']

# 删除指定行
df.drop('a')
# 删除指定列
df.drop('one', axis=1)

'''
重新索引
reindex(index, columns, fill_value = nan, method, limit, copy)
index, columns: 新的行列自定义索引
fill_value: 重新索引中，用于填充缺失位置的值
method: 填充方法，ffill当前值向前填充，bfill向后填充
limit: 最大填充量
'''
df = df.reindex(index=['d', 'c', 'a'], columns=['three', 'two', 'one'], fill_value = 200)

df.index 和 df.columns 返回的都是 Index 类型，Index 类型的常用方法如下：

方法	说明
.append(idx)	连接另一个 Index 对象，产生新的 Index 对象
.diff(idx)	计算差集，产生新的 Index 对象
.intersection(idx)	计算交集，产生新的 Index 对象
.union(idx)	计算并集，产生新的 Index 对象
.delete(loc)	删除 loc 位置处的元素
.insert(loc, e)	在 loc 位置增加一个元素 e

new_col = df.columns.delete(2)
new_index = df.index.insert(4, 'e')
new_df = df.reindex(index=new_index, columns=new_col, method='ffill')

数据运算

算术运算法则

算术运算根据行列索引，补齐后运算，运算默认产生浮点数。
补齐时缺项填充NaN。
二维和一维、一维和零维间为广播运算。
四则法则运算后产生新的对象。

a = pd.DataFrame(np.arange(12).reshape(3,4))
b = pd.DataFrame(np.arange(20).reshape(4,5))
c = pd.Series(np.arange(4))

# 自动补齐，缺项补 NaN
a + b 

# 方法形式的运算(add、sub、mul、div)，可以用 fill_value 替代 NaN
a.mul(b, fill_value = 0)

# 不同维度间为广播运算
c - 10
b - c # 默认在轴1参与运算。b中每一行都减去c
b.sub(c, axis=0) # b中每一列减去c

比较运算法则

比较运算只能比较相同索引的元素，不进行补齐
二维和一维、一维和零维间为广播运算
采用 >、<、>=、<=、==、!=等符号进行的二元运算产生布尔对象

a = pd.DataFrame(np.arange(12).reshape(3,4))
b = pd.DataFrame(np.arange(12, 0, -1).reshape(3,4))
c = pd.Series(np.arange(4))

# 同维度运算，要求尺寸一致
a > b
a == b

# 不同维度，广播运算
c > 0
a > c # 默认在1轴运算

数据清洗

数据清洗是对一些没有用的数据进行处理的过程。

很多数据集存在数据缺失、数据格式错误、错误数据或重复数据的情况，如果要对使数据分析更加准确，就需要对这些没有用的数据进行处理。

清洗空值

dropna() 方法删除包含空字段的行（或列），格式如下：

'''
axis: 指定轴，0表示逢空值剔除整行，1表示逢空值剔除整列。
how: 'any'则一行（列）里任何一个数据出现NaN就剔除；'all'则一行（列）都是NaN才剔除。
thresh: 设置需要多少非空值的数据才可以保留下来。
subset: 设置想要检查的列。如果是多个列，可以使用列名的list 作为参数。
inplace: True则修改源数据。默认False返回一个新的DataFrame,不修改源数据。
'''
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

fillna() 方法替换一些空单元格

# 用123替换全部空单元格
df.fillna(123, inplace = True)

# 用123替换 one 列中的空单元格
df['one'].fillna(123, inplace = True)

# 常使用 mean()、median()、mode() 方法计算列的均值、中位数和众数来替换空单元格
x = df['one'].mean()
df['one'].fillna(x, inplace = True)

清洗格式错误数据

将列中的所有单元格转换为相同格式的数据。

# 第三个日期格式错误
data = {
  "Date": ['2020/12/01', '2020/12/02' , '20201226'],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

清洗错误数据

对错误的数据可以进行替换或者剔除。

person = {
  "name": ['Google', 'Runoob' , 'Taobao'],
  "age": [50, 200, 12345]    
}

df = pd.DataFrame(person)

# 替换
# 将 age 大于 120 的设置为 120:
for x in df.index:
  if df.loc[x, "age"] > 120:
    df.loc[x, "age"] = 120

# 删除
# 将 age 大于 120 的行删除:
for x in df.index:
  if df.loc[x, "age"] > 120:
    df.drop(x, inplace = True)

清洗重复数据

person = {
  "name": ['Google', 'Runoob', 'Runoob', 'Taobao'],
  "age": [50, 40, 40, 23]  
}
df = pd.DataFrame(person)

# duplicated() 方法判断数据是否重复。 
# 如果对应的数据是重复的，duplicated()会返回True,否则返回False
print(df.duplicated())
''' 输出结果如下：
0    False
1    False
2     True
3    False
dtype: bool
'''

# drop_duplicates() 方法删除重复的数据。
df.drop_duplicates(inplace = True)
print(df)
'''输出结果如下：
     name  age
0  Google   50
1  Runoob   40
3  Taobao   23
'''

数据特征分析

数据的摘要是有损地提取数据特征的过程，包括以下几个方面：

基本统计（含排序）
分布/累计统计
相关性、周期性等
数据挖掘

数据的排序

# 在指定轴上根据索引进行排序，默认升序
df.sort_index(axis=0, ascending=True)

# 在指定轴上根据数值进行排序，默认升序
Series.sort_values(axis=0, ascending=True)
# by: axis轴上的某个索引或索引列表
DataFrame.sort_values(by, axis=0, ascending=True)

注：NaN统一放到排序末尾

数据的基本统计分析

适用于 Series 类型：

方法	说明
`.argmin()`、`.argmax()`	计算数据最小值、最大值所在位置的自动索引
`.idxmin()`、`.idxmax()`	计算数据最小值、最大值所在位置的自定义索引

适用于 Series 和 DataFrame 类型的基本统计分析函数：

方法	说明
`.sum()`	计算数据的总和。默认按0轴计算，下同
`.count()`	非NaN值的数量
`.mean()`、`.median()`	计算数据的算术平均值、算术中位数
`.var()`、`.std()`	计算数据的方差、标准差
`.min()`、`.max()`	计算数据的最小值、最大值

pandas中有一个方法对0轴（各列）的统计汇总，即.describe()

数据的累计统计分析

适用于 Series 和 DataFrame类型的累计统计分析函数：

方法	说明
.cumsum()	依次给出前1、2、...、n个数的和（默认0轴，下同）
.cumprod()	依次给出前1、2、...、n个数的积
.cummax()	依次给出前1、2、...、n个数的最大值
.cummin()	依次给出前1、2、...、n个数的最小值

适用于 Series 和 DataFrame类型的滚动计算（窗口计算）函数：

方法	说明
.rolling(w).sum()	依次计算相邻w个元素的和
.rolling(w).mean()	依次计算相邻w个元素的算术平均值
.rolling(w).var()	依次计算相邻w个元素的方差
.rolling(w).std()	依次计算相邻w个元素的标准差
.rolling(w).min/max()	依次计算相邻w个元素的最小/大值

数据的相关性分析

判断两组数据的相关性。

适用于 Series 和 DataFrame类型的相关分析函数：

方法	说明
.cov()	计算协方差矩阵
.corr()	计算相关系数矩阵，Person、Spearman、Kendall等系数

hprice = pd.Series([3.04, 22.93, 12.75, 22.6, 12.33], index=['2008', '2009', '2010', '2011', '2012'])
m2 = pd.Series([8.18, 18.38, 9.13, 7.82, 6.69], index=['2008', '2009', '2010', '2011', '2012'])

hprice.corr(m2)

标签：index,索引,python,Series,笔记,df,DataFrame,pd,pandas
From： https://www.cnblogs.com/hzyuan/p/17062983.html

【python】pandas库学习笔记