标签：00 01 df df2 2023 test Pandas

Pandas库（一）

Python中常用的数据分析库，开源，基于numpy库开发的一种工具
导入库，并使用别名pd，同时导入numpy库，使用别名np
```
import pandas as pd
import numpy as np
```

1. Pandas中的基本数据结构

1.1 Series

一维数组，与Numpy中的一维array类似，也与Python中的基本数据结构list很相近；

Series能保存不同数据类型，包括字符串、布尔值、数字等。

使用列表定义Series

s = pd.Series([2,4,6,np.nan,10])
s
## 输出结果
0     2.0
1     4.0
2     6.0
3     NaN
4    10.0
dtype: float64

默认索引是0、1、2、3…，可以在定义时指定想要使用的索引，例如：

# 改为字符串索引后，在进行切片时仍然可以使用默认的数字索引，也可以使用字符串索引。当使用字符串索引时不再是左闭右开，而是左闭右闭。
s = pd.Series([2,4,6,np.nan,10],index = ['a','b','c','d','e'])
s
## 输出结果
a     2.0
b     4.0
c     6.0
d     NaN
e    10.0
dtype: float64

查看索引：

s.index
## 输出结果
RangeIndex(start=0, stop=5, step=1)

查看值：

s.values
## 输出结果
array([ 2.,  4.,  6., nan, 10.])

查看某一个索引对应的值：
```
s[2]
## 输出结果
6
```

切片操作：

s[1:4]
## 输出结果
1    4.0
2    6.0
3    NaN
dtype: float64

索引赋值：

s.index.name = '索引'
s
## 输出结果
索引
0     2.0
1     4.0
2     6.0
3     NaN
4    10.0
dtype: float64

1.2 DataFrame

二维表格型数据结构，可以理解为Series的容器。

通过二维数组定义DataFrame

定义DataFrame：

df = pd.DataFrame(np.random.randn(6,4))
df
## 输出结果
        0	        1	        2	        3
0	-2.034398	0.122136	0.476024	1.430138
1	-0.121024	0.370069	0.471880	-0.694771
2	0.972311	1.168978	0.323533	-0.139779
3	-0.062181	-0.268727	0.123024	-0.362190
4	-0.114693	-0.871248	0.078057	0.045691
5	1.918886	-1.305533	-1.127720	-0.112664

默认行列索引都是从0开始的数字，还可以通过参数index和columns分别设置第一维下标和列标。

首先定义一组时间序列，作为第一维下标：

date = pd.date_range('20230101',periods = 6)
date
## 输出结果
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6,4),index = date,columns = list('ABCD'))
df
## 输出结果
	            A	         B	         C	        D
2023-01-01	0.348475	1.129904	-1.885244	1.143722
2023-01-02	0.430959	-1.432529	2.208207	1.180147
2023-01-03	-0.349989	-2.051909	-0.501761	-1.526153
2023-01-04	0.507018	0.206360	0.218701	0.275458
2023-01-05	-0.571106	-1.224011	2.588948	-1.205931
2023-01-06	-0.040628	-0.764029	0.132087	0.036127

通过字典传入数据定义DataFrame

df2 = pd.DataFrame({'A':1,'B':pd.Timestamp('20230201'),'C':pd.Series([2]*5),'D':np.array([3]*5,dtype = int),'E':pd.Categorical(['test','train','test','train','test']),'F':'abc'})
df2
## 输出结果
	A	      B	       C	D	  E	  F
0	1	2023-02-01	2	3	test	abc
1	1	2023-02-01	2	3	train	abc
2	1	2023-02-01	2	3	test	abc
3	1	2023-02-01	2	3	train	abc
4	1	2023-02-01	2	3	test	abc

字典的每个key代表一列，其value可以是各种能够转化为Series的对象。

DataFrame的不同列可以是不同数据类型，同一列内数据类型保持一致。

查看数据

df.head() 查看数据框df的前几行，默认5行

df.tail() 查看后几行，默认5行

df.dtypes 查看数据类型

df.index 查看下标

df.columns 查看列标

df.values 查看数据值

2. Pandas读取数据及数据操作

读取数据

df = pd.read_excel(文件名)   #读取同一路径的excel数据文件
df = pd.read_excel(路径及文件名)   #读取其他路径的excel数据文件

行操作

df.iloc[0]  #取数据框df的第1行
df.iloc[0:5]  #取第1~5行，左闭右开
df.loc[1:5]   #取第2~6行，左闭右闭

添加行：

# 增加行,例如在上述的df2中增加一行
# 首先定义一行要添加的数据
s = pd.Series({'A':2,'B':'2023-02-05','C':2,'D':5,'E':'test','F':'a'})
s.name = 5   #定义该行的名字，由于之前df2有5行，下标分别为0~4，所以命名为5
# 然后把s添加到df2中
df2 = df2.append(s)   #后续版本会取消该函数，使用pandas.concat代替
df2
## 输出结果
       A	             B	        C	D	 E	  F
0	1	2023-02-01 00:00:00	2	3	test	abc
1	1	2023-02-01 00:00:00	2	3	train	abc
2	1	2023-02-01 00:00:00	2	3	test	abc
3	1	2023-02-01 00:00:00	2	3	train	abc
4	1	2023-02-01 00:00:00	2	3	test	abc
5	2	2023-02-05	2	5	test	a

删除行：

df2 = df2.drop([5])   #删除行索引为5的行

列操作

# 以df2为例
df2['E']    #查看列标为'E'的这一列，或直接用df2.E
## 输出结果
0     test
1    train
2     test
3    train
4     test
Name: E, dtype: object

df2[['A','E']][2:4]    #查看列标为'A','E'的这两列的第3~4行
## 输出结果
	A	E
2	1	test
3	1	train

添加列：

df2['G'] = range(1:len(df2)+1)   #直接赋值
df2
## 输出结果
	A	          B	        C	D	  E	  F	G
0	1	2023-02-01 00:00:00	2	3	test	abc	1
1	1	2023-02-01 00:00:00	2	3	train	abc	2
2	1	2023-02-01 00:00:00	2	3	test	abc	3
3	1	2023-02-01 00:00:00	2	3	train	abc	4
4	1	2023-02-01 00:00:00	2	3	test	abc	5

删除列：

df2 = df2.drop('G',axis = 1)   #删除列'G'，axis为1是按列删除，若为0是按行删除

通过标签选择数据

df2.loc[[0,2,3],['A','C']]
## 输出结果
	A	C
0	1	2
2	1	2
3	1	2

条件选择

df2[df2['E'] == 'test'][:2]  #选择列'E'值为'test'的行，并只取前两行
## 输出结果
	A	          B	        C	D	  E	  F	G
0	1	2023-02-01 00:00:00	2	3	test	abc	1
2	1	2023-02-01 00:00:00	2	3	test	abc	3

df2[(df2.E == 'test') & (df2.F == 'abc')]

3. 缺失值及异常值处理

3.1 缺失值处理方法

判断缺失值

df.isnull()   #缺失值是True，非缺失值是False
df.notnull()   #缺失值是False，非缺失值是True

填充缺失值

df['A'].fillna(要填充的值,inplace = True)   #填充列'A'中的缺失值，可以用均值、中位数等。inplace = True表示将填充后的内容都保存到原始的数据df中
df.fillna(要填充的值)   #对df中所有缺失值都填充

删除缺失值

df.dropna()
参数：
how = 'all'  表示删除全为空值的行或列
inplace = True  表示会覆盖原始数据，默认不会覆盖
axis = 0  选择行或列，默认是行

3.2 异常值处理方法

直接判断某一列有没有不符合常理的数据，例如：

df[df.投票人数 < 0]    #找到投票人数为负的行
df[df['投票人数']%1 != 0]   #找到投票人数为小数的行

若异常值不影响整体数据分布，一般删除处理

4. 数据保存

df.to_excel('mvdata.xlsx')   #将df存入当前路径，文件名为mvdata.xlsx

标签：00,01,df,df2,2023,test,Pandas
From： https://www.cnblogs.com/DYDNyang/p/17097610.html

Pandas库（一）

Pandas库（一）

1. Pandas中的基本数据结构

1.1 Series

1.2 DataFrame

2. Pandas读取数据及数据操作

3. 缺失值及异常值处理

3.1 缺失值处理方法

3.2 异常值处理方法

4. 数据保存

相关文章

赞助商

阅读排行