首页 > 其他分享 >pandas

pandas

时间:2023-05-05 20:58:56浏览次数:44  
标签:df Country Delhi Capital Belgium pandas Population

Table of Contents

Set-up

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"  # 如果不需要,则修改为none即可。删除该段代码不起作用。

Intro

The Pandas library is built on NumPy
provides data structures and data analysis tools.

import pandas as pd

Pandas切片左闭右开


Pandas Data Structures - 2 Types

Notice:
当同时对s和df进行操作的时候:

  1. s会直接在原始的Series对象上进行更改,而不会创建新的对象 (Series是可变对象)
  2. 而df并不会修改原始数据,除非进行赋值操作,或者设置了参数inplace=True(非破坏操作)

Series

  1. (my understanding: ordinary/normal data) )
  2. A one-dimensional labeled array
  3. capable of holding any data type
 s = pd.Series([1, 2, 3, 4],
              index = ['a', 'b', 'c', 'd'])
print(s)
a    1
b    2
c    3
d    4
dtype: int64

DataFrame

  1. two-dimensional labeled data structure
  2. columns of different types
  3. 大多的DataFrame操作都会自动返回一个新的 DataFrame 对象
# Series
data = {
    'Country': ['Belgium', 'India', 'Brazil'],
    'Capital': ['Brussels', 'New Delhi', 'Brasília'],
    'Population': [11190846, 1303171035, 207847528]
    }

print(data)

# series -> DataFrame
# index默认从0开始递增
# 修改默认行索引:index=['foo', 'bar', 'baz']
df = pd.DataFrame(data,
                columns = ['Country', 'Capital', 'Population'])
df # 原始的 DataFrame 对象

# **注意**
# 对df.__任何操作都不会修改它本身数据,除非进行了赋值
# eg. df = df.drop(0)
{'Country': ['Belgium', 'India', 'Brazil'], 'Capital': ['Brussels', 'New Delhi', 'Brasília'], 'Population': [11190846, 1303171035, 207847528]}
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528

Series VS DataFrame

image.png

Dropping

# drop rows from series: index = 'a' and 'c' 
# 删除指定行
s.drop( ['a', 'c'] ) 
s

df

# drop columns
# 删除指定列 axis=1删除列,axis=0删除行
df.drop('Country', axis=1)
df

# 删除第一行数据, index=0 (不是标题行)
df.drop(0)
df

# Notice
# 看以下输出,数据本身并没有改变
# drop方法会自动返回一个 经过删除操作的 Series/DataFrame
# 如果想改变原有数据,参数中添加:inplace=True
b    2
d    4
dtype: int64






a    1
b    2
c    3
d    4
dtype: int64
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Capital Population
0 Brussels 11190846
1 New Delhi 1303171035
2 Brasília 207847528
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528

Sort & Rank

df
df.sort_index() # 按照index排序,index默认从0开始递增,所以这里看不出变化
df.sort_values( by='Country' ) # sory by 'values'
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
0 Belgium Brussels 11190846
2 Brazil Brasília 207847528
1 India New Delhi 1303171035
df
# 给每个元素在该列进行排序
df.rank() # assign ranks to entires
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
0 1.0 2.0 1.0
1 3.0 3.0 3.0
2 2.0 1.0 2.0

I/O

Notice:

  1. 对于CSV和Excel来说:使用pandas读取到的文件为DataFrame
  2. 对于SQL来说:在Pandas的pd.read_sql()函数中,结果可以是不同的数据结构,取决于查询和它的执行方式. eg. 一个单一的值(比如计数),一组值(比如名字的列表),一个类似表的结构(比如SELECT语句的结果集)。

CSV

Aim:
读取1.csv文件,经过DataFrame处理后,以2.csv的形式输出

# Read

# nrows = number of rows

# 用pandas读取csv -> 存入DataFrame中
# 因为pd.read_csv()方法产生的数据结构是一个DataFrame

# header=0或者None: 文件第一行作为标题行

df = pd.read_csv('1.csv', header = None, nrows = 3)

df
# Write
# 将DataFrame写入CSV文件中
df.to_csv('2.csv')
0 1 2 3 4 5 6 7 8 9 ... 26 27 28 29 30 31 32 33 34 35
0 respondent_id h1n1_concern h1n1_knowledge behavioral_antiviral_meds behavioral_avoidance behavioral_face_mask behavioral_wash_hands behavioral_large_gatherings behavioral_outside_home behavioral_touch_face ... income_poverty marital_status rent_or_own employment_status hhs_geo_region census_msa household_adults household_children employment_industry employment_occupation
1 26707 2 2 0 1 0 1 1 0 1 ... > $75,000 Not Married Rent Employed mlyzmhmf MSA, Not Principle City 1 0 atmlpfrs hfxkjkmi
2 26708 1 1 0 0 0 0 0 0 0 ... Below Poverty Not Married Rent Employed bhuqouqj Non-MSA 3 0 atmlpfrs xqwwgdyp

3 rows × 36 columns

Excel

  1. 如果一个xlsx文件 - 只有一个sheet
# Read
df = pd.read_excel('3.xlsx')

# Write
df.to_excel('4.xlsx', sheet_name = 'new')
  1. 如果一个xlsx文件 - 有多个sheet
# 从一个excel中 - 读取所有sheet
sheets = pd.ExcelFile('5.xls')

# Read
df1 = pd.read_excel(sheets, 'sheet_1')
df2 = pd.read_excel(sheets, 'sheet_2')

# Write
merged_df = pd.concat([df1, df2], axis=0) # Merge the dataframes
merged_df.to_excel('merged_file.xlsx', index=False) # Write to new xlsx file

SQL Query/Database Table

存在问题,需要不断安装sqlalchemy,可能是虚拟环境的问题
jupyter kernelspec list
Available kernels:
pycharm-616d70e3 /Users/huang/Library/Jupyter/kernels/pycharm-616d70e3
python3 /Users/huang/opt/anaconda3/envs/WebAppFlask3_9_MachineLearning/share/jupyter/kernels/python3

Databricks也无法运行

官方文档:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html

image.png

# !pip install sqlalchemy
!conda install sqlalchemy

# >>> from sqlalchemy import create_engine

# >>> engine = create_engine( )

# >>> pd.read_sql( , engine)

# >>> pd.read_sql_table( , engine)

# >>> pd.read_sql_query( , engine)
# read_sql() is a convenience wrapper around read_sql_table() andread_sql_query()
# >>> df.to_sql( , en 'myDf' gine)
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/huang/opt/anaconda3

  added / updated specs:
    - sqlalchemy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    boltons-23.0.0             |   py39hecd8cb5_0         423 KB
    conda-23.3.1               |   py39hecd8cb5_0         962 KB
    jsonpatch-1.32             |     pyhd3eb1b0_0          15 KB
    jsonpointer-2.1            |     pyhd3eb1b0_0           9 KB
    ------------------------------------------------------------
                                           Total:         1.4 MB

The following NEW packages will be INSTALLED:

  boltons            pkgs/main/osx-64::boltons-23.0.0-py39hecd8cb5_0 
  jsonpatch          pkgs/main/noarch::jsonpatch-1.32-pyhd3eb1b0_0 
  jsonpointer        pkgs/main/noarch::jsonpointer-2.1-pyhd3eb1b0_0 

The following packages will be UPDATED:

  conda                               23.1.0-py39hecd8cb5_0 --> 23.3.1-py39hecd8cb5_0 


Proceed ([y]/n)? 

Selection

Getting

print(s)
s['b']
a    1
b    2
c    3
d    4
dtype: int64





2
df
df[1:] # get subset
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
Country Capital Population
1 India New Delhi 1303171035
2 Brazil Brasília 207847528

Selecting, Boolean Indexing & Setting

# by position
# 选择(0,0)的元素
df
df.iloc[0,0]
df.iat[0,0]
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
'Belgium'






'Belgium'
# By Label
df.loc[0, 'Country'] #选择第0行Country列
df.at[0, 'Country']

# Notice:
# 在访问单个标量值时,.at比.loc快,但它只适用于一次访问一个值。

'Belgium'






'Belgium'
#  Boolean Indexing

s

# 选择Series对象s中所有<=1的元素
# ~ 取反操作
s[~(s>1)]

s[ (s<-1) | (s>2) ]

df['Population']>1200000000  # -> 返回的是一个bool类型的Series对象
df[ df['Population']>1200000000 ]

a    1
b    2
c    3
d    4
dtype: int64






a    1
dtype: int64






c    3
d    4
dtype: int64






0    False
1     True
2    False
Name: Population, dtype: bool
Country Capital Population
1 India New Delhi 1303171035
# 修改索引a的值为6
s
s['a'] = 6 #  结果会被修改 - 会直接在原始的Series对象上进行更改,而不会创建新的对象
s

a    1
b    2
c    3
d    4
dtype: int64






a    6
b    2
c    3
d    4
dtype: int64
s[0:2] # 左闭右开
a    6
b    2
dtype: int64

loc和iloc区别

  1. loc: location - df.loc[0, 'Country']
  2. iloc: integer location - df.iloc[0,0]

image.png

## Retrieving Series/DataFrame Information
df
df.shape
df.index
df.columns
df.info()
df.count()
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528
(3, 3)






RangeIndex(start=0, stop=3, step=1)






Index(['Country', 'Capital', 'Population'], dtype='object')



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     3 non-null      object
 1   Capital     3 non-null      object
 2   Population  3 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes





Country       3
Capital       3
Population    3
dtype: int64
### Basic Information


### Summary


## Applying Functions

## Data Alignment
### Internal Data Alignment

### Arithmetic Operations with Fill Methods

https://www.datacamp.com/search?q=pandas




标签:df,Country,Delhi,Capital,Belgium,pandas,Population
From: https://www.cnblogs.com/huangkenicole/p/17375323.html

相关文章

  • Python+Pandas批量合并大量excel文件
    requirments.txtet-xmlfile==1.1.0numpy==1.24.3openpyxl==3.1.2pandas==2.0.1python-dateutil==2.8.2pytz==2023.3six==1.16.0tzdata==2023.3main.pyimportosimportpandasaspddir_path=os.path.dirname(os.path.abspath(__file__))source_location=o......
  • Python教程:pandas读写txt文件——DataFrame和Series
    大家用pandas一般都是读写csv文件或者tsv文件,读写txt文件时一般就withopen了,其实pandas数据类型操作起来更加方便,还是建议全用pandas这一套。读txt文件代码如下,主要是设置正则表达式的分隔符(sep参数),和列名取消(header参数),以及不需要列索引(index_col)。1df=pd.read_csv("workl......
  • 【pandas基础】--数据读取
    数据读取是第一步,只有成功加载数据之后,后续的操作才有可能。pandas可以读取和导入各种数据格式的数据,如CSV,Excel,JSON,SQL,HTML等,不需要手动编写复杂的读取代码。1.各类数据源pandas提供了导入各类常用文件格式数据的接口,这里介绍3种最常用的加载数据的接口。1.1从CSV文件读......
  • python pandas 生成excle
    首先需要导包需要两个包分别是pandas和openpyxlpipinstall-i https://pypi.doubanio.com/simple/ --trusted-host pypi.doubanio.com pandaspipinstallopenpyxl 创建Python开始写入#使用pandas生成xlsx的excel文件importpandasaspdimportpandasaspddata=......
  • pandas drop、loc和iloc函数对数据框进行操作
    pandas.DataFrame.drop—从行或列中删除指定的标签参考:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html语法格式DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise......
  • 关于如何使用pandas将不同的数据写入到同一个Excel的不同Sheet表中
    笔者在之前使用python将数据写入到Excel表格中,一般都只需要写入一张Sheet表中就可以了因最近一个小的需求,需要将不同的大列表数据写入到不同的Sheet表中这样,再使用之前的pandas.DataFrame(content_list),然后.to_excel("./xxxxx.xlsx")就不再好用了于是笔者又去看了一个pandas......
  • 关于pandas.ExcelWriter()对象在执行.save()时报错FutureWarning: save is not part o
    有时使用pandas将多份数据,写入到Excel中不同的Sheet,可能会用到pandas.ExcelWriter("xxxx.xlsx")对象这样在结束时,就会对对象进行.save()和close(),当然这也是从大部分网站上看到的使用方法但是笔者在实际使用过程中,按这个过程,遇到报错:FutureWarning:saveisnotpartofthepu......
  • Pandas
    Series结构Series结构,也称Series序列,是Pandas常用的数据结构之一,它是一种类似于一维数组的结构,由一组数据值(value)和一组标签组成,其中标签与数据值具有对应关系。标签不必是唯一的,但必须是可哈希类型(即不可变类型,如python中的整形、字符串、数值、浮点)。该对象既支持基于整数的索......
  • 【pandas基础】--概述
    Pandas是一个开源的Python数据分析库。它提供了快速,灵活和富有表现力的数据结构,旨在使数据清洗和分析变得简单而快速。Pandas是基于NumPy数组构建的,因此它在许多NumPy函数上提供了直接的支持。它还提供了用于对表格数据进行操作的数据结构,例如Series和DataFrame。Pandas提供了......
  • SQLite vs Pandas
    AnalysisdetailsFortheanalysis,weranthesixtasks10timeseach,for5differentsamplesizes,foreachof3programs:pandas,sqlite,andmemory-sqlite(wheredatabaseisinmemoryinsteadofondisk).See below forthedefinitionsofeachtask.Ou......