首页 > 其他分享 >pannas详解

pannas详解

时间:2023-02-06 19:32:02浏览次数:51  
标签:info food survival pannas titanic 详解 print age


# coding=utf-8

'''
pannas的函数作用;
read_csv(文件的路径) 读取文件

<class 'pandas.core.frame.DataFrame'>(read_csv返回的对象)对象的方法与属性:

属性:
columns输出所有的特征 与dtypes类似
dtypes
shape [多少行,多少列]
loc 获取每一个记录或切片获取,传入一个列表
loc[1] 获取第二行
loc[2:5] 获取3~6行
loc[2,5,10] 获取第3,6,10行
loc(行号,特征) 取出某一样元素
方法:
head(参数为读取的行数) 默认输出所有信息的前五行,也可以指定参数,显示几行
max() DataFrame['特征name'].max()
min() DataFrame['特征name'].min()
mean() DataFrame['特征name'].mean() 平均值
sort_values("Sodium_(mg)", inplace=True) 排序 ascending 默认为True 升序 False降序 NaN放在最后
titanic_survival.pivot_table(index='Pclass',values='Survived',aggfunc=np.mean) 每一个不同类型的中所有Survived的平均值 默认aggfunc为平均值
dropna() 判断是否有缺失值,有的话就舍去
apply()自定义函数
'''

import pandas
#read_csv 读取文件
food_info = pandas.read_csv('food_info.csv')
#print(food_info)
#print(type(food_info)) #<class 'pandas.core.frame.DataFrame'>
#print(food_info.dtypes) #输出每一个特征的类型,意思就是输出属性的类型
'''
#object - For string values
#int - For integer values
#float - For float values
#datetime - For time values
#bool - For Boolean values
#print(food_info.dtypes)
'''


#head 默认输出所有信息的前五行,也可以指定参数,显示几行
first_rows = food_info.head()
#print(first_rows)
#print(food_info.head(3)) #显示前三行
#print(food_info.columns) #输出所有的特征 与dtypes类似
#print(food_info.shape) #行列



#pandas uses zero-indexing
#Series object representing the row at index 0.
#loc 获取每一个记录或切片获取,传入一个列表
#print(food_info.loc[1])
# Series object representing the seventh row.
#food_info.loc[6]
# Will throw an error: "KeyError: 'the label [8620] is not in the [index]'"
#food_info.loc[8620]
#The object dtype is equivalent to a string in Python
# Returns a DataFrame containing the rows at indexes 3, 4, 5, and 6.
#print(food_info.loc[3:6])
# Returns a DataFrame containing the rows at indexes 2, 5, and 10. Either of the following approaches will work.
# Method 1
#two_five_ten = [2,5,10]
#food_info.loc[two_five_ten]
# Method 2
#food_info.loc[[2,5,10]]



#获取指定的列
# Series object representing the "NDB_No" column.
ndb_col = food_info['NDB_No']
#print(ndb_col)
# Alternatively, you can access a column by passing in a string variable.
#col_name = "NDB_No"
#ndb_col = food_info[col_name]

#同时获取多个列
columns = ["Zinc_(mg)", "Copper_(mg)"]
zinc_copper = food_info[columns]
#print(zinc_copper)
# Skipping the assignment.
#zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]


#输出含有单位(g)的特征的数据
col_names = food_info.columns.tolist() #tolist获取所有的特征name
print(col_names)
gram_colums = []
for c in col_names:
if c.endswith('(g)'):
gram_colums.append(c)
gram_df = food_info[gram_colums]
#print(gram_df)


#对数据某一列的操作
div_1000 = food_info['Iron_(mg)'] / 1000
print(div_1000)
# Adds 100 to each value in the column and returns a Series object.
#add_100 = food_info["Iron_(mg)"] + 100

# Subtracts 100 from each value in the column and returns a Series object.
#sub_100 = food_info["Iron_(mg)"] - 100

# Multiplies each value in the column by 2 and returns a Series object.
#mult_2 = food_info["Iron_(mg)"]*2

#两列进行相乘
#It applies the arithmetic operator to the first value in both columns, the second value in both columns, and so on
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
iron_grams = food_info["Iron_(mg)"] / 1000
food_info["Iron_(g)"] = iron_grams



# the "Vit_A_IU" column ranges from 0 to 100000, while the "Fiber_TD_(g)" column ranges from 0 to 79
#For certain calculations, columns like "Vit_A_IU" can have a greater effect on the result,
#due to the scale of the values
# The largest value in the "Energ_Kcal" column.
max_calories = food_info["Energ_Kcal"].max()
# Divide the values in "Energ_Kcal" by the largest value.
normalized_calories = food_info["Energ_Kcal"] / max_calories
normalized_protein = food_info["Protein_(g)"] / food_info["Protein_(g)"].max()
normalized_fat = food_info["Lipid_Tot_(g)"] / food_info["Lipid_Tot_(g)"].max()
#print(food_info.shape)
food_info["Normalized_Protein"] = normalized_protein
food_info["Normalized_Fat"] = normalized_fat
#print(food_info.shape)


#排序 ascending 默认为True 升序 False降序 NaN放在最后
#By default, pandas will sort the data by the column we specify in ascending order and return a new DataFrame
# Sorts the DataFrame in-place, rather than returning a new DataFrame.
#print food_info["Sodium_(mg)"]
food_info.sort_values("Sodium_(mg)", inplace=True)
print(food_info)
#Sorts by descending order, rather than ascending.
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)
print(food_info["Sodium_(mg)"])


# coding=utf-8

import pandas as pd
import numpy as np


titanic_survival = pd.read_csv('titanic_train.csv')
#print(titanic_survival.head())

'''
#统计年龄为空的的记录数
age = titanic_survival['Age']
#print(age.loc[0:10])
age_is_null = pd.isnull(age) #若为Nan,则返回True,否则返回False
#print(age_is_null)
age_null_true = age[age_is_null] #得到为Nan的值
#print(age_null_true)
age_null_count = len(age_null_true)
print(age_null_count)

#The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
#错误的计算,因为sum中有的年龄为Nan
mean_age = sum(titanic_survival['Age']) / len(titanic_survival['Age'])
print(mean_age)

#正确的计算
good_ages = titanic_survival["Age"][age_is_null == False]
#print good_ages
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age)

# missing data is so common that many pandas methods automatically filter for it
correct_mean_age = titanic_survival["Age"].mean()
print(correct_mean_age)

#一等仓,二等仓,三等仓的平均价格
#几等仓的参数为Pclass,价格为Fare
#mean fare for each class
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival["Pclass"] == this_class]
pclass_fares = pclass_rows["Fare"]
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
print(fares_by_class)


#index tells the method which column to group by
#values is the column that we want to apply the calculation to
#aggfunc specifies the calculation we want to perform
#每一个不同类型的中所有Survived的平均值
#默认aggfunc为平均值
passenger_survival = titanic_survival.pivot_table(
index='Pclass',values='Survived',aggfunc=np.mean
)
print(passenger_survival)

'''

#Embarked代表登船的地点
port_stats = titanic_survival.pivot_table(index="Embarked", values=["Fare","Survived"], aggfunc=np.sum)
print(port_stats)

#specifying axis=1 or axis='columns' will drop any columns that have null values
#dropna 判断是否有缺失值,有的话就舍去
print(len(titanic_survival))
drop_na_columns = titanic_survival.dropna(axis=1)
new_titanic_survival = titanic_survival.dropna(axis=0,subset=["Age", "Sex"]) #subset有一个为Nan就丢弃掉
print(len(new_titanic_survival))

#loc[行,列]
row_index_83_age = titanic_survival.loc[83,"Age"]
row_index_1000_pclass = titanic_survival.loc[766,"Pclass"]
print(row_index_83_age)
print(row_index_1000_pclass)

#重新编序号
new_titanic_survival = titanic_survival.sort_values("Age",ascending=False)
print(new_titanic_survival[0:10])
titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:10])


#自定义函数apply
# This function returns the hundredth item from a series
def hundredth_row(column):
# Extract the hundredth item
hundredth_item = column.iloc[99]
return hundredth_item

# Return the hundredth item from each column
hundredth_row = titanic_survival.apply(hundredth_row)
print(hundredth_row)


def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)

column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)


#By passing in the axis=1 argument, we can use the DataFrame.apply() method to iterate over rows instead of columns.
def which_class(row):
pclass = row['Pclass']
if pd.isnull(pclass):
return "Unknown"
elif pclass == 1:
return "First Class"
elif pclass == 2:
return "Second Class"
elif pclass == 3:
return "Third Class"

classes = titanic_survival.apply(which_class, axis=1)
print(classes)


def is_minor(row):
if row["Age"] < 18:
return True
else:
return False

minors = titanic_survival.apply(is_minor, axis=1)
#print minors

def generate_age_label(row):
age = row["Age"]
if pd.isnull(age):
return "unknown"
elif age < 18:
return "minor"
else:
return "adult"

age_labels = titanic_survival.apply(generate_age_label, axis=1)
print(age_labels)


titanic_survival['age_labels'] = age_labels
age_group_survival = titanic_survival.pivot_table(index="age_labels", values="Survived")
print(age_group_survival)


标签:info,food,survival,pannas,titanic,详解,print,age
From: https://blog.51cto.com/u_15955675/6040314

相关文章

  • 详解Spring AOP自定义可重复注解没有生效问题
    目录1.问题背景2.不啰嗦,上代码3.问题排查3.1是不是切点写得有问题,于是换成如下形式:3.2是不是使用的地方不是代理对象4.问题原因 1.问题背景工作中遇......
  • tensorflow中slim详解
    1.变量的定义 from__future__importabsolute_importfrom__future__importdivisionfrom__future__importprint_functionimporttensorflowastfslim=tf.contrib......
  • python3中zip详解
    描述zip()函数用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。如果各个迭代器的元素个数不一致,则返回列表长度与最短的对象......
  • 一文详解TensorFlow模型迁移及模型训练实操步骤
    摘要:本文介绍将TensorFlow网络模型迁移到昇腾AI平台,并执行训练的全流程。然后以TensorFlow1.15训练脚本为例,详细介绍了自动迁移、手工迁移以及模型训练的操作步骤。本文分......
  • 一文详解TensorFlow模型迁移及模型训练实操步骤
    摘要:本文介绍将TensorFlow网络模型迁移到昇腾AI平台,并执行训练的全流程。然后以TensorFlow1.15训练脚本为例,详细介绍了自动迁移、手工迁移以及模型训练的操作步骤。本文......
  • JVM垃圾回收机制,万字详解
    JVM垃圾回收机制jvm的基本组成虚拟机的组成所谓java能实现跨平台,是因为在不同平台上运行不同的虚拟机决定的,因此java文件的执行不直接在操作系统上执行,而是通过jvm虚拟机执......
  • Redis详解
    Redis配置ymlspring:redis:host:82.157.248.243#host地址port:6379#地址端口号password:#密码database:......
  • java注解与反射详解
    一、注解篇1.1、注解的基本概念注解:一种代码级别的说明,它是JDK1.5及以后版本引入的一个特性,与类、接口、枚举是在同一个层次;它可以声明在包、类、字段、方法、局部变量......
  • C++右值引用,移动语义与完美转发详解
    tags:C++Interview写在前面总结一下深入理解C++11这本书的第三章第三节,右值引用部分.文中全部代码可以参考我在GitHub上传的部分:​​Learn_C_Cpp/c++11-14/Depth_unde......
  • Node.JS模块化详解(Math加乘实现/模块外包围)
    视频math.js/* 定义一个模块math -在该模块中提供两个方法 add(a,b);//求两个数的和 mul(a,b);//求两个数的积*/module.exports.add=function(a......