首页 > 其他分享 >机器学习 -> Machine Learning (III)

机器学习 -> Machine Learning (III)

时间:2023-09-04 09:23:35浏览次数:54  
标签:pd df data dataset Machine train Learning test III

来做一些入门题吧. 以下大多是 kaggle 环境.

Q1 Titanic

https://www.kaggle.com/competitions/titanic

import
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
  1. 加载数据

通常用 pandas 来读取 csv 表.

train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

这是很重要的步骤, 它确保数据被正确载入进来了.

  1. 观察数据

在给定的数据集中, 很容易发现女性生存概率远远大于男性 (其他属性也有影响).

women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)
  1. 随机森林

这里采用 scikit 的随机森林根据四个输入特征来分类.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Q2 Spaceship Titanic

https://www.kaggle.com/competitions/spaceship-titanic

如果只是需要一个简单的随机森林模型, 那么使用 scikit-learn 就可以了, 但是对于复杂的任务建议使用 TensorFlow Decision Forests.

import
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
  1. 加载数据
dataset_df = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
print("Full train dataset shape is {}".format(dataset_df.shape))
dataset_df.head(5)
  1. 观察数据

大致观察一下整体:

dataset_df.describe()
dataset_df.info()

对输出标签画一个柱状图 (柱状图横轴是离散的):

plot_df = dataset_df.Transported.value_counts()
plot_df.plot(kind="bar")

对输入标签也画出直方图 (直方图横轴是连续的):

fig, ax = plt.subplots(5,1,  figsize=(10, 10))
plt.subplots_adjust(top = 2)

sns.histplot(dataset_df['Age'], color='b', bins=50, ax=ax[0]);
sns.histplot(dataset_df['FoodCourt'], color='b', bins=50, ax=ax[1]);
sns.histplot(dataset_df['ShoppingMall'], color='b', bins=50, ax=ax[2]);
sns.histplot(dataset_df['Spa'], color='b', bins=50, ax=ax[3]);
sns.histplot(dataset_df['VRDeck'], color='b', bins=50, ax=ax[4]);
  1. 处理数据集

原始数据集比较乱, 既有数字, 字母, 符号, 还有很多缺省参数.

考虑到一些因素不影响结果, 可以先把它们去掉:

dataset_df = dataset_df.drop(['PassengerId', 'Name'], axis=1)
dataset_df.head(5)

下面的代码用于检查每列的缺失值数量:

dataset_df.isnull().sum().sort_values(ascending=False)

Tensorflow 的随机森林可以应对缺省值, 但是无法处理布尔值 (True/False). 对于一些布尔缺省值要用 False 补全, 然后全部转化成 int 类型:

dataset_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = dataset_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].fillna(value=0)
dataset_df.isnull().sum().sort_values(ascending=False)
label = "Transported"
dataset_df[label] = dataset_df[label].astype(int)
dataset_df['VIP'] = dataset_df['VIP'].astype(int)
dataset_df['CryoSleep'] = dataset_df['CryoSleep'].astype(int)

对于 Deck/Cabin_num/Side 这种组合属性形式, 建议分开:

dataset_df[["Deck", "Cabin_num", "Side"]] = dataset_df["Cabin"].str.split("/", expand=True)
dataset_df = dataset_df.drop('Cabin', axis=1)

查看一下处理后的数据集:

dataset_df.head(5)

分开成训练集和验证集:

def split_dataset(dataset, test_ratio=0.20):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
    len(train_ds_pd), len(valid_ds_pd)))

从 pandas 形式转换为 tensorflow 的形式:

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label)
  1. 训练模型
rf = tfdf.keras.RandomForestModel()
rf.fit(x=train_ds)
  1. 评估

随机森林可视化:

tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)

OOB 测试:

import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")
plt.show()

inspector = rf.make_inspector()
inspector.evaluation()

测试验证集:

evaluation = rf.evaluate(x=valid_ds,return_dict=True)

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")

权重贡献:

inspector.variable_importances()["NUM_AS_ROOT"]
  1. 预测测试集

别忘了测试集也要和训练集一样经过预处理:

# Load the test dataset
test_df = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
submission_id = test_df.PassengerId

# Replace NaN values with zero
test_df[['VIP', 'CryoSleep']] = test_df[['VIP', 'CryoSleep']].fillna(value=0)

# Creating New Features - Deck, Cabin_num and Side from the column Cabin and remove Cabin
test_df[["Deck", "Cabin_num", "Side"]] = test_df["Cabin"].str.split("/", expand=True)
test_df = test_df.drop('Cabin', axis=1)

# Convert boolean to 1's and 0's
test_df['VIP'] = test_df['VIP'].astype(int)
test_df['CryoSleep'] = test_df['CryoSleep'].astype(int)

# Convert pd dataframe to tf dataset
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df)

# Get the predictions for testdata
predictions = rf.predict(test_ds)
n_predictions = (predictions > 0.5).astype(bool)
output = pd.DataFrame({'PassengerId': submission_id,
                       'Transported': n_predictions.squeeze()})

output.head()

生成 csv:

sample_submission_df = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')
sample_submission_df['Transported'] = n_predictions
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)
sample_submission_df.head()

对了, 上面是一般解法, 我提交的其实是下面这个:

神经网络解法
import pandas as pd
from keras.regularizers import l2
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from tensorflow import keras
from tensorflow.keras import layers

# Loading data
train_data = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
train_data = train_data.drop(['PassengerId', 'Name'], axis=1)
train_data[["Deck", "Cabin_num", "Side"]] = train_data["Cabin"].str.split("/", expand=True)
train_data = train_data.drop('Cabin', axis=1)
train_data['Transported'] = train_data['Transported'].fillna(value=0).astype(int)

test_data = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
test_data = test_data.drop(['PassengerId', 'Name'], axis=1)
test_data[["Deck", "Cabin_num", "Side"]] = test_data["Cabin"].str.split("/", expand=True)
test_data = test_data.drop('Cabin', axis=1)

# Preprocessor
features_num = ["Age", "VIP", "CryoSleep", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "Cabin_num"]
features_cat = ["HomePlanet", "Destination", "Deck", "Side"]
features_bool = ["VIP", "CryoSleep"]

def bool_to_int(X):
    return X.astype(int)

transformer_bool = FunctionTransformer(bool_to_int, validate=False)

transformer_num = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=0),
    StandardScaler()
)

transformer_cat = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown='ignore')
)

transformer_bool = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=0),
    StandardScaler()
)

preprocessor = make_column_transformer(
    (transformer_num, features_num),
    (transformer_cat, features_cat),
    (transformer_bool, features_bool)
)

# Splitting the data
X = train_data.drop('Transported', axis=1)
y = train_data['Transported']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)

# Applying the preprocessor
X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)

input_shape = [X_train.shape[1]]

model = keras.Sequential([
    layers.BatchNormalization(input_shape=input_shape),
    
    layers.Dense(1024, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),

    layers.Dense(256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    
    layers.Dense(128, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.0003),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

early_stopping = keras.callbacks.EarlyStopping(patience=20, min_delta=0.001, restore_best_weights=True)
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=6, verbose=1)
checkpoint = keras.callbacks.ModelCheckpoint('best_model.h5', save_best_only=True)

# Training
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=100,
    callbacks=[early_stopping, lr_scheduler, checkpoint]
)

# Plotting
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot(title="Cross-entropy")
history_df.loc[:, ['accuracy', 'val_accuracy']].plot(title="Accuracy")

# Validating
val_loss, val_accuracy = model.evaluate(X_valid, y_valid, verbose=0)

print(f"Validation Loss: {val_loss:.4f}")
print(f"Validation Accuracy: {val_accuracy*100:.2f}%")

# Prediction
test_data_processed = preprocessor.transform(test_data)
predictions = model.predict(test_data_processed)
predicted_labels = (predictions > 0.5).astype(bool).flatten()

submission_df = pd.DataFrame({
    "PassengerId": pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")["PassengerId"],  
    "Transported": predicted_labels
})

submission_df.to_csv("submission.csv", index=False)
预处理器的构建很有利于重用 比如尽量这样写
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
import pandas as pd

numeric_features = ['num_feature1', 'num_feature2']
categorical_features = ['cat_feature1', 'cat_feature2']
boolean_features = ['bool_feature1']
date_features = ['date_feature1']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

boolean_transformer = FunctionTransformer(lambda x: x.astype(int), validate=False)

date_transformer = FunctionTransformer(lambda x: pd.to_datetime(x).dt.year, validate=False) 

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('bool', boolean_transformer, boolean_features),
        ('date', date_transformer, date_features)
    ])
# 理论上,几乎所有的数据预处理操作都可以封装到 preprocessor 中

标签:pd,df,data,dataset,Machine,train,Learning,test,III
From: https://www.cnblogs.com/Arcticus/p/17671143.html

相关文章

  • Meta-Learning, A Survey
    一、概述 通常在机器学习里,我们需要用大量的数据来训练一个模型;当场景发生改变时,模型就需要重新训练。这显然提升了成本,而人类学习方式与此不同,一个小孩子在学习动物的过程中,学习了很多动物的名称,当某次给他看一些没有见过的动物时,他总能很快的将新动物和别的动物区分开。Meta......
  • 代码随想录算法训练营第二十五天| 216.组合总和III 17.电话号码的字母组合
     216.组合总和III    卡哥建议:如果把 组合问题理解了,本题就容易一些了。    题目链接/文章讲解:https://programmercarl.com/0216.%E7%BB%84%E5%90%88%E6%80%BB%E5%92%8CIII.html   视频讲解:https://www.bilibili.com/video/BV1wg411873x  做题思路:......
  • aarch64/arm_v8 环境下编译Arcade-Learning-Environment —— ale-py
       condainstallg++=12   cmake../-DCMAKE_BUILD_TYPE=Release-DPYTHON_INCLUDE_DIR=/home/share/xxx/home/software/anaconda3/include-DPYTHON_LIBRARY=/home/share/xxx/home/software/anaconda3/lib/libpython3.11.so-DPython3_EXECUTABLE=/home/share/x......
  • Q-learning and RL implementation
    Aim:Trainamodeltoproperlyplayvintagevideogames...DeepQ-learningAlgo~VeryshortBriefofNotations:{A,pi(Policy),Q(qualityofaction-atastate),R((s,a,s')-Reward,sstatedoingatogotos'andgetaspecificr)} So,ifwew......
  • 迁移学习(CLDA)《CLDA: Contrastive Learning for Semi-Supervised Domain Adaptation》
    Note:[wechat:Y466551|可加勿骚扰,付费咨询]论文信息论文标题:CLDA:ContrastiveLearningforSemi-SupervisedDomainAdaptation论文作者:AnkitSingh论文来源:NeurIPS2021论文地址:download 论文代码:download视屏讲解:click1简介动机:半监督导致来自标记源和目标样本的......
  • 【五期邹昱夫】CCF-A(TIFS'23)SAFELearning: Secure Aggregation in Federated Learning
    "Zhang,Zhuosheng,etal."SAFELearning:SecureAggregationinFederatedLearningwithBackdoorDetectability."IEEETransactionsonInformationForensicsandSecurity(2023)."  本文提出了一种在联邦学习场景下可以保护隐私并防御后门攻击的聚合方法。作者认......
  • Learning Auxiliary Monocular Contexts Helps Monocular 3D Object Detection (2)
    Featurebackbone采用DLA,输入维度为3×H×W的RGB图,得到维度D×h×w的特征图F,然后将特征图送入几个轻量级regressionheads,2Dboudingboxes的中心特征图用下面的模块得到:其中AN是AttentiveNormalization.用公式表示:类似的,2D和3Dboudingboxes的中心之间的offset用公......
  • 机器学习 -> Machine Learning (II)
    这次来学习深度学习吧!1训练前1.1神经元与神经网络神经元是神经网络的基本单位,模拟了生物神经元的工作机制.每个神经元接受一组输入,将这些输入与其权重相乘,然后对所有的乘积求和,并加上一个偏置.最后,将得到的结果传递给激活函数.神经网络由多个神经元组成,这......
  • jts learning
    JTS简介JTS提供了一套操作几何向量的java类库。早期版本com.vividsolutions,已废弃不在维护。现在版本com.locationtech.jts由eclipse开源基金会托管使用说明入门指导GIS开发入门指导jts-core核心库使用说明jts-core核心类库使用说明具体结合示例代码详细介绍JTS下面核心......
  • Proj CDeepFuzz Paper Reading: Deepxplore: Automated whitebox testing of deep lea
    Abstract背景:现有的深度学习测试在很⼤程度上依赖于⼿动标记的数据,因此通常⽆法暴露罕⻅输⼊的错误⾏为。本文:DeepXploreTask:awhite-boxframeworktotestDLModels方法:neuroncoveragedifferentialtestingwithmultipleDLsystems(models)joint-optimizationpro......