来做一些入门题吧. 以下大多是 kaggle 环境.
Q1 Titanic
https://www.kaggle.com/competitions/titanic
import
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
- 加载数据
通常用 pandas 来读取 csv 表.
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()
这是很重要的步骤, 它确保数据被正确载入进来了.
- 观察数据
在给定的数据集中, 很容易发现女性生存概率远远大于男性 (其他属性也有影响).
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)
print("% of women who survived:", rate_women)
- 随机森林
这里采用 scikit 的随机森林根据四个输入特征来分类.
from sklearn.ensemble import RandomForestClassifier
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
Q2 Spaceship Titanic
https://www.kaggle.com/competitions/spaceship-titanic
如果只是需要一个简单的随机森林模型, 那么使用 scikit-learn 就可以了, 但是对于复杂的任务建议使用 TensorFlow Decision Forests.
import
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
- 加载数据
dataset_df = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
print("Full train dataset shape is {}".format(dataset_df.shape))
dataset_df.head(5)
- 观察数据
大致观察一下整体:
dataset_df.describe()
dataset_df.info()
对输出标签画一个柱状图 (柱状图横轴是离散的):
plot_df = dataset_df.Transported.value_counts()
plot_df.plot(kind="bar")
对输入标签也画出直方图 (直方图横轴是连续的):
fig, ax = plt.subplots(5,1, figsize=(10, 10))
plt.subplots_adjust(top = 2)
sns.histplot(dataset_df['Age'], color='b', bins=50, ax=ax[0]);
sns.histplot(dataset_df['FoodCourt'], color='b', bins=50, ax=ax[1]);
sns.histplot(dataset_df['ShoppingMall'], color='b', bins=50, ax=ax[2]);
sns.histplot(dataset_df['Spa'], color='b', bins=50, ax=ax[3]);
sns.histplot(dataset_df['VRDeck'], color='b', bins=50, ax=ax[4]);
- 处理数据集
原始数据集比较乱, 既有数字, 字母, 符号, 还有很多缺省参数.
考虑到一些因素不影响结果, 可以先把它们去掉:
dataset_df = dataset_df.drop(['PassengerId', 'Name'], axis=1)
dataset_df.head(5)
下面的代码用于检查每列的缺失值数量:
dataset_df.isnull().sum().sort_values(ascending=False)
Tensorflow 的随机森林可以应对缺省值, 但是无法处理布尔值 (True/False). 对于一些布尔缺省值要用 False 补全, 然后全部转化成 int 类型:
dataset_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = dataset_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].fillna(value=0)
dataset_df.isnull().sum().sort_values(ascending=False)
label = "Transported"
dataset_df[label] = dataset_df[label].astype(int)
dataset_df['VIP'] = dataset_df['VIP'].astype(int)
dataset_df['CryoSleep'] = dataset_df['CryoSleep'].astype(int)
对于 Deck/Cabin_num/Side 这种组合属性形式, 建议分开:
dataset_df[["Deck", "Cabin_num", "Side"]] = dataset_df["Cabin"].str.split("/", expand=True)
dataset_df = dataset_df.drop('Cabin', axis=1)
查看一下处理后的数据集:
dataset_df.head(5)
分开成训练集和验证集:
def split_dataset(dataset, test_ratio=0.20):
test_indices = np.random.rand(len(dataset)) < test_ratio
return dataset[~test_indices], dataset[test_indices]
train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
len(train_ds_pd), len(valid_ds_pd)))
从 pandas 形式转换为 tensorflow 的形式:
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label)
- 训练模型
rf = tfdf.keras.RandomForestModel()
rf.fit(x=train_ds)
- 评估
随机森林可视化:
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)
OOB 测试:
import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")
plt.show()
inspector = rf.make_inspector()
inspector.evaluation()
测试验证集:
evaluation = rf.evaluate(x=valid_ds,return_dict=True)
for name, value in evaluation.items():
print(f"{name}: {value:.4f}")
权重贡献:
inspector.variable_importances()["NUM_AS_ROOT"]
- 预测测试集
别忘了测试集也要和训练集一样经过预处理:
# Load the test dataset
test_df = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
submission_id = test_df.PassengerId
# Replace NaN values with zero
test_df[['VIP', 'CryoSleep']] = test_df[['VIP', 'CryoSleep']].fillna(value=0)
# Creating New Features - Deck, Cabin_num and Side from the column Cabin and remove Cabin
test_df[["Deck", "Cabin_num", "Side"]] = test_df["Cabin"].str.split("/", expand=True)
test_df = test_df.drop('Cabin', axis=1)
# Convert boolean to 1's and 0's
test_df['VIP'] = test_df['VIP'].astype(int)
test_df['CryoSleep'] = test_df['CryoSleep'].astype(int)
# Convert pd dataframe to tf dataset
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df)
# Get the predictions for testdata
predictions = rf.predict(test_ds)
n_predictions = (predictions > 0.5).astype(bool)
output = pd.DataFrame({'PassengerId': submission_id,
'Transported': n_predictions.squeeze()})
output.head()
生成 csv:
sample_submission_df = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')
sample_submission_df['Transported'] = n_predictions
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)
sample_submission_df.head()
对了, 上面是一般解法, 我提交的其实是下面这个:
神经网络解法
import pandas as pd
from keras.regularizers import l2
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from tensorflow import keras
from tensorflow.keras import layers
# Loading data
train_data = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
train_data = train_data.drop(['PassengerId', 'Name'], axis=1)
train_data[["Deck", "Cabin_num", "Side"]] = train_data["Cabin"].str.split("/", expand=True)
train_data = train_data.drop('Cabin', axis=1)
train_data['Transported'] = train_data['Transported'].fillna(value=0).astype(int)
test_data = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
test_data = test_data.drop(['PassengerId', 'Name'], axis=1)
test_data[["Deck", "Cabin_num", "Side"]] = test_data["Cabin"].str.split("/", expand=True)
test_data = test_data.drop('Cabin', axis=1)
# Preprocessor
features_num = ["Age", "VIP", "CryoSleep", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "Cabin_num"]
features_cat = ["HomePlanet", "Destination", "Deck", "Side"]
features_bool = ["VIP", "CryoSleep"]
def bool_to_int(X):
return X.astype(int)
transformer_bool = FunctionTransformer(bool_to_int, validate=False)
transformer_num = make_pipeline(
SimpleImputer(strategy="constant", fill_value=0),
StandardScaler()
)
transformer_cat = make_pipeline(
SimpleImputer(strategy="constant", fill_value="NA"),
OneHotEncoder(handle_unknown='ignore')
)
transformer_bool = make_pipeline(
SimpleImputer(strategy="constant", fill_value=0),
StandardScaler()
)
preprocessor = make_column_transformer(
(transformer_num, features_num),
(transformer_cat, features_cat),
(transformer_bool, features_bool)
)
# Splitting the data
X = train_data.drop('Transported', axis=1)
y = train_data['Transported']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)
# Applying the preprocessor
X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)
input_shape = [X_train.shape[1]]
model = keras.Sequential([
layers.BatchNormalization(input_shape=input_shape),
layers.Dense(1024, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(rate=0.4),
layers.Dense(512, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(rate=0.4),
layers.Dense(256, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(rate=0.4),
layers.Dense(128, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(rate=0.4),
layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.0003),
loss='binary_crossentropy',
metrics=['accuracy']
)
early_stopping = keras.callbacks.EarlyStopping(patience=20, min_delta=0.001, restore_best_weights=True)
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=6, verbose=1)
checkpoint = keras.callbacks.ModelCheckpoint('best_model.h5', save_best_only=True)
# Training
history = model.fit(
X_train, y_train,
validation_data=(X_valid, y_valid),
batch_size=512,
epochs=100,
callbacks=[early_stopping, lr_scheduler, checkpoint]
)
# Plotting
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot(title="Cross-entropy")
history_df.loc[:, ['accuracy', 'val_accuracy']].plot(title="Accuracy")
# Validating
val_loss, val_accuracy = model.evaluate(X_valid, y_valid, verbose=0)
print(f"Validation Loss: {val_loss:.4f}")
print(f"Validation Accuracy: {val_accuracy*100:.2f}%")
# Prediction
test_data_processed = preprocessor.transform(test_data)
predictions = model.predict(test_data_processed)
predicted_labels = (predictions > 0.5).astype(bool).flatten()
submission_df = pd.DataFrame({
"PassengerId": pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")["PassengerId"],
"Transported": predicted_labels
})
submission_df.to_csv("submission.csv", index=False)
预处理器的构建很有利于重用 比如尽量这样写
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
import pandas as pd
numeric_features = ['num_feature1', 'num_feature2']
categorical_features = ['cat_feature1', 'cat_feature2']
boolean_features = ['bool_feature1']
date_features = ['date_feature1']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
boolean_transformer = FunctionTransformer(lambda x: x.astype(int), validate=False)
date_transformer = FunctionTransformer(lambda x: pd.to_datetime(x).dt.year, validate=False)
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
('bool', boolean_transformer, boolean_features),
('date', date_transformer, date_features)
])
# 理论上,几乎所有的数据预处理操作都可以封装到 preprocessor 中