本人希望从0开始,自己的Kaggle竞赛
- 12月拿到自己的第一块Kaggle奖牌
- 短期内读完Abhishek Thakur的Approaching (Almost) Any Machine Learning Problem并且发博客记录https://github.com/abhishekkrthakur/approachingalmost
- 12月至少发21篇博客
- 每天保持八小时的学习时间
Approaching categorical variables(实验部分)
RandomForestClassifier 随机森林模型
在上一篇文章中https://blog.51cto.com/u_15683639/8822476
我们使用默认的RF参数进行训练得到了如下结果:
Fold = 0, AUC = 0.7143420371128966
Fold = 1, AUC = 0.7182654891323974
Fold = 2, AUC = 0.7162629185564836
Fold = 3, AUC = 0.7138862032799431
Fold = 4, AUC = 0.7169939048511448
我们发现是明显弱于逻辑回归的结果的,我们尝试对模型进行优化
xgboot
之后的模型我们讲基于xgboot的基础效果,采用对变量进行处理的方式优化模型
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import metrics
from sklearn import preprocessing
mapping = {
'ord_1':
{
'Novice': 0,
'Contributor': 1,
'Expert': 2,
'Master': 3,
'Grandmaster': 4
},
'ord_2':
{
"Freezing": 0,
"Warm": 1,
"Cold": 2,
"Boiling Hot": 3,
"Hot": 4,
"Lava Hot": 5
}
}
def run(fold):
df = pd.read_csv("input/cat_train.csv")
features = [
f for f in df.columns if f not in ("id", "target", "kfold")
]
for col in features:
# if col in mapping.keys():
# df.loc[:, col] = df[col].map(mapping[col])
# df.loc[:, col] = df[col].fillna(np.mean(df[col]))
# continue
df.loc[:, col] = df[col].astype(str).fillna("NONE")
lbl = preprocessing.LabelEncoder()
lbl.fit(df[col])
df.loc[:, col] = lbl.transform(df[col])
df_train = df[df.kfold != fold].reset_index(drop=True)
df_valid = df[df.kfold == fold].reset_index(drop=True)
# 将训练集、验证集沿行合并
x_train = df_train[features].values
x_valid = df_valid[features].values
model = xgb.XGBClassifier(
n_jobs=-1,
max_depth=7,
n_estimators=200)
model.fit(x_train, df_train.target.values)
valid_preds = model.predict_proba(x_valid)[:, 1]
auc = metrics.roc_auc_score(df_valid.target.values, valid_preds)
print(f"Fold = {fold}, AUC = {auc}")
if __name__ == "__main__":
# 运行折叠0
for fold in range(5):
run(fold)
结果
Fold = 0, AUC = 0.7593127296060671
Fold = 1, AUC = 0.7636982814902823
Fold = 2, AUC = 0.7612284353949836
Fold = 3, AUC = 0.7588940006054744
Fold = 4, AUC = 0.7588587930357571
将有相对值的变量,赋值为相对值
mapping = {
'ord_1':
{
'Novice': 0,
'Contributor': 1,
'Expert': 2,
'Master': 3,
'Grandmaster': 4
},
'ord_2':
{
"Freezing": 0,
"Warm": 1,
"Cold": 2,
"Boiling Hot": 3,
"Hot": 4,
"Lava Hot": 5
}
}
对于“NONE”我们赋值为平均值
df.loc[:, col] = df[col].fillna(np.mean(df[col]))
结果
Fold = 0, AUC = 0.7592536106272681
Fold = 1, AUC = 0.7628082884089922
Fold = 2, AUC = 0.7617342260822304
Fold = 3, AUC = 0.7580344934678367
Fold = 4, AUC = 0.7591430836126225
NONE赋值为0时:
Fold = 0, AUC = 0.7598587660799765
Fold = 1, AUC = 0.7635835866917194
Fold = 2, AUC = 0.7618911101907058
Fold = 3, AUC = 0.7578635518090245
Fold = 4, AUC = 0.7597198400752665
我们发现模型效果几乎没有变化