在本次进阶实践部分,将在原有Baseline基础上做更多优化,一般优化思路,从特征工程与模型中来思考。
- 特征选择与删除:分析特征的重要性,可以使用特征选择方法(如基于模型的特征重要性)来选择最具有预测能力的特征,也可以删除一些对模型性能影响较小的特征。
- 特征组合与交互:将不同特征进行组合、相乘、相除等操作,创建新的特征,以捕捉特征之间的复杂关系。
- 数值型特征的分桶(Binning):将连续的数值型特征划分为多个区间,可以提高模型对特征的鲁棒性。
- 类别型特征的编码:除了One-Hot编码外,可以尝试使用其他编码方式,如Label Encoding、Target Encoding等,来更好地处理类别型特征。
- 时间特征的挖掘:除了示例中的日期和小时提取,还可以尝试提取星期几、月份等时间信息,可能会影响用户行为。
- 特征缩放:对数值型特征进行缩放,将它们映射到一个相似的范围,有助于模型收敛和性能提升。
- 降维技术:使用降维算法(如PCA、t-SNE等)将高维特征映射到低维空间,减少特征维度,降低计算复杂度和噪声干扰。
- 样本平衡:对于不平衡数据集,可以使用过采样或欠采样等方法来平衡各个类别的样本数量,避免模型偏向于多数类别。
- 集成学习:通过结合多个基模型的预测结果,如投票、平均等方式,可以提高模型的泛化能力和准确性。
- 正则化:使用L1/L2正则化等方法对模型进行约束,减少过拟合的风险,提高模型的泛化能力。
- 优化算法选择:根据具体问题和数据集的特点,选择合适的优化算法(如梯度下降、随机梯度下降、牛顿法等),以提高模型训练效率和准确性。
- 超参数调优:通过网格搜索、随机搜索等方法,寻找最优的超参数组合,进一步提高模型性能。
本次实操步骤:
- 在现有特征工程基础上,加入新的特征观察模型F1是否发生变化。
- 尝试三组能增加模型精度的特征,并记录下特征编码过程
- 将最优的特征从新训练模型,提交结果。
# 导入库
import pandas as pd
import numpy as np
# 读取训练集和测试集文件
train_data = pd.read_csv('用户新增预测挑战赛公开数据/train.csv')
test_data = pd.read_csv('用户新增预测挑战赛公开数据/test.csv')
# 提取udmap特征,人工进行onehot
def udmap_onethot(d):
v = np.zeros(9)
if d == 'unknown':
return v
d = eval(d)
for i in range(1, 10):
if 'key' + str(i) in d:
v[i-1] = d['key' + str(i)]
return v
train_udmap_df = pd.DataFrame(np.vstack(train_data['udmap'].apply(udmap_onethot)))
test_udmap_df = pd.DataFrame(np.vstack(test_data['udmap'].apply(udmap_onethot)))
train_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]
test_udmap_df.columns = ['key' + str(i) for i in range(1, 10)]
# 编码udmap是否为空
train_data['udmap_isunknown'] = (train_data['udmap'] == 'unknown').astype(int)
test_data['udmap_isunknown'] = (test_data['udmap'] == 'unknown').astype(int)
# udmap特征和原始数据拼接
train_data = pd.concat([train_data, train_udmap_df], axis=1)
test_data = pd.concat([test_data, test_udmap_df], axis=1)
# 提取eid的频次特征
train_data['eid_freq'] = train_data['eid'].map(train_data['eid'].value_counts())
test_data['eid_freq'] = test_data['eid'].map(train_data['eid'].value_counts())
# 提取eid的标签特征
train_data['eid_mean'] = train_data['eid'].map(train_data.groupby('eid')['target'].mean())
test_data['eid_mean'] = test_data['eid'].map(train_data.groupby('eid')['target'].mean())
# 提取时间戳
train_data['common_ts'] = pd.to_datetime(train_data['common_ts'], unit='ms')
test_data['common_ts'] = pd.to_datetime(test_data['common_ts'], unit='ms')
train_data['common_ts_hour'] = train_data['common_ts'].dt.hour
test_data['common_ts_hour'] = test_data['common_ts'].dt.hour
# 导入模型
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
# 导入交叉验证和评价指标
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report
训练并验证SGDClassifier
pred = cross_val_predict(
SGDClassifier(max_iter=20),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target'],
)
print(classification_report(train_data['target'], pred, digits=3))
precision recall f1-score support
0 0.864 0.763 0.811 533155
1 0.156 0.267 0.197 87201
accuracy 0.694 620356
macro avg 0.510 0.515 0.504 620356
weighted avg 0.765 0.694 0.724 620356
训练并验证DecisionTreeClassifier
pred = cross_val_predict(
DecisionTreeClassifier(),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target'],
cv=5
)
print(classification_report(train_data['target'], pred, digits=3))
precision recall f1-score support
0 0.934 0.940 0.937 533155
1 0.617 0.592 0.605 87201
accuracy 0.891 620356
macro avg 0.776 0.766 0.771 620356
weighted avg 0.889 0.891 0.890 620356
训练并验证MultinomialNB
pred = cross_val_predict(
MultinomialNB(),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
precision recall f1-score support
0 0.893 0.736 0.807 533155
1 0.221 0.458 0.298 87201
accuracy 0.697 620356
macro avg 0.557 0.597 0.552 620356
weighted avg 0.798 0.697 0.735 620356
训练并验证RandomForestClassifier
pred = cross_val_predict(
RandomForestClassifier(n_estimators=5),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
precision recall f1-score support
0 0.921 0.955 0.938 533155
1 0.645 0.500 0.563 87201
accuracy 0.891 620356
macro avg 0.783 0.728 0.751 620356
weighted avg 0.882 0.891 0.885 620356
from sklearn.linear_model import RidgeClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.ensemble import BaggingClassifier
训练并验证RidgeClassifier
pred = cross_val_predict(
RidgeClassifier(),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
precision recall f1-score support
0 0.864 0.998 0.926 533155
1 0.730 0.039 0.074 87201
accuracy 0.863 620356
macro avg 0.797 0.518 0.500 620356
weighted avg 0.845 0.863 0.806 620356
训练并验证ExtraTreeClassifier
pred = cross_val_predict(
ExtraTreeClassifier(),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
precision recall f1-score support
0 0.922 0.931 0.927 533155
1 0.553 0.519 0.536 87201
accuracy 0.873 620356
macro avg 0.738 0.725 0.731 620356
weighted avg 0.870 0.873 0.872 620356
训练并验证BaggingClassifier
pred = cross_val_predict(
BaggingClassifier(),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
precision recall f1-score support
0 0.930 0.966 0.948 533155
1 0.730 0.555 0.630 87201
accuracy 0.909 620356
macro avg 0.830 0.761 0.789 620356
weighted avg 0.902 0.909 0.903 620356
训练并验证DecisionTreeClassifier
pred = cross_val_predict(
DecisionTreeClassifier(),
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
train_data['target']
)
print(classification_report(train_data['target'], pred, digits=3))
precision recall f1-score support
0 0.956 0.948 0.952 533155
1 0.699 0.735 0.716 87201
accuracy 0.918 620356
macro avg 0.828 0.842 0.834 620356
weighted avg 0.920 0.918 0.919 620356
添加更多的特征,并将模型从DecisionTreeClassifier替换为BaggingClassifier:
train_data["common_ts_month"] = train_data["common_ts"].dt.month # 添加新特征“month”,代表”当前月份“。
test_data["common_ts_month"] = test_data["common_ts"].dt.month
train_data["common_ts_day"] = train_data["common_ts"].dt.day # 添加新特征“day”,代表”当前日期“。
test_data["common_ts_day"] = test_data["common_ts"].dt.day
train_data["common_ts_hour"] = train_data["common_ts"].dt.hour # 添加新特征“hour”,代表”当前小时“。
test_data["common_ts_hour"] = test_data["common_ts"].dt.hour
train_data["common_ts_minute"] = train_data["common_ts"].dt.minute # 添加新特征“minute”,代表”当前分钟“。
test_data["common_ts_minute"] = test_data["common_ts"].dt.minute
train_data["common_ts_weekofyear"] = train_data["common_ts"].dt.isocalendar().week.astype(int) # 添加新特征“weekofyear”,代表”当年第几周“,并转换成 int,否则 LightGBM 无法处理。
test_data["common_ts_weekofyear"] = test_data["common_ts"].dt.isocalendar().week.astype(int)
train_data["common_ts_dayofyear"] = train_data["common_ts"].dt.dayofyear # 添加新特征“dayofyear”,代表”当年第几日“。
test_data["common_ts_dayofyear"] = test_data["common_ts"].dt.dayofyear
train_data["common_ts_dayofweek"] = train_data["common_ts"].dt.dayofweek # 添加新特征“dayofweek”,代表”当周第几日“。
test_data["common_ts_dayofweek"] = test_data["common_ts"].dt.dayofweek
train_data["common_ts_is_weekend"] = train_data["common_ts"].dt.dayofweek // 6 # 添加新特征“is_weekend”,代表”是否是周末“,1 代表是周末,0 代表不是周末。
test_data["common_ts_is_weekend"] = test_data["common_ts"].dt.dayofweek // 6
train_data["common_ts_hoursofday"] = (train_data["common_ts_hour"] < 2) | (train_data["common_ts_hour"] > 13)
train_data["common_ts_hoursofday"] = train_data["common_ts_hoursofday"].astype(int)
test_data["common_ts_hoursofday"] = (test_data["common_ts_hour"] < 2) | (train_data["common_ts_hour"] > 13)
test_data["common_ts_hoursofday"] = test_data["common_ts_hoursofday"].astype(int)
train_data['x1_freq'] = train_data['x1'].map(train_data['x1'].value_counts())
test_data['x1_freq'] = test_data['x1'].map(train_data['x1'].value_counts())
train_data['x1_mean'] = train_data['x1'].map(train_data.groupby('x1')['target'].mean())
test_data['x1_mean'] = test_data['x1'].map(train_data.groupby('x1')['target'].mean())
train_data['x2_freq'] = train_data['x2'].map(train_data['x2'].value_counts())
test_data['x2_freq'] = test_data['x2'].map(train_data['x2'].value_counts())
train_data['x2_mean'] = train_data['x2'].map(train_data.groupby('x2')['target'].mean())
test_data['x2_mean'] = test_data['x2'].map(train_data.groupby('x2')['target'].mean())
##x3_freq和x4_freq存在为nan的值
#train_data['x3_freq'] = train_data['x3'].map(train_data['x3'].value_counts())
#test_data['x3_freq'] = test_data['x3'].map(train_data['x3'].value_counts())
#train_data['x4_freq'] = train_data['x4'].map(train_data['x4'].value_counts())
#test_data['x4_freq'] = test_data['x4'].map(train_data['x4'].value_counts())
train_data['x6_freq'] = train_data['x6'].map(train_data['x6'].value_counts())
test_data['x6_freq'] = test_data['x6'].map(train_data['x6'].value_counts())
train_data['x6_mean'] = train_data['x6'].map(train_data.groupby('x6')['target'].mean())
test_data['x6_mean'] = test_data['x6'].map(train_data.groupby('x6')['target'].mean())
train_data['x7_freq'] = train_data['x7'].map(train_data['x7'].value_counts())
test_data['x7_freq'] = test_data['x7'].map(train_data['x7'].value_counts())
train_data['x7_mean'] = train_data['x7'].map(train_data.groupby('x7')['target'].mean())
test_data['x7_mean'] = test_data['x7'].map(train_data.groupby('x7')['target'].mean())
train_data['x8_freq'] = train_data['x8'].map(train_data['x8'].value_counts())
test_data['x8_freq'] = test_data['x8'].map(train_data['x8'].value_counts())
train_data['x8_mean'] = train_data['x8'].map(train_data.groupby('x8')['target'].mean())
test_data['x8_mean'] = test_data['x8'].map(train_data.groupby('x8')['target'].mean())
train_data["x7_01"] = train_data["x7"] == 1
train_data["x7_01"] = train_data["x7_01"].astype(int)
test_data["x7_01"] = test_data["x7"] == 1
test_data["x7_01"] = test_data["x7_01"].astype(int)
# clf = DecisionTreeClassifier()
# clf.fit(
# train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1),
# train_data['target']
# )
clf = BaggingClassifier()
clf.fit(
train_data.drop(['udmap', 'common_ts', 'uuid', 'target'], axis=1), # 特征数据:移除指定的列作为特征
train_data['target']
)
result_df = pd.DataFrame({
'uuid': test_data['uuid'],
'target': clf.predict(test_data.drop(['udmap', 'common_ts', 'uuid'], axis=1)) # 使用模型 clf 对测试数据集进行预测,并将预测结果存储在 'target' 列中
})
最后提交结果,可以看到得分是上升了的: