模型亮点
- 随机森林,初始测试集上评分为0.53,调参后测试集上评分为0.85
- 梯度提升决策树,初始测试集上评分为0.56,调参后测试集上评分为0.88
-----------------------------------------以下为模型具体实现-----------------------------------------
Step1.数据读取
import pandas as pd df=pd.read_csv('bankpep.csv',index_col=0) df.head()
Step2.数据清洗
# 1)性别:female->1,male->0 df.loc[df['sex']=='FEMALE','sex']=1 df.loc[df['sex']=='MALE','sex']=0
# 2)婚否、车、储蓄行为、目前行为、是否按揭、是否接受提议:yes->1,no->0 col=['married','car','save_act','current_act','mortgage','pep'] for i in col: df.loc[df[i]=='YES',i]=1 df.loc[df[i]=='NO',i]=0
# 3)地区:rural->1,suburan->2,town->3,inner_city->4 df.loc[df['region']=='RURAL','region']=1 df.loc[df['region']=='SUBURBAN','region']=2 df.loc[df['region']=='TOWN','region']=3 df.loc[df['region']=='INNER_CITY','region']=4
Step3.划分训练集和测试集
from sklearn.model_selection import train_test_split x=(df.drop('pep',axis=1)).astype(float) y=df['pep'].astype(float) x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
Step4.启动随机森林分类器
from sklearn.ensemble import RandomForestClassifier model1=RandomForestClassifier(n_estimators=1,max_depth=1) #初始参数设定 model1.fit(x_train,y_train)
Step5.模型评价-随机森林
def train_score(model,x_train,y_train): print("训练集上评分:",round(model.score(x_train,y_train),2)) def test_score(model,x_test,y_test): print("测试集上评分:",round(model.score(x_test,y_test),2)) print("-----随机森林-----") train_score(model1,x_train,y_train) test_score(model1,x_test,y_test)
Step6.优化参数-随机森林
from sklearn.model_selection import RandomizedSearchCV def randomizedsearch(model,params,x_train,y_train): model=RandomizedSearchCV(model,params,n_iter=100,cv=10) model.fit(x_train,y_train) #再次用新参数拟合 print("最优评分:",round(model.best_score_,2)) print("最优参数:",model.best_params_) return model print("-----随机森林-----") params1={'n_estimators':range(1,21),'max_depth':range(1,21)} #注意,参数要加引号 model1=randomizedsearch(model1,params1,x_train,y_train)
Step7.模型保存-随机森林
from sklearn.externals import joblib joblib.dump(model1,'d:\gbdt.pkl') new_model1=joblib.load('d:\gbdt.pkl') print("-----随机森林-----") print("测试集上预测结果:\n",new_model1.predict(x_test))
Step8.启动梯度提升分类器
from sklearn.ensemble import GradientBoostingClassifier model2=GradientBoostingClassifier(learning_rate=0.01,n_estimators=1,max_depth=1) #初始参数设定 model2.fit(x_train,y_train)
Step9.模型评价-梯度提升决策树
print("-----梯度提升决策树-----") train_score(model2,x_train,y_train) test_score(model2,x_test,y_test)
Step10.优化参数-梯度提升决策树
import numpy as np params2={'learning_rate':np.arange(0.01,0.3,0.1),'n_estimators':range(1,21),'max_depth':range(1,21)} model2=randomizedsearch(model2,params2,x_train,y_train)
Step11.模型保存-梯度提升决策树
from sklearn.externals import joblib joblib.dump(model2,'d:\gbdt.pkl') new_model2=joblib.load('d:\gbdt.pkl') print("-----梯度提升决策树-----") print("测试集上预测结果:\n",new_model2.predict(x_test))
-END
标签:df,RandomForest,print,train,-----,test,model,GBDT,决策树 From: https://www.cnblogs.com/peitongshi/p/17476065.html