课程地址:课程竞赛:加州2020年房价预测_哔哩哔哩_bilibili
竞赛地址:California House Prices | Kaggle
李沐老师官方答案:10行代码战胜90%数据科学家?_哔哩哔哩_bilibili
Kaggle竞赛-2020年加州房价预测
李沐老师2020年加州房价数据太大,按4.10节教材中的方法来训练,调参消耗巨大时间,故改用automl(autogluon)进行训练和预测。
# 解压data
import zipfile
import os
def unzip_file(zip_file, extract_dir):
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
zip_ref.extractall(extract_dir)
os.remove(zip_file)
unzip_file('train.csv.zip', '/home/NAS/HUIDA/YaqinJiang/my')
unzip_file('test.csv.zip', '/home/NAS/HUIDA/YaqinJiang/my')
#训练
from autogluon.tabular import TabularDataset, TabularPredictor
import numpy as np
import pandas as pd
train_data = TabularDataset(r".../my/train.csv")
# 对于数值变化较大的列取log值处理
large_val_cols = ['Lot', 'Total interior livable area',
'Tax assessed value', 'Annual tax amount',
'Listed Price', 'Last Sold Price']
for col in large_val_cols + ['Sold Price']:
train_data[col] = np.log(train_data[col] + 1)
# 删去Id列和State列(因为都是加州)
predictor = TabularPredictor(label='Sold Price').fit(
train_data.drop(columns=['Id', 'State']))
# 预测
test_data = TabularDataset(r".../my/test.csv")
for col in large_val_cols:
test_data[col] = np.log(test_data[col] + 1)
preds = predictor.predict(test_data.drop(columns=['Id', 'State']))
submission = pd.DataFrame({'Id':test_data['Id'], 'Sold Price':np.exp(preds)-1})
submission.to_csv('submission_automl.csv',index=False)
提交Kaggle后,Private score为0.12561,Public score为0.13845;Private score接近但是没有超过李沐老师的baseline(0.12502),Public score略微超过李沐老师的baseline(0.13911)。
标签:竞赛,zip,data,train,2020,file,test,加州,col From: https://blog.csdn.net/scdifsn/article/details/139594212