目录
1 项目简介
本项目结合TF-IDF方法,使用高斯朴素贝叶斯分类器、岭回归分类器、支持向量机,以及深度神经网络MLP,这4个机器学习模型,完成新闻文本分类任务;并有央广网新闻文本爬取代码,即本项目数据集来源。
2 机器学习新闻文本分类
2.1 jieba分词
x_train = [" ".join(jieba.lcut(i)) for i in x_train]对于每一篇新闻,使用jieba分词后,通过空格将分好的词连成一个字符串,提供给Tfidf使用。
2.2 TfidfVectorizer
使用Tfidf方法将文本转化为矩阵,以提供给sklearn分类器,Tfidf方法介绍如下:
- 将各文档中每个单词的出现次数除以该文档中所有单词的总数,这些新的特征称之为词频tf。解决的问题:出现次数的统计是非常好的开始,但长的文本相对于短的文本有更高的单词平均出现次数,尽管他们可能在描述同一个主题。
- 降低在该训练文集中的很多文档中均出现的单词的权重,从而突出那些仅在该训练文集中在一小部分文档中出现的单词的信息量。
2.3 sklearn机器学习
使用朴素贝叶斯、岭回归分类器、支持向量机,对新闻文本实现分类。
- 朴素贝叶斯方法是基于贝叶斯定理的一组有监督学习算法,即“简单”地假设每对特征之间相互独立。GaussianNB实现了运用于分类的高斯朴素贝叶斯算法。
- 岭分类器,是使用岭回归的分类器。此分类器首先将目标值转换,然后将问题视为回归任务(在多类别情况下的多输出回归)。
- 支持向量机是一类按监督学习方式对数据进行二元分类的广义线性分类器,其决策边界是对学习样本求解的最大边距超平面。SVM使用铰链损失函数计算经验风险并在求解系统中加入了正则化项以优化结构风险,是一个具有稀疏性和稳健性的分类器。SVM可以通过核方法进行非线性分类,是常见的核学习方法之一。
2.4 实验源代码
# 机器学习sklearn新闻文本分类
import pandas as pd
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import RidgeClassifier
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
import time
def load_data(filepath):
train_df = pd.read_csv(filepath)
x_train, y_train = train_df['text'], train_df['label']
x_train, x_test, y_train, y_test = \
train_test_split(x_train, y_train, test_size=0.2)
return x_train, x_test, y_train, y_test
def data_prep(x_train, y_train, x_test):
tf_idf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
x_train = [" ".join(jieba.lcut(i)) for i in x_train]
x_test = [" ".join(jieba.lcut(i)) for i in x_test]
x_train = tf_idf.fit_transform(x_train)
x_test = tf_idf.transform(x_test)
selector = SelectKBest(f_classif, k=min(20000, x_train.shape[1]))
selector.fit(x_train, y_train)
x_train = selector.transform(x_train) # numpy.float64
x_test = selector.transform(x_test)
return x_train, x_test
def main():
x_train, x_test, y_train, y_test = load_data("newses.csv")
x_train, x_test = data_prep(x_train, y_train, x_test)
start = time.time()
gnb = GaussianNB() # 朴素贝叶斯
print(f'1:{cross_val_score(gnb, x_train.toarray(), y_train, cv=10)}')
gnb.fit(x_train.toarray(), y_train)
answer_gnb = pd.Series(gnb.predict(x_test.toarray()))
answer_gnb.to_csv("answer_gnb.csv", header=False, index=False)
score_gnb = f1_score(y_test, answer_gnb, average='macro')
print(f'F1_core_gnb:{score_gnb}')
end = time.time()
print(f'时间:{end - start}s')
start = time.time()
rc = RidgeClassifier() # 岭回归分类器
print(f'\n2:{cross_val_score(rc, x_train, y_train, cv=10)}')
rc.fit(x_train, y_train)
answer_rc = pd.Series(rc.predict(x_test))
answer_rc.to_csv("answer_rc.csv", header=False, index=False)
score_rc = f1_score(y_test, answer_rc, average='macro')
print(f'F1_core_rc:{score_rc}')
end = time.time()
print(f'时间:{end - start}s')
start = time.time()
sv = svm.SVC() # 支持向量机
print(f'\n3:{cross_val_score(sv, x_train, y_train, cv=10)}')
sv.fit(x_train, y_train)
answer_sv = pd.Series(sv.predict(x_test))
answer_sv.to_csv("answer_sv.csv", header=False, index=False)
score_sv = f1_score(y_test, answer_sv, average='macro')
print(f'F1_core_sv:{score_sv}')
end = time.time()
print(f'时间:{end - start}s')
main()
2.5 实验心得
- 使用jieba分词后,sklearn中分类器的效果提高明显。
- 在文本文件中执行机器学习算法,需要将文本内容转化成数值形式的特征向量,建议使用tf-idf方法,而不是词袋。
- 对数据集,要先将其分成训练集和测试集后,再将文本内容转化成数值形式的特征向量,以便处理没有标签的文本内容。
3 深度学习新闻文本分类
3.1 MLP
人工神经网络按其模型结构大体可以分为前馈型网络(也称为多层感知机网络)和反馈型网络(也称为Hopfield网络)两大类,前者在数学上可以看作是一类大规模的非线性映射系统,后者则是一类大规模的非线性动力学系统。
3.2 实验源代码
# 深度学习MLP新闻文本分类
import numpy as np
import pandas as pd
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from tensorflow.python.keras import models
from tensorflow.python.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
def load_data(filepath):
train_df = pd.read_csv(filepath)
x_train, y_train = train_df['text'], train_df['label']
x_train, x_test, y_train, y_test = \
train_test_split(x_train, y_train, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2)
return x_train, x_val, x_test, y_train, y_val, y_test
def data_prep(x_train, y_train, x_val, x_test):
tf_idf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
x_train = [" ".join(jieba.lcut(i)) for i in x_train]
x_test = [" ".join(jieba.lcut(i)) for i in x_test]
x_train = tf_idf.fit_transform(x_train)
x_val = tf_idf.transform(x_val)
x_test = tf_idf.transform(x_test)
selector = SelectKBest(f_classif, k=min(20000, x_train.shape[1]))
selector.fit(x_train, y_train)
x_train = selector.transform(x_train) # numpy.float64
x_val = selector.transform(x_val)
x_test = selector.transform(x_test)
return x_train, x_val, x_test
def main():
x_train, x_val, x_test, y_train, y_val, y_test = load_data("newses.csv")
x_train, x_val, x_test = data_prep(x_train, y_train, x_val, x_test)
model = models.Sequential([
Dropout(rate=0.2, input_shape=x_train.shape[1:]), # x_train.shape[1:]:(20000,)
Dense(units=64, activation='relu'),
Dropout(rate=0.2),
Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(x_train.toarray(), y_train, epochs=100, verbose=0,
validation_data=(x_val.toarray(), y_val),
batch_size=128)
history = history.history
print('Validation accuracy: {acc}, loss: {loss}'.format(acc=history['val_accuracy'][-1],
loss=history['val_loss'][-1]))
model.evaluate(x_test.toarray(), y_test)
y_predict = model.predict(x_test.toarray())
predicts = []
for i in y_predict:
predicts.append(np.argmax(i))
print(f'Predicts:{predicts}')
score = f1_score(y_test, predicts, average='macro')
print(f'F1_core:{score}')
model.save('News_mlp_model.h5')
main()
3.3 实验心得
- 神经网络损失函数的正确与否,会影响程序的正常运行。
- 神经网络模型受数据量影响较大。
- 数据集划分为训练集、验证集、测试集,提供给机器学习模型。
- 神经网络模型训练时,可向其提供验证集,得到模型在验证集上的准确率。
- Model.predict()返回的值需要通过np.argmax()转换,才能得到预测值。
4 新闻文本爬取
4.1 requests + BeautifulSoup
Requests获取网页源代码,BeautifulSoup提取网页新闻链接和新闻文本。
4.2 爬取新闻文本
首先,获取所有的新闻链接;其次,遍历这些链接,获取每一篇新闻的标题和正文,并加标签;最后,将结果转为pandas的DataFrame格式,并保存为CSV格式的文件。
4.3 实验源代码
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_html_text(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except Exception as e:
print(e)
print(url)
return url
def parse_news_page(html):
try:
ilt = []
soup = BeautifulSoup(html, "html.parser")
title = soup.find("title").string
ilt.append(title)
content = soup.find_all("p")
for p in content:
s = p.text.strip()
s = "".join(s.split("\n"))
ilt.append(s)
news = "".join(ilt)
return news
except Exception as e:
return e
def parse_href_page(html, hrefs):
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all("a")
for tag in tags:
href = tag.attrs["href"]
if "shtml" == href[-5:] and href not in hrefs:
hrefs.append(href)
return hrefs
def get_newses(url, newses, labels, count):
hrefs = []
html = get_html_text(url)
parse_href_page(html, hrefs)
for href in hrefs:
html = get_html_text(href)
if html == href:
continue
news = parse_news_page(html)
# print(news)
newses.append(news)
labels.append(count)
def main():
newses = []
labels = []
urls = ["http://finance.cnr.cn/", "http://tech.cnr.cn/", "http://food.cnr.cn/",
"http://health.cnr.cn/", "http://edu.cnr.cn/", "http://travel.cnr.cn/",
"http://military.cnr.cn/", "http://auto.cnr.cn/", "http://house.cnr.cn/",
"http://gongyi.cnr.cn/"]
count = 0
for url in urls:
print(url)
get_newses(url, newses, labels, count)
count += 1
newses = pd.DataFrame({"label": labels, "text": newses})
newses.to_csv("newses.csv", index=False)
main()
4.4 实验心得
- 解析包含所有新闻链接的网页这一步,确定仅提取新闻链接,而不提取新闻标题很重要,这样可以避免对于一篇新闻,没有提取到其新闻标题或其新闻链接。
- 新闻标题,要从包含整篇新闻内容的网页中获取,这样提取容易很多。
- 需要编写解析新闻文本网页的函数,编写解析包含所有新闻链接的网页的函数,编写获取1个类别下全部新闻文本的函数,也要编写遍历新闻类别链接获取10个类别新闻的全部新闻文本的主函数。
5 项目结果
图1
图2
图3