1 朴素贝叶斯算法描述
工作原理:
对于给出的待分类项,求解在此项出现的条件下各个类别出现的概率,哪个最大,就认为此待分类项属于哪个类别。
2 计算概率的伪代码
计算每个类别中的文档数目:
对每篇训练文档:
对每个类别:
If 词条出现在文档中:
增加该词条的计数值
增加所有词条的计数值
对每个类别:
对每个词条:
将该词条的数目除以总词条数目得到条件概率
return 每个类别的条件概率
3 Wiki-贝叶斯定理、朴素、平滑处理
(1)贝叶斯定理
贝叶斯定理是关于随机事件A和B的条件概率的一则定理。
其中P(A|B)是在B发生的情况下A发生的可能性。
在贝叶斯定理中,每个名词都有约定俗成的名称:
- P(A |B)是已知B发生后A的条件概率,也由于得自B的取值而被称作A的后验概率。
- P(B |A)是已知A发生后B的条件概率,也由于得自A的取值而被称作B的后验概率。
- P(A)是A的先验概率(或边缘概率)。之所以称为”先验”是因为它不考虑任何B方面的因素。
- P(B)是B的先验概率或边缘概率。
(2)朴素(native)
数据集中的各个特征相互独立,一个特征或者单词出现的可能性与它和其他单词相邻没有关系
(3)平滑处理(smoothing)
在利用朴素贝叶斯分类器对文档进行分类时,要计算很多个概率的乘积,以获得文档属于某个类别的概率。如果其中一个概率为0,那么最后乘积也为0。为了避免0概率的发生,我们可以加上平滑项。
4 朴素贝叶斯的优点与缺点
(1)优点
在数据较少的情况下仍然有效,可以处理多类别问题
(2)缺点
对于输入数据的准备方式比较敏感
5 Python 代码实现
(1)词表到向量的转换函数
def loadDataSet():
"""
创建一些试验样本
"""
# 进行词条切分后的文档集合
postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
# 类别标签集合
# 0 代表正常文档, 1 代表侮辱性文档
# 标注信息用于训练程序一遍自动检测侮辱性留言
classVec = [0, 1, 0, 1, 0, 1]
return postingList, classVec
# listOPosts, listClasses = loadDataSet()
def createVocabList(dataSet):
"""
创建一个包含在所有文档中出现的不重复词的 list
"""
# 创建一个空集
vocabSet = set([])
# 创建两个集合的并集
for document in dataSet:
# 将每篇文档返回的新词集合添加到该集合中
# 操作符 | 用于求两个集合的并集
vocabSet = vocabSet | set(document)
return list(vocabSet)
# myVocabList = createVocabList(listOPosts)
# print(myVocabList)
#
# index_stupid = myVocabList.index('stupid')
# print(index_stupid)
# 词集模型
def setOfWord2Vec(vocabList, inputSet):
"""
:param vocabList: 词汇表
:param inputSet: 某个文档
:return: 文档向量,向量的每个元素为 1 或 0,分别表示词汇表中的单词再输入文档中是否出现
"""
# 创建一个其中所含元素都为 0 的向量,长度与词汇表相同
returnVec = [0]*len(vocabList)
# 遍历文档中所有的单词
for word in inputSet:
# 如果出现了词汇表中的单词
if word in vocabList:
# 将输出的文档向量中的对应值设置为 1
returnVec[vocabList.index(word)] = 1
else:
print('the word %s is not in my Vocabulary!' % word)
return returnVec
# result_1 = setOfWord2Vec(myVocabList, listOPosts[0])
# result_2 = setOfWord2Vec(myVocabList, listOPosts[3])
# print(result_1, result_2)
(2)朴素贝叶斯分类器训练函数
def trainNB0(trainMatrix, trainCatagory):
"""
:param trainMatrix: 训练文档矩阵
:param trainCatagory: 训练文档对应的标签
:return:
"""
# 训练文档的总数
numTrainDocs = len(trainMatrix)
# 词汇表的长度(列数)
numWords = len(trainMatrix[0])
# 任意文档 属于 侮辱性 文档 的概率
pAbusive = sum(trainCatagory)/float(numTrainDocs)
# # 词汇表长度,以 0 填充的矩阵
# p0Num = np.zeros(numWords)
# p1Num = np.zeros(numWords)
#
# # denom 分母项
# p0Denom = 0.0
# p1Denom = 0.0
# 如果其中一个概率为0,那么最后乘积也为0
# 为了降低这种影响,将所有词的出现数初始化为1,并将分母初始化为2
p0Num = np.ones(numWords)
p1Num = np.ones(numWords)
p0Denom = 2.0
p1Denom = 2.0
# 遍历训练文档集中的每一篇文档
for i in range(numTrainDocs):
# 如果该文档的分类为 侮辱性 文档
if trainCatagory[i] == 1:
# 文档矩阵相加,最后获得的 p1Num 矩阵的每个元素为该词汇在所有文档中出现的总次数(一个行向量)
p1Num += trainMatrix[i]
# 矩阵单行元素相加,最后获得的 p1Denom 为整个文档集中所有词汇出现的总次数(一个常数)
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
# # 获得由每个单词 出现频率 组成的矩阵向量 p1Vect
# p1Vect = p1Num/p1Denom
# p0Vect = p0Num/p0Denom
# 由于太多很小的数字相乘,造成 下溢出
# 解决办法是对乘积取自然对数,通过求对数可以避免下溢出或者浮点数舍入导致的错误
# 采用自然对数进行处理不会造成任何损失
p1Vect = np.log(p1Num/p1Denom)
p0Vect = np.log(p0Num/p0Denom)
'''
p0Vect: 非侮辱性文档中,每个词出现的概率
p1Vect: 侮辱性文档中,每个词出现的概率
pAbusive: 任意一篇文档,是侮辱性文档的概率
'''
return p0Vect, p1Vect, pAbusive
# trainMat = []
# for postinDoc in listOPosts:
# trainMat.append(setOfWord2Vec(myVocabList, postinDoc))
# print('------------------------')
# print(trainMat)
# p0v, p1v, pAb = trainNB0(trainMat, listClasses)
# print(p0v)
# print(p0v[index_stupid])
# print('------------------------')
#
# print(p1v)
# print(p1v[index_stupid])
# print('------------------------')
#
# print(pAb)
# print('------------------------')
(3)朴素贝叶斯分类函数(伯努利贝叶斯)
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
"""
:param vec2Classify: 要分类的向量
:param p0Vec:
:param p1Vec:
:param pClass1:
:return:
"""
p1 = sum(vec2Classify*p1Vec) + np.log(pClass1)
p0 = sum(vec2Classify*p0Vec) + np.log(1.0-pClass1)
# print(p1, p0)
if p1 > p0:
return 1
else:
return 0
def testingNB():
# 训练部分
listOPosts, listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat = []
for postinDoc in listOPosts:
trainMat.append(setOfWord2Vec(myVocabList, postinDoc))
p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))
# 测试部分
# 输入的测试文档
testEntry = ['love', 'my', 'dalmation']
# 将 测试文档 根据 词汇表 转化为 矩阵向量
thisDoc = np.array(setOfWord2Vec(myVocabList, testEntry))
print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))
testEntry = ['stupid', 'garbage']
thisDoc = np.array(setOfWord2Vec(myVocabList, testEntry))
print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))
# testingNB()
(4)朴素贝叶斯词袋模型
def bagOfWords2VecMN(vocabList, inputSet):
# 创建一个其中所含元素都为 0 的向量,长度与词汇表相同
returnVec = [0] * len(vocabList)
# 遍历文档中所有的单词
for word in inputSet:
# 如果出现了词汇表中的单词
if word in vocabList:
# 将输出的文档向量中的对应值设置为 1
returnVec[vocabList.index(word)] += 1
return returnVec
6 示例:使用朴素贝叶斯过滤垃圾邮件
from bayes import createVocabList, setOfWord2Vec, trainNB0, classifyNB
import random
import numpy as np
def textParse(bigString):
"""
接受一个 bigString 长字符串,并将其解析为 由长度大于 2 单词 组成的 list
:param bigString: 长字符串
:return: 单词组成的 list
"""
import re
# 以 [a-zA-Z0-9] 以外的元素进行 拆分
listOfTokens = re.split('\W+', bigString)
# 将长度大于 2 的单词转换为小写,并存入 list
return [tok.lower for tok in listOfTokens if len(tok) > 2]
def spamTest():
"""
对贝叶斯辣鸡邮件分类器进行自动化处理
:return:
"""
# 初始化 文档 list, list 中的每一个元素都是一个 文档(由单词组成的 list)
docList = []
# 初始化 文档分类 list, classList 与 docList 中的每个元素 一一对应,即为对应 文档的分类
classList = []
# 初始化 全部文本 list, list 中的每个元素, 为 一个单词
fullText = []
# 遍历 spam 和 ham 目录下的各个 txt 文件
for i in range(1, 26):
# 打开目录下的一个 文本 ,并对其 进行解析 为 文档
wordList = textParse(open('email/spam/%d.txt' % i).read())
# 将文档 append 入 docList 中
docList.append(wordList)
# 将文档 extend 到 fullText 后
fullText.extend(wordList)
# 在 classList 中 添加 文档对应的分类
classList.append(1)
wordList = textParse(open('email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
# 根据 docList 调用 createVocabList 创建 词汇表
vocabList = createVocabList(docList)
# 初始化 trainingSet 训练集,一个长度为 50 的 list
trainingSet = list(range(50))
# 初始化 testSet 测试集,为空
testSet = []
# 重复 10 次
for i in range(10):
# 从 0 到 训练集长度,随机选择一个整数,作为 randIndex 随机索引
randIndex = int(random.uniform(0, len(trainingSet)))
# 测试集 添加 训练集中随机索引 对应的 元素
testSet.append(trainingSet[randIndex])
# 从 训练集 中 删除 随机索引 对应的元素
del(trainingSet[randIndex])
# 初始化 训练矩阵
trainMat = []
# 初始化 训练分类 list
trainClasses = []
# 依次遍历 训练集 中的每个元素, 作为 docIndex 文档索引
for docIndex in trainingSet:
# 在 trainMat 训练矩阵中 添加 单词向量 矩阵
trainMat.append(setOfWord2Vec(vocabList, docList[docIndex]))
# 在 trainClasses 训练文档分类中 添加 文档对应的分类
trainClasses.append(classList[docIndex])
# 调用 trainNB0 函数,以 trainMat 和 trainClasses 作为输入数据,计算 p0V, p1V, pSpam
p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))
# 初始化 错误统计
errorCount = 0
# 遍历 测试集 中的每个元素 作为 文档索引 docIndex
for docIndex in testSet:
# 生成单词向量
wordVector = setOfWord2Vec(vocabList, docList[docIndex])
# 如果计算后的分类结果 与 实际分类 不同
if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
# 错误数量 + 1
errorCount += 1
# 打印 错误率
print('the error rate is:', float(errorCount)/len(testSet))
'''
随机选择数据的一部分作为训练集,而剩余部分作为测试集的过程称为 留存交叉验证
'''
# spamTest()
7 使用 pandas 和 scikit-learn 实现书上的例子
In [56]:
import os, sys
import re
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from sklearn.feature_extraction.text import CountVectorizer
×
…
In [57]:
text_dict = {}
for label in os.listdir('./email/'):
# print(label)
text_dict[label] = []
for filename in os.listdir('./email/{}'.format(label)):
# print(filename)
read_file = open('./email/{}/{}'.format(label, filename), 'r', encoding='UTF-8', errors='ignore').read()
text_dict[label].append(read_file)
# print(text_dict)
for text_dict_key in text_dict.keys():
text_dict[text_dict_key] = list(map(lambda x: ' '.join([word for word in [piece.lower() for piece in re.split(r'\W+', x) if len(piece)>2] \
if re.findall(r'^[A-Za-z]', word)]), text_dict[text_dict_key]))
# text_dict
row_doc = text_dict['ham'] + text_dict['spam']
row_doc
×
Out[57]:
['peter with jose out town you want meet once while keep things going and some interesting stuff let know eugene',
'ryan whybrew commented your status ryan wrote turd ferguson butt horn',
'arvind thirumalai commented your status arvind wrote you know reply this email comment this status',
'thanks peter definitely check this how your book going heard chapter came and was good shape hope you are doing well cheers troy',
'jay stepp commented your status jay wrote the reply this email comment this status see the comment thread follow the link below',
'linkedin kerry haloney requested add you connection linkedin peter like add you professional network linkedin kerry haloney',
'peter the hotels are the ones that rent out the tent they are all lined the hotel grounds much for being one with nature more like being one with couple dozen tour groups and nature have about pictures from that trip can through them and get you jpgs favorite scenic pictures where are you and jocelyn now new york will you come tokyo for chinese new year perhaps see the two you then will thailand for winter holiday see mom take care',
'yeah ready may not here because jar jar has plane tickets germany for',
'benoit mandelbrot benoit mandelbrot wilmott team benoit mandelbrot the mathematician the father fractal mathematics and advocate more sophisticated modelling quantitative finance died october aged wilmott magazine has often featured mandelbrot his ideas and the work others inspired his fundamental insights you must logged view these articles from past issues wilmott magazine',
'peter sure thing sounds good let know what time would good for you will come prepared with some ideas and can from there regards vivek',
'linkedin julius requested add you connection linkedin peter looking forward the book accept view invitation from julius',
'yay you both doing fine working mba design strategy cca top art school new program focusing more right brained creative and strategic approach management the way done today',
'thought about this and think possible should get another lunch have car now and could come pick you this time does this wednesday work can have signed copy you book',
'saw this the way the coast thought might like hangzhou huge one day wasn enough but got glimpse went inside the china pavilion expo pretty interesting each province has exhibit',
'hommies just got phone call from the roofer they will come and spaying the foaming today will dusty pls close all the doors and windows could you help close bathroom window cat window and the sliding door behind the don know how can those cats survive sorry for any inconvenience',
'scifinance now automatically generates gpu enabled pricing risk model source code that runs faster than serial code using new nvidia fermi class tesla series gpu scifinance derivatives pricing and risk model development tool that automatically generates and gpu enabled source code from concise high level model specifications parallel computing cuda programming expertise required scifinance automatic gpu enabled monte carlo pricing model source code generation capabilities have been significantly extended the latest release this includes',
'will there the latest',
'that cold there going retirement party are the leaves changing color',
'what going there talked john email talked about some computer stuff that went bike riding the rain was not that cold went the museum yesterday was get and they had free food the same time was giants game when got done had take the train with all the giants fans they are drunk',
'been working running website using jquery and the jqplot plugin not too far away from having prototype launch you used jqplot right not think you would like',
'there was guy the gas station who told that knew mandarin and python could get job with the fbi',
'hello since you are owner least one google groups group that uses the customized welcome message pages files are writing inform you that will longer supporting these features starting february made this decision that can focus improving the core functionalities google groups mailing lists and forum discussions instead these features encourage you use products that are designed specifically for file storage and page creation such google docs and google sites for example you can easily create your pages google sites and share the site http www google com support sites bin answer answer with the members your group you can also store your files the site attaching files pages http www google com support sites bin answer answer the site youre just looking for place upload your files that your group members can download them suggest you try google docs you can upload files http docs google com support bin answer answer and share access with either group http docs google com support bin answer answer individual http docs google com support bin answer answer assigning either edit download only access the files you have received this mandatory email service announcement update you about important changes google groups',
'zach hamm commented your status zach wrote doggy style enough said thank you good night',
'this mail was sent from notification only address that cannot accept incoming mail please not reply this message thank you for your online reservation the store you selected has located the item you requested and has placed hold your name please note that all items are held for day please note store prices may differ from those online you have questions need assistance with your reservation please contact the store the phone number listed below you can also access store information such store hours and location the web http www borders com online store storedetailview_98',
'peter these are the only good scenic ones and too bad there was girl back one them just try enjoy the blue sky',
'codeine for visa only codeine methylmorphine narcotic opioid pain reliever have pills for for for visa only',
'ordercializviagra online save pharmacy noprescription required buy canadian drugs wholesale prices and save fda approved drugs superb quality drugs only accept all major credit cards',
'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe the proven naturalpenisenhancement that works moneyback guaranteeed',
'buy ambiem zolpidem pill pills pills pills pills pills',
'ordercializviagra online save pharmacy noprescription required buy canadian drugs wholesale prices and save fda approved drugs superb quality drugs only accept all major credit cards order today from',
'buyviagra brandviagra femaleviagra from per pill viagranoprescription needed from certified canadian pharmacy buy here accept visa amex check worldwide delivery',
'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
'home based business opportunity knocking your door dont rude and let this chance you can earn great income and find your financial life transformed learn more here your success work from home finder experts',
'codeine the most competitive price net codeine wilson codeine wilson freeviagra pills codeine wilson freeviagra pills codeine wilson freeviagra pills',
'get off online watchesstore discount watches for all famous brands watches arolexbvlgari dior hermes oris cartier and more brands louis vuitton bags wallets gucci bags tiffany jewerly enjoy full year warranty shipment via reputable courier fedex ups dhl and ems speedpost you will recieve your order save off quality watches',
'hydrocodone vicodin brand watson vicodin brand watson brand watson noprescription required free express fedex days delivery for over order major credit cards check',
'get off online watchesstore discount watches for all famous brands watches arolexbvlgari dior hermes oris cartier and more brands louis vuitton bags wallets gucci bags tiffany jewerly enjoy full year warranty shipment via reputable courier fedex ups dhl and ems speedpost you will recieve your order',
'percocet withoutprescription tabs percocet narcotic analgesic used treat moderate moderately severepain top quality express shipping safe discreet private buy cheap percocet online',
'get off online watchesstore discount watches for all famous brands watches arolexbvlgari dior hermes oris cartier and more brands louis vuitton bags wallets gucci bags tiffany jewerly enjoy full year warranty shipment via reputable courier fedex ups dhl and ems speedpost you will recieve your order',
'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
'experience with biggerpenis today grow inches more the safest most effective methods of_penisen1argement save your time and money bettererections with effective ma1eenhancement products ma1eenhancement supplement trusted millions buy today',
'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe the proven naturalpenisenhancement that works moneyback guaranteeed',
'percocet withoutprescription tabs percocet narcotic analgesic used treat moderate moderately severepain top quality express shipping safe discreet private buy cheap percocet online',
'codeine for visa only codeine methylmorphine narcotic opioid pain reliever have pills for for for visa only',
'oem adobe microsoft softwares fast order and download microsoft office professional plus microsoft windows ultimate adobe photoshop cs5 extended adobe acrobat pro extended windows professional thousand more titles',
'bargains here buy phentermin buy genuine phentermin low cost visa accepted',
'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
'bargains here buy phentermin buy genuine phentermin low cost visa accepted']
…
In [58]:
count_vec = CountVectorizer()
count_vec.fit_transform(row_doc).toarray()
×
Out[58]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 1, ..., 0, 0, 0],
[0, 0, 0, ..., 2, 0, 0],
[0, 0, 1, ..., 0, 0, 0]], dtype=int64)
…
In [59]:
DataFrame(count_vec.fit_transform(row_doc).toarray(), columns=count_vec.get_feature_names())
×
Out[59]:
about | accept | accepted | access | acrobat | add | address | adobe | advocate | aged | … | yeah | year | yesterday | york | you | your | youre | yourpenis | zach | zolpidem | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 1 | 0 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
12 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
21 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 9 | 5 | 1 | 0 | 0 | 0 |
22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 0 |
23 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 5 | 3 | 0 | 0 | 0 | 0 |
24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
26 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
29 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
30 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
32 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 0 | 0 |
34 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
35 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
36 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
37 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
39 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
42 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
43 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
44 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
45 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
46 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
47 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
49 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
50 rows × 645 columns
…
In [60]:
len(text_dict['ham'])
×
Out[60]:
25
…
In [61]:
len(text_dict['spam'])
×
Out[61]:
25
…
In [62]:
text_dict.keys()
×
Out[62]:
dict_keys(['ham', 'spam'])
…
In [71]:
# 正常邮件为 ham,辣鸡邮件为 spam
dataSet_df = DataFrame(count_vec.fit_transform(row_doc).toarray(), columns=count_vec.get_feature_names()).join(DataFrame(['ham']*25+['spam']*25, columns=['label']))
dataSet_df
×
Out[71]:
about | accept | accepted | access | acrobat | add | address | adobe | advocate | aged | … | year | yesterday | york | you | your | youre | yourpenis | zach | zolpidem | label | |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ham |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ham |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ham |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ham |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ham |
5 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | ham |
6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 0 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | ham |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ham |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ham |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ham |
10 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ham |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ham |
12 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | ham |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ham |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ham |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ham |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ham |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ham |
18 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ham |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | ham |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ham |
21 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 9 | 5 | 1 | 0 | 0 | 0 | ham |
22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 0 | ham |
23 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | … | 0 | 0 | 0 | 5 | 3 | 0 | 0 | 0 | 0 | ham |
24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ham |
25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
26 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | spam |
28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | spam |
29 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
30 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | spam |
32 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | spam |
33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 3 | 0 | 0 | 0 | 0 | spam |
34 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
35 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | spam |
36 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
37 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | spam |
38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
39 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | spam |
40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | spam |
41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | spam |
42 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | spam |
43 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | spam |
44 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
45 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
46 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
47 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | spam |
49 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | spam |
50 rows × 646 columns
…
In [100]:
# 生成 50 个随机索引
indices = np.random.permutation(dataSet_df.shape[0])
# 选取 40 个数据为 训练集
x_train = dataSet_df.iloc[:, :-1].values[indices[:-10]]
y_train = dataSet_df.iloc[:, -1].values[indices[:-10]]
# 选取 10 个数据为 测试集
x_test = dataSet_df.iloc[:, :-1].values[indices[-10:]]
y_test = dataSet_df.iloc[:, -1].values[indices[-10:]]
×
…
In [101]:
from sklearn.naive_bayes import GaussianNB
gauss_NB = GaussianNB()
gauss_NB.fit(x_train, y_train)
×
Out[101]:
GaussianNB(priors=None)
…
In [102]:
gauss_NB.predict(x_test)
×
Out[102]:
array(['spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'spam',
'ham'],
dtype='<U4')
…
In [103]:
gauss_NB.predict_proba(x_train)
×
Out[103]:
array([[ 1., 0.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.],
[ 0., 1.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 0., 1.],
[ 0., 1.],
[ 0., 1.],
[ 0., 1.],
[ 0., 1.],
[ 0., 1.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 1., 0.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.],
[ 0., 1.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 0., 1.],
[ 0., 1.],
[ 0., 1.],
[ 1., 0.]])
…
In [104]:
gauss_NB.score(x_test, y_test)
×
Out[104]:
0.90000000000000002
…
In [105]:
test_text = []
for my_index in indices[-10:]:
test_text.append((row_doc[my_index],dataSet_df.iloc[:, -1].values[my_index]))
test_text
×
Out[105]:
[('get off online watchesstore discount watches for all famous brands watches arolexbvlgari dior hermes oris cartier and more brands louis vuitton bags wallets gucci bags tiffany jewerly enjoy full year warranty shipment via reputable courier fedex ups dhl and ems speedpost you will recieve your order save off quality watches',
'spam'),
('you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
'spam'),
('jay stepp commented your status jay wrote the reply this email comment this status see the comment thread follow the link below',
'ham'),
('experience with biggerpenis today grow inches more the safest most effective methods of_penisen1argement save your time and money bettererections with effective ma1eenhancement products ma1eenhancement supplement trusted millions buy today',
'spam'),
('benoit mandelbrot benoit mandelbrot wilmott team benoit mandelbrot the mathematician the father fractal mathematics and advocate more sophisticated modelling quantitative finance died october aged wilmott magazine has often featured mandelbrot his ideas and the work others inspired his fundamental insights you must logged view these articles from past issues wilmott magazine',
'ham'),
('that cold there going retirement party are the leaves changing color',
'ham'),
('you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
'spam'),
('peter with jose out town you want meet once while keep things going and some interesting stuff let know eugene',
'ham'),
('buy ambiem zolpidem pill pills pills pills pills pills', 'spam'),
('linkedin kerry haloney requested add you connection linkedin peter like add you professional network linkedin kerry haloney',
'ham')]
…