首页 > 其他分享 >《机器学习实战》学习笔记(3)—— 朴素贝叶斯

《机器学习实战》学习笔记(3)—— 朴素贝叶斯

时间:2023-06-08 11:01:02浏览次数:45  
标签:increase ham text 笔记 贝叶斯 学习 文档 print your


1 朴素贝叶斯算法描述

工作原理:

对于给出的待分类项,求解在此项出现的条件下各个类别出现的概率,哪个最大,就认为此待分类项属于哪个类别。


2 计算概率的伪代码

计算每个类别中的文档数目:
对每篇训练文档:
    对每个类别:
        If 词条出现在文档中:
            增加该词条的计数值
        增加所有词条的计数值
    对每个类别:
        对每个词条:
            将该词条的数目除以总词条数目得到条件概率
    return 每个类别的条件概率

3 Wiki-贝叶斯定理、朴素、平滑处理

(1)贝叶斯定理



贝叶斯定理是关于随机事件A和B的条件概率的一则定理。




其中P(A|B)是在B发生的情况下A发生的可能性。

在贝叶斯定理中,每个名词都有约定俗成的名称:

  • P(A |B)是已知B发生后A的条件概率,也由于得自B的取值而被称作A的后验概率。
  • P(B |A)是已知A发生后B的条件概率,也由于得自A的取值而被称作B的后验概率。
  • P(A)是A的先验概率(或边缘概率)。之所以称为”先验”是因为它不考虑任何B方面的因素。
  • P(B)是B的先验概率或边缘概率。

(2)朴素(native)

数据集中的各个特征相互独立,一个特征或者单词出现的可能性与它和其他单词相邻没有关系

(3)平滑处理(smoothing)

在利用朴素贝叶斯分类器对文档进行分类时,要计算很多个概率的乘积,以获得文档属于某个类别的概率。如果其中一个概率为0,那么最后乘积也为0。为了避免0概率的发生,我们可以加上平滑项。

《机器学习实战》学习笔记(3)—— 朴素贝叶斯_文档


4 朴素贝叶斯的优点与缺点

(1)优点

在数据较少的情况下仍然有效,可以处理多类别问题

(2)缺点

对于输入数据的准备方式比较敏感


5 Python 代码实现

(1)词表到向量的转换函数

def loadDataSet():
    """
    创建一些试验样本
    """
    # 进行词条切分后的文档集合
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]

    # 类别标签集合
    # 0 代表正常文档, 1 代表侮辱性文档
    # 标注信息用于训练程序一遍自动检测侮辱性留言
    classVec = [0, 1, 0, 1, 0, 1]

    return postingList, classVec

# listOPosts, listClasses = loadDataSet()

def createVocabList(dataSet):
    """
    创建一个包含在所有文档中出现的不重复词的 list
    """
    # 创建一个空集
    vocabSet = set([])

    # 创建两个集合的并集
    for document in dataSet:

        # 将每篇文档返回的新词集合添加到该集合中
        # 操作符 | 用于求两个集合的并集
        vocabSet = vocabSet | set(document)

    return list(vocabSet)

# myVocabList = createVocabList(listOPosts)
# print(myVocabList)
#
# index_stupid = myVocabList.index('stupid')
# print(index_stupid)

# 词集模型
def setOfWord2Vec(vocabList, inputSet):
    """
    :param vocabList: 词汇表
    :param inputSet: 某个文档
    :return: 文档向量,向量的每个元素为 1 或 0,分别表示词汇表中的单词再输入文档中是否出现
    """

    # 创建一个其中所含元素都为 0 的向量,长度与词汇表相同
    returnVec = [0]*len(vocabList)

    # 遍历文档中所有的单词
    for word in inputSet:

        # 如果出现了词汇表中的单词
        if word in vocabList:

            # 将输出的文档向量中的对应值设置为 1
            returnVec[vocabList.index(word)] = 1
        else:
            print('the word %s is not in my Vocabulary!' % word)
    return returnVec

# result_1 = setOfWord2Vec(myVocabList, listOPosts[0])
# result_2 = setOfWord2Vec(myVocabList, listOPosts[3])
# print(result_1, result_2)

(2)朴素贝叶斯分类器训练函数

def trainNB0(trainMatrix, trainCatagory):
    """

    :param trainMatrix: 训练文档矩阵
    :param trainCatagory: 训练文档对应的标签
    :return:
    """
    # 训练文档的总数
    numTrainDocs = len(trainMatrix)

    # 词汇表的长度(列数)
    numWords = len(trainMatrix[0])

    # 任意文档 属于 侮辱性 文档 的概率
    pAbusive = sum(trainCatagory)/float(numTrainDocs)

    # # 词汇表长度,以 0 填充的矩阵
    # p0Num = np.zeros(numWords)
    # p1Num = np.zeros(numWords)
    #
    # # denom 分母项
    # p0Denom = 0.0
    # p1Denom = 0.0

    # 如果其中一个概率为0,那么最后乘积也为0
    # 为了降低这种影响,将所有词的出现数初始化为1,并将分母初始化为2
    p0Num = np.ones(numWords)
    p1Num = np.ones(numWords)
    p0Denom = 2.0
    p1Denom = 2.0

    # 遍历训练文档集中的每一篇文档
    for i in range(numTrainDocs):
        # 如果该文档的分类为 侮辱性 文档
        if trainCatagory[i] == 1:
            # 文档矩阵相加,最后获得的 p1Num 矩阵的每个元素为该词汇在所有文档中出现的总次数(一个行向量)
            p1Num += trainMatrix[i]
            # 矩阵单行元素相加,最后获得的 p1Denom 为整个文档集中所有词汇出现的总次数(一个常数)
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])

    # # 获得由每个单词 出现频率 组成的矩阵向量 p1Vect
    # p1Vect = p1Num/p1Denom
    # p0Vect = p0Num/p0Denom

    # 由于太多很小的数字相乘,造成 下溢出
    # 解决办法是对乘积取自然对数,通过求对数可以避免下溢出或者浮点数舍入导致的错误
    # 采用自然对数进行处理不会造成任何损失
    p1Vect = np.log(p1Num/p1Denom)
    p0Vect = np.log(p0Num/p0Denom)

    '''
    p0Vect: 非侮辱性文档中,每个词出现的概率
    p1Vect: 侮辱性文档中,每个词出现的概率
    pAbusive: 任意一篇文档,是侮辱性文档的概率
    '''
    return p0Vect, p1Vect, pAbusive

# trainMat = []
# for postinDoc in listOPosts:
#     trainMat.append(setOfWord2Vec(myVocabList, postinDoc))
# print('------------------------')
# print(trainMat)


# p0v, p1v, pAb = trainNB0(trainMat, listClasses)

# print(p0v)
# print(p0v[index_stupid])
# print('------------------------')
#
# print(p1v)
# print(p1v[index_stupid])
# print('------------------------')
#
# print(pAb)
# print('------------------------')

(3)朴素贝叶斯分类函数(伯努利贝叶斯)

def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    """

    :param vec2Classify: 要分类的向量
    :param p0Vec:
    :param p1Vec:
    :param pClass1:
    :return:
    """
    p1 = sum(vec2Classify*p1Vec) + np.log(pClass1)
    p0 = sum(vec2Classify*p0Vec) + np.log(1.0-pClass1)
    # print(p1, p0)
    if p1 > p0:
        return 1
    else:
        return 0

def testingNB():

    # 训练部分
    listOPosts, listClasses = loadDataSet()
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    for postinDoc in listOPosts:
        trainMat.append(setOfWord2Vec(myVocabList, postinDoc))
    p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))

    # 测试部分

    # 输入的测试文档
    testEntry = ['love', 'my', 'dalmation']

    # 将 测试文档 根据 词汇表 转化为 矩阵向量
    thisDoc = np.array(setOfWord2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))

    testEntry = ['stupid', 'garbage']
    thisDoc = np.array(setOfWord2Vec(myVocabList, testEntry))
    print(testEntry, 'classified as:', classifyNB(thisDoc, p0V, p1V, pAb))


# testingNB()

(4)朴素贝叶斯词袋模型

def bagOfWords2VecMN(vocabList, inputSet):
    # 创建一个其中所含元素都为 0 的向量,长度与词汇表相同
    returnVec = [0] * len(vocabList)

    # 遍历文档中所有的单词
    for word in inputSet:

        # 如果出现了词汇表中的单词
        if word in vocabList:

            # 将输出的文档向量中的对应值设置为 1
            returnVec[vocabList.index(word)] += 1
    return returnVec

6 示例:使用朴素贝叶斯过滤垃圾邮件

from bayes import createVocabList, setOfWord2Vec, trainNB0, classifyNB
import random
import numpy as np

def textParse(bigString):
    """
    接受一个 bigString 长字符串,并将其解析为 由长度大于 2 单词 组成的 list
    :param bigString: 长字符串
    :return: 单词组成的 list
    """
    import re

    # 以 [a-zA-Z0-9] 以外的元素进行 拆分
    listOfTokens = re.split('\W+', bigString)

    # 将长度大于 2 的单词转换为小写,并存入 list
    return [tok.lower for tok in listOfTokens if len(tok) > 2]

def spamTest():
    """
    对贝叶斯辣鸡邮件分类器进行自动化处理
    :return:
    """
    # 初始化 文档 list, list 中的每一个元素都是一个 文档(由单词组成的 list)
    docList = []

    # 初始化 文档分类 list, classList 与 docList 中的每个元素 一一对应,即为对应 文档的分类
    classList = []

    # 初始化 全部文本 list, list 中的每个元素, 为 一个单词
    fullText = []

    # 遍历 spam 和 ham 目录下的各个 txt 文件
    for i in range(1, 26):
        # 打开目录下的一个 文本 ,并对其 进行解析 为 文档
        wordList = textParse(open('email/spam/%d.txt' % i).read())

        # 将文档 append 入 docList 中
        docList.append(wordList)

        # 将文档 extend 到 fullText 后
        fullText.extend(wordList)

        # 在 classList 中 添加 文档对应的分类
        classList.append(1)

        wordList = textParse(open('email/ham/%d.txt' % i).read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)

    # 根据 docList 调用 createVocabList 创建 词汇表
    vocabList = createVocabList(docList)

    # 初始化 trainingSet 训练集,一个长度为 50 的 list
    trainingSet = list(range(50))

    # 初始化 testSet 测试集,为空
    testSet = []

    # 重复 10 次
    for i in range(10):
        # 从 0 到 训练集长度,随机选择一个整数,作为 randIndex 随机索引
        randIndex = int(random.uniform(0, len(trainingSet)))

        # 测试集 添加 训练集中随机索引 对应的 元素
        testSet.append(trainingSet[randIndex])

        # 从 训练集 中 删除 随机索引 对应的元素
        del(trainingSet[randIndex])

    # 初始化 训练矩阵
    trainMat = []

    # 初始化 训练分类 list
    trainClasses = []

    # 依次遍历 训练集 中的每个元素, 作为 docIndex 文档索引
    for docIndex in trainingSet:
        # 在 trainMat 训练矩阵中 添加 单词向量 矩阵
        trainMat.append(setOfWord2Vec(vocabList, docList[docIndex]))
        # 在 trainClasses 训练文档分类中 添加 文档对应的分类
        trainClasses.append(classList[docIndex])

    # 调用 trainNB0 函数,以 trainMat 和 trainClasses 作为输入数据,计算 p0V, p1V, pSpam
    p0V, p1V, pSpam = trainNB0(np.array(trainMat), np.array(trainClasses))

    # 初始化 错误统计
    errorCount = 0

    # 遍历 测试集 中的每个元素 作为 文档索引 docIndex
    for docIndex in testSet:
        # 生成单词向量
        wordVector = setOfWord2Vec(vocabList, docList[docIndex])

        # 如果计算后的分类结果 与 实际分类 不同
        if classifyNB(np.array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            # 错误数量 + 1
            errorCount += 1

    # 打印 错误率
    print('the error rate is:', float(errorCount)/len(testSet))

'''
随机选择数据的一部分作为训练集,而剩余部分作为测试集的过程称为 留存交叉验证
'''

# spamTest()

7 使用 pandas 和 scikit-learn 实现书上的例子


In [56]:

import os, sys
import re
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
from sklearn.feature_extraction.text import CountVectorizer


×




In [57]:


text_dict = {}
for label in os.listdir('./email/'):
#     print(label)
text_dict[label] = []
for filename in os.listdir('./email/{}'.format(label)):
#         print(filename)
read_file = open('./email/{}/{}'.format(label, filename), 'r', encoding='UTF-8', errors='ignore').read()
text_dict[label].append(read_file)
#     print(text_dict)
for text_dict_key in text_dict.keys():
text_dict[text_dict_key] = list(map(lambda x: ' '.join([word for word in [piece.lower() for piece in re.split(r'\W+', x) if len(piece)>2] \
if re.findall(r'^[A-Za-z]', word)]), text_dict[text_dict_key]))
# text_dict
row_doc = text_dict['ham'] + text_dict['spam']
row_doc


×


Out[57]:



['peter with jose out town you want meet once while keep things going and some interesting stuff let know eugene',
 'ryan whybrew commented your status ryan wrote turd ferguson butt horn',
 'arvind thirumalai commented your status arvind wrote you know reply this email comment this status',
 'thanks peter definitely check this how your book going heard chapter came and was good shape hope you are doing well cheers troy',
 'jay stepp commented your status jay wrote the reply this email comment this status see the comment thread follow the link below',
 'linkedin kerry haloney requested add you connection linkedin peter like add you professional network linkedin kerry haloney',
 'peter the hotels are the ones that rent out the tent they are all lined the hotel grounds much for being one with nature more like being one with couple dozen tour groups and nature have about pictures from that trip can through them and get you jpgs favorite scenic pictures where are you and jocelyn now new york will you come tokyo for chinese new year perhaps see the two you then will thailand for winter holiday see mom take care',
 'yeah ready may not here because jar jar has plane tickets germany for',
 'benoit mandelbrot benoit mandelbrot wilmott team benoit mandelbrot the mathematician the father fractal mathematics and advocate more sophisticated modelling quantitative finance died october aged wilmott magazine has often featured mandelbrot his ideas and the work others inspired his fundamental insights you must logged view these articles from past issues wilmott magazine',
 'peter sure thing sounds good let know what time would good for you will come prepared with some ideas and can from there regards vivek',
 'linkedin julius requested add you connection linkedin peter looking forward the book accept view invitation from julius',
 'yay you both doing fine working mba design strategy cca top art school new program focusing more right brained creative and strategic approach management the way done today',
 'thought about this and think possible should get another lunch have car now and could come pick you this time does this wednesday work can have signed copy you book',
 'saw this the way the coast thought might like hangzhou huge one day wasn enough but got glimpse went inside the china pavilion expo pretty interesting each province has exhibit',
 'hommies just got phone call from the roofer they will come and spaying the foaming today will dusty pls close all the doors and windows could you help close bathroom window cat window and the sliding door behind the don know how can those cats survive sorry for any inconvenience',
 'scifinance now automatically generates gpu enabled pricing risk model source code that runs faster than serial code using new nvidia fermi class tesla series gpu scifinance derivatives pricing and risk model development tool that automatically generates and gpu enabled source code from concise high level model specifications parallel computing cuda programming expertise required scifinance automatic gpu enabled monte carlo pricing model source code generation capabilities have been significantly extended the latest release this includes',
 'will there the latest',
 'that cold there going retirement party are the leaves changing color',
 'what going there talked john email talked about some computer stuff that went bike riding the rain was not that cold went the museum yesterday was get and they had free food the same time was giants game when got done had take the train with all the giants fans they are drunk',
 'been working running website using jquery and the jqplot plugin not too far away from having prototype launch you used jqplot right not think you would like',
 'there was guy the gas station who told that knew mandarin and python could get job with the fbi',
 'hello since you are owner least one google groups group that uses the customized welcome message pages files are writing inform you that will longer supporting these features starting february made this decision that can focus improving the core functionalities google groups mailing lists and forum discussions instead these features encourage you use products that are designed specifically for file storage and page creation such google docs and google sites for example you can easily create your pages google sites and share the site http www google com support sites bin answer answer with the members your group you can also store your files the site attaching files pages http www google com support sites bin answer answer the site youre just looking for place upload your files that your group members can download them suggest you try google docs you can upload files http docs google com support bin answer answer and share access with either group http docs google com support bin answer answer individual http docs google com support bin answer answer assigning either edit download only access the files you have received this mandatory email service announcement update you about important changes google groups',
 'zach hamm commented your status zach wrote doggy style enough said thank you good night',
 'this mail was sent from notification only address that cannot accept incoming mail please not reply this message thank you for your online reservation the store you selected has located the item you requested and has placed hold your name please note that all items are held for day please note store prices may differ from those online you have questions need assistance with your reservation please contact the store the phone number listed below you can also access store information such store hours and location the web http www borders com online store storedetailview_98',
 'peter these are the only good scenic ones and too bad there was girl back one them just try enjoy the blue sky',
 'codeine for visa only codeine methylmorphine narcotic opioid pain reliever have pills for for for visa only',
 'ordercializviagra online save pharmacy noprescription required buy canadian drugs wholesale prices and save fda approved drugs superb quality drugs only accept all major credit cards',
 'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe the proven naturalpenisenhancement that works moneyback guaranteeed',
 'buy ambiem zolpidem pill pills pills pills pills pills',
 'ordercializviagra online save pharmacy noprescription required buy canadian drugs wholesale prices and save fda approved drugs superb quality drugs only accept all major credit cards order today from',
 'buyviagra brandviagra femaleviagra from per pill viagranoprescription needed from certified canadian pharmacy buy here accept visa amex check worldwide delivery',
 'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
 'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
 'home based business opportunity knocking your door dont rude and let this chance you can earn great income and find your financial life transformed learn more here your success work from home finder experts',
 'codeine the most competitive price net codeine wilson codeine wilson freeviagra pills codeine wilson freeviagra pills codeine wilson freeviagra pills',
 'get off online watchesstore discount watches for all famous brands watches arolexbvlgari dior hermes oris cartier and more brands louis vuitton bags wallets gucci bags tiffany jewerly enjoy full year warranty shipment via reputable courier fedex ups dhl and ems speedpost you will recieve your order save off quality watches',
 'hydrocodone vicodin brand watson vicodin brand watson brand watson noprescription required free express fedex days delivery for over order major credit cards check',
 'get off online watchesstore discount watches for all famous brands watches arolexbvlgari dior hermes oris cartier and more brands louis vuitton bags wallets gucci bags tiffany jewerly enjoy full year warranty shipment via reputable courier fedex ups dhl and ems speedpost you will recieve your order',
 'percocet withoutprescription tabs percocet narcotic analgesic used treat moderate moderately severepain top quality express shipping safe discreet private buy cheap percocet online',
 'get off online watchesstore discount watches for all famous brands watches arolexbvlgari dior hermes oris cartier and more brands louis vuitton bags wallets gucci bags tiffany jewerly enjoy full year warranty shipment via reputable courier fedex ups dhl and ems speedpost you will recieve your order',
 'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
 'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
 'experience with biggerpenis today grow inches more the safest most effective methods of_penisen1argement save your time and money bettererections with effective ma1eenhancement products ma1eenhancement supplement trusted millions buy today',
 'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe the proven naturalpenisenhancement that works moneyback guaranteeed',
 'percocet withoutprescription tabs percocet narcotic analgesic used treat moderate moderately severepain top quality express shipping safe discreet private buy cheap percocet online',
 'codeine for visa only codeine methylmorphine narcotic opioid pain reliever have pills for for for visa only',
 'oem adobe microsoft softwares fast order and download microsoft office professional plus microsoft windows ultimate adobe photoshop cs5 extended adobe acrobat pro extended windows professional thousand more titles',
 'bargains here buy phentermin buy genuine phentermin low cost visa accepted',
 'you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
 'bargains here buy phentermin buy genuine phentermin low cost visa accepted']




In [58]:

count_vec = CountVectorizer()
count_vec.fit_transform(row_doc).toarray()



×


Out[58]:



array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 2, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=int64)



In [59]:

DataFrame(count_vec.fit_transform(row_doc).toarray(), columns=count_vec.get_feature_names())



×


Out[59]:



about

accept

accepted

access

acrobat

add

address

adobe

advocate

aged


yeah

year

yesterday

york

you

your

youre

yourpenis

zach

zolpidem

0

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

1

0

0

0

0

2

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

1

0

0

0

0

3

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

1

0

0

0

0

4

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

1

0

0

0

0

5

0

0

0

0

0

2

0

0

0

0


0

0

0

0

2

0

0

0

0

0

6

1

0

0

0

0

0

0

0

0

0


0

1

0

1

4

0

0

0

0

0

7

0

0

0

0

0

0

0

0

0

0


1

0

0

0

0

0

0

0

0

0

8

0

0

0

0

0

0

0

0

1

1


0

0

0

0

1

0

0

0

0

0

9

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

0

0

0

10

0

1

0

0

0

1

0

0

0

0


0

0

0

0

1

0

0

0

0

0

11

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

0

0

0

12

1

0

0

0

0

0

0

0

0

0


0

0

0

0

2

0

0

0

0

0

13

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

14

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

0

0

0

15

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

16

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

17

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

18

1

0

0

0

0

0

0

0

0

0


0

0

1

0

0

0

0

0

0

0

19

0

0

0

0

0

0

0

0

0

0


0

0

0

0

2

0

0

0

0

0

20

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

21

1

0

0

2

0

0

0

0

0

0


0

0

0

0

9

5

1

0

0

0

22

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

1

0

0

2

0

23

0

1

0

1

0

0

1

0

0

0


0

0

0

0

5

3

0

0

0

0

24

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

25

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

26

0

1

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

27

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

2

0

0

28

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

1

29

0

1

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

30

0

1

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

31

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

2

0

0

32

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

2

0

0

33

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

3

0

0

0

0

34

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

35

0

0

0

0

0

0

0

0

0

0


0

1

0

0

1

1

0

0

0

0

36

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

37

0

0

0

0

0

0

0

0

0

0


0

1

0

0

1

1

0

0

0

0

38

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

39

0

0

0

0

0

0

0

0

0

0


0

1

0

0

1

1

0

0

0

0

40

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

2

0

0

41

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

2

0

0

42

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

1

0

0

0

0

43

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

2

0

0

44

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

45

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

46

0

0

0

0

1

0

0

3

0

0


0

0

0

0

0

0

0

0

0

0

47

0

0

1

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

48

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

2

0

0

49

0

0

1

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

0

50 rows × 645 columns



In [60]:


len(text_dict['ham'])


×

Out[60]:



25

In [61]:

len(text_dict['spam'])


×




Out[61]:



25





In [62]:

text_dict.keys()





×




Out[62]:



dict_keys(['ham', 'spam'])




In [71]:

# 正常邮件为 ham,辣鸡邮件为 spam
dataSet_df = DataFrame(count_vec.fit_transform(row_doc).toarray(), columns=count_vec.get_feature_names()).join(DataFrame(['ham']*25+['spam']*25, columns=['label']))
dataSet_df



×


Out[71]:



about

accept

accepted

access

acrobat

add

address

adobe

advocate

aged


year

yesterday

york

you

your

youre

yourpenis

zach

zolpidem

label

0

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

0

0

0

ham

1

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

0

0

ham

2

0

0

0

0

0

0

0

0

0

0


0

0

0

1

1

0

0

0

0

ham

3

0

0

0

0

0

0

0

0

0

0


0

0

0

1

1

0

0

0

0

ham

4

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

0

0

ham

5

0

0

0

0

0

2

0

0

0

0


0

0

0

2

0

0

0

0

0

ham

6

1

0

0

0

0

0

0

0

0

0


1

0

1

4

0

0

0

0

0

ham

7

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

ham

8

0

0

0

0

0

0

0

0

1

1


0

0

0

1

0

0

0

0

0

ham

9

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

0

0

0

ham

10

0

1

0

0

0

1

0

0

0

0


0

0

0

1

0

0

0

0

0

ham

11

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

0

0

0

ham

12

1

0

0

0

0

0

0

0

0

0


0

0

0

2

0

0

0

0

0

ham

13

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

ham

14

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

0

0

0

ham

15

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

ham

16

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

ham

17

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

ham

18

1

0

0

0

0

0

0

0

0

0


0

1

0

0

0

0

0

0

0

ham

19

0

0

0

0

0

0

0

0

0

0


0

0

0

2

0

0

0

0

0

ham

20

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

ham

21

1

0

0

2

0

0

0

0

0

0


0

0

0

9

5

1

0

0

0

ham

22

0

0

0

0

0

0

0

0

0

0


0

0

0

1

1

0

0

2

0

ham

23

0

1

0

1

0

0

1

0

0

0


0

0

0

5

3

0

0

0

0

ham

24

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

ham

25

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

26

0

1

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

27

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

2

0

0

spam

28

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

1

spam

29

0

1

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

30

0

1

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

31

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

2

0

0

spam

32

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

2

0

0

spam

33

0

0

0

0

0

0

0

0

0

0


0

0

0

1

3

0

0

0

0

spam

34

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

35

0

0

0

0

0

0

0

0

0

0


1

0

0

1

1

0

0

0

0

spam

36

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

37

0

0

0

0

0

0

0

0

0

0


1

0

0

1

1

0

0

0

0

spam

38

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

39

0

0

0

0

0

0

0

0

0

0


1

0

0

1

1

0

0

0

0

spam

40

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

2

0

0

spam

41

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

2

0

0

spam

42

0

0

0

0

0

0

0

0

0

0


0

0

0

0

1

0

0

0

0

spam

43

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

2

0

0

spam

44

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

45

0

0

0

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

46

0

0

0

0

1

0

0

3

0

0


0

0

0

0

0

0

0

0

0

spam

47

0

0

1

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

48

0

0

0

0

0

0

0

0

0

0


0

0

0

1

0

0

2

0

0

spam

49

0

0

1

0

0

0

0

0

0

0


0

0

0

0

0

0

0

0

0

spam

50 rows × 646 columns





In [100]:


# 生成 50 个随机索引
indices = np.random.permutation(dataSet_df.shape[0])
# 选取 40 个数据为 训练集
x_train = dataSet_df.iloc[:, :-1].values[indices[:-10]]
y_train = dataSet_df.iloc[:, -1].values[indices[:-10]]

# 选取 10 个数据为 测试集
x_test = dataSet_df.iloc[:, :-1].values[indices[-10:]]
y_test = dataSet_df.iloc[:, -1].values[indices[-10:]]


×




In [101]:

from sklearn.naive_bayes import GaussianNB
gauss_NB = GaussianNB()
gauss_NB.fit(x_train, y_train)

×

Out[101]:



GaussianNB(priors=None)




In [102]:

gauss_NB.predict(x_test)



×


Out[102]:



array(['spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'spam',
       'ham'], 
      dtype='<U4')




In [103]:



gauss_NB.predict_proba(x_train)

×


Out[103]:



array([[ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.]])




In [104]:

gauss_NB.score(x_test, y_test)

×

Out[104]:


0.90000000000000002



In [105]:

test_text = []
for my_index in indices[-10:]:
test_text.append((row_doc[my_index],dataSet_df.iloc[:, -1].values[my_index]))
test_text

×


Out[105]:


[('get off online watchesstore discount watches for all famous brands watches arolexbvlgari dior hermes oris cartier and more brands louis vuitton bags wallets gucci bags tiffany jewerly enjoy full year warranty shipment via reputable courier fedex ups dhl and ems speedpost you will recieve your order save off quality watches',
  'spam'),
 ('you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
  'spam'),
 ('jay stepp commented your status jay wrote the reply this email comment this status see the comment thread follow the link below',
  'ham'),
 ('experience with biggerpenis today grow inches more the safest most effective methods of_penisen1argement save your time and money bettererections with effective ma1eenhancement products ma1eenhancement supplement trusted millions buy today',
  'spam'),
 ('benoit mandelbrot benoit mandelbrot wilmott team benoit mandelbrot the mathematician the father fractal mathematics and advocate more sophisticated modelling quantitative finance died october aged wilmott magazine has often featured mandelbrot his ideas and the work others inspired his fundamental insights you must logged view these articles from past issues wilmott magazine',
  'ham'),
 ('that cold there going retirement party are the leaves changing color',
  'ham'),
 ('you have everything gain incredib1e gains length inches yourpenis permanantly amazing increase thickness yourpenis betterejacu1ation control experience rock harderecetions explosive intenseorgasns increase volume ofejacu1ate doctor designed and endorsed herbal natural safe',
  'spam'),
 ('peter with jose out town you want meet once while keep things going and some interesting stuff let know eugene',
  'ham'),
 ('buy ambiem zolpidem pill pills pills pills pills pills', 'spam'),
 ('linkedin kerry haloney requested add you connection linkedin peter like add you professional network linkedin kerry haloney',
  'ham')]






标签:increase,ham,text,笔记,贝叶斯,学习,文档,print,your
From: https://blog.51cto.com/u_16152603/6438420

相关文章

  • Unity动画系统学习笔记
    title:Unity动画系统学习笔记date:2023-06-07T07:42:12Zlastmod:2023-06-07T11:27:45ZUnity动画系统学习笔记动画系统Unity动画系统动画片段AnimationClip:动画资源,用于展示游戏物体变化动画状态机AnimatorController:控制游戏物体各动画片段播放与切换......
  • Python+Redis学习笔记
    首先,通过pip来安装操作redis的相关包,pipinstallredis然后导入我们要使用的模块,formredis.ClientimportRedis然后,通过docker启动redis,fromredis.clientimportRedisr=Redis(host="0.0.0.0",port=6379,db=0,password="")#r.set("kol_height",187)res=r.......
  • 嵌入式相关知识点概念笔记
    01操作系统(OperatingSystem,OS)是管理计算机硬件与软件资源的系统软件,同时也是计算机系统的内核与基石。操作系统需要处理管理与配置内存、决定系统资源供需的优先次序、控制输入与输出设备、操作网络与管理文件系统等基本事务。操作系统也提供一个让用户与系统交互的操作界面。......
  • 自然语言处理(NLP)学习笔记——文本预处理
    自然语言处理入门1、什么是自然语言处理自然语言处理(NaturalLanguageProcessing,简称NLP)是计算机科学与语言学中关注于计算机与人类语言间转换的领域。2、自然语言处理的发展简史1950年,计算机科学之父图灵在论文中提出“机器可以思考吗”者一划时代的问题,从此促成了人类语言学与计......
  • Java 深入学习(27) —— 反射:运行时的类型信息
    1什么是反射反射(Reflection)是Java程序开发语言的特征之一,它允许运行中的Java程序获取类的信息,并且可以操作类或对象的内部属性。通过反射,我们可以在运行时获得程序或程序集中每一个类型的成员和成员的信息。反射的核心是JVM在运行时才动态加载类或调用方法/访问属性,它不需要......
  • 算法学习笔记(24): 狄利克雷卷积和莫比乌斯反演
    狄利克雷卷积和莫比乌斯反演看了《组合数学》,再听了学长讲的……感觉三官被颠覆……目录狄利克雷卷积和莫比乌斯反演狄利克雷卷积特殊的函数函数之间的关系除数函数和幂函数欧拉函数和恒等函数卷积的逆元莫比乌斯函数与莫比乌斯反演求法数论分块(整除分块)莫比乌斯反演的经典结......
  • 微信扫描领取“完整版的Python全套学习资料”的骗子
    网上搜Python学习资料,看到过很多次所谓“这份完整版的Python全套学习资料已经上传CSDN,朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【保证100%免费】”这些领资料的都是骗人的,扫了微信会问你杂七杂八的问题:python基础怎样,为什么学python,安装了pycharm没有等等。只......
  • 博客与笔记
    最近发现语雀的会员在今年9月即将过期,这意味着不能接下来语雀对我来说更类似于笔记软件,不能再发布内容了。笔记更像是自己不成熟的纪录,博客更多的是将解决问题的步骤作为可以分享的经验发布出来。 芥川龙之介曾经有写过一段随笔:“我有候会想,二十年后,五十年后,甚或一百年后的事。......
  • 网络学习笔记:华为 voice vlan配置(设备型号s7700核心)
    组网结构简介1、核心s7700下面连接电话交换机s5735(POE供电给IP话机),dhcp,voicevlan等配置在s7700实现,s5735充当傻瓜式交换机使用。2、目标实现数据vlan125给PC,语音vlan124给话机,话机连接交换机电口,PC连接话机的PC口使用。配置dhcp配置Vlan124/125#全局dhcpenable##......
  • 1.4OF-CONFIG南向接口协议学习
    OF-CONFIG南向接口协议学习任务目的1、了解OF-CONFIG协议的基本原理。2、掌握使用OF_CONFIG协议配置交换机的方法。任务环境设备名称软件环境(镜像)硬件环境交换机Ubuntu14.04桌面版OpenvSwitchofconfigCPU:1核内存:2G磁盘:20G注:系统默认的账户为:管理员权......