首页 > 其他分享 >【自然语言处理】中文垃圾邮件的分类代码

【自然语言处理】中文垃圾邮件的分类代码

时间:2024-06-02 16:46:43浏览次数:28  
标签:垃圾邮件 中文 features labels label train test corpus 自然语言

代码如下:

"""
author: wangyilin
"""
import numpy as np
from sklearn.model_selection import train_test_split


def get_data():
    '''
    获取数据
    :return: 文本数据,对应的labels
    '''
    with open("/home/bkrc/Desktop/6/data/ham_data.txt", encoding="utf8") as ham_f, open("/home/bkrc/Desktop/6/data/spam_data.txt", encoding="utf8") as spam_f:
        ham_data = ham_f.readlines()
        spam_data = spam_f.readlines()

        ham_label = np.ones(len(ham_data)).tolist()
        spam_label = np.zeros(len(spam_data)).tolist()

        corpus = ham_data + spam_data

        labels = ham_label + spam_label

    return corpus, labels


def prepare_datasets(corpus, labels, test_data_proportion=0.3):
    '''
    
    :param corpus: 文本数据
    :param labels: label数据
    :param test_data_proportion:测试数据占比 
    :return: 训练数据,测试数据,训练label,测试label
    '''
    train_X, test_X, train_Y, test_Y = train_test_split(corpus, labels,
                                                        test_size=test_data_proportion, random_state=42)
    return train_X, test_X, train_Y, test_Y


def remove_empty_docs(corpus, labels):
    filtered_corpus = []
    filtered_labels = []
    for doc, label in zip(corpus, labels):
        if doc.strip():
            filtered_corpus.append(doc)
            filtered_labels.append(label)

    return filtered_corpus, filtered_labels


from sklearn import metrics


def get_metrics(true_labels, predicted_labels):
    print('准确率:', np.round(
        metrics.accuracy_score(true_labels,
                               predicted_labels),
        2))
    print('精度:', np.round(
        metrics.precision_score(true_labels,
                                predicted_labels,
                                average='weighted'),
        2))
    print('召回率:', np.round(
        metrics.recall_score(true_labels,
                             predicted_labels,
                             average='weighted'),
        2))
    print('F1得分:', np.round(
        metrics.f1_score(true_labels,
                         predicted_labels,
                         average='weighted'),
        2))


def train_predict_evaluate_model(classifier,
                                 train_features, train_labels,
                                 test_features, test_labels):
    # build model
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features)
    # evaluate model prediction performance
    get_metrics(true_labels=test_labels,
                predicted_labels=predictions)
    return predictions


def main():
    corpus, labels = get_data()  # 获取数据集

    print("总的数据量:", len(labels))

    corpus, labels = remove_empty_docs(corpus, labels)

    print('样本之一:', corpus[10])
    print('样本的label:', labels[10])
    label_name_map = ["垃圾邮件", "正常邮件"]
    print('实际类型:', label_name_map[int(labels[10])], label_name_map[int(labels[5900])])

    # 对数据进行划分
    train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus,
                                                                            labels,
                                                                            test_data_proportion=0.3)

    from normalization import normalize_corpus
    # 进行归一化
    norm_train_corpus = normalize_corpus(train_corpus)
    norm_test_corpus = normalize_corpus(test_corpus)

    ''.strip()

    from feature_extractors import bow_extractor, tfidf_extractor
    import gensim
    import jieba

    # 词袋模型特征
    bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)
    bow_test_features = bow_vectorizer.transform(norm_test_corpus)

    # tfidf 特征
    tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)
    tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)

    # tokenize documents
    tokenized_train = [jieba.lcut(text)
                       for text in norm_train_corpus]
    print(tokenized_train[2:10])
    tokenized_test = [jieba.lcut(text)
                      for text in norm_test_corpus]
    # build word2vec 模型
    model = gensim.models.Word2Vec(tokenized_train,
                                   size=500,
                                   window=100,
                                   min_count=30,
                                   sample=1e-3)

    from sklearn.naive_bayes import MultinomialNB
    from sklearn.linear_model import SGDClassifier
    from sklearn.linear_model import LogisticRegression
    mnb = MultinomialNB()
    svm = SGDClassifier(loss='hinge', max_iter=100)
    lr = LogisticRegression()

    # 基于词袋模型的多项朴素贝叶斯
    print("基于词袋模型特征的贝叶斯分类器")
    mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
                                                       train_features=bow_train_features,
                                                       train_labels=train_labels,
                                                       test_features=bow_test_features,
                                                       test_labels=test_labels)

    # 基于词袋模型特征的逻辑回归
    print("基于词袋模型特征的逻辑回归")
    lr_bow_predictions = train_predict_evaluate_model(classifier=lr,
                                                      train_features=bow_train_features,
                                                      train_labels=train_labels,
                                                      test_features=bow_test_features,
                                                      test_labels=test_labels)

    # 基于词袋模型的支持向量机方法
    print("基于词袋模型的支持向量机")
    svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
                                                       train_features=bow_train_features,
                                                       train_labels=train_labels,
                                                       test_features=bow_test_features,
                                                       test_labels=test_labels)


    # 基于tfidf的多项式朴素贝叶斯模型
    print("基于tfidf的贝叶斯模型")
    mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
                                                         train_features=tfidf_train_features,
                                                         train_labels=train_labels,
                                                         test_features=tfidf_test_features,
                                                         test_labels=test_labels)
    # 基于tfidf的逻辑回归模型
    print("基于tfidf的逻辑回归模型")
    lr_tfidf_predictions=train_predict_evaluate_model(classifier=lr,
                                                         train_features=tfidf_train_features,
                                                         train_labels=train_labels,
                                                         test_features=tfidf_test_features,
                                                         test_labels=test_labels)


    # 基于tfidf的支持向量机模型
    print("基于tfidf的支持向量机模型")
    svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
                                                         train_features=tfidf_train_features,
                                                         train_labels=train_labels,
                                                         test_features=tfidf_test_features,
                                                         test_labels=test_labels)

    import re

    num = 0
    for document, label, predicted_label in zip(test_corpus, test_labels, svm_tfidf_predictions):
        if label == 0 and predicted_label == 0:
            print('邮件类型:', label_name_map[int(label)])
            print('预测的邮件类型:', label_name_map[int(predicted_label)])
            print('文本:-')
            print(re.sub('\n', ' ', document))

            num += 1
            if num == 4:
                break

    num = 0
    for document, label, predicted_label in zip(test_corpus, test_labels, svm_tfidf_predictions):
        if label == 1 and predicted_label == 0:
            print('邮件类型:', label_name_map[int(label)])
            print('预测的邮件类型:', label_name_map[int(predicted_label)])
            print('文本:-')
            print(re.sub('\n', ' ', document))

            num += 1
            if num == 4:
                break


if __name__ == "__main__":
    main()

标签:垃圾邮件,中文,features,labels,label,train,test,corpus,自然语言
From: https://www.cnblogs.com/mllt/p/18227265/py_ai_NLP_zwljyjfl

相关文章

  • 【自然语言处理】中文语义消歧实验代码
    本实验以句子为单位进行语义消歧,即输入一句话,识别该句子中某个歧义词的含义。本次实验使用的算法比较简单,是以TF_IDF为权重的频数判别代码如下:importosimportjiebafrommathimportlog2#读取每个义项的语料defread_file(path):withopen(path,'r',encoding='u......
  • yt-dlp 使用指南 (中文版)
    yt-dlp使用指南(中文版)yt-dlp是一款功能丰富的命令行音视频下载器,支持数千个网站。该项目是基于已停止维护的youtube-dlc项目的youtube-dl的分支。安装WindowsUnixMacOSPyPi源代码压缩包其他变体所有版本您可以使用二进制文件、pip或第三方包管理器安装yt-dlp......
  • 【WEEK14】 【DAY5】Swagger第三部分【中文版】
    2024.5.31Friday接上文【WEEK14】【DAY4】Swagger第二部分【中文版】目录16.6.配置API分组16.6.1.修改SwaggerConfig.java16.6.2.重启16.7.实体配置16.7.1.新建pojo文件夹16.7.2.修改HelloController.java16.7.3.重启16.8.常用注解16.8.1.Swagger的所有注解定义在i......
  • linux 安装字体解决JAVA图形中文乱码问题
    1、在C:\Windows\Fonts\找到想要安装到linux的字体;如微软雅黑字体,它们可能的文件包括:2、将相关字体文件复制到指定文件夹“/usr/share/fonts/”3、执行字体安装:cd/usr/share/fonts/mkfontscalemkfontdir如果提示 mkfontscale:commandnotfound,需自行安装 yuminstallm......
  • Idea中关于输出控制台中文乱码解决
    元注解@Target(value={ElementType.TYPE})@Retention(value=RetentionPolicy.RUNTIME)public@interfaceComponent{Stringvalue();}实体类@Component("userBean")publicclassUser{}*测试@TestpublicvoidComponentScan()throwsUnsupport......
  • 基于BERT-BILSTM的中文情感识别
            欢迎来到BERT-BiLSTM中文情感识别项目!我们利用BERT模型提取文本语义特征,结合BiLSTM网络学习时序信息,显著提升中文情感识别性能。为解决训练时间长问题,我们部署在GPU环境,加速模型训练。项目提供可视化中文情感识别系统,欢迎贡献代码、建议或数据,共同优化模型,让中......
  • (中文参数)可编程逻辑IC 5SGXEB6R2F40I2G、5SGXEB6R2F40I3G、5SGXEB6R3F40I3G、5SGXEB6R
    概述StratixV是业内第一款可提供精度可变DSP模块的FPGA,这使得它可提供业内效率最高、性能最好的多精度DSP数据通路和功能,如FFT、FIR和浮点DSP。StratixVFPGA具有1.6Tbps串行交换能力,采用各种创新技术和前沿28-nm工艺,突破带宽瓶颈,降低了宽带应用的成本和功耗。StratixVFP......
  • Python-pptx正确设置中文字体
    使用pptx_ea_font库设置中文字体:1.安装pptx_ea_font库:pipinstallpptx-ea-font2.p=text_frame.paragraphs[0]#取文本段落 run=p.runs[0]#取文本运行对象,该对象为段落的子元素pptx_ea_font.set_font(run,'微软雅黑')#以下方法只能修改数字和英文#run.font.name=......
  • mysql针对中文和数字字段进行排序
    场景1field函数的使用field(str,str1,str2,str3,str4…)字段str按照字符串1、字符串2、字符串3、字符串4的顺序返回查询到的结果集。如果表字段值str不存在,放在结果集的最前面subString如七年级1班,想要截取第一个字符,就是substring(user_name,1,1),第一个参数写字段,第二个参数......
  • 山东大学项目实训-基于LLM的中文法律文书生成系统(十六)- 指令微调(1)
    指令微调指令微调(InstructionTuning)是指使用自然语言形式的数据对预训练后的大语言模型进行参数微调。微调适用的场景通常来说,适合微调的场景主要分为行业场景和通用场景对于行业场景:例如客服助手,智能写作辅导等需要专门的回答范式和预期的场景例如智慧医生,智慧律师等需要......