首页 > 其他分享 >使用神经网络-垃圾邮件检测-LSTM或者CNN(一维卷积)效果都不错【代码有问题,pass】

使用神经网络-垃圾邮件检测-LSTM或者CNN(一维卷积)效果都不错【代码有问题,pass】

时间:2023-05-31 15:05:01浏览次数:41  
标签:features max print test train wordbag pass CNN LSTM

 

from sklearn.feature_extraction.text import CountVectorizer
import os
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm
from sklearn.feature_extraction.text import TfidfTransformer
import tensorflow as tf
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.conv import conv_1d, global_max_pool
from tflearn.layers.conv import conv_2d, max_pool_2d
from tflearn.layers.merge_ops import merge
from tflearn.layers.estimator import regression
from tflearn.data_utils import to_categorical, pad_sequences
from sklearn.neural_network import MLPClassifier
from tflearn.layers.normalization import local_response_normalization
from tensorflow.contrib import learn


max_features=500
max_document_length=1024



def load_one_file(filename):
    x=""
    with open(filename) as f:
        for line in f:
            line=line.strip('\n')
            line = line.strip('\r')
            x+=line
    return x

def load_files_from_dir(rootdir):
    x=[]
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            v=load_one_file(path)
            x.append(v)
    return x

def load_all_files():
    ham=[]
    spam=[]
    for i in range(1,5):
        path="../data/mail/enron%d/ham/" % i
        print "Load %s" % path
        ham+=load_files_from_dir(path)
        path="../data/mail/enron%d/spam/" % i
        print "Load %s" % path
        spam+=load_files_from_dir(path)
    return ham,spam

def get_features_by_wordbag():
    ham, spam=load_all_files()
    x=ham+spam
    y=[0]*len(ham)+[1]*len(spam)
    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print vectorizer
    x=vectorizer.fit_transform(x)
    x=x.toarray()
    return x,y

def show_diffrent_max_features():
    global max_features
    a=[]
    b=[]
    for i in range(1000,20000,2000):
        max_features=i
        print "max_features=%d" % i
        x, y = get_features_by_wordbag()
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)
        gnb = GaussianNB()
        gnb.fit(x_train, y_train)
        y_pred = gnb.predict(x_test)
        score=metrics.accuracy_score(y_test, y_pred)
        a.append(max_features)
        b.append(score)
        plt.plot(a, b, 'r')
    plt.xlabel("max_features")
    plt.ylabel("metrics.accuracy_score")
    plt.title("metrics.accuracy_score VS max_features")
    plt.legend()
    plt.show()

def do_nb_wordbag(x_train, x_test, y_train, y_test):
    print "NB and wordbag"
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print metrics.accuracy_score(y_test, y_pred)
    print metrics.confusion_matrix(y_test, y_pred)

def do_svm_wordbag(x_train, x_test, y_train, y_test):
    print "SVM and wordbag"
    clf = svm.SVC()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print metrics.accuracy_score(y_test, y_pred)
    print metrics.confusion_matrix(y_test, y_pred)

def get_features_by_wordbag_tfidf():
    ham, spam=load_all_files()
    x=ham+spam
    y=[0]*len(ham)+[1]*len(spam)
    vectorizer = CountVectorizer(binary=True,
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print vectorizer
    x=vectorizer.fit_transform(x)
    x=x.toarray()
    transformer = TfidfTransformer(smooth_idf=False)
    print transformer
    tfidf = transformer.fit_transform(x)
    x = tfidf.toarray()
    return  x,y


def do_cnn_wordbag(trainX, testX, trainY, testY):
    global max_document_length
    print "CNN and tf"

    trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
    testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_document_length], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="spam")

def do_rnn_wordbag(trainX, testX, trainY, testY):
    global max_document_length
    print "RNN and wordbag"

    trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
    testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, max_document_length])
    net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=10,run_id="spm-run",n_epoch=5)


def do_dnn_wordbag(x_train, x_test, y_train, y_testY):
    print "DNN and wordbag"

    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    print  clf
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print metrics.accuracy_score(y_test, y_pred)
    print metrics.confusion_matrix(y_test, y_pred)



def  get_features_by_tf():
    global  max_document_length
    x=[]
    y=[]
    ham, spam=load_all_files()
    x=ham+spam
    y=[0]*len(ham)+[1]*len(spam)
    vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
                                              min_frequency=0,
                                              vocabulary=None,
                                              tokenizer_fn=None)
    x=vp.fit_transform(x, unused_y=None)
    x=np.array(list(x))
    return x,y



if __name__ == "__main__":
    print "Hello spam-mail"
    #print "get_features_by_wordbag"
    #x,y=get_features_by_wordbag()
    #x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)

    #print "get_features_by_wordbag_tfidf"
    #x,y=get_features_by_wordbag_tfidf()
    #x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)
    #NB
    #do_nb_wordbag(x_train, x_test, y_train, y_test)
    #show_diffrent_max_features()

    #SVM
    #do_svm_wordbag(x_train, x_test, y_train, y_test)

    #DNN
    #do_dnn_wordbag(x_train, x_test, y_train, y_test)

    print "get_features_by_tf"
    x,y=get_features_by_wordbag()
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)
    #CNN
    do_cnn_wordbag(x_train, x_test, y_train, y_test)


    #RNN
    #do_rnn_wordbag(x_train, x_test, y_train, y_test)

自己写检测算法的时候也记得多个算法比较下

标签:features,max,print,test,train,wordbag,pass,CNN,LSTM
From: https://blog.51cto.com/u_11908275/6386902

相关文章

  • 在树莓派上实现numpy的LSTM长短期记忆神经网络做图像分类,加载pytorch的模型参数,推理mn
    这几天又在玩树莓派,先是搞了个物联网,又在尝试在树莓派上搞一些简单的神经网络,这次搞得是LSTM识别mnist手写数字识别训练代码在电脑上,cpu就能训练,很快的:importtorchimporttorch.nnasnnimporttorchvisionimportnumpyasnpimportosfromPILimportImage#定义LSTM......
  • MegEngine 使用小技巧:如何解读 MegCC 编译模型几个阶段 Pass 的作用
    MegCC 是一个真真实实的深度学习模型编译器,具备极其轻量的Runtime二进制体积,高性能,方便移植,极低内存使用以及快启动等核心特点。用户可在MLIR上进行计算图优化,内存规划,最后通过预先写好的code模版进行代码生成。MegCC中主要的PassMGBToKernelPass:这个Pass主要将MGB......
  • Linux - sshpass的安装与使用
     ssh登陆不能在命令行中指定密码,sshpass的出现则解决了这一问题。它允许你用-p参数指定明文密码,然后直接登录远程服务器,它支持密码从命令行、文件、环境变量中读取。 1、安装[root@node1home]#wgethttp://sourceforge.net/projects/sshpass/files/sshpass/1.05/ssh......
  • 忽略这些小事,你的简历会直接被Pass
    本文首发自公粽hao「林行学长」,欢迎来撩,免费领取20个求职工具资源包。了解校招、分享校招知识的学长来了!求职有三步,简历第一步。在各种简历筛选中,如果你不能做到百里挑一,那上岸机会其实是微乎其微简历是求职面试的敲门砖倘若因为忽略了一些小事而浪费掉自身优势,那可就太可惜了所以......
  • 使用 CNN 提取内容和风格进行风格迁移(PyTorch 实现)
    使用CNN提取内容和风格进行迁移目录使用CNN提取内容和风格进行迁移论文概述论文地址基本概念及对应用符号生成过程准备目标生成代码实现网络结构具体实现结果本文演示了使用CNN进行风格迁移(styletransfer)的深度学习PyTorch实现。完整实现代码位于https://github.com/......
  • MySQL中--skip-password参数作用
     MySQL中--skip-password参数探究 本篇使用客户端:mysql版本:MySQL8认证插件:mysql_native_password对于初始化数据库时,若是使用了--initialize-inscure选项,则对于用户root@localhost会使用空密码。2023-05-26T09:20:21.205673+08:006[Warning][MY-010453][Server]roo......
  • [cnn]cnn训练MINST数据集demo
    [cnn]cnn训练MINST数据集demotips:在文件路径进入conda输入jupyternbconvert--tomarkdowntest.ipynb将ipynb文件转化成markdown文件jupyternbconvert--tohtmltest.ipynbjupyternbconvert--topdftest.ipynb(html,pdf文件同理)importtorchimporttorch.nnasn......
  • cnn全连接层
    作用根据特征的组合进行分类大大减少特征位置对分类带来的影响减少特征位置对分类带来的影响就是它把特征representation整合到一起,输出为一个值这样做,有一个什么好处?就是大大减少特征位置对分类带来的影响为啥都是两层?但是大部分是两层以上呢这是为啥子呢泰勒公式都......
  • The Difficulty of Passive Learning in Deep Reinforcement Learning
    发表时间:2021(NeurIPS2021)文章要点:这篇文章提出一个tandemlearning的实验范式来研究为什么offlineRL很难学。对于offlineRL来说,一个很严重的问题就是extrapolationerror,也就是没见过的stateactionpair的估计是不准确的。再加上bootstrapping的更新方式,就会加剧误差导致o......
  • 图像分类基于cnn的戴口罩和不戴口罩的分类任务-详细教程文档(视频同款)
    图像分类基于cnn的戴口罩和不戴口罩的分类任务-详细教程文档(视频同款)......