首页 > 其他分享 >Kaggle自然语言处理入门 推特灾难文本分类 Natural Language Processing with Disaster Tweets

Kaggle自然语言处理入门 推特灾难文本分类 Natural Language Processing with Disaster Tweets

时间:2024-04-11 19:13:59浏览次数:24  
标签:推特 Natural sentence text list Kaggle feature words word

和新闻按照标题分类差不多,用的朴素贝叶斯

#导入必要的包
import random
import sys

from sklearn import model_selection
from sklearn.naive_bayes import MultinomialNB
import joblib
import re,string
import pandas as pd
import numpy as np
def text_to_words(file_path):#将文本拆分成 词语 和 标签
    myTrain = pd.read_csv(file_path)
    sentences_arr = []
    lab_arr = list(myTrain.values[:, 4])
    for i in range(len(myTrain.values)):
        sentence = myTrain.values[i, 3].split('	')[-1].strip()  # 得到句子
        sentence = re.sub("[\s+\.\!\/_,$%^*(+\"\')]+|[+——()?【】“”!,。?、~@#¥%……&*()《》:]+", " ", sentence)  # sub是代替,这里是把标点符号换成空格
        sentence = sentence.split(' ')
        sentences_arr.append(sentence)

    return sentences_arr, lab_arr
def load_stopwords(file_path):#创建停用词表
    stopwords = [line.strip() for line in open(file_path, encoding='UTF-8').readlines()]#line.strip()用于去除两端空格
    return stopwords
def get_dict(sentences_arr,stopswords):#生成词典
    word_dic = {}
    for sentence in sentences_arr:
        for word in sentence:
            if word != ' ' and word.isalpha():#isalpha函数用于判断字符串是否全部由字母组成
                if word not in stopswords:
                    word_dic[word] = word_dic.get(word,1) + 1
    word_dic=sorted(word_dic.items(),key=lambda x:x[1],reverse=True) #按词频序排列

    return word_dic

def get_feature_words(word_dic,word_num):#选取出现次数最多的前 word_num 个单词作为特征词
    n = 0
    feature_words = []
    for word in word_dic:
        if n < word_num:
            feature_words.append(word[0])
        n += 1
    return feature_words

# 文本特征
def get_text_features(train_data_list, test_data_list, feature_words):#根据特征词,将 训练集 和 测试集 中的句子转化为特征向量
    def text_features(text, feature_words):
        text_words = set(text)
        features = [1 if word in text_words else 0 for word in feature_words] # 形成特征向量
        return features
    train_feature_list = [text_features(text, feature_words) for text in train_data_list]
    test_feature_list = [text_features(text, feature_words) for text in test_data_list]
    return train_feature_list, test_feature_list


sentences_arr, lab_arr = text_to_words('../train.csv')#获取分词后的数据及标签
print(sentences_arr[0])

stopwords = load_stopwords('../stopwords.txt')#加载停用词
word_dic = get_dict(sentences_arr,stopwords)#生成词典
train_data_list, test_data_list, train_class_list, test_class_list = model_selection.train_test_split(sentences_arr,lab_arr,test_size=0.1)#数据集划分
feature_words =  get_feature_words(word_dic,1000)#生成特征词列表



train_feature_list,test_feature_list = get_text_features(train_data_list,test_data_list,feature_words)#生成特征向量
from sklearn.metrics import accuracy_score,classification_report

#获取朴素贝叶斯分类器
classifier = MultinomialNB(alpha=1.0,  # 拉普拉斯平滑
                          fit_prior=True,  #否要考虑先验概率
                          class_prior=None)

print(type(train_feature_list))
print(type(train_class_list))
classifier.fit(train_feature_list, train_class_list)#进行训练

predict = classifier.predict(test_feature_list)# 在验证集上进行验证
test_accuracy = accuracy_score(predict,test_class_list)
print("准确率 accuracy_score: %.4lf"%(test_accuracy))
print("模型评估报告 Classification report for classifier:\n",classification_report(test_class_list, predict))
joblib.dump(classifier, "NewsClassification.model")

myModel = joblib.load("NewsClassification.model")

def load_sentence(sentence):
    sentence = re.sub("[\s+\.\!\/_,$%^*(+\"\')]+|[+——()?【】“”!,。?、~@#¥%……&*()《》:]+", " ",sentence)  # sub是代替,这里是把标点符号换成空格
    sentence = sentence.split(' ')
    return sentence




p_data = 'We had a big earthquake here and many houses collapsed'
sentence = load_sentence(p_data)
sentence= [sentence]
print('分词结果:', sentence)
p_words = get_text_features(sentence,sentence,feature_words)#形成特征向量
res = myModel.predict(p_words[0])
print("所属类型:",int(res))


cnt=0
id=[]
target=[]
myTest = pd.read_csv('../test.csv')
for i in range(len(myTest.values)):
    sentence = myTest.values[i, 3].split('	')[-1].strip()  # 得到句子
    sentence = re.sub("[\s+\.\!\/_,$%^*(+\"\')]+|[+——()?【】“”!,。?、~@#¥%……&*()《》:]+", " ", sentence)  # sub是代替,这里是把标点符号换成空格
    sentence = sentence.split(' ')
    sentence = [sentence]
    print('分词结果:', sentence)
    p_words = get_text_features(sentence, sentence, feature_words)  # 形成特征向量
    res = myModel.predict(p_words[0])
    print("所属类型:", int(res))
    id.append(myTest.values[i, 0])
    target.append(int(res))
    cnt=cnt+1
    if cnt%1000 ==0:
        print(cnt)
myAns = pd.DataFrame({'id': id, 'target': target})
myAns.to_csv("myAns.csv", index=False, sep=',')

标签:推特,Natural,sentence,text,list,Kaggle,feature,words,word
From: https://www.cnblogs.com/wljss/p/18129880

相关文章

  • Coursera自然语言处理专项课程04:Natural Language Processing with Attention Models
    NaturalLanguageProcessingSpecializationIntroductionhttps://www.coursera.org/specializations/natural-language-processingCertificateNaturalLanguageProcessingwithAttentionModelsCourseCertificate本文是学习这门课NaturalLanguageProcessing......
  • 【六 (2)机器学习-机器学习建模步骤/kaggle房价回归实战】
    一、确定问题和目标:1、业务需求分析:与业务团队或相关利益方进行深入沟通,了解他们的需求和期望。分析业务流程,找出可能的瓶颈、机会或挑战。思考机器学习如何帮助解决这些问题或实现业务目标。2、问题定义:将业务需求转化为一个或多个具体的机器学习问题,例如分类、回归......
  • Kaggle量化比赛复盘: Optiver - Trading at the Close
    目录前言一、开源方案1.6th获奖方案(代码未开源)1.1.特征工程(关键代码)1.2.方案解析2. 7th获奖方案(开源)2.1.特征工程2.2.特征工程3. 9th获奖方案(半开源)3.1.特征构造3.2.特征筛选3.3.模型3.4.zero_sum(标签后处理)4. 14th获奖方案(开源)4.1.方案......
  • 【踩坑随笔】Kaggle安装langchain相关依赖报错
    kaggle执行语句%pipinstalldatasetslangchainsentence_transformerstqdmchromadblangchain_wenxin安装langchain相关依赖报错的时候出现了以下报错主要是版本不匹配,报错什么就再加载什么就可以了,执行下面的语句%pipinstallkeras-core执行结果%pipinstallw......
  • Coursera自然语言处理专项课程01:Natural Language Processing with Classification an
    NaturalLanguageProcessingwithClassificationandVectorSpacesCourseCertificate本文是NaturalLanguageProcessingwithClassificationandVectorSpaces这门课的学习笔记,仅供个人学习使用,如有侵权,请联系删除。文章目录NaturalLanguageProcessingwi......
  • kaggle 大语言模型新赛保银
    比赛类型:LLM文本转写挑战。任务目标是恢复用于转写给定文本的LLM提示语句。在这个竞赛中,参与者将面临识别和复原经LLM改写后文本原始提示的挑战,这是探索如何有效利用LLM进行文本改写的新颖方式。竞赛概述:问题定义:恢复用于转写给定文本的LLM提示。技术挑战:超越传统文本处理......
  • 【论文阅读】Natural Adversarial Examples 自然对抗的例子
    文章目录一、文章概览(一)摘要(二)导论(三)相关工作二、IMAGENET-A和IMAGENET-O(一)数据集构造方式(二)数据收集过程三、模型的故障模式四、实验(一)评估指标(二)使用数据增强(三)使用更多更真实的标记数据(四)架构变化策略一、文章概览(一)摘要文章的主要工作:使用简单的对抗性过......
  • CF1915D Unnatural Language Processing 题解
    容易发现音节的划分不仅要求子串形如\(\texttt{CV}\)或\(\texttt{CVC}\),并且接下来的两个字符也必须是\(\texttt{CV}\),不然会导致无法划分下去。于是我们遍历字符串,找出所有满足上述条件的子串,记录需要输出\(\texttt{.}\)的位置即可。实现:intn;strings,ans,t="";cin>......
  • Twitter推特 api接口 获取trending趋势搜索关键词 python数据采集
    iDataRiver平台https://www.idatariver.com/zh-cn/提供开箱即用的Twitter公开数据采集API,供用户按需调用。接口使用详情请参考Twitter接口文档接口列表1.获取X/Twitter用户发布的作品列表,支持翻页参数类型是否必填默认值示例值描述apikeystring是idr_***......
  • 走进Kaggle的未知领域:性别和年龄推断算法解析
    ​1、环境设置:此环节将加载实现笔记本无缝功能的基本模块,包括NumPy、Pandas和TensorFlow等库。此外,它还建立了关键的环境常数,如图像尺寸和学习率,这对后续分析和模型训练至关重要。#Generalimportosimportkerasimportnumpyasnpimportpandasaspdimporttensorflow......