【文本分类】Bag of Tricks for Efficient Text Classification

时间：2023-01-16 18:00:51浏览次数：51

标签：nn Classification Efficient Text self 语法 embedding config out

·阅读摘要：
本文主要提出fastText模型。
·参考文献：
[1] Bag of Tricks for Efficient Text Classification

[0] 摘要

文章提出fastText模型，效果接近深度学习基线模型，但是速度非常快。

[1] 介绍

深度学习模型在实践中取得了非常好的性能，但它们在训练和测试时往往相对较慢，从而限制了它们在非常大的数据集上的使用。

线性分类器通常被认为是文本分类问题的强基线。如果使用得当，它们通常会有最先进的性能，从而应用到大语料库。

论文提出的fastText模型表明，线性模型与秩约束和快速损失近似可以在十分钟内训练十亿字，同时实现高性能的表现。

[2] 模型结构

【文本分类】Bag of Tricks for Efficient Text Classification_基线

这里从代码的角度上来讲解会更清楚。

$【文本分类】Bag of Tricks for Efficient Text Classification_基线_02$ 是fastText模型的输入，输入是一元语法、二元语法、三元语法的嵌入向量，然后拼接，再取平均，变成hidden层的数据，即文档向量。最后经过fc来分类。

pytorch版本的fastText代码如下：

class Model(nn.Module):
    def __init__(self, config):
        super(Model, self).__init__()
        if config.embedding_pretrained is not None:
            self.embedding = nn.Embedding.from_pretrained(config.embedding_pretrained, freeze=False)
        else:
            self.embedding = nn.Embedding(config.n_vocab, config.embed, padding_idx=config.n_vocab - 1)
        self.embedding_ngram2 = nn.Embedding(config.n_gram_vocab, config.embed)
        self.embedding_ngram3 = nn.Embedding(config.n_gram_vocab, config.embed)
        self.dropout = nn.Dropout(config.dropout)
        self.fc1 = nn.Linear(config.embed * 3, config.hidden_size)
        # self.dropout2 = nn.Dropout(config.dropout)
        self.fc2 = nn.Linear(config.hidden_size, config.num_classes)

    def forward(self, x):

        out_word = self.embedding(x[0])
        out_bigram = self.embedding_ngram2(x[2])
        out_trigram = self.embedding_ngram3(x[3])
        out = torch.cat((out_word, out_bigram, out_trigram), -1)

        out = out.mean(dim=1)
        out = self.dropout(out)
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

可以看到，一元语法的embedding可以从预训练词向量获取，二元语法、三元语法就只能模型自己来训练了。

但随着语料库的增加，由于二元语法、三元语法的存在，内存需求也会不断增加，严重影响模型构建速度，针对这些问题我们使用以下几种解决方案：

1、使用hash来存储二元语法、三元语法
2、由采用字粒度变化为采用词粒度

构建数据集时，我们把二元语法、三元语法通过Hash整合到一起，变成一个索引值，操作如下：

def biGramHash(sequence, t, buckets):
        t1 = sequence[t - 1] if t - 1 >= 0 else 0
        return (t1 * 14918087) % buckets

    def triGramHash(sequence, t, buckets):
        t1 = sequence[t - 1] if t - 1 >= 0 else 0
        t2 = sequence[t - 2] if t - 2 >= 0 else 0
        return (t2 * 14918087 * 18408749 + t1 * 14918087) % buckets

标签：nn,Classification,Efficient,Text,self,语法,embedding,config,out
From： https://blog.51cto.com/u_15942590/6010684

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Lon
·阅读摘要：本文更像是对多标签文本分类的损失函数的综述，文中提到的几个损失函数（包括为了解决长尾问题的损失函数）都是前人已经提出的。·参考文献： [1]BalancingM......
ContextType
ContentType有哪些？概述：Content-Type（内容类型），一般是指网页中存在的Content-Type，用于定义网络文件的类型和网页的编码，决定浏览器将以什么形式、什么编码读取这个文件，这就......
视频直播app源码，EditText保留小数位数
视频直播app源码，EditText保留小数位数 packagecom.shentaiwang.jsz.savepatient.util;importandroid.text.Editable;importandroid.text.TextWatcher;importandroid......
12.PyQt5【基本组件】多行输入框-QTextEdit
一、前言QTextEdit文本输入框可以输入多行文本。二、学习目标1.QTextEdit常用方法2.QTextEdit常用信号3.QTextEdit组件的应用三、知识点1.【QTextEdit常用方法】......
Winform DataGridViewTextBoxCell 编辑添加右键菜单，编辑选中文本
如上是我们使用DataGridView时，编辑单元格右键会出现系统菜单。现在我们添加自己的右键菜单，并可以操作选中文字。DataGridViewTextBoxCell：DataGridViewTextBoxCell类是......
安卓学习3--实现跑马灯效果的TextView
实现跑马灯效果的TextView11.android:singleLine：内容单行显示232.android:focusable：是否可以获取焦点453.android：focusableinTouchMode:用于控制视图在触摸模式......
安卓学习2--带阴影的TextView
带阴影的TextView11.android:shadowColor：设置阴影颜色，需要与shadowRadius一起使用232.android:shadowRadius：设置阴影的模糊程度，设为0.1就变成字体颜色，建议使用3.04......
org.springframework.context.ApplicationContextException: Unable to start embedde
前言swagger引入后一直报错，尝试多个解决后没办法Causedby:org.springframework.beans.factory.BeanCreationException:Errorcreatingbeanwithname'httpPutFormConte......
安卓自定义----带Edit的TextView标签组件
组件效果图如下，组件包含两种显示方式，第一种是TextView和EditText横排显示，第二种是TextView和EditText竖排显示：主activety_main.xml内容，组件包含两种显示方式......
记录实体操作日志--通过DbContext单次批量记录本次数据库操作中所有实体的更新情况
一、先看需求需求就是在我们的业务中存在查看修改日志的情况，比如：甲修改了乙的身份证号，丙想知道是谁修改了乙的信息，修改了哪些信息。二、常规方案在修改乙的信息的方法中......

【文本分类】Bag of Tricks for Efficient Text Classification

[0] 摘要

[1] 介绍

[2] 模型结构

相关文章

赞助商

阅读排行