首页 > 其他分享 >【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution

时间:2023-01-16 18:00:13浏览次数:48  
标签:Loss Multi 分类 Methods Classification 标签 样本 损失 Distribution


·阅读摘要:
  本文更像是对多标签文本分类的损失函数的综述,文中提到的几个损失函数(包括为了解决长尾问题的损失函数)都是前人已经提出的。
·参考文献:
  [1] Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution

[1] Loss Functions

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数个样本的训练集【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_多标签文本分类_02,其中【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_文本分类_03【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_长尾问题_04是类别数量,假设模型对于某个样本的输出为【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_多标签文本分类_05,则 BCE 损失的定义如下:
【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_文本分类_06

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_长尾问题_07,对于多标签分类问题来说我们需要将模型的输出值压缩到 [0,1] 之间,所以需要用到 sigmoid 函数.

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_08相当于一个 onehot 向量,而对于多标签来说,真实值【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_08相当于一个 onehot 向量中多了一些 1,例如 [0,1,0,1],表示该样本同时是第 1 类和第 3 类

  这种朴素的 BCE 非常容易收到标签不平衡的影响,因为头部样本比较多,可能所有头部样本的损失总和为 100,尾部所有样本的损失加起来都不超过 10。下面,我们介绍三种替代方法解决多标签文本分类中长尾数据的类别不均衡问题。这些平衡方法主要思想是重新加权 BCE,使罕见的样本 - 标签对得到合理的 “关注”

[2] Focal Loss (FL)

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_多标签文本分类_10,Focal Loss 将更高的损失权重放在 “难分类” 的样本上,这些样本对其真实值的预测概率很低。对于多标签分类任务,Focal Loss 定义如下:
【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_11

[3] Class-balanced focal loss (CB)

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_长尾问题_12,那么对于每个类别来说,都有其平衡项【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_多标签文本分类_13
【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_文本分类_14
【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_长尾问题_15控制着有效样本数量的增长速度,损失函数变为
【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_16

[4] Distribution-balanced loss (DB)

  通过整合再平衡权重以及头部样本容忍正则化(negative tolerant regularization, NTR),Distribution-balanced Loss 首先减少了标签共现的冗余信息(这在多标签分类的情况下是很关键的),然后对 “容易分类的” 样本(头部样本)分配较低的权重

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_17来加权,但是在多标签的情况下,如果采用同样的策略,一个具有多标签的样本会被过度采样,概率是【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_ci_18。因此,我们需要结合两者重新平衡权重
【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_文本分类_19
  可以将上述权重变得更光滑一些(有界)

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_长尾问题_20

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_长尾问题_21的值域为【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_多标签文本分类_22

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_多标签文本分类_23

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_ci_24和一个内在的特定类别偏差【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_ci_25以降低尾部类别的阈值,避免过度抑制

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_多标签文本分类_26

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_27;对于头部样本来说,【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_28【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_ci_25可以在训练开始时最小化损失函数来估计,其比例系数为【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_多标签文本分类_30,类别先验信息 【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_ci_31,则
【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_长尾问题_32
  最终,通过整合再平衡权重以及 NTR,Distribution-balanced Loss 为
【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_33

[5] 实验结果

  使用的模型为 SVM,对比不同损失函数的效果

【多标签文本分类】Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution_损失函数_34


标签:Loss,Multi,分类,Methods,Classification,标签,样本,损失,Distribution
From: https://blog.51cto.com/u_15942590/6010686

相关文章