首页 > 其他分享 >文本分类数据集


时间:2023-01-08 15:56:34浏览次数:47  
标签:csv 分类 com 样本 surface https 文本 数据


Yahoo! Answers Topic Classification Dataset


一个获取该数据集的简单方法是用 huggingface datasets 加载数据集,而在源码里面我们可以找到下载地址:https://s3.amazonaws.com/fast-ai-nlp/yahoo_answers_csv.tgz


Yahoo! 问答话题分类数据集,这个数据集一共有 1,400,000 个训练样本,60,000 个测试样本。每个样本包含 4 个值域,例子如下:

topic_id: "5",
question_title: "why doesn't an optical mouse work on a glass table?",
question_content: "or even on some surfaces?",
best_answer: "Optical mice use an LED and a camera to rapidly capture images of the surface
              beneath the mouse.  The infomation from the camera is analyzed by a DSP (Digital 
              Signal Processor) and used to detect imperfections in the underlying surface and 
              determine motion. Some materials, such as glass, mirrors or other very shiny, 
              uniform surfaces interfere with the ability of the DSP to accurately analyze 
              the surface beneath the mouse.  \nSince glass is transparent and very uniform, 
              the mouse is unable to pick up enough imperfections in the underlying surface 
              to determine motion.  Mirrored surfaces are also a problem, since they constantly 
              reflect back the same image, causing the DSP not to recognize motion properly.
              When the system is unable to see surface changes associated with movement, the 
              mouse will not work properly."

标签一共有 10 个

Society & Culture
Science & Mathematics
Education & Reference
Computers & Internet
Business & Finance
Entertainment & Music
Family & Relationships
Politics & Government






AGNews 数据集由学术新闻搜索引擎 ComeToMyHead 搜集而成,新闻数据源多达 2000 个。数据集一共有 120,000 个训练样本,7,600 个测试样本。每个样本包含 3 个值域,例子如下:

id: "3",
headline: "Wall St. Bears Claw Back Into the Black (Reuters)",
content: "Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics,
          are seeing green again."

标签一共有 4 个,没有给出标签的具体含义,只有四个数字。






SST 是 Standford Sentiment Treebank 的缩写,数据主要来自影评。数据集划分为 train/dev/test 三份,分别包含 67359、873、1822 个样本。每个样本包含 2 个值域,例子如下:

sentence: "for those moviegoers who complain that ` they don't 
           make movies like they used to anymore"
label: "0"

标签一共有 2 个,0 表示 negative,1 表示 positive。





IMDB 影评数据集。数据集划分了为训练集和测试集,各包含 25000 个样本,其中每个划分中,正例和负例各有 12500 个样本。每个样本按照文件夹进行组织,下面给出一个训练集中的正例:

If you like adult comedy cartoons, like South Park, then this is nearly a similar
format about the small adventures of three teenage girls at Bromwell High.
Keisha, Natella and Latrina have given exploding sweets and behaved like bitches
, I think Keisha is a good leader. There are also small stories going on with the
teachers of the school. There's the idiotic principal, Mr. Bip, the nervous Maths
teacher and many others. The cast is also fantastic, Lenny Henry's Gina Yashere,
EastEnders Chrissie Watts, Tracy-Ann Oberman, Smack The Pony's Doon Mackichan,
Dead Ringers' Mark Perry and Blunder's Nina Conti. I didn't know this came from
Canada, but it is very good. Very good!

标签一共有两个,pos 和 neg。

Yelp Review Full




Yelp 是美国的商户点评网站,类似大众点评。数据集划分为训练集和测试集,其中训练集有 650000 个样本,测试集有 50000 个样本。

label: "5",
text: "dr. goldberg offers everything i look for in a general practitioner.  
       he's nice and easy to talk to without being patronizing; he's always on 
       time in seeing his patients; he's affiliated with a top-notch hospital 
       (nyu) which my parents have explained to me is very important in case 
       something happens and you need surgery; and you can get referrals to see 
       specialists without having to see him first.  really, what more do you 
       need?  i'm sitting here trying to think of any complaints i have about 
       him, but i'm really drawing a blank."

标签是评级,从 1 个星星到 5 个星星。

From: https://www.cnblogs.com/zzk0/p/16964477.html
