新闻归档
Yahoo! Answers Topic Classification Dataset
下载地址
一个获取该数据集的简单方法是用 huggingface datasets 加载数据集,而在源码里面我们可以找到下载地址:https://s3.amazonaws.com/fast-ai-nlp/yahoo_answers_csv.tgz
数据集介绍
Yahoo! 问答话题分类数据集,这个数据集一共有 1,400,000 个训练样本,60,000 个测试样本。每个样本包含 4 个值域,例子如下:
topic_id: "5",
question_title: "why doesn't an optical mouse work on a glass table?",
question_content: "or even on some surfaces?",
best_answer: "Optical mice use an LED and a camera to rapidly capture images of the surface
beneath the mouse. The infomation from the camera is analyzed by a DSP (Digital
Signal Processor) and used to detect imperfections in the underlying surface and
determine motion. Some materials, such as glass, mirrors or other very shiny,
uniform surfaces interfere with the ability of the DSP to accurately analyze
the surface beneath the mouse. \nSince glass is transparent and very uniform,
the mouse is unable to pick up enough imperfections in the underlying surface
to determine motion. Mirrored surfaces are also a problem, since they constantly
reflect back the same image, causing the DSP not to recognize motion properly.
When the system is unable to see surface changes associated with movement, the
mouse will not work properly."
标签一共有 10 个
Society & Culture
Science & Mathematics
Health
Education & Reference
Computers & Internet
Sports
Business & Finance
Entertainment & Music
Family & Relationships
Politics & Government
AGNews
下载地址
训练集:https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
测试集:https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
数据集介绍
AGNews 数据集由学术新闻搜索引擎 ComeToMyHead 搜集而成,新闻数据源多达 2000 个。数据集一共有 120,000 个训练样本,7,600 个测试样本。每个样本包含 3 个值域,例子如下:
id: "3",
headline: "Wall St. Bears Claw Back Into the Black (Reuters)",
content: "Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics,
are seeing green again."
标签一共有 4 个,没有给出标签的具体含义,只有四个数字。
情感分析
SST-2
下载地址
https://dl.fbaipublicfiles.com/glue/data/SST-2.zip
数据集介绍
SST 是 Standford Sentiment Treebank 的缩写,数据主要来自影评。数据集划分为 train/dev/test 三份,分别包含 67359、873、1822 个样本。每个样本包含 2 个值域,例子如下:
sentence: "for those moviegoers who complain that ` they don't
make movies like they used to anymore"
label: "0"
标签一共有 2 个,0 表示 negative,1 表示 positive。
IMDB
下载地址
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
数据集介绍
IMDB 影评数据集。数据集划分了为训练集和测试集,各包含 25000 个样本,其中每个划分中,正例和负例各有 12500 个样本。每个样本按照文件夹进行组织,下面给出一个训练集中的正例:
If you like adult comedy cartoons, like South Park, then this is nearly a similar
format about the small adventures of three teenage girls at Bromwell High.
Keisha, Natella and Latrina have given exploding sweets and behaved like bitches
, I think Keisha is a good leader. There are also small stories going on with the
teachers of the school. There's the idiotic principal, Mr. Bip, the nervous Maths
teacher and many others. The cast is also fantastic, Lenny Henry's Gina Yashere,
EastEnders Chrissie Watts, Tracy-Ann Oberman, Smack The Pony's Doon Mackichan,
Dead Ringers' Mark Perry and Blunder's Nina Conti. I didn't know this came from
Canada, but it is very good. Very good!
标签一共有两个,pos 和 neg。
Yelp Review Full
下载地址
https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz
数据集介绍
Yelp 是美国的商户点评网站,类似大众点评。数据集划分为训练集和测试集,其中训练集有 650000 个样本,测试集有 50000 个样本。
label: "5",
text: "dr. goldberg offers everything i look for in a general practitioner.
he's nice and easy to talk to without being patronizing; he's always on
time in seeing his patients; he's affiliated with a top-notch hospital
(nyu) which my parents have explained to me is very important in case
something happens and you need surgery; and you can get referrals to see
specialists without having to see him first. really, what more do you
need? i'm sitting here trying to think of any complaints i have about
him, but i'm really drawing a blank."
标签是评级,从 1 个星星到 5 个星星。
标签:csv,分类,com,样本,surface,https,文本,数据 From: https://www.cnblogs.com/zzk0/p/16964477.html