目录
1.断句模块:
import nltk
from nltk.tokenize import sent_tokenize #英文断句模块
#要断句的文本
paragraph = 'You must follow me carefully. I shall have to controvert one or twoideas that are almost universally accepted. The geometry, forinstance, they taught you at school is founded on a misconception.'
tokenized_text = sent_tokenize(paragraph)
print(tokenized_text)
tokenized_text输出结果: ['You must follow me carefully.', 'I shall have to controvert one or twoideas that are almost universally accepted.', 'The geometry, forinstance, they taught you at school is founded on a misconception.']
2.分词模块:
from nltk import word_tokenize #导入分词模块
text = 'You must follow me carefully.'
tokenized_word = word_tokenize(text)
print(tokenized_word)
tokenized_word输出结果:
['You', 'must', 'follow', 'me', 'carefully', '.']
3.去除文本中的除标点符号:
import string
punctuation = string.punctuation #英文标点符号
text = 'You must follow me carefully.' #待处理文本
#设置映射关系: 用空格替代标点=删除掉标点
text_1 = text.translate(str.maketrans(punctuation, ' ' * len(punctuation)))
print(text_1)
text_1输出结果:
'You must follow me carefully '
3.2方法二:对tokenized_text过滤
from nltk import word_tokenize #分词模块
import string #导入获取标点符号的模块
punctuation = string.punctuation #英文标点符号 (字符串类型)
punctuation = list(punctuation) #字符串转化为列表
text = 'You must follow me carefully.' #待处理文本
text = 'You must follow me carefully.'
tokenized_word = word_tokenize(text)
#列表推导式 [输出结果 循环(可以多次循环) 条件判断]
text_1 = [word for word in tokenized_word if word not in punctuation]
print(text_1)
text_1输出结果:
['You', 'must', 'follow', 'me', 'carefully']
4.去除停用词:
说明:
stopword 噪音单词, 没有意义, 常用英文停用词: is, am, a, are the, an, to , for...
from nltk.corpus import stopwords #加载停用词的模块
#加载英文停用词
stop_words = stopwords.words('english')
text = 'I shall have to controvert one or twoideas that are almost universally accepted.'
#分词
tokenized_text = word_tokenize(text)
#过滤标点与停用词
text_1 = [word for word in tokenized_text if word not in punctuation and word not in stop_words]
print(text_1)
text_1输出结果:
['I', 'shall', 'controvert', 'one', 'twoideas', 'almost', 'universally', 'accepted']
5.词频提取与词频绘图:
5.1词频的提取
paragraph = 'You must follow me carefully. I shall have to controvert one or twoideas that are almost universally accepted. The geometry, forinstance, they taught you at school is founded on a misconception.'
#分词
tokenized_word = nltk.word_tokenize(paragraph)
#词频提取
word_freqs = nltk.FreqDist(w.lower() for w in tokenized_word) #w.lower()将大写字母变成小写
word_freqs_dict = word_freqs.items()
print(word_freqs_dict)
word_freqs_dict输出结果:
dict_items([('you', 2), ('must', 1), ('follow', 1), ('me', 1), ('carefully', 1), ('.', 3), ('i', 1), ('shall', 1), ('have', 1), ('to', 1), ('controvert', 1), ('one', 1), ('or', 1), ('twoideas', 1), ('that', 1), ('are', 1), ('almost', 1), ('universally', 1), ('accepted', 1), ('the', 1), ('geometry', 1), (',', 2), ('forinstance', 1), ('they', 1), ('taught', 1), ('at', 1), ('school', 1), ('is', 1), ('founded', 1), ('on', 1), ('a', 1), ('misconception', 1)])
5.2画出词频
#词频提取
word_freqs = nltk.FreqDist(w.lower() for w in tokenized_word)
#画出词频图
word_freqs.plot()
5.3画出出现频率最高的三个词
word_freqs.plot(3, cumulative=True)
6.单词搜索
from nltk.book import * # *引入一个包中的所有类
text1.concordance('boy') #搜索boy一词的出处
输出结果: Displaying 25 of 65 matches: ? Why is almost every robust healthy boy with a robust healthy soul in him , a lings in a most direful manner . " My boy ," said the landlord , " you ' ll hav idends . Rising from a little cabin - boy in short clothes of the drabbest drab ain ' t Captain Peleg ; HE ' S AHAB , boy ; and Ahab of old , thou knowest , wa to have a wicked name . Besides , my boy , he has a wife -- not three voyages stors , and scolding her little black boy meantime . " Wood - house !" cried I Careful , careful !-- come , Bildad , boy -- say your last . Luck to ye , Starb , no ! he went before . Poor Alabama boy ! On the grim Pequod ' s forecastle , t sleep then . Didn ' t that Dough - Boy , the steward , tell me that of a mor r hold for , every night , as Dough - Boy tells me he suspects ; what ' s that in - Table . It is noon ; and Dough - Boy , the steward , thrusting his pale lo he was the youngest son , and little boy of this weary family party . His were vious repast , often the pale Dough - Boy was fain to bring on a great baron of ith a sudden humor , assisted Dough - Boy ' s memory by snatching him up bodily ions of these three savages , Dough - Boy ' s whole life was one continual lip much so , that the trembling Dough - Boy almost looked to see whether any mark all tend to tranquillize poor Dough - Boy . How could he forget that in his Isl vivial indiscretions . Alas ! Dough - Boy ! hard fares the white waiter who wai O men , you will yet see that -- Ha ! boy , come back ? bad pennies come not so ht bells there ! d ' ye hear , bell - boy ? Strike the bell eight , thou Pip ! CING ) Go it , Pip ! Bang it , bell - boy ! Rig it , dig it , stig it , quig it , dig it , stig it , quig it , bell - boy ! Make fire - flies ; break the jingl ness , have mercy on this small black boy down here ; preserve him from all men cried Ahab . " Time ! time !" Dough - Boy hurried below , glanced at the watch fter hold for , so often , as Dough - Boy long suspected . They were hidden dow
text3.similar('time') #搜索与time意思相近的词
标签:me,boy,word,text,分词,tokenized,词频,模块,NLTK From: https://blog.csdn.net/Hiweir/article/details/142086841输出结果:
day land was days cattle all lord field stone east trees way sheep son men plain people souls cities kings