首页 > 其他分享 >NLTK英文文本分词的常用模块

NLTK英文文本分词的常用模块

时间:2024-09-10 12:50:57浏览次数:12  
标签:me boy word text 分词 tokenized 词频 模块 NLTK

目录

1.断句模块:

2.分词模块:

3.去除文本中的除标点符号:

4.去除停用词:

5.词频提取与词频绘图: 

5.1词频的提取

5.2画出词频

5.3画出出现频率最高的三个词

 6.单词搜索


1.断句模块:

import nltk
from nltk.tokenize import sent_tokenize  #英文断句模块

#要断句的文本
paragraph = 'You must follow me carefully. I shall have to controvert one or twoideas that are almost universally accepted. The geometry, forinstance, they taught you at school is founded on a misconception.'

tokenized_text = sent_tokenize(paragraph)
print(tokenized_text)

tokenized_text输出结果:
['You must follow me carefully.', 'I shall have to controvert one or twoideas that are almost universally accepted.', 'The geometry, forinstance, they taught you at school is founded on a misconception.']

2.分词模块:

from nltk import word_tokenize  #导入分词模块


text = 'You must follow me carefully.'
tokenized_word = word_tokenize(text)
print(tokenized_word)

tokenized_word输出结果:
['You', 'must', 'follow', 'me', 'carefully', '.'] 


3.去除文本中的除标点符号:

import string


punctuation = string.punctuation  #英文标点符号
text = 'You must follow me carefully.' #待处理文本

#设置映射关系: 用空格替代标点=删除掉标点
text_1 = text.translate(str.maketrans(punctuation, ' ' * len(punctuation)))
print(text_1)

text_1输出结果:
'You must follow me carefully '


3.2方法二:对tokenized_text过滤

from nltk import word_tokenize  #分词模块
import string  #导入获取标点符号的模块


punctuation = string.punctuation  #英文标点符号  (字符串类型)
punctuation = list(punctuation) #字符串转化为列表

text = 'You must follow me carefully.'  #待处理文本

text = 'You must follow me carefully.'
tokenized_word = word_tokenize(text)

#列表推导式  [输出结果  循环(可以多次循环)  条件判断]
text_1 = [word  for word in tokenized_word  if word not in punctuation]
print(text_1)

 text_1输出结果:

['You', 'must', 'follow', 'me', 'carefully']

4.去除停用词:

说明:stopword 噪音单词, 没有意义, 常用英文停用词: is, am, a, are the, an, to , for...

from nltk.corpus import stopwords  #加载停用词的模块


#加载英文停用词
stop_words = stopwords.words('english')

text = 'I shall have to controvert one or twoideas that are almost universally accepted.'  
#分词
tokenized_text = word_tokenize(text) 
#过滤标点与停用词
text_1 = [word  for word in tokenized_text  if word not in punctuation and word not in stop_words]
print(text_1)

text_1输出结果:

['I',
 'shall',
 'controvert',
 'one',
 'twoideas',
 'almost',
 'universally',
 'accepted']

5.词频提取与词频绘图: 

5.1词频的提取

paragraph = 'You must follow me carefully. I shall have to controvert one or twoideas that are almost universally accepted. The geometry, forinstance, they taught you at school is founded on a misconception.'

#分词
tokenized_word = nltk.word_tokenize(paragraph)

#词频提取
word_freqs = nltk.FreqDist(w.lower() for w in tokenized_word)  #w.lower()将大写字母变成小写
word_freqs_dict = word_freqs.items()
print(word_freqs_dict)

 word_freqs_dict输出结果:

dict_items([('you', 2), ('must', 1), ('follow', 1), ('me', 1), ('carefully', 1), ('.', 3), ('i', 1), ('shall', 1), ('have', 1), ('to', 1), ('controvert', 1), ('one', 1), ('or', 1), ('twoideas', 1), ('that', 1), ('are', 1), ('almost', 1), ('universally', 1), ('accepted', 1), ('the', 1), ('geometry', 1), (',', 2), ('forinstance', 1), ('they', 1), ('taught', 1), ('at', 1), ('school', 1), ('is', 1), ('founded', 1), ('on', 1), ('a', 1), ('misconception', 1)])

5.2画出词频

#词频提取
word_freqs = nltk.FreqDist(w.lower() for w in tokenized_word)

#画出词频图
word_freqs.plot()

 

5.3画出出现频率最高的三个词

word_freqs.plot(3, cumulative=True)

 6.单词搜索

from nltk.book import *  # *引入一个包中的所有类


text1.concordance('boy')  #搜索boy一词的出处
输出结果:
Displaying 25 of 65 matches:
 ? Why is almost every robust healthy boy with a robust healthy soul in him , a
lings in a most direful manner . " My boy ," said the landlord , " you ' ll hav
idends . Rising from a little cabin - boy in short clothes of the drabbest drab
ain ' t Captain Peleg ; HE ' S AHAB , boy ; and Ahab of old , thou knowest , wa
 to have a wicked name . Besides , my boy , he has a wife -- not three voyages 
stors , and scolding her little black boy meantime . " Wood - house !" cried I 
Careful , careful !-- come , Bildad , boy -- say your last . Luck to ye , Starb
 , no ! he went before . Poor Alabama boy ! On the grim Pequod ' s forecastle ,
 t sleep then . Didn ' t that Dough - Boy , the steward , tell me that of a mor
r hold for , every night , as Dough - Boy tells me he suspects ; what ' s that 
in - Table . It is noon ; and Dough - Boy , the steward , thrusting his pale lo
 he was the youngest son , and little boy of this weary family party . His were
vious repast , often the pale Dough - Boy was fain to bring on a great baron of
ith a sudden humor , assisted Dough - Boy ' s memory by snatching him up bodily
ions of these three savages , Dough - Boy ' s whole life was one continual lip 
 much so , that the trembling Dough - Boy almost looked to see whether any mark
all tend to tranquillize poor Dough - Boy . How could he forget that in his Isl
vivial indiscretions . Alas ! Dough - Boy ! hard fares the white waiter who wai
O men , you will yet see that -- Ha ! boy , come back ? bad pennies come not so
ht bells there ! d ' ye hear , bell - boy ? Strike the bell eight , thou Pip ! 
CING ) Go it , Pip ! Bang it , bell - boy ! Rig it , dig it , stig it , quig it
, dig it , stig it , quig it , bell - boy ! Make fire - flies ; break the jingl
ness , have mercy on this small black boy down here ; preserve him from all men
cried Ahab . " Time ! time !" Dough - Boy hurried below , glanced at the watch 
fter hold for , so often , as Dough - Boy long suspected . They were hidden dow

 

text3.similar('time')  #搜索与time意思相近的词

 输出结果:

day land was days cattle all lord field stone east trees way sheep son
men plain people souls cities kings


标签:me,boy,word,text,分词,tokenized,词频,模块,NLTK
From: https://blog.csdn.net/Hiweir/article/details/142086841

相关文章

  • JavaScript之模块模式
    一模块模式JavaScript模块模式有传统的立即调用函数表达式(IIFE)、CommonJS、AMD、ES6模块立即调用函数表达式(IIFE)立即调用函数表达式(IIFE)是一种设计模式,通过创建一个立即执行的函数来封装代码,避免全局命名空间污染。IIFE结构如下:(function(){//你的代码})(......
  • 如何集成Android平台GB28181设备接入模块?
    技术优势大牛直播SDK的Android平台GB28181设备接入模块在适用场景、音视频能力、定位与通信、数据管理、安全性与稳定性、配置与扩展性以及集成与维护等方面均表现出显著的优势。这些优势使得该模块在视频监控、巡检抢修、远程指挥等多个领域具有广泛的应用前景和重要的应用价值。......
  • 树莓派通过串口驱动HC-08蓝牙模块
    树莓派通过串口驱动HC-08蓝牙模块文章目录树莓派通过串口驱动HC-08蓝牙模块一、HC-08蓝牙模块介绍二、树莓派与蓝牙模块硬件连接三、树莓派通过蓝牙控制设备一、HC-08蓝牙模块介绍蓝牙模块,是一种集成的蓝牙功能的PCB板,用于短距离无线通信,按照功能分为蓝牙数据模块......
  • ansible模块编写
    目录1.程序为什么能识别模块a.模块是独立的脚本b.模块的路径与目录结构2.模块的工作特性a.ansible模块执行任务,会ssh到远程主机b.ansible.module_utils.basicc.核心模块(CoreModules)与插件(Plugins)的区别3.构建简单模块a.模块功能b.模块文档c.验证模块功能1.程序为什......
  • Python 编程:如何巧妙运用 `abc` 模块解锁面向对象设计的新维度?
    引言在软件开发的世界里,面向对象编程(OOP)作为一门艺术,其精髓在于通过封装、继承与多态来构建可维护性高、易于扩展的系统。而在Python这门语言中,abc模块则为我们提供了一种优雅的方式来定义抽象基类(AbstractBaseClasses,ABCs),从而帮助我们更好地实践OOP的核心原则。本文将带......
  • shutil模块详解
    shutil模块提供了一系列高级文件操作功能,包括复制、移动、删除和搜索文件或目录。shutil模块对压缩包的处理是调用ZipFile和TarFile这两个模块来进行的。下面详细介绍并给出示例代码:1. shutil.copy(src,dst)复制文件,但不保留权限和时间戳等元数据。importshutils......
  • 关于schneider施耐德140模块
    SCHNEIDER140模块是施耐德电气(SchneiderElectric)公司生产的一系列工业自动化和控制产品,这些模块通常用于Quantum系列PLC(可编程逻辑控制器)系统中,以实现各种自动化和控制任务。以下是对SCHNEIDER140模块的一些详细介绍:一、模块类型与功能SCHNEIDER140模块包括但不限于以下......
  • 功率单元和功率模块有什么区别?
        功率单元和功率模块在电力电子领域中各自扮演着重要角色,它们之间存在明显的区别,主要体现在以下几个方面:一、定义与构成功率单元:指的是集成在一块PCB(印刷电路板)上的功率转换电路模块,通常由MOSFET、IGBT(绝缘栅双极型晶体管)或SiCMOSFET等功率半导体器件构成。它能够......
  • Unity框架(场景切换模块)
    1、为什么要制作场景切换模块在游戏开发中很多时候可能需要进行场景切换只要存在场景切换,我们往往需要在切换场景时和切换场景结束后进行一些操作2、实现场景切换模块的主要思路1.制作SceneMgr单例模式管理器2.实现同步加载场景的公共方法3.实现异步加载场景的公共......
  • 推荐一个Python流式JSON处理模块:streaming-json-py
    每天,我们的设备、应用程序和服务都在生成大量的数据流,这些数据往往大多是以JSON格式存在的。如何高效地解析和处理这些JSON数据流是一大挑战。今天,我要为大家介绍一个能极大简化这一过程的利器:streaming-json-pystreaming-json-py介绍streaming-json-py是一个专为实时......