首页 > 其他分享 >jieba 分词

jieba 分词

时间:2023-12-19 16:12:47浏览次数:29  
标签:jieba rword word elif add counts 分词

西游记相关的分词,出现次数最高的20个

输入:

 1 import jieba
 2 excludes = {"一个", "我们", "怎么", "那里", "不知", "不是", "只见", "两个", "不敢", "这个", "如何", "原来", "甚么", "不曾", "闻言", "正是", "那怪", "一声"}
 3 txt = open("西游记1.txt", "r", encoding='UTF-8').read()
 4 words = jieba.lcut(txt)
 5 jieba.add_word("孙悟空")
 6 jieba.add_word("金公")
 7 jieba.add_word("孙行者")
 8 jieba.add_word("心猿")
 9 jieba.add_word("齐天大圣")
10 jieba.add_word("斗战胜佛")
11 jieba.add_word("美猴王")
12 jieba.add_word("孙行者")
13 jieba.add_word("三藏法师")
14 jieba.add_word("玄奘")
15 jieba.add_word("金蝉子")
16 jieba.add_word("江流儿")
17 jieba.add_word("御弟")
18 jieba.add_word("沙僧")
19 jieba.add_word("沙和尚")
20 jieba.add_word("沙悟净")
21 jieba.add_word("刀圭")
22 jieba.add_word("黄婆")
23 jieba.add_word("悟能")
24 jieba.add_word("猪悟能")
25 jieba.add_word("猪刚鬣")
26 jieba.add_word("木母")
27 jieba.add_word("白龙马")
28 jieba.add_word("天龙马")
29 jieba.add_word("玉龙三太子")
30 jieba.add_word("八部天龙广力菩萨")
31 counts = {}
32 for word in words:
33     if len(word) == 1:
34         continue
35     elif word == "师父" or word == "三藏" or word == "玄奘" or word == "三藏法师" or word == "金蝉子" or word == "江流儿" or word == "御弟":
36         rword = "唐僧"
37     elif word == "大圣" or word == "老孙" or word == "孙悟空" or word == "美猴王" or word == "孙行者" or word == "齐天大圣" or word == "斗战胜佛" or word == "金公" or word == "心猿":
38         rword = "悟空"
39     elif word == "悟能" or word == "八戒" or word == "猪悟能" or word == "呆子" or word == "木母" or word == "猪刚鬣":
40         rword = "猪八戒"
41     elif word == "沙僧" or word == "沙悟净" or word == "沙和尚" or word == "刀圭" or word == "黄婆":
42         rword = "悟净"
43     elif word == "天龙马" or word == "玉龙三太子" or word == "八部天龙广力菩萨":
44         rword = "白龙马"
45     else:
46         rword = word
47     counts[rword] = counts.get(rword, 0) + 1
48 for word in excludes:
49     del(counts[word])
50 items = list(counts.items())
51 items.sort(key=lambda x:x[1], reverse=True)
52 for i in range(20):
53     word, count = items[i]
54     print("{0:<10}{1:>5}".format(word, count))

输出:

 

标签:jieba,rword,word,elif,add,counts,分词
From: https://www.cnblogs.com/ChenWenshi/p/17914004.html

相关文章

  • jieba分词
    尾号为1,2,3的同学做,西游记相关的分词,出现次数最高的20个。```importjieba#读取文本文件path="西游记.txt"file=open(path,"r",encoding="utf-8")text=file.read()file.close()#使用jieba分词words=jieba.lcut(text)#统计词频counts={}forwordin......
  • jieba 分词
    描述尾号为1,2,3的同学做,西游记相关的分词,出现次数最高的20个。‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪......
  • jieba 分词红楼梦相关的分词,出现次数最高的20个
    点击查看代码importjiebaimportwordclouddeftakeSecond(elem):returnelem[1]defcreateWordCloud(text):#生成词云函数w=wordcloud.WordCloud(font_path="STZHONGS.TTF",width=1000,height=500,background_color="white")w.g......
  • jieba 分词西游记
    importjiebatxt=open("西游记.txt","r",encoding='utf-8').read()words=jieba.lcut(txt)counts={}forwordinwords:iflen(word)==1:continueelifword=="大圣"orword=="老孙"or......
  • jieba分词
    jieba分词‪‬‪‬‪‬‪‬‪‬‮‬‪‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‫‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‭‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬描述‪‬‪‬‪‬‪‬‪‬‮‬‪‬......
  • jieba分词之聊斋
    importjiebaexcludes={"不知","不可","一日","不敢","数日","以为","不能","可以","不得","如此","------------","三日","而已","明日","其中&qu......
  • jieba分词
    importjiebapath="all.txt"#读取文本文件file=open(path,"r",encoding="utf-8")text=file.read()file.close()words=jieba.lcut(text)#使用jieba分词counts={}#统计词频forwordinwords:iflen(word)==1:#过滤掉长度为1的词语......
  • jieba分词
    importjiebawithopen('红楼梦.txt','r',encoding='utf-8')asf:#打开文件txt=f.read()#读取为txtwords=jieba.lcut(txt)#利用jieba库的lcut分词counts={}#创建字典forwordinwords:#逐个遍历iflen(word)==1:#对于一些分词之......
  • 一种可以实现搜索结果按照相似度来排序的sql,核心是分词和order by like 的使用
    常规的搜索一般使用like执行模糊搜索,这种搜索有个缺陷,一旦搜索内容里面有一个错的就会导致搜索失败。有没有一种实现可以容错的且按照相似度排序的方法呢?类似百度google那样的。经过自己的测试发现使用分词结合排序的orderbylike可以实现。我直接给出例子sql的吧  比如......
  • python123——西游记相关的分词,出现次数最高的20个
       #统计西游记人物出场次数,(去除冠词,代词等干扰)并降序排列p173importjiebaexcludes={"一个","那里","怎么","我们","不知","两个","甚么","不是","只见","原来","如何","这个","不曾&q......