jieba-cant-extract-single-character

时间：2023-11-13 15:57:54浏览次数：45

标签：jieba short wc character analyse words extract

jieba cant extract single character

Subtitle: jieba 无法提取单个字符
Created: 2023-11-13T15:28+08:00
Published: 2023-11-13T15:45+08:00

以句子"我喜欢赵"为例，用「赵」代指某个人名，使用 jieba 提取关键词：

import jieba
import jieba.analyse
import jieba.posseg as pseg

short_content = "我喜欢赵"
short_words = pseg.lcut(short_content) # [pair('我', 'r'), pair('喜欢', 'v'), pair('赵', 'nr')] # 可以看到分词的词性是对的
jieba.analyse.extract_tags(short_content) # ['喜欢'] # 但是提取到的关键词少了"赵"

原因是在 jieba/analyse/tfidf.py 里有一段代码，过滤掉了所有长度小于 2 的字符串：

# jieba/analyse/tfidf.py
# extract_tags()
freq = {}
for w in words:
    if allowPOS:
        if w.flag not in allowPOS:
            continue
        elif not withFlag:
            w = w.word
    wc = w.word if allowPOS and withFlag else w
    if len(wc.strip()) < 2 or wc.lower() in self.stop_words: # <--- LOOK HERE
        continue
    freq[w] = freq.get(w, 0.0) + 1.0
total = sum(freq.values())

本意是过滤掉空白字符串 \s+ 和标点符号，但是把单个字符表示的人名也给丢掉了。

解决步骤：

因为要直接修改第三方库，所以以管理员权限打开这个文件，把那一行改成：
```
if len(wc.strip()) < 1 or wc.lower() in self.stop_words:
```
这样只过滤掉空白符，如果不过滤空白符的话也会报错
添加标点符号到 stopwords 词库，jieba.analyse.set_stop_words('./path_to_stopwords.txt')

附标点符号库，使用时候自行替换空格为换行符：

0 1 2 3 4 5 6 7 8 9 , . : ; " ' / [ ] { } \ | + - ) ( ) * & ^ % $ # @ ! ~ ` 。 ， ！ ？ ： ； 「 」 、 ： — … … — — － － ／ ［ ］ 【 】 （ ） ｛ ｝ 《 》 〈 〉 「 」 『 』 “ ” ‘ ’ ‵ ′ ＂ ＇ ﹃ ﹄ ﹁ ﹂ ﹏ ﹏ ︴ ︵ ︿ ﹀ ︹

用完记得改回来

参考：GitHub - fxsjy/jieba: 结巴中文分词

标签：jieba,short,wc,character,analyse,words,extract
From： https://www.cnblogs.com/ticlab/p/17829330.html

SyntaxError: Non-ASCII character 与 Cannot decode using encoding "ascii" 错误解
转载请注明出处：python调试时遇到的两个相同的编码错误进行总结：1.错误：Cannotdecodeusingencoding"ascii",unexpectedbyteatposition具体错误信息如下： 2.错误：SyntaxError:Non-ASCIIcharacter当程序文件中，存在中文字符时候，文件未......
Extracting info from VCF files
R,BioconductorfilterVcf:ExtractVariantsofInterestfromaLargeVCFFile(PaulShannon)Wedemonstratethreemethods: filteringbygenomicregion, filteringonattributesofeachspecificvariantcall,andintersectingwithknownregionsofinterest(......
How To Use Coordinates To Extract Sequences In Fasta File
[1]bedtools(https://github.com/arq5x/bedtools2)hereisalsobedtools(https://github.com/arq5x/bedtools2)getfasta.ItusesErik'scodeunderthehood.$cattest.fa>chr1AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG$cattest.bedchr1510$......
LangChain=>RecursiveCharacterTextSplitter
.Net版本LangChain源码：github.comRecursiveCharacterTextSplitter调用方法：varstate_of_the_union_txt="text-Content";vartextSplitter=newRecursiveCharacterTextSplitter(chunkSize:300,chunkOverlap:30);vartexts=textSplitter.CreateDocuments(n......
论文阅读：DeepKE：A Deep Learning Based Knowledge Extraction Toolkit for Knowledge B
DeepKE，支持数据集和模型的结合来实现非结构化数据中信息的提取。同时提出框架和一系列的组件来实现足够的模块化和可扩展性。项目地址1.Introduction现存的KB是在实体和关系方面是不完备的。常见的一些标志性的应用：Spacy（实体识别）OpenNER（关系提取）OpenIE（信息提取）RESIN（事......
[论文阅读] Disentangling Writer and Character Styles for Handwriting Generation
Pretitle:DisentanglingWriterandCharacterStylesforHandwritingGenerationaccepted:CVPR2023paper:https://arxiv.org/abs/2303.14736code:https://github.com/dailenson/sdtref:https://mp.weixin.qq.com/s/aqHfC0hfimK0QhEUzQRZgw关键词：handwriting,styl......
Unexpected character '=' (code 61); expected a semi-colon after the reference fo
在初始化hive时报错，出现如下问题：错误原因：hive-site.xml配置文件中，数据库的地址带有&符号。将数据库地址中的&符号调整为&，详情如下：再次初始化hive，执行结果如下： ......
Adobe Character Animator 2020 Mac「Ch 动画制作工具」中文版
ch2020mac是款适合设计师们使用的动画制作工具。ch2020mac正式版能够实时将2D人物制成动画的软件，用户可以利用网络摄像头将各种艺术作品转变为动画人物。ch2020mac中用户还可以使用键盘或MIDI设备控制挥手等姿势，支持将动作保存为按钮，让腿部、手臂和头部动画更灵活。软件地址：看置......
解决Matlab遇到的svmtrain (line 234) Y must be a vector or a character array.
解决Matlab遇到的svmtrain(line234)Ymustbeavectororacharacterarray.在使用MATLAB进行SVM分类器训练时，有时会出现以下错误提示：svmtrain(line234)Ymustbeavectororacharacterarray.这个错误是由于目标变量Y的类型不正确导致的。本文将介绍如何解决这个问题......
value too long for type character varying报错处理
瀚高数据库目录环境症状问题原因解决方案环境系统平台：N/A版本：4.5症状使用insertinto插入数据时出现报错valuetoolongfortypecharactervarying问题原因458新增NLS_LENGTH_SEMANTICS参数，默认设置为byte。之前版本默认为char。NLS_LENGTH_SEMANTICS：该参数有byte和char两种取值......

jieba-cant-extract-single-character

相关文章

赞助商

阅读排行