首页 > 编程语言 >提取关键词的算法

提取关键词的算法

时间:2023-02-14 15:25:30浏览次数:60  
标签:提取 language doc 关键词 ngram 算法 kw keywords import

1、keyBERT(英文效果>中文)

链接:https://hidadeng.github.io/blog/keybert_tutorial/

 

用法:

!pip3 install gensim==4.0.0
!pip3 install keybert==0.5.1
!pip3 install gensim==3.8.3
from keybert import KeyBERT
import jieba

bertModel = KeyBERT('distiluse-base-multilingual-cased')
# bertModel = KeyBERT('distilbert-base-nli-mean-tokens')
doc="Primovist 10ml pre-filled glass syringes"


doc = " ".join(jieba.cut(doc))
kw = bertModel.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None, top_n=10)
print(kw)
from keybert import KeyBERT
import spacy
import jieba


zh_model = spacy.load("zh_core_web_sm")
bertModel = KeyBERT(model=zh_model)
# 中文测试数据

doc = "各有关单位:   为提高体外诊断试剂临床试验的科学性与合理性,优化临床试验技术要求和管理要求,按照中共中央办公厅、国务院办公厅印发《关于深化审评审批制度改革鼓励药品医疗器械创新的意见》(厅字〔2017〕42号)总体要求和国家药品监督管理局的统一部署,结合《医疗器械监督管理条例》及配套规章的修改情况,我中心对《体外诊断试剂临床试验技术指导原则》(国家食品药品监督管理总局通告 2014年第16号)进行修订,形成了《体外诊断试剂临床试验指导原则(征求意见稿)》。"

# 整理成类似于英语这样空格间隔词语形式的文本
doc = ' '.join(jieba.lcut(doc))


# 关键词提取
keywords = bertModel.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None, top_n=10)
print(keywords)
from keybert import KeyBERT
import spacy

doc="Primovist 10ml pre-filled glass syringes: Possible issue with plunger resistance A Dear Healthcare Professional Letter has been issued by Bayer to inform healthcare professionals of possible increased plunger resistance with single Primovist 10 mL pre-filled glass syringes. "

en_model = spacy.load("en_core_web_sm")
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 1), stop_words=None, top_n=20)
print(keywords)

常用extract_keywords参数

bertModel.extract_keywords(docs, keyphrase_ngram_range, stop_words, top_n)

  • docs 文档字符串(空格间隔词语的字符串)
  • keyphrase_ngram_range 设置ngram,默认(1, 1)
  • stop_words 停用词列表
  • top_n 显示前n个关键词,默认5
  • highlight 可视化标亮关键词,默认False
  • use_maxsum: 默认False;是否使用Max Sum Similarity作为关键词提取标准,
  • use_mmr: 默认False;是否使用Maximal Marginal Relevance (MMR) 作为关键词提取标准
  • diversity 如果use_mmr=True,可以设置该参数。参数取值范围从0到1

对于keyphrase_ngram_range参数,

  • (1, 1) 只单个词
  • (2, 2) 考虑词组
  • (1, 2) 同时考虑以上两者情况

spacy  的版本与"zh_core_web_sm"、"en_core_web_sm" 存在不匹配情况

装不上可先下载 然后本地安装 :pip install /文件路径/en_core_web_sm-2.2.5.tar.gz

下载地址:

https://github.com/explosion/spacy-models/tree/master/meta

 

YAKE!(对中文效果不好,支持20+外文)

链接: https://github.com/LIAAD/yake

用法:直接安装

pip install git+https://github.com/LIAAD/yake
import yake
text = "近日,国家药品监督管理局经审查,批准了腾讯医疗健康(深圳)有限公司生产的^慢性青光眼样视神经病变眼底图像辅助诊断软件^创新产品注册申请。"

language = "zh"
# language = "en"
# language = "el"
# language = "pt"
# language = "ar"
max_ngram_size = 1
deduplication_threshold = 0.9  # 重复数据删除阈值
deduplication_algo = 'seqm'  # 重复数据删除
windowSize = 1
numOfKeywords = 20
#
# custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold,
#                                             dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords,
#
#
#
with open(r'D:\app\Python310\Lib\site-packages\yake\StopwordsList\stopwords_zh.txt', 'r', encoding='utf-8') as f:
    stop_words= (f.read())

custom_kw_extractor = yake.KeywordExtractor(lan=language,stopwords=stop_words)
keywords = custom_kw_extractor.extract_keywords(text)

for kw in keywords:
    print(kw)

YAKE!支持多种语言,

通过 language = "ar" 来更改语言,对应的参数一般为语言中文名的前两个汉字首字母小写。安装完yake 可在路径下面找到对应语言的驻停祠

 

 

TF-IDF(针对中文)

参考链接:https://blog.csdn.net/asialee_bird/article/details/81486700

 text = "Affects the quality of early treatment of patients with blood poisoning Supervision of health services must contribute to safer services with a higher quality. But little is known about the extent to which supervision succeeds in achieving these goals. "


import jieba.analyse

kw = jieba.analyse.extract_tags(text, topK=20, withWeight=True, allowPOS=('ns', 'n', 'vn', 'v'))  # 可选名次、动词、动名词等参数
print(kw)

 

Rake(英文效果好)

链接:

https://github.com/laserwave/keywords_extraction_rake

用法在链接里

标签:提取,language,doc,关键词,ngram,算法,kw,keywords,import
From: https://www.cnblogs.com/avivi/p/17119666.html

相关文章