首页 > 其他分享 >中文文本错别字检测以及自动纠错

中文文本错别字检测以及自动纠错

时间:2022-11-03 11:31:27浏览次数:48  
标签:文本 candidate misspelled pinyin 错别字 phrase 纠错 correct dict

机器学习AI算法工程   公众号:datayx


How to use :

  • run in the terminal : python Autochecker4Chinese.py
  • You will get the following result :

中文文本错别字检测以及自动纠错_深度学习



代码及运行教程 获取:

关注微信公众号 datayx  然后回复  纠错  即可获取。


1. Make a detecter

  • Construct a dict to detect the misspelled chinese phrase,key is the chinese phrase, value is its corresponding frequency appeared in corpus.
  • You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file.
  • The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase.

中文文本错别字检测以及自动纠错_机器学习_02



 Make an autocorrecter

  • Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase
  • We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules:
  • If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected.
  • Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order.
  • Otherwise, we put the candidate in third order.


中文文本错别字检测以及自动纠错_机器学习_03



中文文本错别字检测以及自动纠错_特征工程_04


3. Correct the misspelled phrase in a sentance

  • For any given sentence, use jieba do the segmentation,
  • Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase
  • Use auto_correct function to correct the misspelled phrase
  • Output the correct sentence


中文文本错别字检测以及自动纠错_特征工程_05




不断更新资源

深度学习、机器学习、数据分析、python

 搜索公众号添加: datayx  

中文文本错别字检测以及自动纠错_特征工程_06



机大数据技术与机器学习工程

 搜索公众号添加: datanlp

中文文本错别字检测以及自动纠错_机器学习_07

标签:文本,candidate,misspelled,pinyin,错别字,phrase,纠错,correct,dict
From: https://blog.51cto.com/u_15404184/5819255

相关文章