机器学习AI算法工程 公众号:datayx
How to use :
- run in the terminal : python Autochecker4Chinese.py
- You will get the following result :
代码及运行教程 获取:
关注微信公众号 datayx 然后回复 纠错 即可获取。
1. Make a detecter
- Construct a dict to detect the misspelled chinese phrase,key is the chinese phrase, value is its corresponding frequency appeared in corpus.
- You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file.
- The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase.
Make an autocorrecter
- Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase
- We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules:
- If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected.
- Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order.
- Otherwise, we put the candidate in third order.
3. Correct the misspelled phrase in a sentance
- For any given sentence, use jieba do the segmentation,
- Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase
- Use auto_correct function to correct the misspelled phrase
- Output the correct sentence
不断更新资源
深度学习、机器学习、数据分析、python
搜索公众号添加: datayx
机大数据技术与机器学习工程
搜索公众号添加: datanlp
标签:文本,candidate,misspelled,pinyin,错别字,phrase,纠错,correct,dict From: https://blog.51cto.com/u_15404184/5819255