How to use :

  • run in the terminal : python Autochecker4Chinese.py
  • You will get the following result :


1. Make a detecter

  • Construct a dict to detect the misspelled chinese phrase,key is the chinese phrase, value is its corresponding frequency appeared in corpus.
  • You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file.
  • The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase.


 Make an autocorrecter

  • Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase
  • We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules:
  • If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected.
  • Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order.
  • Otherwise, we put the candidate in third order.



3. Correct the misspelled phrase in a sentance

  • For any given sentence, use jieba do the segmentation,
  • Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase
  • Use auto_correct function to correct the misspelled phrase
  • Output the correct sentence




