Giter VIP home page Giter VIP logo

autochecker4chinese's Introduction

Solutions of autochecker for chinese

How to use :

  • run in the terminal : python Autochecker4Chinese.py
  • You will get the following result :

1. Make a detecter

  • Construct a dict to detect the misspelled chinese phrase,key is the chinese phrase, value is its corresponding frequency appeared in corpus.
  • You can finish this step by collecting corpus from the internet, or you can choose a more easy way, load some dicts already created by others. Here we choose the second way, construct the dict from file.
  • The detecter works in this way: for any phrase not appeared in this dict, the detecter will detect it as a mis-spelled phrase.
def construct_dict( file_path ):
    
    word_freq = {}
    with open(file_path, "r") as f:
        for line in f:
            info = line.split()
            word = info[0]
            frequency = info[1]
            word_freq[word] = frequency
    
    return word_freq
FILE_PATH = "./token_freq_pos%40350k_jieba.txt"
phrase_freq = construct_dict( FILE_PATH )
print( type(phrase_freq) )
print( len(phrase_freq) )
<type 'dict'>
349045

2. Make an autocorrecter

  • Make an autocorrecter for the misspelled phrase, we use the edit distance to make a correct-candidate list for the mis-spelled phrase
  • We sort the correct-candidate list according to the likelyhood of being the correct phrase, based on the following rules:
    • If the candidate's pinyin matches exactly with misspelled phrase's pinyin, we put the candidate in first order, which means they are the most likely phrase to be selected.
    • Else if candidate first word's pinyin matches with misspelled phrase's first word's pinyin, we put the candidate in second order.
    • Otherwise, we put the candidate in third order.
import pinyin
# list for chinese words
# read from the words.dic
def load_cn_words_dict( file_path ):
    cn_words_dict = ""
    with open(file_path, "r") as f:
        for word in f:
            cn_words_dict += word.strip().decode("utf-8")
    return cn_words_dict
# function calculate the edite distance from the chinese phrase 
def edits1(phrase, cn_words_dict):
    "All edits that are one edit away from `phrase`."
    phrase = phrase.decode("utf-8")
    splits     = [(phrase[:i], phrase[i:])  for i in range(len(phrase) + 1)]
    deletes    = [L + R[1:]                 for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:]   for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]             for L, R in splits if R for c in cn_words_dict]
    inserts    = [L + c + R                 for L, R in splits for c in cn_words_dict]
    return set(deletes + transposes + replaces + inserts)
# return the phrease exist in phrase_freq
def known(phrases): return set(phrase for phrase in phrases if phrase.encode("utf-8") in phrase_freq)
# get the candidates phrase of the error phrase
# we sort the candidates phrase's importance according to their pinyin
# if the candidate phrase's pinyin exactly matches with the error phrase, we put them into first order
# if the candidate phrase's first word pinyin matches with the error phrase first word, we put them into second order
# else we put candidate phrase into the third order
def get_candidates( error_phrase ):
    
    candidates_1st_order = []
    candidates_2nd_order = []
    candidates_3nd_order = []
    
    error_pinyin = pinyin.get(error_phrase, format="strip", delimiter="/").encode("utf-8")
    cn_words_dict = load_cn_words_dict( "./cn_dict.txt" )
    candidate_phrases = list( known(edits1(error_phrase, cn_words_dict)) )
    
    for candidate_phrase in candidate_phrases:
        candidate_pinyin = pinyin.get(candidate_phrase, format="strip", delimiter="/").encode("utf-8")
        if candidate_pinyin == error_pinyin:
            candidates_1st_order.append(candidate_phrase)
        elif candidate_pinyin.split("/")[0] == error_pinyin.split("/")[0]:
            candidates_2nd_order.append(candidate_phrase)
        else:
            candidates_3nd_order.append(candidate_phrase)
    
    return candidates_1st_order, candidates_2nd_order, candidates_3nd_order
def auto_correct( error_phrase ):
    
    c1_order, c2_order, c3_order = get_candidates(error_phrase)
    # print c1_order, c2_order, c3_order
    if c1_order:
        return max(c1_order, key=phrase_freq.get )
    elif c2_order:
        return max(c2_order, key=phrase_freq.get )
    else:
        return max(c3_order, key=phrase_freq.get )
# test for the auto_correct 
error_phrase_1 = "呕涂" # should be "呕吐"
error_phrase_2 = "东方之朱" # should be "东方之珠"
error_phrase_3 = "沙拢" # should be "沙龙"

print error_phrase_1, auto_correct( error_phrase_1 )
print error_phrase_2, auto_correct( error_phrase_2 )
print error_phrase_3, auto_correct( error_phrase_3 )
呕涂 呕吐
东方之朱 东方之珠
沙拢 沙龙

3. Correct the misspelled phrase in a sentance

  • For any given sentence, use jieba do the segmentation,
  • Get segment list after segmentation is done, check if the remain phrase exists in word_freq dict, if not, then it is a misspelled phrase
  • Use auto_correct function to correct the misspelled phrase
  • Output the correct sentence
import jieba
import string
import re
PUNCTUATION_LIST = string.punctuation
PUNCTUATION_LIST += "。,?:;{}[]‘“”《》/!%……()"
def auto_correct_sentence( error_sentence, verbose=True):
    
    jieba_cut = jieba.cut(err_test.decode("utf-8"), cut_all=False)
    seg_list = "\t".join(jieba_cut).split("\t")
    
    correct_sentence = ""
    
    for phrase in seg_list:
        
        correct_phrase = phrase
        # check if item is a punctuation
        if phrase not in PUNCTUATION_LIST.decode("utf-8"):
            # check if the phrase in our dict, if not then it is a misspelled phrase
            if phrase.encode("utf-8") not in phrase_freq.keys():
                correct_phrase = auto_correct(phrase.encode("utf-8"))
                if verbose :
                    print phrase, correct_phrase
    
        correct_sentence += correct_phrase
    
    if verbose:
        print correct_sentence
    return correct_sentence
err_sent = '机七学习是人工智能领遇最能体现智能的一个分知!'
correct_sent = auto_correct_sentence( err_sent )
机七 机器
领遇 领域
分知 分枝
机器学习是人工智能领域最能体现智能的一个分枝!
print correct_sent
机器学习是人工智能领域最能体现智能的一个分枝!

autochecker4chinese's People

Contributors

beyondacm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autochecker4chinese's Issues

A question with ”word_freq”?

I think the follow function needs a small change, and I marked it by ****. Because when you using "max(c1_order, key=phrase_freq.get )", the type of data maybe string.
def construct_dict( file_path ):
word_freq = {}
with open(file_path, mode='r', encoding='UTF-8') as f:
for line in f:
info = line.split()
word = info[0]
frequency = info[1] -------------- frequency = int(info[1])
word_freq[word] = frequency
return word_freq

pinyin

hi,where is the pinyin file ?

Operating Environment

There are so many problems when I ran your program, and I list all of them:

  1. Please list all the denpendent libraies and operating environment. Your program seems base on python2, please make the statement that other reader can understand first.

  2. reload and ys.setdefaultencoding('utf-8') functions could not be found. I tried to remove both of them, however a series problem occurred.

Thank for your contribution, hope you can solve these problems soon

纠正错误

机七-->机
并没有纠正“七”,这是哪里出了问题?

只按照单词频率来纠正吗?

我看到如下代码, 似乎只考虑了单词的频率,而没有结合语境考虑问题. 感觉需要改进算法
def auto_correct( error_phrase ):

c1_order, c2_order, c3_order = get_candidates(error_phrase)
# print c1_order, c2_order, c3_order
if c1_order:
    return max(c1_order, key=phrase_freq.get )
elif c2_order:
    return max(c2_order, key=phrase_freq.get )
else:
    return max(c3_order, key=phrase_freq.get )

'function' object has no attribute 'get'

Traceback (most recent call last):
File "Autochecker4Chinese.py", line 128, in
main()
File "Autochecker4Chinese.py", line 117, in main
correct_sent = auto_correct_sentence( err_sent_1 )
File "Autochecker4Chinese.py", line 99, in auto_correct_sentence
correct_phrase = auto_correct(phrase.encode("utf-8"))
File "Autochecker4Chinese.py", line 76, in auto_correct
c1_order, c2_order, c3_order = get_candidates(error_phrase)
File "Autochecker4Chinese.py", line 58, in get_candidates
error_pinyin = pinyin.get(error_phrase, format="strip", delimiter="/")#.encode("utf-8")
Thanks a lot!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.