×

Combining trigram and automatic weight distribution in Chinese spelling error correction. (English) Zbl 1095.68707

Summary: The researches on spelling correction aiming at detecting errors in texts tend to focus on context-sensitive spelling error correction, which is more difficult than traditional isolated-word error correction. A novel and efficient algorithm for the system of Chinese spelling error correction, CInsunSpell, is presented. In this system, the work of correction includes two parts: checking phase and correcting phase. At the first phase, a Trigram algorithm within one fixed-size window is designed to locate potential errors in local area. The second phase employs a new method of automatically and dynamically distributing weights among the characters in the confusion set as well as in the Bayesian language model. The tactics used above exhibits good performances.

MSC:

68T50 Natural language processing
68T10 Pattern recognition, speech recognition

Keywords:

language model
Full Text: DOI

References:

[1] Kukich K. Techniques for automatically correcting words in text.ACM Computing Surveys, 1992, 24(4): 377–439. · doi:10.1145/146370.146380
[2] Mays Eric, Damerau F J, Mercer Robert L. Context-based spelling correction.Information Processing and Management, 1991, 27(5): 517–522. · doi:10.1016/0306-4573(91)90066-U
[3] Golding Andrew R. A Bayesian hybrid method for context-sensitive spelling correction. InProc. the Third Workshop on Very Large Corpora, MIT, Cambridge, Massachusetts, USA, 1995, pp.39–53.
[4] Golding Andrew R, Schabes Yves. Combining trigram-based and feature-based methods for context-sensitive spelling correction. InProc. the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA, 1996, pp. 71–78.
[5] Roth Dan, Zelenko Dmitry. Part of speech tagging using a network of linear separators. InProc. COLING’98, Montreal, Canada, 1998, pp.1136–1142.
[6] Golding Andrew R. A window-based approach to context-sensitive spelling correction.Machine Learning, February, 1999, 34: pp.107–130. · Zbl 0917.68168 · doi:10.1023/A:1007545901558
[7] Golding Andrew R, Roth Dan. Applying window to context-sensitive spelling correction. InMachine Learning: Proceedings of the 13th International Conference, 1996, pp.182–190.
[8] Kukich K. Spelling correction for the telecommunications network for the deaf.Communication ACM, 1992, 35(5): 80–90. · doi:10.1145/129875.129882
[9] Littlestone N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.Machine Learning, 1988, 2(4): 285–318.
[10] Littlestone N, Warmuth M K. The weighted majority algorithm.Information and Computation, 1994, 108(2): 212–261. · Zbl 0804.68121 · doi:10.1006/inco.1994.1009
[11] Meknavin Surapant. Combining trigram and window in Thai OCR error correction. InProc. COLING’98, Montreal, Canada, 1998, pp. 836–842.
[12] Nagata Masaaki. Japanese OCR error correction using character shape similarity and statistical language. InProc. COLING’98, Montreal, Canada, 1998, pp.922–928.
[13] Schneider David, McCoy Kathleen F. Recognizing syntactic errors in the writing of second language learners. InProc. COLING’98, 1998, pp.1198–1204.
[14] Oflazer Kemal. Error-tolerant finite state recognition with applications to morphological analysis and spelling correction.Computational Linguistics, 1996, 22(1): 73–89.
[15] Li Jianhua, Wang Xiaolong, Sun Yuqi. The research of Chinese text proofreading algorithms.High Technology Letters, 2000, 6(1): 1–7.
[16] Ng Hwee Tou, Zelle John. Corpus-based approaches to semantic interpretation in natural language processing.AI Magazine, Winter, 1997, pp. 45–64.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.