Open Access Open Access  Restricted Access Subscription or Fee Access

Chinese Word Segmentation Probability Dictionary Training and Enrich Solution

Wang Qi, Zeng Guangping, Wang Yonghao, Gui Xunlong

Abstract


Word segmentation is one necessary component for Asian language search engine, and probability dictionary is core component for statistical language model based word segmentation application. Manually marking is the traditional way to build probability dictionary, slow and low efficient, usually can’t cover recent new words. Society is progressing, there are always new words born in human language. How to include new words into probability dictionary to increase word segmentation application’s recall and precision value is a big challenge for search engine of Asian language, for example Chinese. This article introduces one automatically probability dictionary learning and enriches approach. This unsupervised Machine Learning based solution extracts word appear probability and word transfer probability information from user search logs, learn new words which does not exist in our current lexicon to enrich our tokenization probability dictionary.

Keywords


word segmentation, probability dictionary, statistical language model, Machine Learning.

Full Text:

PDF


Disclaimer/Regarding indexing issue:

We have provided the online access of all issues and papers to the indexing agencies (as given on journal web site). It’s depend on indexing agencies when, how and what manner they can index or not. Hence, we like to inform that on the basis of earlier indexing, we can’t predict the today or future indexing policy of third party (i.e. indexing agencies) as they have right to discontinue any journal at any time without prior information to the journal. So, please neither sends any question nor expects any answer from us on the behalf of third party i.e. indexing agencies.Hence, we will not issue any certificate or letter for indexing issue. Our role is just to provide the online access to them. So we do properly this and one can visit indexing agencies website to get the authentic information.