Adaptive compression-based models of Chinese text

Teahan, W.J. and Wu, P. and Liu, W. (2014) Adaptive compression-based models of Chinese text. In: International Conference on Audio, Language and Image Processing (ICALIP), 7 - 9 July 2014, Shanghai, China.

Full-text not available from this repository..


Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.

Item Type: Conference or Workshop Item (UNSPECIFIED)
Subjects: Research Publications
Departments: College of Physical and Applied Sciences > School of Computer Science
Date Deposited: 09 Dec 2014 16:28
Last Modified: 09 Apr 2016 02:34
URI: http://e.bangor.ac.uk/id/eprint/238
Identification Number: DOI: 10.1109/ICALIP.2014.7009920
Publisher: IEEE publishing
Administer Item Administer Item

eBangor is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.