本論文提出一種基於大字符集柏洛-菲勒轉換(Burrows-Wheeler transform, BWT) 之中文文本資料的壓縮方法,先以Big-5加上ASCII形成的大字符集(alphabet)來剖析輸入的中文文字檔案,再接著進行BWT、MTF(move to front)、和算術編碼的處理。我們也研究了,在大字符集要求下能夠適用於BWT、MTF和算術編碼處理上的實作方法,以提升處理的速度。我們已經將這個壓縮方法製作成可以實際使用之軟體程式,對於中文文字檔案的測試實驗,結果顯示我們方法獲得的壓縮率,比一般常被使用的Win-ZIP好約12.9%,比Win-RAR好約4.7%,而比原始的基於BWT的壓縮軟體BZIP2的壓縮率好約1.7%。
In this thesis, a Chinese text compression scheme based on large alphabet Burrows-Wheeler transform(BWT) is proposed. First, an inputted Chinese text file is parsed with a large alphabet consisting of characters from BIG-5 and ASCII codes. Then, the parsed token stream is processed by BWT, MTF(Move To Front), and arithmetic coding. To improve the speed of the proposed scheme, we have also studied a few ways for practical implementations of BWT, MTF and arithmetic coding under large-alphabet parsing condition. According to the compression scheme, a practically executable program is developed. When compared with other compression programs, i.e., Win-ZIP, Win-RAR, and BZIP2, our program is shown, in Chinese text file compression experiments, to have better compression rates. Rate improvements are 12.9%, 4.7%, and 1.7%, respectively.