Spell-check for Chinese: viable methods

I hope the reader has as much fun reading this as I did writing it.  It was truly a pleasure.

 

Abstract

This paper is divided into two parts.  In the first, it takes a look at the development of input methods for computing in Chinese and their current manifestations.  It focuses particularly on the present schemes devised for keyboard inputting as they are the most widely employed tools for this process.  The second part of this paper examines what these input systems mean for word processing.  Given the nature of character-based writing and the various input methods, I look at what makes it more difficult to develop word processing mechanisms, particularly a robust spell-check system, compared to alphabets.  Spell-check approaches based on graphic and phonetic input methods are discussed.  The results and implications of the approaches are compared.

 

I  Intoduction

Around the 10th century AD, movable type for printing presses was invented in China (“Printing” 2005).  While the activity of printing books - mostly Buddhist texts – on paper has been going on for centuries, it was all done on large, individually carved blocks with each character’s anagram delicately finished by a master artisan.  It was at this time that a Chinese inventor took the idea of printing to the next level by devising the concept of printing with movable type.  Individual characters would be first carved out and later arranged on a printing block to reproduce the texts desired on a page.  This invention though never came around to significantly impact Chinese society with the promise of the written word, cheaply produced, and the reason was simple.  The Chinese language is represented by a character set numbering in the tens of thousands.  The task of organizing and manipulating such a large typeset was tremendously impractical and costly, and so individually carved blocks continued to be the choice medium for printing.

This same story can be repeated in the West, although with vastly different results.  Johannes Gutenberg is the man largely credited with the invention of movable type in his shop in Germany in 1450.  Although is came many centuries after the Chinese, his invention was truly revolutionizing.  Languages that employ a limited set of letters for written expression such as English which requires only 52 letters (upper and lower case) flourished with the onset of print material that can be reproduced in mass for cheap.

This historical example provides an interesting illustration of the different results that can come from technological innovation due to the differences in the written word.  There is a parallel to be drawn from this to modern times with the development of computers and its interaction with character-based scripts and alphabets.  Viewed by some users of character script as the tool that will finally level the playing field for text processing, computers have in fact done pretty much the reverse.  It has amplified the differences intrinsic in character-based writing and its alphabetic counterpart, revealing the unwieldiness of the former as an obstacle that will forever prevent it from reaching parity in terms of computing speeds with alphanumeric systems.  As the expert sinologist Professor Marshall Unger points out, the inefficiency of character writing will only increase “as the scope and number of computer applications grow (Unger 1987).  Much to the disappointment of character script users, computers will only continue to accentuate the performance gap as processor speeds continue to rise.

All this is not to say that there is no future for languages that employ character-based writing systems.  Once cognizant of the difference between alphabetic and character systems, it is possible find a way out of this problem, perhaps even by adopting attributes of the former.  The following paper will look at one dimension of the character computing issue – namely, that of input systems for Chinese languages.  Starting from discussing the various schemes devised for this intricate task, it will then move on to the implications these system have for word processing.

 

II  Input Systems

An input system in the context of this paper on computing is a way by which users, following predefined rules and procedures, interact with some medium in order to produce a desired output from the computer.  In this case, it is a display of text across the computer screen that is intended by the user.  The development of these input systems for Chinese characters was a long and arduous task.  It did not come simple as can be seen by the obstacles faced by early typewriters.

 

Early input methods

Before the invention of computers, typewriters were the tools used for word processing.  The words ‘input system’ or ‘input method’ can be applied to typewriters in a loose sense since the users were still interacting through some medium to write Chinese characters.  The difference is that in a typewriter the process was completely mechanical, while a computer turns the actions the user into an electrical signal. 

The first English typewriter was invented in 1867.  Nearly 50 years later, this first Chinese character typewriter appeared in 1917.  It was manufactured by the Japanese firm Nippon Typewriter Co.  The Nippon typewriter has a flat bed of 3,000 Japanese characters, most of which were Kanji of which some can be used to type Chinese.  This typewriter was considered a shorthand version since the Japanese language contains in excess of 30,000 characters (Russo 2000).  To use the typewriter, paper is wrapped around the cylindrical rubber platen, which moves on rollers over the bed of type.  The operator uses a level to control an arm that picks up a piece of metal type from the bed, presses it against the paper, and returns it to its niche.  The process was slow and tedious, even for trained typists.

The large bed of characters continued to persist in typewriters developed later on.  The problem with Chinese character typewriters was twofold.  Not only did they require a lot of movement to operate, since touch typing was impossible with these machines, but they also burdened their operators with the task of finding the right character amid a bed of literally thousands of characters.  Trying to find the right character was a time consuming task.  Their positions had to be memorized since what the operator sees is a collection of upside-down, backward characters in very small type.  The machine takes 160 to 200 times longer to master compared to a Western-language typewriter (Feng 1989).

 

Input systems for modern computing

It would seem that the early modern period was a dark time for Chinese character computing, but it was still some kind of improvement over the past.  Moving into the computer age, many more options began to open up to users of Chinese characters.   From the definition of input systems given previously, we can categorize these various new ways of inputting Chinese by their mediums into three basic groups.  First there is voice-to-text which utilizes a microphone and voice recognition software in order to write Chinese.  Second there is handwriting recognition which uses a pressure sensitive tablet, stylus, and specialized software to convert handwritten characters to text on a screen.  I will not give coverage to either group here since neither of these two types of input systems have the potential to become the dominant system given the current technology.  The last and largest category of input systems is that based on the keyboard.  To date, this has been the most viable medium for inputting Chinese in a speedy and practical manner.   The proliferation of keyboard based input systems is the proof to their superiority and so attention will be given to their discussion.

 

Keyboard input methods

Keyboard input methods are by far the largest category.   It can be further divided into two subgroups.  One is shape-based, graphic input methods, such as Wubizixing and Cangjie.  The other is pronunciation-based, phonetic input methods, such as Pinyin, Zhuyin, and methods that use non-Mandarin sounds as a starting point.  While there still exists many other keyboard input systems for Chinese, such as English-to-Chinese or Hanzhong, telegraphic code, and the four corners system or CKC, the number of people who use these systems exclusively are negligible when compared to either graphic or phonetic systems described below.

 

Graphic methods

Graphic input systems rely on the visual components of characters.  Graphical components of characters are assigned to specific keys on the keyboard.  Users follow set rules to input these components in the right order to get the appropriate character to appear on the screen.  There are two major graphical input systems for keyboard.  They are Wubizixing which is used in China and Cangjie which is found in Taiwan.

While there may be differences across the systems, graphical input generally follow a predictable pattern.  First, they begin with the user identifying the shape of the character they would like to type.  Having the strokes of the character in mind, the user then proceeds to follow any of the following rules depending on the input method used.  Strokes may be inputted via the standard stroke order as one would write it by hand.  The user may also be called to single out the identifying graphical components representative of the word and input just parts of those graphs in combination together to get the word.  The rules governing this type of graphical input system is more ad hoc since it does not follow some sort of accepted convention like stroke order.

 

Phonetic methods

The most common phonetic input methods for Chinese are based on the Pinyin and Zhuyin system, also known as BoPoMoFo.  Pinyin uses the alphabet to indicate the sounds of the Chinese language, while Zhuyin which was developed in 1913 is a phonetic system that resembles Japanese kana.  Both of these systems were developed with modeling the sound of standard Mandarin in mind.  The Pinyin input system, since it utilizes the alphabet, follows the standard QWERTY layout found on most English language computers.  Zhuyin input system, as it appears in places where it is used as in Taiwan, maps its 40 keys (36 sounds, 4 tones) on the QWERTY keyboard by adding the fourth row reserved for numbers and symbols in typing English as keys.

Phonetic input systems as their name implies are based on sounds, and so users type in their natural spoken language.  Some developers using this fact have made headway in developing phonetic input systems based on non-Mandarin Chinese sounds, such as Cantonese.  One such system is the Red Dragonfly Chinese Input System which can be readily downloaded online.  It functions similar to Pinyin, but it is a Romanized system based on the Cantonese topolect spoken in southern China.

 

Which is the better system?

Graphic input systems by their nature, do not suffer from the word ambiguity problem, but they are troubled by difficulties of another sort.  Graphic input systems require two things.  First, the user must know what the word looks like.  Without this knowledge, the user will be at a lost of the proper method of inputting the character on any graphic based system.  Second, the user must know, depending on the particular system being used, what order to input the graphical information to produce the right character on the screen.  Neither of these requirements is based in the natural spoken language of the user and so they must be learned separately.  This is a major hindrance to their proliferation and accounts for the popular use of Pinyin or phonetic systems.

It should come to no surprise that because of its ease to learn and use, Pinyin is the most popular Chinese input method by far.  Over 97% of the users in China use Pinyin for input (Chen 1997).  Although Pinyin and phonetic input methods in general possess much currency with users, they suffer from several problems, one of which is conversion error when going from sound to character.  As explained above, phonetic input methods convert sounds into characters, but in Mandarin which most of these systems based, there is only a corpus of 398 distinct syllables if tones were excluded.  The number rises to 1277 with the inclusion of tone (DeFrancis 1984).  This small syllabary under the phonetic input systems must correspond to over 6000 common Chinese characters, so it is very difficult for system to select the correct corresponding Chinese characters automatically.

For phonetic systems, the solution to the ‘homonym’ problem lies in n-gram language models or the clear delineation of groups of Chinese characters that come together to form words.  Both these frameworks are built on the same foundation: while taken individually, each syllable can be expressed by multiple Chinese characters, but if grouped together into logical, lexical units, the occurrence of homonyms declines dramatically.  Multiple studies have proven this to be the case.  Phonetic ambiguity drops dramatically with grouping Chinese characters into words.  In a Chinese dictionary of 60,000 words, some 4,000 or about 7 percent of its entries have homonyms.  For a 120,000 word dictionary, the number of homonyms increases to about 6,000 or about 5 percent (Zhou 1987).  Through the delineation of words, the homonym problem is greatly reduced to a manageable size.  At this level, precise determinations of the appropriate characters can be made through context by the computer program or the user.

 

III Word Processing

Western alphabetic scripts

Word processing for alphabetic scripts is made easier with the orthography or conventions of writing already in use by the time computers arrived.  With word division and high degree of standardization when it comes to spelling, western alphabetic scripts are just simpler to analyze.  In order to illustrate this point more clearly, it is necessary to compare the alphabetic scripts with the Chinese writing system with characters.

First, Chinese writing does not employ word division.  In the context of word processing, this can become problematic.  In alphabetic scripts, a simple rule can be utilized to single out a word for analysis: Anything that appears between two spaces is a word.  For Chinese writing, the same rule does not hold.  Instead more complex rules must be devised to ‘lift’ the words out of context. 

If that was not hard enough already, the issue is compounded by the fact that in Chinese, there is no standardized concept of ‘word’ laid down by authority.  The closest thing that comes to it is character constructions adopted by the majority of Chinese users.  These are characters that appear next to each other in a string predictably to describe something.

 

Solutions to the spell-check problem

Chinese does not have many of the orthographic qualities that alphabetic writing systems have which make them friendly to word processing.  Still there is a way to build a spell-check model for Chinese even though is does not possess things like word division.  The solution lies in breaking down individual characters to the keystrokes used in their input and constructing some sort of framework that can identify Chinese words through context.  Two studies described in the next two sections seek to do just that.  One chooses a graphic system as the starting point and the other selects a phonetic system.

 

A graphic input system approach

A spell-check system for graphic input for Chinese must take into consideration the inability to differentiate between a character that is keyed in intentionally by the user and a legitimate character that was the result of pressing the incorrect key sequence.  In a study to develop a spelling correction system for Chinese, a group of researchers in Taiwan looked at ways it could be done for the Taiwanese graphic input system Cangjie.  The original paper can be found here.  It is the first entry.

In building the foundation to their system, they established two things.  First, they looked to input key order for Cangjie as the starting point for building their spell-check system.  Second, they used various n-gram language models and Chinese frequent strings (CFS) as a benchmark with which to compare their output to determine if something was amiss with the input.  Since there is no official standardization of word or word division in Chinese writing, the n-gram language models and CFS serve as close substitutes by looking at real world use of characters and their associations with each other.

From these two starting points, the researchers constructed a mechanism to determine what they call the ‘confusing set.’  The confusing set is built from legitimate sequences of words based on the n-gram language models and CFS that resemble the actual input by the user.  Thus, a list of candidate sentences is generated among which the intended sentence may be present.

Running two experiments that tested the spell-check model with 485,272 sentences, the researchers reported their results.  In the first experiment, they used the Academia Sinica Chinese Electronic Dictionary (ASCED) along with traditional word bi-gram language model as the benchmark for determining the ‘confusing set.’  The resulting candidates in the ‘confusing set’ are then ranked by likelihood.  In the first experiment, the model accurately predicted the correct sentence with the top choice 80.95% of the time.  Given the top ten choices, the accuracy provided by the model goes up to 84.84%.  There is a small incremental gain of 3.89% going from offering the top one to the tenth.

The second experiment used a dictionary consisting of Chinese frequent strings and an uni-gram language model as the benchmark for ‘confusing set’ determination.  The results from this experiment were even better than the first in terms of prediction accuracy with the top choice being right 87.32% of the time.  Given the top ten choices, the accuracy improves to 97.48% which is a 10.16% increase in predictive accuracy.

While the results were impressive for both experiments, there are still questions on the robustness of this model.  In the experiment design process, the researchers only decided to test their model with sentences that contain at most one typing error.  Would it be able to handle sentences with more than one error?  If so, what would be the accurate prediction rates of the top choices?  What would be the cumulative prediction rate amongst the top 3 or 5 choices?

 

A phonetic input system approach

A look at ways to approach spell-check in Chinese with phonetic input systems was conducted by Microsoft researchers in China with Pinyin.  Their paper can be found here.

They used the individual letter inputs as the basic unit of analysis to build the set of candidate sentences and a statistical language model that was an amalgamation of tri-gram modeling and maximum likelihood-based methods to serve as the benchmark from which to derive the candidate set.

They conducted they test in two parts.  First, they looked at error rates for Pinyin input before any attempt at correction was made.  For a set of perfectly inputted Pinyin, the error rate of Pinyin to character conversion was 6.84%.  With actual input which included spelling errors from test subjects, the error rate of Pinyin to character conversion rose to 20.84%.  The interesting thing to note here is that only 4.6% of the Chinese characters were typed incorrectly in the set produced by the test subjects, but the compounding of error due to a word-based input model pushed that figure higher.

In the second part, they applied their language model for spelling correction in varying weights to determine the optimum level which minimizes error.  In the perfect input case, the results under the system were worse, but only slightly, since the adaptive spell-check model was overcompensating for the errors that were not there.  The result of the test on actual input shows that typing error can decreased to a low of slightly above 13% for test subjects.  The relative error reduction rate is about 50%.  Since the researchers did not seek to control the number of input errors per sentence, this model may prove to be more robust and have a higher accurate prediction rate when limited to only one error as was the case in the graphic input system approach to spell-check.

 

Comparison and discussion of results

It is interesting to note that while analyzing input systems that have fundamentally different starting points, the development of a spell-checking system must begin with breaking down the system into sub-character components.  For graphic input systems, this was the keystroke order.  For phonetic input systems, it was the individual letters that constitute sound.  From these building blocks, a consideration set for possible correct sentences can be built from the input from the user.  Evidence from both studies show that a high degree of accuracy can be achieved through analysis of these individual components – whether they are graphical or phonetic.

A second commonality that arises between these two studies and approaches is the agreement that the unit for word processing analysis must be the word, not individual characters.  The step of determining possible alternatives to what was inputted by the user was only possible under a defined set of words or commonly appearing Chinese character strings.  Without identifying and grouping these characters as distinct units of thought, it would not have been possible to achieve the accuracy level the researchers did with character-by-character analysis.  In neither case was the need for word-based benchmarking questioned.

Both systems establish a viable framework with which to analyze a string of Chinese character inputs for the purpose of spell-check.  With further development, it will be possible to achieve some sort of parity with alphabetic scripts.

 

Works Citied

Chen, Yuan. 1997. Chinese Language Processing. Shanghai.

DeFrancis, John. 1984. The Chinese Language: Fact and Fantasy. University of Hawaii Press.

Feng, Zhiwei, 1989. Xiandai hanyu he jisuanji. Beijing.

"Printing," Microsoft® Encarta® Online Encyclopedia 2005
http://encarta.msn.com © 1997-2005 Microsoft Corporation. All Rights Reserved.

Russo, Thomas A. 2000. Office Collectibles: 100 Years of Business Technology. Schiffer:                Atglen, PA. p.161

Unger, J. Marshall. 1987. The Fifth Generation Fallacy. New York.

Zhou, Youguang. 1987. Zhongguo yuci chuli he xiandai hanzixue. Yuwen jianshe 5:7-13