|
Spell-check for Chinese:
viable methods
I hope the reader has as much fun
reading this as I did writing it. It was truly a pleasure.
Abstract
This paper
is divided into two parts. In the first, it takes a look at the
development of input methods for computing in Chinese and their
current manifestations. It focuses particularly on the present
schemes devised for keyboard inputting as they are the most
widely employed tools for this process. The second part of this
paper examines what these input systems mean for word
processing. Given the nature of character-based writing and the
various input methods, I look at what makes it more difficult to
develop word processing mechanisms, particularly a robust
spell-check system, compared to alphabets. Spell-check
approaches based on graphic and phonetic input methods are
discussed. The results and implications of the approaches are
compared.
I
Intoduction
Around the
10th century AD, movable type for printing presses
was invented in China (“Printing” 2005). While the activity of
printing books - mostly Buddhist texts – on paper has been going
on for centuries, it was all done on large, individually carved
blocks with each character’s anagram delicately finished by a
master artisan. It was at this time that a Chinese inventor
took the idea of printing to the next level by devising the
concept of printing with movable type. Individual characters
would be first carved out and later arranged on a printing block
to reproduce the texts desired on a page. This invention though
never came around to significantly impact Chinese society with
the promise of the written word, cheaply produced, and the
reason was simple. The Chinese language is represented by a
character set numbering in the tens of thousands. The task of
organizing and manipulating such a large typeset was
tremendously impractical and costly, and so individually carved
blocks continued to be the choice medium for printing.
This same
story can be repeated in the West, although with vastly
different results.
Johannes Gutenberg is the man largely credited with
the invention of movable type in his shop in Germany in 1450.
Although is came many centuries after the Chinese, his invention
was truly revolutionizing. Languages that employ a limited set
of letters for written expression such as English which requires
only 52 letters (upper and lower case) flourished with the onset
of print material that can be reproduced in mass for cheap.
This
historical example provides an interesting illustration of the
different results that can come from technological innovation
due to the differences in the written word. There is a parallel
to be drawn from this to modern times with the development of
computers and its interaction with character-based scripts and
alphabets. Viewed by some users of character script as the tool
that will finally level the playing field for text processing,
computers have in fact done pretty much the reverse. It has
amplified the differences intrinsic in character-based writing
and its alphabetic counterpart, revealing the unwieldiness of
the former as an obstacle that will forever prevent it from
reaching parity in terms of computing speeds with alphanumeric
systems. As the expert sinologist Professor Marshall Unger
points out, the inefficiency of character writing will only
increase “as the scope and number of computer applications grow
(Unger 1987). Much to the disappointment of character script
users, computers will only continue to accentuate the
performance gap as processor speeds continue to rise.
All this is
not to say that there is no future for languages that employ
character-based writing systems. Once cognizant of the
difference between alphabetic and character systems, it is
possible find a way out of this problem, perhaps even by
adopting attributes of the former. The following paper will
look at one dimension of the character computing issue – namely,
that of input systems for Chinese languages. Starting from
discussing the various schemes devised for this intricate task,
it will then move on to the implications these system have for
word processing.
II Input
Systems
An input
system in the context of this paper on computing is a way by
which users, following predefined rules and procedures, interact
with some medium in order to produce a desired output from the
computer. In this case, it is a display of text across the
computer screen that is intended by the user. The development
of these input systems for Chinese characters was a long and
arduous task. It did not come simple as can be seen by the
obstacles faced by early typewriters.
Early input
methods
Before the
invention of computers, typewriters were the tools used for word
processing. The words ‘input system’ or ‘input method’ can be
applied to typewriters in a loose sense since the users were
still interacting through some medium to write Chinese
characters. The difference is that in a typewriter the process
was completely mechanical, while a computer turns the actions
the user into an electrical signal.
The first
English typewriter was invented in 1867. Nearly 50 years later,
this first Chinese character typewriter appeared in 1917. It
was manufactured by the Japanese firm Nippon Typewriter Co. The
Nippon typewriter has a flat bed of 3,000 Japanese characters,
most of which were Kanji of which some can be used to type
Chinese. This typewriter was considered a shorthand version
since the Japanese language contains in excess of 30,000
characters (Russo 2000). To use the typewriter, paper is
wrapped around the cylindrical rubber platen, which moves on
rollers over the bed of type. The operator uses a level to
control an arm that picks up a piece of metal type from the bed,
presses it against the paper, and returns it to its niche. The
process was slow and tedious, even for trained typists.
The large
bed of characters continued to persist in typewriters developed
later on. The problem with Chinese character typewriters was
twofold. Not only did they require a lot of movement to
operate, since touch typing was impossible with these machines,
but they also burdened their operators with the task of finding
the right character amid a bed of literally thousands of
characters. Trying to find the right character was a time
consuming task. Their positions had to be memorized since what
the operator sees is a collection of upside-down, backward
characters in very small type. The machine takes 160 to 200
times longer to master compared to a Western-language typewriter
(Feng 1989).
Input
systems for modern computing
It would
seem that the early modern period was a dark time for Chinese
character computing, but it was still some kind of improvement
over the past. Moving into the computer age, many more options
began to open up to users of Chinese characters. From the
definition of input systems given previously, we can categorize
these various new ways of inputting Chinese by their mediums
into three basic groups. First there is voice-to-text which
utilizes a microphone and voice recognition software in order to
write Chinese. Second there is handwriting recognition which
uses a pressure sensitive tablet, stylus, and specialized
software to convert handwritten characters to text on a screen.
I will not give coverage to either group here since neither of
these two types of input systems have the potential to become
the dominant system given the current technology. The last and
largest category of input systems is that based on the
keyboard. To date, this has been the most viable medium for
inputting Chinese in a speedy and practical manner. The
proliferation of keyboard based input systems is the proof to
their superiority and so attention will be given to their
discussion.
Keyboard
input methods
Keyboard
input methods are by far the largest category. It can be
further divided into two subgroups. One is shape-based, graphic
input methods, such as Wubizixing and Cangjie. The other is
pronunciation-based, phonetic input methods, such as Pinyin,
Zhuyin, and methods that use non-Mandarin sounds as a starting
point. While there still exists many other keyboard input
systems for Chinese, such as English-to-Chinese or Hanzhong,
telegraphic code, and the four corners system or CKC, the number
of people who use these systems exclusively are negligible when
compared to either graphic or phonetic systems described below.
Graphic
methods
Graphic
input systems rely on the visual components of characters.
Graphical components of characters are assigned to specific keys
on the keyboard. Users follow set rules to input these
components in the right order to get the appropriate character
to appear on the screen. There are two major graphical input
systems for keyboard. They are Wubizixing which is used in
China and Cangjie which is found in Taiwan.
While there
may be differences across the systems, graphical input generally
follow a predictable pattern. First, they begin with the user
identifying the shape of the character they would like to type.
Having the strokes of the character in mind, the user then
proceeds to follow any of the following rules depending on the
input method used. Strokes may be inputted via the standard
stroke order as one would write it by hand. The user may also
be called to single out the identifying graphical components
representative of the word and input just parts of those graphs
in combination together to get the word. The rules governing
this type of graphical input system is more ad hoc since it does
not follow some sort of accepted convention like stroke order.
Phonetic
methods
The most
common phonetic input methods for Chinese are based on the
Pinyin and Zhuyin system, also known as BoPoMoFo. Pinyin uses
the alphabet to indicate the sounds of the Chinese language,
while Zhuyin which was developed in 1913 is a phonetic system
that resembles Japanese kana. Both of these systems were
developed with modeling the sound of standard Mandarin in mind.
The Pinyin input system, since it utilizes the alphabet, follows
the standard QWERTY layout found on most English language
computers. Zhuyin input system, as it appears in places where
it is used as in Taiwan, maps its 40 keys (36 sounds, 4 tones)
on the QWERTY keyboard by adding the fourth row reserved for
numbers and symbols in typing English as keys.
Phonetic
input systems as their name implies are based on sounds, and so
users type in their natural spoken language. Some developers
using this fact have made headway in developing phonetic input
systems based on non-Mandarin Chinese sounds, such as
Cantonese. One such system is the Red Dragonfly Chinese Input
System which can be readily downloaded online. It functions
similar to Pinyin, but it is a Romanized system based on the
Cantonese topolect spoken in southern China.
Which is
the better system?
Graphic
input systems by their nature, do not suffer from the word
ambiguity problem, but they are troubled by difficulties of
another sort. Graphic input systems require two things. First,
the user must know what the word looks like. Without this
knowledge, the user will be at a lost of the proper method of
inputting the character on any graphic based system. Second,
the user must know, depending on the particular system being
used, what order to input the graphical information to produce
the right character on the screen. Neither of these
requirements is based in the natural spoken language of the user
and so they must be learned separately. This is a major
hindrance to their proliferation and accounts for the popular
use of Pinyin or phonetic systems.
It should
come to no surprise that because of its ease to learn and use,
Pinyin is the most popular Chinese input method by far. Over
97% of the users in China use Pinyin for input (Chen 1997).
Although Pinyin and phonetic input methods in general possess
much currency with users, they suffer from several problems, one
of which is conversion error when going from sound to
character. As explained above, phonetic input methods convert
sounds into characters, but in Mandarin which most of these
systems based, there is only a corpus of 398 distinct syllables
if tones were excluded. The number rises to 1277 with the
inclusion of tone (DeFrancis 1984). This small syllabary under
the phonetic input systems must correspond to over 6000 common
Chinese characters, so it is very difficult for system to select
the correct corresponding Chinese characters automatically.
For
phonetic systems, the solution to the ‘homonym’ problem lies in
n-gram language models or the clear delineation of groups of
Chinese characters that come together to form words. Both these
frameworks are built on the same foundation: while taken
individually, each syllable can be expressed by multiple Chinese
characters, but if grouped together into logical, lexical units,
the occurrence of homonyms declines dramatically. Multiple
studies have proven this to be the case. Phonetic ambiguity
drops dramatically with grouping Chinese characters into words.
In a Chinese dictionary of 60,000 words, some 4,000 or about 7
percent of its entries have homonyms. For a 120,000 word
dictionary, the number of homonyms increases to about 6,000 or
about 5 percent (Zhou 1987). Through the delineation of words,
the homonym problem is greatly reduced to a manageable size. At
this level, precise determinations of the appropriate characters
can be made through context by the computer program or the user.
III Word
Processing
Western
alphabetic scripts
Word
processing for alphabetic scripts is made easier with the
orthography or conventions of writing already in use by the time
computers arrived. With word division and high degree of
standardization when it comes to spelling, western alphabetic
scripts are just simpler to analyze. In order to illustrate
this point more clearly, it is necessary to compare the
alphabetic scripts with the Chinese writing system with
characters.
First,
Chinese writing does not employ word division. In the context
of word processing, this can become problematic. In alphabetic
scripts, a simple rule can be utilized to single out a word for
analysis: Anything that appears between two spaces is a word.
For Chinese writing, the same rule does not hold. Instead more
complex rules must be devised to ‘lift’ the words out of
context.
If that was
not hard enough already, the issue is compounded by the fact
that in Chinese, there is no standardized concept of ‘word’ laid
down by authority. The closest thing that comes to it is
character constructions adopted by the majority of Chinese
users. These are characters that appear next to each other in a
string predictably to describe something.
Solutions
to the spell-check problem
Chinese
does not have many of the orthographic qualities that alphabetic
writing systems have which make them friendly to word
processing. Still there is a way to build a spell-check model
for Chinese even though is does not possess things like word
division. The solution lies in breaking down individual
characters to the keystrokes used in their input and
constructing some sort of framework that can identify Chinese
words through context. Two studies described in the next two
sections seek to do just that. One chooses a graphic system as
the starting point and the other selects a phonetic system.
A graphic
input system approach
A
spell-check system for graphic input for Chinese must take into
consideration the inability to differentiate between a character
that is keyed in intentionally by the user and a legitimate
character that was the result of pressing the incorrect key
sequence. In a study to develop a spelling correction system
for Chinese, a group of researchers in Taiwan looked at ways it
could be done for the Taiwanese graphic input system Cangjie.
The original paper can be found
here.
It is the first entry.
In building
the foundation to their system, they established two things.
First, they looked to input key order for Cangjie as the
starting point for building their spell-check system. Second,
they used various n-gram language models and Chinese frequent
strings (CFS) as a benchmark with which to compare their output
to determine if something was amiss with the input. Since there
is no official standardization of word or word division in
Chinese writing, the n-gram language models and CFS serve as
close substitutes by looking at real world use of characters and
their associations with each other.
From these
two starting points, the researchers constructed a mechanism to
determine what they call the ‘confusing set.’ The confusing set
is built from legitimate sequences of words based on the n-gram
language models and CFS that resemble the actual input by the
user. Thus, a list of candidate sentences is generated among
which the intended sentence may be present.
Running two
experiments that tested the spell-check model with 485,272
sentences, the researchers reported their results. In the first
experiment, they used the Academia Sinica Chinese Electronic
Dictionary (ASCED) along with traditional word bi-gram language
model as the benchmark for determining the ‘confusing set.’ The
resulting candidates in the ‘confusing set’ are then ranked by
likelihood. In the first experiment, the model accurately
predicted the correct sentence with the top choice 80.95% of the
time. Given the top ten choices, the accuracy provided by the
model goes up to 84.84%. There is a small incremental gain of
3.89% going from offering the top one to the tenth.
The second
experiment used a dictionary consisting of Chinese frequent
strings and an uni-gram language model as the benchmark for
‘confusing set’ determination. The results from this experiment
were even better than the first in terms of prediction accuracy
with the top choice being right 87.32% of the time. Given the
top ten choices, the accuracy improves to 97.48% which is a
10.16% increase in predictive accuracy.
While the
results were impressive for both experiments, there are still
questions on the robustness of this model. In the experiment
design process, the researchers only decided to test their model
with sentences that contain at most one typing error. Would it
be able to handle sentences with more than one error? If so,
what would be the accurate prediction rates of the top choices?
What would be the cumulative prediction rate amongst the top 3
or 5 choices?
A phonetic
input system approach
A look at
ways to approach spell-check in Chinese with phonetic input
systems was conducted by Microsoft researchers in China with
Pinyin. Their paper can be found
here.
They used
the individual letter inputs as the basic unit of analysis to
build the set of candidate sentences and a statistical language
model that was an amalgamation of tri-gram modeling and maximum
likelihood-based methods to serve as the benchmark from which to
derive the candidate set.
They
conducted they test in two parts. First, they looked at error
rates for Pinyin input before any attempt at correction was
made. For a set of perfectly inputted Pinyin, the error rate of
Pinyin to character conversion was 6.84%. With actual input
which included spelling errors from test subjects, the error
rate of Pinyin to character conversion rose to 20.84%. The
interesting thing to note here is that only 4.6% of the Chinese
characters were typed incorrectly in the set produced by the
test subjects, but the compounding of error due to a word-based
input model pushed that figure higher.
In the
second part, they applied their language model for spelling
correction in varying weights to determine the optimum level
which minimizes error. In the perfect input case, the results
under the system were worse, but only slightly, since the
adaptive spell-check model was overcompensating for the errors
that were not there. The result of the test on actual input
shows that typing error can decreased to a low of slightly above
13% for test subjects. The relative error reduction rate is
about 50%. Since the researchers did not seek to control the
number of input errors per sentence, this model may prove to be
more robust and have a higher accurate prediction rate when
limited to only one error as was the case in the graphic input
system approach to spell-check.
Comparison
and discussion of results
It is
interesting to note that while analyzing input systems that have
fundamentally different starting points, the development of a
spell-checking system must begin with breaking down the system
into sub-character components. For graphic input systems, this
was the keystroke order. For phonetic input systems, it was the
individual letters that constitute sound. From these building
blocks, a consideration set for possible correct sentences can
be built from the input from the user. Evidence from both
studies show that a high degree of accuracy can be achieved
through analysis of these individual components – whether they
are graphical or phonetic.
A second
commonality that arises between these two studies and approaches
is the agreement that the unit for word processing analysis must
be the word, not individual characters. The step of determining
possible alternatives to what was inputted by the user was only
possible under a defined set of words or commonly appearing
Chinese character strings. Without identifying and grouping
these characters as distinct units of thought, it would not have
been possible to achieve the accuracy level the researchers did
with character-by-character analysis. In neither case was the
need for word-based benchmarking questioned.
Both
systems establish a viable framework with which to analyze a
string of Chinese character inputs for the purpose of
spell-check. With further development, it will be possible to
achieve some sort of parity with alphabetic scripts.
Works
Citied
Chen, Yuan.
1997. Chinese Language Processing. Shanghai.
DeFrancis,
John. 1984. The Chinese Language: Fact and Fantasy. University
of Hawaii Press.
Feng,
Zhiwei, 1989. Xiandai hanyu he jisuanji. Beijing.
"Printing,"
Microsoft® Encarta® Online Encyclopedia 2005
http://encarta.msn.com © 1997-2005 Microsoft Corporation. All
Rights Reserved.
Russo,
Thomas A. 2000. Office Collectibles: 100 Years of Business
Technology. Schiffer:
Atglen, PA. p.161
Unger, J.
Marshall. 1987. The Fifth Generation Fallacy. New York.
Zhou,
Youguang. 1987. Zhongguo yuci chuli he xiandai hanzixue. Yuwen
jianshe 5:7-13 |