(Translation
Studies / Bilingual Lexicography)
Department of English, School of
Foreign Studies,
南京大学外国语学院英语系
姓名 |
吴志杰 |
学号 |
MG0209051 |
||||||||||||||||||||||||||||||
年级 |
二年级 |
导师姓名 |
张柏然、刘华文 |
||||||||||||||||||||||||||||||
1. Specific topic of research Chinese Word Segmentation in Machine Translation: A
Cognitive Approach |
|||||||||||||||||||||||||||||||||
2. Significance of the topic (the rationale
for your proposed research) Word-segmentation
systems available are either linguistically-oriented or
statistically-oriented. Both types, however, have some innate defects that
cannot be overcome due to the pragmatic-oriented feature of the Chinese
language. As machine translation programs are concerned, they do not produce
very satisfactory results in terms of Chinese word segmentation. A cognitive
approach is designed in the light of how human beings carry out
word-segmentation process in their reading, aiming at addressing the
word-segmentation problem in machine translation from a different
perspective. |
|||||||||||||||||||||||||||||||||
3. Existing research on this topic
(literature review) I. Status Quo of
Chinese Word Segmentation Systems Chinese word segmentation is one of the most important
components of Natural Language Processing for Chinese language, and is often
referred to as the bottleneck in Chinese language understanding and
processing. a. Linguistically-based word-segmentation systems Although the significance of Chinese word segmentation has
long been recognized, the pioneering research in this field did not begin
until the 1980s. Since the first Chinese word segmentation system, i.e. CDWS,
was developed in 1983 by the Computer Department of Beijing University of
Aeronautics and Astronautics, multiple studies have been conducted and quite
a few models have been established (Wang et al 2003). First came the
mechanical matching method for Chinese word segmentation, which was employed
in the CDWS system. By this method, we first establish a large lexicon that
contains all the possible words in Chinese, and then apply certain rules to
divide the input sentence into small linguistic blocks, which are to be
compared with the items in the lexicon. If all these linguistic blocks match lexons from the lexicon, then word segmentation is
accomplished. Otherwise, an alternative division of the sentence is carried
out and the above process is repeated. The process, however, might be applied
several times before an acceptable result can be obtained. Based on this
method, there exist some subcategories, such as Maximum Matching and Minimum
Matching, and Obverse Matching and Reverse Matching, with the first two
divided by the criterion of priority to long words or short words, and the
latter pair, by the direction of processing. Among them, the Maximum Reverse
Matching Method has been the one most widely used. This method, however, is
notorious for its poor treatment of ambiguity processing of word
segmentation. In order to improve its performance, feature lexicon, binding
matrix, and grammar analysis have recently been incorporated into this
method. Feature lexicon, by extracting functional words (e.g., 了), words
with affixes (e.g., 老虎), and words formed by doubling the same
character (e.g., 明明白白), is employed as a kind of pre-processing
before applying the mechanical matching process. Binding matrix is used to
check results of the mechanical matching process by a grammar matrix and a
semantic matrix on the level of phrases. Grammar analysis also applies
grammar rules for the purpose of word-segmentation, but it is performed
synchronously with the mechanical matching both on the level of phrase and
sentence, therefore it is different from the binding grammar matrix. (Liu
2000; Wang et al. 2003; Luo et al. 1997; Yin 1998.) These, although different
from each other, all belong to the formal rule-based methods. The subsequent
systems based on them do not produce very satisfactory results. b. Statistically-based word-segmentation systems There is another underdeveloped approach for Chinese word
segmentation, that is, a statistically-based word-segmentation processing
system. It usually employs word frequency and character cooccurrence
probability to decide word boundaries, the only example of which is the
system designed by Harbin Industrial University. Although it improves the
segmentation for non-common words, it does not perform well on common words,
components of which are very flexible in forming words with other characters,
and in most cases multiple in the meaning (Liu 2000; Wang et al. 2003).
Translation Memory also is in this category, though it seems to be a rather
remote relative of the system designed by Harbin Industrial University.
Translation Memory deals with word segmentation (to be more accurate, linguistic
block segmentation) by following what human translators have already done. At
present, it is of very limited use because of the lack of parallel corpora
between Chinese and English and immature technology in dealing with alignment
of parallel corpora, not to mention being available on the market. The difference between these two approaches can be described
as deductive vs. inductive. The fundamental difference between these is the
source of knowledge that eventually determines the behavior of the system.
Deductive systems rely on linguists and language engineers, who create or
modify sets of rules in accordance with their knowledge, expertise, and
intuition. While inductive systems depend on examples, which usually take the
form of a corpus. The rules of inductive systems are often derived by the
system itself from the examples. Neither of these has so far given a
satisfactory response to the increasing need for Chinese word segmentation.
The problem is that both approaches, in addition to their obvious advantages,
have a number of serious drawbacks. II. Shortcomings
discussed Why cant these systems segment Chinese words very well?
According to the information concerning the designs for these translating
systems, most of programs adopt either a linguistically-based or a
statistically-based word-segmentation processing system. However, neither
segmentation approach can achieve a very satisfactory result. The rule-based
linguistically-oriented translation systems do not produce very satisfactory
results due to the fact that most Chinese words can serve more than one part
of speech and have couples of, or even dozens of, meanings, with a single
character capable of forming words with many different other characters,
before or after. As to a statistically-based word-segmentation processing
system, it cannot solve the problem either. A statistically-based
word-segmentation processing system can only make sure a certain percentage
of word segmentations are all right while leaving the remaining words poorly processed
and making ridiculous segmentation mistakes. This approach improves the
performance of unusual word segmentation, but does a very poor job concerning
common words, components of which are very flexible in forming words with
other characters, and in most cases multiple in the meaning (Liu 2000; Wang
et al. 2003). As some scholars have argued, European languages are mainly
syntactic-oriented while Chinese is basically pragmatic-oriented. In other
words, pragmatic contexts and co-texts play an important role in
understanding Chinese texts, hence they are significant in Chinese
word-segmentation. All the above mentioned systems, however, fail to take
pragmatic contextual and co-textual information into consideration, which
largely accounts for their poor word-segmentation performance. Therefore, how
to incorporate into word-segmentation systems pragmatic contextual and
co-textual information becomes very relevant to the problem here. |
|||||||||||||||||||||||||||||||||
4. Specific, real questions existing research
does not answer (or does not answer adequately) but you attempt to answer.
Also, state if these are definition, basic data, descriptive, or
causes-and-effects questions and what you have done to avoid reinventing the
wheel in your proposed study. The
question my study intends to address while existing research does not answer
is how we can incorporate into word-segmentation systems pragmatic contextual
and co-textual information so that the word segmentation problem will be more
satisfactorily solved and the performance of machine translation will be
greatly improved. To avoid reinventing the wheel in my study, I first
conducted a pilot study to assess all the translation programs available on
the market, the results of which show that a relatively large proportion of
the translation errors can be attributed to mistakes in word-segmentation.
This means my research question is a genuine one and has relatively great
significance. Second, I have tried my best to obtain a rather thorough
literature so that I am sure that my approach has never been tried before. |
|||||||||||||||||||||||||||||||||
5. Your tentative answer to the questions
(your hypothesis): interpretive, descriptive, explanatory, or predicative?
How will you operationalize your hypothesis so that it can be tested (e.g. by
measuring its key concepts)? If your hypothesis is an interpretive one, does
it have testable consequences? The
tentative answer to my hypothesis is predicative in nature. That is, if we
incorporate into word-segmentation systems pragmatic contextual and
co-textual information, the word segmentation problem will be more
satisfactorily solved and the performance of machine translation will be
greatly improved. In order to operationalize my hypothesis, I will conduct
one test to see how well the available machine translation programs on the
market can deal with Chinese word-segmentation (My pilot study shows that a
relatively large proportion of the translation errors can be attributed to
mistakes in word-segmentation.). Besides, I will carry out two survey studies
and several interviews to find out how Chinese people segment a Chinese
sentence into words and how similar they carry out this word-segmenting
process. |
|||||||||||||||||||||||||||||||||
6. Methods by which you will obtain and
analyze your evidence: What theoretical model will you use? Do you propose to
conduct conceptual research, empirical research, or applied research? If you
opt for empirical research, what specific kind(s)
of empirical research (naturalistic, experimental, qualitative,
quantitative?) will you do and what specific empirical research method(s) (case study, corpus study, survey, historical/archival
study, etc.) will you use? The main theoretical model of my study will be a process
model, which will introduce the word-segmentation process of my newly designed
model step by step. However, the other two theoretical models, i.e.,
the comparative model and causal model, will be employed in my study as well.
The comparative model will be used when comparing the results of human
segmentation and those of translation programs while the causal mode will be
applied to justification of the new model. My study, in the first place, is conceptual research, since
the focus of study is a new model for Chinese word segmentation. However, the
part that supports my new model is empirical. In this part, one test, two
surveys and six interviews will be carried out to obtain my evidence. The
test and surveys are mainly quantitative studies whereas the interviews aim
at obtaining qualitative data. |
|||||||||||||||||||||||||||||||||
7. Evidence you will use or expect to find.
Are they textual or contextual? Why do you think your evidence supports your
hypothesis? Evidence that I will use is mainly contextual variables. To
be more specific, I will use source-text variables, especially
word-segmentation of source texts. Chinese word segmentation is one of the most important
components of Natural Language Processing for Chinese language, and is often
referred to as the bottleneck for Chinese language understanding and
processing. It is obvious that how source texts are segmented will affect the
quality of machine translation. My new model of Chinese word-segmentation
takes pragmatic contextual and co-textual information into consideration,
which, hopefully, will contribute to the better performance of machine translation
systems. |
|||||||||||||||||||||||||||||||||
8. Preliminary findings (Give one example,
e.g. variables youve identified and patterns or
regularities youve found in your pilot study.) In my pilot study, I discovered that computer translation
programs available on the market, which are either linguistically-oriented or
statistically-oriented, are unable to tackle Chinese segmentation very well
in Chinese-English translating. It shows that a relatively large proportion
of the translation errors can be attributed to mistakes in word-segmentation.
And in my survey studies and interviews, I found that people carry out
word-segmentation process quite similarly and in most cases they make use of
the pragmatic contextual and co-textual information, rather than linguistic
principles. |
|||||||||||||||||||||||||||||||||
9. Major references Lamb, Sydney. (1999). Pathways of the brain: Neurocoginitive
basis of language. Amsterdam & Philadelphia: John Benjamins. Liu, Kaiying
刘开瑛. (2000).《中文文本自动分词和标注》. 北京: 商务印书馆. Luo, Zhengqing
et al. 骆正清等. (1997). 《汉语自动分词研究综述》.《浙江大学学报》[Journal of Zhejiang
University]. 第31卷第3期. Taylor, John. (2002). Cognitive grammar. Oxford: Oxford
University Press. Wang, Ke et al. 王科 等. (2003). 《汉语分词的主要技术及其应用展望》 [Main
techniques in Chinese word-segmentation and their prospective applications].《通信技术》[Communications Technology]. 2003年第6期. Wu, Andi. (2003).
Chinese word segmentation in MSR-NLP. The Second SIGHAN Workshop on Chinese
Language Processing, Sapporo, Japan. Retrieved May 11, 2004, from http://jefferson.village.virginia.edu/pmc/text-only/issue.591/moulthro.591). Yin, Jianping
殷建平. (1998). 《汉语自动分词方法》[Automatic word
segmentation methods for Chinese]. 《计算机工程与科学》[Computer Engineering and
Science]. 第20卷第3期. |
|||||||||||||||||||||||||||||||||
10. The timetable for your research
|
|||||||||||||||||||||||||||||||||
导师签名 (1) |
|
日期 |
|
||||||||||||||||||||||||||||||
导师签名 (2) |
|
日期 |
|