Proposal for the Thesis Project

(Translation Studies / Bilingual Lexicography)

Department of English, School of Foreign Studies, Nanjing University

研究生学位论文 (翻译研究/双语词典学方向) 开题报告表

南京大学外国语学院英语系

姓名

吴志杰

学号

MG0209051

年级

二年级

导师姓名

张柏然、刘华文

1. Specific topic of research

Chinese Word Segmentation in Machine Translation: A Cognitive Approach

2. Significance of the topic (the rationale for your proposed research)

Word-segmentation systems available are either linguistically-oriented or statistically-oriented. Both types, however, have some innate defects that cannot be overcome due to the pragmatic-oriented feature of the Chinese language. As machine translation programs are concerned, they do not produce very satisfactory results in terms of Chinese word segmentation. A cognitive approach is designed in the light of how human beings carry out word-segmentation process in their reading, aiming at addressing the word-segmentation problem in machine translation from a different perspective.

3. Existing research on this topic (literature review)

I. Status Quo of Chinese Word Segmentation Systems

Chinese word segmentation is one of the most important components of Natural Language Processing for Chinese language, and is often referred to as the bottleneck in Chinese language understanding and processing.

a. Linguistically-based word-segmentation systems

Although the significance of Chinese word segmentation has long been recognized, the pioneering research in this field did not begin until the 1980s. Since the first Chinese word segmentation system, i.e. CDWS, was developed in 1983 by the Computer Department of Beijing University of Aeronautics and Astronautics, multiple studies have been conducted and quite a few models have been established (Wang et al 2003). First came the mechanical matching method for Chinese word segmentation, which was employed in the CDWS system. By this method, we first establish a large lexicon that contains all the possible words in Chinese, and then apply certain rules to divide the input sentence into small linguistic blocks, which are to be compared with the items in the lexicon. If all these linguistic blocks match lexons from the lexicon, then word segmentation is accomplished. Otherwise, an alternative division of the sentence is carried out and the above process is repeated. The process, however, might be applied several times before an acceptable result can be obtained. Based on this method, there exist some subcategories, such as Maximum Matching and Minimum Matching, and Obverse Matching and Reverse Matching, with the first two divided by the criterion of priority to long words or short words, and the latter pair, by the direction of processing. Among them, the Maximum Reverse Matching Method has been the one most widely used. This method, however, is notorious for its poor treatment of ambiguity processing of word segmentation. In order to improve its performance, feature lexicon, binding matrix, and grammar analysis have recently been incorporated into this method. Feature lexicon, by extracting functional words (e.g., 了), words with affixes (e.g., 老虎), and words formed by doubling the same character (e.g., 明明白白), is employed as a kind of pre-processing before applying the mechanical matching process. Binding matrix is used to check results of the mechanical matching process by a grammar matrix and a semantic matrix on the level of phrases. Grammar analysis also applies grammar rules for the purpose of word-segmentation, but it is performed synchronously with the mechanical matching both on the level of phrase and sentence, therefore it is different from the binding grammar matrix. (Liu 2000; Wang et al. 2003; Luo et al. 1997; Yin 1998.) These, although different from each other, all belong to the formal rule-based methods. The subsequent systems based on them do not produce very satisfactory results.

b. Statistically-based word-segmentation systems

There is another underdeveloped approach for Chinese word segmentation, that is, a statistically-based word-segmentation processing system. It usually employs word frequency and character cooccurrence probability to decide word boundaries, the only example of which is the system designed by Harbin Industrial University. Although it improves the segmentation for non-common words, it does not perform well on common words, components of which are very flexible in forming words with other characters, and in most cases multiple in the meaning (Liu 2000; Wang et al. 2003). Translation Memory also is in this category, though it seems to be a rather remote relative of the system designed by Harbin Industrial University. Translation Memory deals with word segmentation (to be more accurate, linguistic block segmentation) by following what human translators have already done. At present, it is of very limited use because of the lack of parallel corpora between Chinese and English and immature technology in dealing with alignment of parallel corpora, not to mention being available on the market.

The difference between these two approaches can be described as deductive vs. inductive. The fundamental difference between these is the source of knowledge that eventually determines the behavior of the system. Deductive systems rely on linguists and language engineers, who create or modify sets of rules in accordance with their knowledge, expertise, and intuition. While inductive systems depend on examples, which usually take the form of a corpus. The rules of inductive systems are often derived by the system itself from the examples. Neither of these has so far given a satisfactory response to the increasing need for Chinese word segmentation. The problem is that both approaches, in addition to their obvious advantages, have a number of serious drawbacks.

II. Shortcomings discussed

Why cant these systems segment Chinese words very well? According to the information concerning the designs for these translating systems, most of programs adopt either a linguistically-based or a statistically-based word-segmentation processing system. However, neither segmentation approach can achieve a very satisfactory result. The rule-based linguistically-oriented translation systems do not produce very satisfactory results due to the fact that most Chinese words can serve more than one part of speech and have couples of, or even dozens of, meanings, with a single character capable of forming words with many different other characters, before or after. As to a statistically-based word-segmentation processing system, it cannot solve the problem either. A statistically-based word-segmentation processing system can only make sure a certain percentage of word segmentations are all right while leaving the remaining words poorly processed and making ridiculous segmentation mistakes. This approach improves the performance of unusual word segmentation, but does a very poor job concerning common words, components of which are very flexible in forming words with other characters, and in most cases multiple in the meaning (Liu 2000; Wang et al. 2003).

As some scholars have argued, European languages are mainly syntactic-oriented while Chinese is basically pragmatic-oriented. In other words, pragmatic contexts and co-texts play an important role in understanding Chinese texts, hence they are significant in Chinese word-segmentation. All the above mentioned systems, however, fail to take pragmatic contextual and co-textual information into consideration, which largely accounts for their poor word-segmentation performance. Therefore, how to incorporate into word-segmentation systems pragmatic contextual and co-textual information becomes very relevant to the problem here.

4. Specific, real questions existing research does not answer (or does not answer adequately) but you attempt to answer. Also, state if these are definition, basic data, descriptive, or causes-and-effects questions and what you have done to avoid reinventing the wheel in your proposed study.

The question my study intends to address while existing research does not answer is how we can incorporate into word-segmentation systems pragmatic contextual and co-textual information so that the word segmentation problem will be more satisfactorily solved and the performance of machine translation will be greatly improved. To avoid reinventing the wheel in my study, I first conducted a pilot study to assess all the translation programs available on the market, the results of which show that a relatively large proportion of the translation errors can be attributed to mistakes in word-segmentation. This means my research question is a genuine one and has relatively great significance. Second, I have tried my best to obtain a rather thorough literature so that I am sure that my approach has never been tried before.

5. Your tentative answer to the questions (your hypothesis): interpretive, descriptive, explanatory, or predicative? How will you operationalize your hypothesis so that it can be tested (e.g. by measuring its key concepts)? If your hypothesis is an interpretive one, does it have testable consequences?

The tentative answer to my hypothesis is predicative in nature. That is, if we incorporate into word-segmentation systems pragmatic contextual and co-textual information, the word segmentation problem will be more satisfactorily solved and the performance of machine translation will be greatly improved. In order to operationalize my hypothesis, I will conduct one test to see how well the available machine translation programs on the market can deal with Chinese word-segmentation (My pilot study shows that a relatively large proportion of the translation errors can be attributed to mistakes in word-segmentation.). Besides, I will carry out two survey studies and several interviews to find out how Chinese people segment a Chinese sentence into words and how similar they carry out this word-segmenting process.

6. Methods by which you will obtain and analyze your evidence: What theoretical model will you use? Do you propose to conduct conceptual research, empirical research, or applied research? If you opt for empirical research, what specific kind(s) of empirical research (naturalistic, experimental, qualitative, quantitative?) will you do and what specific empirical research method(s) (case study, corpus study, survey, historical/archival study, etc.) will you use?

The main theoretical model of my study will be a process model, which will introduce the word-segmentation process of my newly designed model step by step. However, the other two theoretical models, i.e., the comparative model and causal model, will be employed in my study as well. The comparative model will be used when comparing the results of human segmentation and those of translation programs while the causal mode will be applied to justification of the new model.

My study, in the first place, is conceptual research, since the focus of study is a new model for Chinese word segmentation. However, the part that supports my new model is empirical. In this part, one test, two surveys and six interviews will be carried out to obtain my evidence. The test and surveys are mainly quantitative studies whereas the interviews aim at obtaining qualitative data.

7. Evidence you will use or expect to find. Are they textual or contextual? Why do you think your evidence supports your hypothesis?

Evidence that I will use is mainly contextual variables. To be more specific, I will use source-text variables, especially word-segmentation of source texts.

Chinese word segmentation is one of the most important components of Natural Language Processing for Chinese language, and is often referred to as the bottleneck for Chinese language understanding and processing. It is obvious that how source texts are segmented will affect the quality of machine translation. My new model of Chinese word-segmentation takes pragmatic contextual and co-textual information into consideration, which, hopefully, will contribute to the better performance of machine translation systems.

8. Preliminary findings (Give one example, e.g. variables youve identified and patterns or regularities youve found in your pilot study.)

In my pilot study, I discovered that computer translation programs available on the market, which are either linguistically-oriented or statistically-oriented, are unable to tackle Chinese segmentation very well in Chinese-English translating. It shows that a relatively large proportion of the translation errors can be attributed to mistakes in word-segmentation. And in my survey studies and interviews, I found that people carry out word-segmentation process quite similarly and in most cases they make use of the pragmatic contextual and co-textual information, rather than linguistic principles.

9. Major references

Lamb, Sydney. (1999). Pathways of the brain: Neurocoginitive basis of language. Amsterdam & Philadelphia: John Benjamins.

Liu, Kaiying 刘开瑛. (2000).《中文文本自动分词和标注》. 北京: 商务印书馆.

Luo, Zhengqing et al. 骆正清等. (1997). 《汉语自动分词研究综述》.《浙江大学学报》[Journal of Zhejiang University]. 第31卷第3期.

Taylor, John. (2002). Cognitive grammar. Oxford: Oxford University Press.

Wang, Ke et al. 王科等. (2003). 《汉语分词的主要技术及其应用展望》 [Main techniques in Chinese word-segmentation and their prospective applications].《通信技术》[Communications Technology]. 2003年第6期.

Wu, Andi. (2003). Chinese word segmentation in MSR-NLP. The Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan. Retrieved May 11, 2004, from http://jefferson.village.virginia.edu/pmc/text-only/issue.591/moulthro.591).

Yin, Jianping 殷建平. (1998). 《汉语自动分词方法》[Automatic word segmentation methods for Chinese]. 《计算机工程与科学》[Computer Engineering and Science]. 第20卷第3期.

10. The timetable for your research

Items	Schedule	Present Status
Literature reading and review	By April 20, 2004	Almost finished and open to new information
2 Surveys	April 9 and 16	Done
5 Interviews	April 9 and 16	Done
Test	By August 1	Unfinished (but a pilot study conducted)
Data collection	By August 15	Data of surveys collected; 4 of 5 interviews transcribed; Waiting for test results
New model	By August 20	A rough model designed
First draft	By September 15	Literature review almost finished; Surveys and Interviews partly finished; Justification of the cognitive approach almost done
Second draft	By September 30	Unfinished
Third draft	By October 30	Unfinished

导师签名 (1)

日期

导师签名 (2)

日期