翻译语料库简介

李平

南京气象学院大学英语部 lee5110@263.net

国内的相关介绍

贝克尔创建了世界上第一个翻译英语语料库（Translational English Corpus），提出运用平行语料库（parallel corpus）、多语语料库（multilingual corpus）和可比语料库（comparable corpus）可以发现和确定用常规方法很难发现的语义特征，研究文本的风格、语言习惯，如语言的的冗余度、词汇共现（co-occurrence）、规范程度、连贯形式、句法模式，甚至标点符号的使用特征，并帮助我们选择相应的翻译策略，例如，将歌德的作品翻译成现代拉丁语时，如果：

1) 现代德语的句子平均长度为12个词，

2) 歌德创作中句子的平均长度为24个词，

3) 拉丁语文学作品句子的平均长度为24个词，

4) 将歌德的作品翻译成拉丁语后句子的平均长度应该为48个词。

弗米尔（H.J.Vermeer）认为，只有这样译文才能反映歌德对正常德语文本规范的偏移以及偏移的程度（Baker, 1995:238）。更重要的是，利用语料库我们能快捷、可靠地发现和验证某些翻译规范和翻译普遍性，如简略化（simplification）、明朗化（explication）和规范化（conventionalization）。由于语料库研究方法是数据驱动（data-driven）的定量型分析，是自下而上、从具体数据推导出理论结论，可以重复验证，因而客观有效，从而大大克服了译学研究的主观性和随意性，成为定性型分析的重要补充。

　　　　　　　　（廖七一 . 研究范式与中国译学 . 中国翻译，2001 (5).）

目前，贝克主要是利用语料库来研究译文的语言特性和译者的文体。为了研究译者的文体，英国曼彻斯特大学科技学院翻译研究中心建立了一个大型的翻译英语语料库（Translational English Corpus）。该语料库收集了译自多种语言的英语资料。该中心还研制了半自动化地处理这些资料的软件，到2001年年底，该语料库总容量达2000万字。语料库收集的主要是小说和传记，也有新闻和旅游类的小型子库。翻译英语语料库中的每一份文本都有一个眉头，为研究者提供关于译者的资料：译者姓名、性别、民族、职业、翻译方向、源语、译本出版商等。为了了解不同译者的译作，从不同的角度探讨关于文体的问题，语料库收集了多位有经验的文学翻译家的译作，其中有相同译者所译的五、六种源自不同语言或不同原作者的著作，也有不同译者所译的同一原作的多个译本。

----张美芳. 利用语料库调查译者的文体贝克研究新法介评.《解放军外语学院学报》2002（3）

翻译与跨文化研究中心（CTIS）创建了目前世界上最大的英语译文电子数据库---英语译文语料库（TEC），该语料库已收录有约７千万词的英语译文语料，是翻译工作者与研究者的重要电子资源。

－－柯平，世界各地高校的口笔译专业与翻译研究机构，中国翻译，2002(6)。

What is a corpus

A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.

A computer corpus is a corpus which is encoded in a standardised and homogenous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance. Eagles Preliminary Recommendations on Corpus Typology

What is TEC?

TEC is a computerised collection of contemporary translational English text. It is freely available to the research community, with a set of software tools to allow scholars to investigate the language of translated English. The corpus is continually being enlarged and the software tools refined and made more versatile and user-friendly.

TEC is a corpus of contemporary translational English: it consists of written texts translated into English from a variety of source languages, European and non-European. It was set up and is currently managed by Professor Mona Baker at the Centre for Translation & Intercultural Studies. The custom-made software for processing the corpus, which is downloadable from the web, is designed by Dr. Saturnino Luz, Trinity College Dublin, who is also in charge of maintaining the corpus.

What does TEC consist of?

TEC consists of four subcorpora: fiction, biography, news and inflight magazines. The overall size of the corpus is currently (2003) around 10 million words. It can be accessed freely via the web, using a custom-built concordancer designed by Dr. Saturnino Luz.

TEC is meticulously documented in terms of extralinguistic features such as gender, nationality and occupation of the translator, direction of translation, source language, publisher of the translated text, etc. This information is held in a separate header file for each text. The concordancing software is designed to make the information in the header file available to the researcher at a glance

What type of research does TEC support?

TEC has supported a broad range of studies in two main areas: the way in which the patterning of translated text might be different from that of non-translated text in the same language, and stylistic variation across individual translators. Examples of both types of study can be found in the Selected Bibliography attached to this document.

TEC files

1. Subcorpus: Inflight magazines

Lufthansa Bordbuch and Blue Wings

2. Subcorpus: Newspapers

The Guardian and The European

3. Subcorpus: Biography

4. Subcorpus: Fiction

Sample Header File

TITLE :

Filename: fn000009.txt

Subcorpus: Fiction

Collection: Memoirs of Leticia Valle

TRANSLATOR

Name: Carol Maier

Gender: female

Nationality: American

Employment: Lecturer

TRANSLATION

Mode: written

Extent: 55179

Publisher: University of Nebraska Press

Place: USA

Date: 1994

Comments: Title in European Women Writers Series

AUTHOR

Name: Rosa Chacel

Gender: female

Nationality: Spanish

SOURCE TEXT

Language: Spanish

Mode: written

Status: original

Place: Spain

Date: 1945

Basic Methodologies

Comparable: two corpora in the same language, one consisting of translated and the other of non-translated texts;

Parallel: corpora of source texts and their translations;

Multilingual: corpora of non-translated texts in two or more languages, from the same domain, time period, etc.

Examples of corpora studies

Comparable corpora (corpora of translated and non-translated texts in the same language, and in similar domains) e.g. Olohan & Baker (based on TEC and BNC)

Parallel corpora (corpora of source texts and their single or multiple translations) e.g. Bosseaux (the Waves + 2 French translations )

Parallel corpora, with monolingual reference corpus in the language of the translated subcorpus e.g. Wallace (ECPC; IT & popular science English texts and two sets of Chinese translations, plus SINICA Chinese reference corpus)

Parallel corpora, with a monolingual reference corpus in each language (translated and non-translated) e.g. kenny (GEPCOLT; experimental German literary texts and single English translations, plus BNC & Mannheim Reference Corpora)

Features Investigated (translated vs. non-translated texts)

Broad features (universals?): explicitation, simplification, normalisation, levelling out

Specific features (syntactic, lexical, literary): zero/that variation; contractions; split infinitives; use of idioms; recurrent lexical patterns; reformulation markers; marked collocations; point of view; deixis, etc. (Mona Baker)

Laviosa（1998b）通过比较母语作者写的英语记叙文和英译的记叙文，发现译文在词汇使用方面有四大特征：实义词与功能词之比较低，高频词与低频词之比较高，最常用词重复率较高，最常用词变化较少。其他研究也表明，除了用词以外，译文的特征是名词化、简单化（见Baker 1993, 1998）、明确化（即更加连贯，见ضer峼/span>, 1998）和净化（即消除字里行间的意思，见Kenny 1998）。以上这些是英语译文中常见的典型特征，在此基础上作进一步研究不仅能够揭示翻译的内在规律，即Frawley（1984）所称的第三代码，而且还有助于译员和翻译专业的学生着意注意这些问题。 (Zhonghua Xiao)

平行语料库与可比语料库在语言研究中的应用

1. 便于从语言对比中深入了解所对比的语言，而这种了解往往容易在研究单语种语料库时被忽略；

2. 通过一系列比较，揭示语言的共性以及某语种所特有的、语言类型的与文化上的差异；

3. 揭示原文与译文、母语与非母语之间的差异；

4. 用于诸多实际应用，如词典编撰、外语教学和翻译。-----Aijmer & Altenberg（1996: 12）Cited in Zhonghua Xiao

英汉-汉英翻译语料库

http://icl.pku.edu.cn/project/parallel/default.htm （北大汉英平行语料库）

http://www.ling.lancs.ac.uk/corplang/babel/babel.htm (The Babel English-Chinese Parallel Corpus)

目前国内最大的双语平行语料库，也是国际上涉及汉语的最大双语语料库

中国外语教育研究中心的通用汉英对应语料库（王克非教授主持研制）。主要利用该库翻译文本库中的四个子库：汉英文学库、英汉文学库、汉英非文学库、英汉非文学库，分别选取语料150万字/词、170万字/词、100万字/词、130万字/词。其中，汉译英约占40%，英译汉约占60%，文学与非文学文本分别占55%和45%左右（王克非 2003）。

国外免费语料库

使用BNC语料库在线查询必须先登录BNC主页（http://info.ox.ac.uk/bnc/index.html ），或通过匿名ftp（ftp://sable.ox.ac.uk/pub/ota/BNC/SARA/）下载语料库客户端工具软件SARA，注册后可以免费使用20天。

COBUILD语料库可以通过登录伯明翰大学主页或COBUILD主页进行在线查询（http://titania.cobuild.collins.co.uk/index.html），但如果未经注册，只能使用其演示功能，每次索引只显示50行。

TeCCon是英国翻译英语语料库索引软件，可通过登录英国Manchester大学主页（http://www.art.man.ac.uk/SML/ctis/research/tec.htm ）进行查询。该网址包含翻译英语语料库、索引软件、以及对语料库和软件的说明文件。该网址提供两种方式进行在线索引，一是通过该网页上的索引程序插件进行索引，二是通过下载其客户端程序TEC Browser（http://ronaldo.cs.tcd.ie/tec/jnlp/ ），安装在自己的计算机上后，再上网查询。

References

柯平. 世界各地高校的口笔译专业与翻译研究机构，中国翻译，2002(6).

廖七一，语料库与翻译研究，外语教学与研究, 2000 (5).

廖七一 . 研究范式与中国译学 . 中国翻译, 2001 (5).

王克非. 双语平行语料库的创建及应用研究（2000－2003）

肖忠华. 平行语料库与可比语料库在语言研究中的应用,中国英语教育,2003 (1).

张美芳. 利用语料库调查译者的文体贝克研究新法介评. 解放军外语学院学报, 2002 (3).

Baker, Mona. Corpus-based Translation Studies (Lecture Handout), 2004.

http://www.monabaker.com/tsresources/TranslationalEnglishCorpus.htm

http://www.art.man.ac.uk/SML/ctis/research/tec.htm