CWMT 2017 Machine Translation Evaluation Guidelines

CWMT 2017 Machine Translation Evaluation Committee

 

I.            Introduction

The 13th China Workshop on Machine Translation (CWMT 2017) will be held at Dalian, China on September 27-29, 2017. CWMT 2017 will continue the ongoing series of machine translation (MT) evaluation campaigns. Compared with the previous evaluation (CWMT 2015), there are several changes in the evaluation plan:

1. The Chinese-English and English-Chinese news translation tasks are co-organized by CWMT and WMT2017; participants for WMT 2017 are welcome to submit their results to CWMT, and to participate the CWMT event, as well.

2. A new Japanese-Chinese patent domain translation task is added, which is co-organized by CWMT and Beijing Lingosail Technology Co. Ltd.; we also welcome other contributors from industry to participate in the organization of the evaluation in the future.

3.The training period starts immediately after the release of this guideline. Participants could get the corresponding data and tools and start the training process right after their registration. We would like to encourage potential participants to finish the registration as soon as possible.

4.The "Double Blind Evaluation" task is not held this year; the organizer will not provide the Baseline System, corresponding key steps or intermediate result files for the evaluation tasks.

We sincerely hope that this campaign will enhance the cooperation and connections of machine translation research and technology among domestic and overseas research sites, and promote the cooperation between academia and industry.

 

Information on CWMT 2017 is provided below:

 

The sponsor of CWMT 2017 machine translation evaluation is:

Chinese Information Processing Society of China

 

The organizers of this evaluation are:

Nanjing University

Institute of Computing Technology, Chinese Academy of Sciences

 

The resource providers of this evaluation are:

Beijing Lingosail Technology Co. Ltd.

Datum Data Co., Ltd.

Harbin Institute of Technology

Inner Mongolia University

Institute of Automation, Chinese Academy of Sciences

Institute of Computing Technology, Chinese Academy of Sciences

Institute of Intelligent Machines, Chinese Academy of Sciences

Nanjing University

Northeastern University

Northwest University of Nationalities

Peking University

Qinghai Normal University

Tibet University

Xiamen University

Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences                

Xinjiang University

 

The chair of this evaluation is:

HUANG Shujian (Nanjing University)

 

The committee members of this evaluation are:

Aishan Wumaier (Xinjiang University)

WEI Yongpeng (Beijing Lingosail Technology Co. Ltd)

XIAO Tong (Northeastern University)

YANG YaTing (Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences)

Yiliyaer Jiaermuhamaiti (Nanjing University)

ZHANG Jiajun (Institute of Automation, Chinese Academy of Sciences)

ZHAO Hongmei (Institute of Computing Technology, Chinese Academy of Sciences)

 

For more information about CWMT 2017 and the MT evaluation tasks, please visit:

                  http://nlp.nju.edu.cn/cwmt2017/evaluation.html

 

 

II.       Evaluation Tasks

CWMT 2017 MT evaluation campaign consists of six tasks involving 4 domains and 5 language pairs overall , as listed in Table 1.

No

Task ID

Task Name

Domain

1

CE

Chinese-to-English News Translation

News

2

EC

English-to-Chinese News Translation

News

3

MC

Mongolian-to-Chinese Daily Expression Translation

Daily Expressions

4

TC

Tibetan-to-Chinese Government Document Translation

Government Documents

5

UC

Uyghur-to-Chinese News Translation

News

6

JC

Japanese -to Chinese Patent Domain Translation

Patent

Table 1 CWMT2017 MT evaluation tasks

For every evaluation task, participants can freely choose the MT technology they wish to use, such as techniques based on rules or cases, or statistical machine translation or neural network machine translation, etc.

Participants can also use system combination technology, but should provide explicit indication in the system description and describe the performance of each single system in the technical report. Here system combination technology means using translations from two or more single systems for the reconstruction or selection of translation results at character, word, phrase or sentence level. Techniques which do not generate results from two or more single systems explicitly will not be considered as system combination technologies this time. Examples of such techniques are collaborative decoding in statistic machine translation, output ensemble in neural machine translation and reranking of nbest results from a single system. Systems that use system combination technology will be indicated as such in the report of the evaluation result.

 

III.      Evaluation Methods

1.           Evaluation Metrics

Automatic evaluation

Automatic evaluation tools will be used to evaluate the system performance in automatic metrics including: BLEU-SBP, BLEU-NIST, TER, METEOR, NIST, GTM, mWER, mPER, and ICT.

The organizers will use the following setting in the automatic evaluation:

(1) All the scores of these metrics will be case-sensitive; case-insensitive scores for some metrics may also be listed as reference;

(2) BLEU-SBP will be the primary metric;

(3) The evaluation of EC, TC, UC, MC and JC will be based on characters instead of words.

(4) In the evaluation of EC, TC, UC, MC and JC, the organizer will convert the full-width Chinese characters in the A3 area of GB2312 in the Chinese translations to half-width characters;

(5) The evaluation of CE will be based on English words.

 

2.           Evaluation Procedure

CWMT 2017 MT evaluation will take the following four stages:

(1) Registration: The potential participant sends the registration form and evaluation agreement to the organizer. The organizer sends the training and development data to the confirmed participants (by FTP).

(2) Training stage: The participants train and develop their systems based on the data released or additional data (data conditions are listed in Section IV).

(3) Test stage: The organizer releases the source file of the test set. The participants run their systems and submit the translation results and system descriptions in the required format (Appendix B) by the deadline.

(4) Evaluation and reporting stage: The organizer evaluate the submitted translations and report back the final evaluation results. The participants prepare their system technological reports and attend CWMT 2017 workshop. 

The exact time schedule is listed in Section VII.

 

IV.   Evaluation Data and Training Conditions

The organizer will offer several language resources including the training corpora, development sets and test sets (source files).

1.         Training Sets

Please refer to Appendix D for the resource list released by CWMT2017 and Appendix B for the document structure description.

CWMT 2017 MT evaluation campaign will update and add the following training data:

New datasets for the CE, EC tasks:

l   NEU English-Chinese Parallel Corpus (2017)

l   Datum English-Chinese Parallel Corpus (2017)

The EC and CE tasks are co-organized by WMT17, so the data provided by WMT17 can also be used for the evaluation of the corresponding EC and CE translation. In addition to the training, development and test set data provided by CWMT2017, WMT17 also allows the following data to be used[1]:

Parallel data in English and Chinese (News Commentary V12 and UN Parallel Corpus V1.0)

Chinese and English monolingual training data (Europarl, News Commentary, Common Crawl, News Crawl, News Discussions etc.); LDC Gigaword in English and Chinese (LDC2011T07, LDC2009T13, LDC2007T07, LDC2009T27)

New datasets for the MC task:

l   IMU Mongolian-Chinese Parallel Corpus (2017)

l   ICTCAS  Mongolian-Chinese Parallel Corpus (2017)

New datasets for the TC task:

l   ICTCAS  Tibetan-Chinese Parallel Corpus (2017)

New datasets for the UC task:

l   XJU Chinese- Uyghur Parallel Corpus (2017)

l   ICTCAS  Uyghur-Chinese Parallel Corpus (2017)

l   XJIPC-CAS Uyghur-Chinese Parallel Corpus (2017)

New datasets for the JC task:

l   Lingosail Chinese-Japanese  Parallel Corpus (2017)

Participants can obtain the training data of the tasks to which they register.

 

2.         Training Conditions

For statistical machine translation systems, two kinds of training conditions are allowed in CWMT 2017: Constrained training and Unconstrained training.

(1) Constrained training

Under this condition, only data provided by the evaluation organizer can be used for system development. System development must follow the following restrictions:

l   The primary systems must be in the Constrained training condition in order to be evaluated under similar conditions.

l   Rule-based MT modules or systems can use hand-crafted translation knowledge sources such as rules, templates, and dictionaries. Participants using rule-based MT system are required to describe the size of the knowledge sources and the ways to construct and use the knowledge sources in the system description and the technical reports.

l   Tools for monolingual processing (such as lexical analyzers, parsers and named entity recognizers) are not subject to the training data restrictions.

l   Tools for bilingual translation (such as named entity translator, syllable-to-character converter) must not use any additional resources. The exceptions here are tools for translating numerals and time words.

l   For any evaluation task, systems can only use the corpora related to this task. Usage of the corpora of any other evaluation task is not acceptable, even the participant takes participation in more than one task.

l   Constrained training corpora for Chinese-English and English-Chinese evaluation task (co-organized by CWMT and WMT) are composed of data provided by CWMT (listed in appendix D) and data provided by WMT. The participant should state clearly, in the system description and technical report, which parts of the data are being used: WMT data, CWMT data or both. The submissions with different training data conditions will be indicated in the report of the evaluation result.

 (2) Unconstrained training

Under this condition, the participant is allowed to used data from other resources to assist the training of their systems. System development must follow the following restrictions:

l   The contrast systems of participants can be developed under the unconstrained training condition.

l   If participants use additional data, they should declare whether the data can be accessed publicly in the system description and technological report. If the data can be accessed publicly, the participants should provide the origin of the data.

l   Participants are welcome to use their own online translation systems, but these systems should also be described in the system description and technological report briefly. The translation results of online systems will only be used as reference, which will be excluded from the ranking of unconstrained training systems.

 

3.         The Development Sets

Information on the development sets is provided in Table 2.

Task ID

Size

Provider

Note

CE

2,002 sentences

Nanjing University

single reference

EC

2,002 sentences

Nanjing University

single reference

MC

1,000 sentences

Inner Mongolia University

4 references

TC

650 sentences

Qinghai Normal University

4 references

UC

700 sentences

Xinjiang University

4 references

JC

3,000 sentences

Beijing Lingosail Technology Co. Ltd.

single reference

Table 2 The development sets for CWMT2017

The CE and EC tasks share the same development data, which is the combination of 1002 sentences translated from English to Chinese, and 1000 sentence translated from Chinese to English.

4.         The Test Sets

Information on the test sets is provided in Table 3.

Task ID

Size

Provider

Note

CE

1,000 sentences

Nanjing University

single reference

EC

1,000 sentences

Nanjing University

single reference

MC

1,001 sentences

Inner Mongolia University

4 references

TC

729 sentences

Qinghai Normal University

4 references

UC

1,000 sentences

Xinjiang Technical Institute of Physics & Chemistry, CAS

4 references

JC

1,000 sentences

Beijing Lingosail Technology Co. Ltd.

single reference

Table 3 The test sets for CWMT2017

The MC, TC and UC tasks use the same evaluation data from the last evaluation (CWMT2015).

Please refer to Appendix B for instructions regarding the format of the development and test sets.

 

V.       Results Submission

The translation result(s) must be returned to the organizer before the deadline. Each participant should submit one final translation result as primary result and at most three other translation result(s) as contrast results for each task that they registered to. Each system submission should be accompanied by its system description. Please refer to Appendix B for the format of MT evaluation data and submission files.

Participants in CE and EC news translation tasks can submit results to CWMT2017 or WMT2017, or both. However, the submission should follow the requirements of that particular event.

 

VI.   Technical Report Submission

After the evaluation, participants should submit a detailed technical report to CWMT 2017, which describes the architecture, major technology and the use of data. Each team ought to send at least one person to attend CWMT2017 and exchange related technology. Please see the reporting requirements in Appendix C.

 

VII.                 Evaluation Calendar

1

March 15, 2017

Registration starts. The organizer send training and development data, and scripts for scoring and format checking to the participants, according to their registration.

2

March 31, 2017

Deadline for registration. The data and scripts will not be provided to other organizations. (Please contact the organizer for later registrations.)

3

May 2, 2017

10:00 AM GMT+8

postpone to May 15

10:00 AM GMT+8

Release of test data for CE and EC tasks to participants

4

May 8, 2017

17:30 PM GMT+8

postpone to May 22

17:30 PM GMT+8

Deadline for submitting results for the CE and EC tasks

5

May 20, 2017

10:00 AM GMT+8

Release of test data for JC, UC, MC and TC tasks to participants

6

May 27, 2017

17:30 PM GMT+8

Deadline for submitting results for the JC, UC, MC and TC tasks

7

June 15, 2017

Preliminary release of evaluation results to participants

8

June 30, 2017

Deadline for submitting technical report

9

July 10, 2017

Reviews of technical reports sent to participants, who should modify reports accordingly

10

July 30, 2017

Deadline for submitting technical report camera-ready

11

September 27 to September 29, 2017

CWMT 2017 workshop. Official public release of results.

 

VIII.             Appendix

This document includes the following appendices:

Appendix A: Registration Form and  Evaluation Agreement

Appendix B: Format of MT Evaluation Data

Appendix C: Requirements of Technical Report

Appendix D: List of Resources Released by the Organizer


 

Appendix A: Registration Form and Evaluation Agreement

Any organization engaged in MT research or development can register for the CWMT 2017 evaluation. The participating sites of CWMT 2017 evaluation should fill the registration form and agreement and send it to the organizer by either email or post. The registration form and the agreement should be signed by the person in charge of the participating team/organization, or stamped with the official seal of the team/organization.

The evaluation does not charge any registration fees.  Each participant should send at least one person to attend the workshop (CWMT 2017).

Please send the registration form and the agreement to:

Name: HUANG Shujian

Email: huangsj@nju.edu.cn

Address: Department of Computer Science and Technology, Nanjing University, 163 Xianlin Avenue, Nanjing 210023, China

Post Code: 210023

Telephone: 025-89680690

 

Appendix B: Format of MT Evaluation Data

This appendix describes the format of the data released by the organizer and the result files that the participants should submit.

All the files should be encoded in UTF-8 format. Among them, the development set (including its reference), the test set and the final translation result files must be strict XML files (whose formats are defined by the XML DTD described in section III) encoded in UTF-8 (with BOM), and all the others are plain text files encoded in UTF-8 (without BOM).

I.                              Data released by the organizer

The organizer will release three kinds of data: training sets, development sets and test sets. Here we take the Chinese-to-English Translationtask as an example for illustration purposes.

1. Training Set

The training data contains one sentence per line. The parallel corpus of each language pair is made of a source file and a target file, which contain the source and target sentences respectively.

Figure 1 illustrates the data format of the parallel corpus.

 

Chinese File:

English File:

战法训练有了新的突破

Tactical training made new breakthrough

第一章总则

Chapter I general rules

人民币依其面额支付

The renminbi is paid by denomination

……

……

 Figure 1 Example of the parallel corpus

2. Development Set

There are source files and reference files in the development set and the test set.

(1)       Source File

A source file contains one single srcset element, which has the following attributes:

l   setid: the dataset

l  srclang: the source language. One element of this set:{en, zh, mn, uy, ti, jp}

l  trglang: the target language. One element of this set:{en, zh, mn, uy, ti, jp}

A srcset element contains one or more DOC element(s), and each DOC element contains one single attribute docid, which indicates the genre of the DOC.

Each DOC element contains several seg elements with attribute id.

One or more segments may be encapsulated inside other elements, such as p. Only the text surrounded by seg elements is to be translated.

Figure 2 shows an example of the source file.

<?xml version="1.0" encoding="UTF-8"?>

<srcset setid="zh_en_news_trans" srclang="zh" trglang="en">

<DOC docid="news">

<p>

<seg id="1">sentence 1 </seg>.

<seg id="2">sentence 2</seg>

……

</p>

……

</DOC>

</srcset>

Figure 2 Example of the source file

 

(2)      Reference file

A reference file contains a refset element. Each refset element contains the following attributes:

l  setid: the dataset

l  srclang: the source language. One element of this set:{en, zh, mn, uy, ti, jp}

l  trglang: the target language. One element of this set:{en, zh, mn, uy, ti, jp}

Each refset element contains several DOC elements. Each DOC has two attributes:

l  docid: the genre of the DOC

l  site: the indicator for different references. One element of this set:{1, 2, 3, 4}

Figure 3 shows an example of the reference file.

<?xml version="1.0" encoding="UTF-8"?>

<refset setid="zh_en_news_trans" srclang="zh" trglang="en">

<DOC docid="news" sysid="ref" site="1">

<p>

<seg id="1">reference 11 </seg>

<seg id="2">reference 12</seg>

……

</p>

……

</DOC>

<DOC docid="news" sysid="ref" site="2">

<p>

<seg id="1">reference 21 </seg>

<seg id="2">reference 22</seg>

……

</p>

……

</DOC>

<DOC docid="news" sysid="ref" site="3">

<p>

<seg id="1">reference 31</seg>

<seg id="2">reference 32</seg>

……

</p>

……

</DOC>

<DOC docid="news" sysid="ref" site="4">

<p>

<seg id="1">reference 41 </seg>

<seg id="2">reference 42</seg>

……

</p>

……

</DOC>

</refset>

Figure 3 Example of the reference file

 

3. Test set

For convenience of evaluation, organizer will only release the test set source files in the same format as the development set.

 

II.                           Data Format in the Submission

Participants need to submit translations and system descriptions in the format below.

1. File Naming

Please name the submitted files according to the naming mode in the following table (we use “ce”, “ictand “2017here as examples of Task ID, Participant ID and year of the test data respectively).

File

Naming mode

Example

final translation result

Task ID – year of the test data – Participant ID – Primary vs. contrast system – System ID.xml

ce-2017-ict-primary-a.xml

ce-2017-ict-contrast-c.xml

 

2. Final translation result

The final translation result is the translation result from the participant’s translation systems, with proper post-processings such as recase, detokenize, etc.

The submission of each system should be an xml file which has a format similar to the source file of the test set.

The final submission file contains a tstset element with the following attributes:

l  setid: the dataset

l  srclang: the source language. One element of this set:{en, zh, mn, uy, ti, jp}

l  trglang: the target language. One element of this set: {en, zh, mn, uy, ti, jp}

The tstset element contain a system element with the following attributes:

l  site: the label of the participant

l  sysid: the identification of the MT system

The value of the system element is the description of the participating system including the following information:

l  Hardware and software environment: the operating system and its version, number of CPUs, CPU type and frequency, system memory size, etc.

l  Execution Time: the time from accepting the input to generating the output.

l  Technology outline: an outline of the main technology and important parameters of the participating system. If the system uses system combination techniques, single systems being combined and the combination techniques should be described here.

l  Training Data: a description of the training data and development data used for system training, with indication of the training condition (i.e. constrained or unconstrained). For CE and EC tasks, please also indicate the source of the training data (i.e. WMT, CWMT or both).

l  External Technology: a declaration of the external technologies which are used in the participating system but not owned by the participating site, including: open-source code, free software, shareware, and commercial software.

The content of each DOC element is exactly the same as that of the test sets source file, which is described before.

Here is an example of the final submission file in Figure 4.

<?xml version="1.0" encoding="UTF-8"?>

<tstset setid="zh_en_news_trans" srclang="zh" trglang="en">

<system site="unit name " sysid="system identification">

description information of the participating system

............

</system>

<DOC docid="document name " sysid="system identification">

<p>

<seg id="1">submit translation 1</seg>

<seg id="2">submit translation 2</seg>

……

</p>

……

</DOC>

</tstset >

Figure 4 Illustration of the final submission file

 

3. Document details

please pay attention to the following points when generating  the submission file:.

l  Please note that CWMT 2017 evaluation adopts strict XML file format. The main difference between the XML file format and the NIST evaluation file format lies in the following: in an XML file, if the following five characters occur in the text outside tags, they should be replaced by escape sequences:

Character

Escape sequence

&

&amp;

< 

&lt;

> 

&gt;

"

&quot;

'

&apos;

 

l  As for Chinese encoding, the middle point in a foreign person name should be written as "E2 80 A2" in UTF-8, for example, "托德·西蒙斯" ;

l  As for English tokenization, the tokenization should be consistent with the "normalizeText" function of the Perl script "mteval-v11b.pl" released by NIST. The main part of the script is listed in Figure 5.

# language-dependent part (assuming Western languages):

$norm_text = " $norm_text ";

# Add a space before the beginning and after the end of the text respectively (and then delete)

$norm_text =~ tr/[A-Z]/[a-z]/ unless $preserve_case;

# Uppercase letters are generally converted to lowercase, unless the user specifies the reserved case

$norm_text =~ s/([\{-\~\[-\` -\&\(-\+\:-\@\/])/ $1 /g;   # tokenize punctuation

# Add a space to both sides of the following symbols (corresponding the ASCII character of hexadec-imal for-mats marked behind)

#{ | } ~   (0x7b-0x7e)

#[ \ ] ^ - ` (0x5b-0x60)

#( Space )! " # $ % & (0x20-0x26) 

#( ) * + (0x28-0x2b)

# : ; < = > ? @ (0x3a-0x40)

# / (0x2f)

$norm_text =~ s/([^0-9])([\.,])/$1 $2 /g; # tokenize period and comma unless preceded by a digit

#When non-numeric characters are follow by a period '.' or a comma ',', a space character should be added to both sides of the period of comma. (No space character will be added if a number is followed by period or comma.)

$norm_text =~ s/([\.,])([^0-9])/ $1 $2/g; # tokenize period and comma unless followed by a digit

#When periods '.' or commas ',' aren't followed by numeric character 0-9, a space character should be added to both sides of the period or comma

$norm_text =~ s/([0-9])(-)/$1 $2 /g; # tokenize dash when preceded by a digit

#When numeric characters 0-9 are followed by a hyphen, a space character should be added to both sides of the hyphen

$norm_text =~ s/\s+/ /g; # one space only between words

#Replace continuous space characters with one single space character

$norm_text =~ s/^\s+//;  # no leading space

#Remove space characters at the beginning of the text

$norm_text =~ s/\s+$//;  # no trailing space

#Remove space characters at the end of the text

Figure 5 the main code of the tokenizeation script

III.                        Description of CWMT 2017 XML filesdocument structure

<?xml version=”1.0” encoding = “UTF-8”?>

<!ELEMENT srcset (DOC+)>

<!ATTLIST srcset setid CDATA #REQUIRED>

<!ATTLIST srcset srclang (en | zh | mn | uy | ti | jp ) #REQUIRED>

<!ATTLIST srcset trglang (en | zh | mn | uy | ti | jp) #REQUIRED>

<!ELEMENT refset (DOC+)>

<!ATTLIST refset setid CDATA #REQUIRED>

<!ATTLIST refset srclang (en | zh | mn | uy | ti| jp ) #REQUIRED>

<!ATTLIST refset trglang (en | zh | mn | uy | ti | jp) #REQUIRED>

<!ELEMENT tstset (DOC+)>

<!ELEMENT tstset (system+)>

<!ATTLIST tstset setid CDATA #REQUIRED>

<!ATTLIST tstset srclang (en | zh | mn | uy | ti | jp) #REQUIRED>

<!ATTLIST tstset trglang (en | zh | mn | uy | ti | jp) #REQUIRED>

<!ELEMENT system (#PCDATA) >

<!ATTLIST system site CDATA #REQUIRED >

<!ATTLIST system sysid CDATA #REQUIRED >

<!ELEMENT DOC ( p* )>

<!ATTLIST DOC docid CDATA #REQUIRED>

<!ATTLIST DOC site CDATA #IMPLIED>

<!ELEMENT p(seg*)>

<!ELEMENT seg (#PCDATA)>

<!ATTLIST seg id CDATA #REQUIRED>

 

 

Appendix C: Requirement of Technical Report

                  All participating sites should submit a technical report to the 13th China Workshop of Machine Translation (CWMT 2017). The technological report should describe the technologies used in the participating system(s) in detail, in order to inform the reader about how the reported results are obtained. A good technological report should be detailed enough so that the reader could replicate the work which is described in the report. The report should be no shorter than 5,000 Chinese characters or 3,000 English words.

                  Generally, a technology report should provide the following information:

                  Introduction: Give the background information; introduce the evaluation tasks participated, and the outline of the participating systems;

                  System: Describe the architecture and each module of the participating system in detail. The technologies used in the system should be focused. If there is any open technology adopted, it should be explicitly declared. If the technologies are developed by the participating site itself, that should be described in detail. If the participating site uses system combination techniques, the single systems(including results from those systems) as well as the combination technique should be described. If the participating site uses hand-crafted translation knowledge sources such as rules, templates, and dictionaries, the size of the knowledge sources and the ways to construct and use the knowledge sources should be described.

                  Data: Give detailed description of the data used in the system training and the processing of the data.

                  Experiment: Give detailed description to the experiment process, the parameters and the results obtained on the evaluation set. Analyze the results.

                  Conclusion: ( open )

 

Appendix D: Resource List Released by the Organizer

Without special indication, the data file provided by the organizer is encoded in UTF-8 (without BOM).

1. The Chinese-English news resources provided by the organizer

1.1 Training data

ChineseLDC resource ID

Resource description

Datum2015

Name

Datum English-Chinese Parallel Corpus (2015) (Part)

Providers

Datum Data Co., Ltd.

Languages

Chinese-English

Domain

Multi-domain, including: textbooks for language education, bilingual books, technological documents, bilingual news, government white books, government documents, bilingual resources on web, etc.

Size

1,000,004sentence pairs

Description

It is a part of the “Bilingual / Multi-lingual Parallel Corpus” developed by Datum Data Co., Ltd under the support of the National High Technology Research and Development Program of China (863 Program).

CASICT2011

 

(CLDC-2010-001)

(CLDC-2012-001)

Name

ICT Web Chinese-English Parallel Corpus (2013)

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

Chinese-English

Domain

Multi-domain

Size

1,936,633 sentence pairs

Description

The parallel corpus is automatically acquired from web. All the processes, including parallel web page discovering and verification, parallel text extraction, sentence alignment, etc., are entirely automatic. The accuracy of the corpus is about 95%.

This work was supported by the National Natural Science Foundation of China (Grant No. 60603095).

CASICT2015

Name

ICT  Web Chinese-English Parallel Corpus (2015)

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

Chinese-English

Domain

Multi-domain

Size

2,036,834 sentence pairs

Description

The parallel corpus is automatically acquired from web. All the processes, including parallel web page discoverin g and verification, parallel text extraction, sentence align ment, etc., are entirely automatic. The Institute of Computing Technology has corrected this corpus roughly. The accuracy of the corpus is greater than 99%. Three sources of sentences were selected to provide this corpus: 60% from the web, 20% from movie subtitles, and the rest 20% from the English-to-Chinese or Chinese-to-English dictionaries.

CASIA2015

Name

CASIA Web Chinese-English Parallel Corpus (2015)

Providers

Institute of Automation, Chinese Academy of Sciences

Languages

Chinese-English

Domain

Multi-domain

Size

1,050,000 sentence pairs

Description

The parallel corpus is automatically acquired from web. All the processes, including parallel web page discoverin g and verification, parallel text extraction, sentence align ment, etc., are entirely automatic.

Datum2017

Name

Datum English-Chinese Parallel Corpus (2017)

Providers

Datum Data Co., Ltd.

Languages

Chinese-English

Domain

 

Size

1,000,000 sentence pairs,divided into 20 parts

Description

 

NEU2017

Name

NEU Chinese-English Parallel Corpus (2017)

Providers

Natural Language Processing Group, Northeastern University

Languages

Chinese-to-English, English-to-Chinese

Domain

 

Size

2,000,000 sentence pairs

Description

 

SSMT2007 MT Evaluation Data

 

(2007-863-001)

Name

SSMT2007 Machine Translation Evaluation Data (a part of Chinese-English & English-Chinese MT evaluation data)

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

Chinese-English

Domain

News

Size

This is the test data of SSMT 2007 MT Evaluation, which contain data of 2 translation directions (Chinese-English and English-Chinese) in news domain. The source file of Chinese-English data contains 1,002 Chinese sentences. The source file of English-Chinese data contains 955 English sentences. There are 4 reference translations made by human experts for each test sentence.

Description

 

HTRDP(863)2005 MT Evaluation Data

 

(2005-863-001)

Name

HTRDP(“863 Program”) 2005 Machine Translation Evaluation Data (a part of Chinese-English & English-Chinese MT evaluation data)

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

Chinese-English

Domain

The data contains two genres: one is dialog data from Olympics-related domains, which includes game reports, weather forecasts, traffic and hotels, travel, foods, etc, and the other one is text data from news domain.

Size

The source files of dialog and text data in Chinese-to-English and English-to-Chinese directions contain about 460 sentence pairs respectively.

Description

 

HTRDP(863)2004 MT Evaluation Data

 

(2004-863-001)

Name

HTRDP (“863 Program”) 2004 Machine Translation Evaluation Data (a part of Chinese-English & English-Chinese MT evaluation data)

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

Chinese-English

Domain

Two data genres: one is text data, the other is dialog data. The data covers general domain and Olympic-related domains which include game reports, weather forecasts, traffic and hotels, travel, foods, etc.

Size

The source files of Chinese-to-English direction contain dialog data of 400 sentences and text data of 308 sentences. The source files of English-to-Chinese direction contain dialog data of 400 sentences and text data of 310 sentences. There are 4 reference translations made by human experts for each source sentence.

Description

The test data for the 2004 “863 Program” machine translation evaluation.

HTRDP(863)2003 MT Evaluation Data

 

(2003-863-004)

Name

HTRDP (“863 Program”) 2003 Machine Translation Evaluation Data (A part of Chinese-English & English-Chinese MT evaluation data)

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

Chinese-English

Domain

The data covers Olympic-related domains which include game reports, weather forecasts, traffic and hotels, travel, foods, etc.

Size

The source files of Chinese-to-English direction contain dialog data of 437 sentences and text data of 169 sentences, and the source files of English-to-Chinese direction contain dialog data of 496 sentences and text data of 322 sentences. There are 4 reference translations made by human experts for each source sentence.

Description

The test data for the 2003 “863 Program” machine translation evaluation.

CWMT2008 Machine Translation Evaluation Data

 

(CLDC-2009-001)

(CLDC-2009-002)

Name

CWMT2008 Machine Translation Evaluation Data

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

Chinese-English

Domain

News

Size

The source files of dialog and text data in Chinese-to-English and English-to-Chinese directions contain about 1000 sentence pairs respectively. There are 4 reference translations made by human experts for each source sentence.

Description

 

CWMT2009 Machine Translation Evaluation Data

Name

CWMT2009 Machine Translation Evaluation Data

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

Chinese-English

Domain

News

Size

The source files of dialog and text data in Chinese-to-English and English-to-Chinese directions contain about 1000 sentence pairs respectively. There are 4 reference translations made by human experts for each source sentence.

Description

 

CWMT2011 Machine Translation Evaluation Data

Name

CWMT2011 Machine Translation Evaluation Data

Providers

Institute of Computing Technology, Chinese Academy of Sciences

Languages

EnglishàChinese

Domain

News

Size

The source files of dialog and text data in English-to-Chinese directions contain 3187 sentence pairs. There are 4 reference translations made by human experts for each source sentence.

Description

 

1.2.The Chinese monolingual resources

XMU-CWMT2017

Name

XMU Natural Language Processing Group XINHUA News Corpus (2017)

Providers

Xiamen University

Languages

Chinese

Domain

News

Size