Description of Data.

This folder contains data collected and shared by China Workshop on Machine Translation (CWMT) community, for the training, development and evaluation of the machine translation systems between Chinese and English, in news domain.

There are three types of data for Chinese-English machine translation: Monolingual Chinese text, Parallel Chinese-English text, Multiple-Reference text.

Monolingual Chinese text:

  1. xmu corpus

    The xmu corpus is collected by the NLP lab in Xiamen University, China. The corpus contains the news articles from xinhuanet website (http://news.xinhuanet.com) in 2011, from different channels including political news, international news, financial news, forum, education, etc.

    Each article is marked with its title, source, date and url.

    The corpus contains a total number of 662,904 articles, about 11 million words.

Parallel Chinese-English text:

  1. casia2015 corpus

    The casia2015 corpus is provided by the research group in Institute of Automation, Chinese Academy of Sciences.

    The corpus contains about 1 million sentences pairs automatically collected from the web.

  2. casict2011 corpus

    The casict2011 corpus is provided by the research group in Institute of Computing Technology, Chinese Academy of Sciences.

    The corpus contains 2 parts, each containing about 1 million (adding up to 2 million) sentence pairs automatically collected from the web.

    The sentence level alignment precision is about 90%.

  3. casict2015 corpus

    The casict2015 corpus is provided by the research group in Institute of Computing Technology, Chinese Academy of Sciences.

    The corpus contains about 2 million sentences pairs, including sentences collected from the web (60%), from movie subtitles (20%) and from English/Chinese thesaurus (20%).

    The sentence level alignment precision is higher than 99%.

  4. datum2015 corpus

    The datum2015 corpus is provided by Datum Data Co., Ltd.

    The corpus contains 1 million sentence pairs covering different genres such as textbooks for language education, bilingual books, technological documents, bilingual news, government white books, government documents, bilingual resources on web, etc.

    Please note that several portions of the Chinese side of the data are word-segmented.

  5. datum2017 corpus

    The datum2017 corpus is provided by Datum Data Co., Ltd.

    The corpus contains the 20 files, covering different genres such as news, conversations, law documents, novels, etc.

    Each file has 50,000 sentences. The whole corpus contains 1 million sentences.

    The first 10 files (Book1-Book10) have their Chinese side word-segmented.

  6. neu2017 corpus

    The neu2017 corpus is provided by the NLP lab of Northeastern University, China.

    The corpus contains 2 million sentence pairs automatically collected from the web, including news, technical documents, etc.

    The sentence level alignment precision is about 90%.

Multiple-Reference text:

These part of the data are provided by the research group in Institute of Computing Technology, Chinese Academy of Sciences.

The datasets are the development/evaluation data from the MT evaluations held in China in previous years, including 863-2003, 863-2004, 863-2005, SSMT-2007, CWMT2008, CWMT2009 and CWMT2011.

Please see the readme file in each data folder for more information.

Download:

Currently, please visit the following ftp to download.

ftp://nlp.nju.edu.cn

username: cwmt-wmt

passwd: cwmt-wmt