This is a new task that aims to evaluate systems on the translation of scientific publications for the the biological and health domains. The documents were retrieved from the Scielo database of scientific publications. The biomedical translation task will address the following language pairs:
We will make available parallel corpora for the above three language pairs, as well as monolingual corpora for each of the four languages. The documents were retrieved from both the Scielo database for both the parallel and the monolingual corpora. The documents can be composed of either a title, the abstract or both of them, depending on their availability in the database. Additionally, we will also make available a parallel corpus of Medline titles.
All files are available in the WMT'16 biomedical task Google Drive account.
The parallel documents from the Scielo database are located in the "scielo" folder. There is no parallel dataset for the biological domain and the language pair FR/EN. Please use out-of-the domain corpora or the health and Medline datasets as training data.
The Scielo corpus is available in the BioC XML format, for which readers and writers are available for many programming languages, as well as various natural language processing tools for biomedicine. There are specific values for the attribute "key" of the XML tag "infon" to identify the language of each document, the section (title or abstract) and the number of the sentence, as illustrated in the example below:
We have aligned the documents from the Scielo database with the GMA tool. The files derived from this alignment are located in the "scielo-gma" folder and include the following files for each section of the document (title and abstract/text):
The Medline documents are located in the "medline" folder.
The Medline documents will be located in the "scielo-monolingual" folder.
The training data and the test data are available in the BioC format. More information about BioC as well as readers are writer for many programming languages can be found in the BioC web site.
An example of the test set format is shown below for the English to Spanish (en2es) language pair:
An example of the submission format is shown below for the above en2es language pair:
Please identify each sentence with the corresponding "sentnum" specified in the test file. The submission file has the same format of the test file, except for the "language" attribute, which should contain the target language instead of the source language, and the "text" tag, which should contain the translation of the text to the target language.
Please register your team using this form. You will receive a mail with the confirmation of your registration. The link for submission is informed in this mail.
The test files are available in the "testset" folder in the WMT'16 biomedical task Google Drive account and their file names are according to the dataset (biological or health) and language pairs (e.g., en2es or es2en). For instance, the test file for the biological dataset for English to Spanish is called "biological_en2es.xml".
The format for the submission files should included the original test file name preceded by the team identifier (as registered in the form above) and the run number, following this example: the submission file for run 1 of the "HPI" team for the biological dataset for English to Spanish should be called "HPI_run1_biological_en2es.xml".
Each team is allowed to submit up to 3 runs per test file, i.e., 3 runs for the "biological_en2es.xml" test file, 3 runs for the "biological_es2en.xml", etc. There is no biological test set for neither "fr2en" nor "en2fr" language pairs.
|Release of training data||end of January 2016|
|Release of test data||April 15, 2016|
|Results submission deadline|
|Paper submission deadline|
|Notification of acceptance||June 5, 2016|
|Camera-ready deadline||June 22, 2016|