Shared Task: Biomedical Translation Task

Task description

This is a new task that aims to evaluate systems on the translation of scientific publications for the the biological and health domains. The documents were retrieved from the Scielo database of scientific publications. The biomedical translation task will address the following language pairs:

Data

We will make available parallel corpora for the above three language pairs, as well as monolingual corpora for each of the four languages. The documents were retrieved from both the Scielo database for both the parallel and the monolingual corpora. The documents can be composed of either a title, the abstract or both of them, depending on their availability in the database. Additionally, we will also make available a parallel corpus of Medline titles.

All files are available in the WMT'16 biomedical task Google Drive account.

Parallel corpora from Scielo

The parallel documents from the Scielo database are located in the "scielo" folder. There is no parallel dataset for the biological domain and the language pair FR/EN. Please use out-of-the domain corpora or the health and Medline datasets as training data.

DatasetES/ENFR/ENPT/EN
Biologicales-en-training-biological.xml.gz-pt-en-training-biological.xml.gz
Healthes-en-training-health.xml.gzfr-en-training-health.xml.gzpt-en-training-health.xml.gz

The Scielo corpus is available in the BioC XML format, for which readers and writers are available for many programming languages, as well as various natural language processing tools for biomedicine. There are specific values for the attribute "key" of the XML tag "infon" to identify the language of each document, the section (title or abstract) and the number of the sentence, as illustrated in the example below:

<document> <id>S0034-77441998000200003</id> <passage> <infon key="language">EN</infon> <infon key="section">abstract</infon> <sentence> <infon key="sentnum">0</infon> <text>The gastrointestinal activity of an aqueous extract of the dry wood of Quassia amara was investigated using animal models. </text> </sentence> <sentence> <infon key="sentnum">1</infon><offset>-1</offset><text> Oral administration of the extract to mice produces an increase of gastrointestinal transit at doses of 500 and 1000 mg/kg. The antiulcerogenic activity was measured inducing ulcers on Sprague-Dowly rats with indomethacin or ethanol and by the induction of stress.</text> </sentence> <sentence> <infon key="sentnum">2</infon> <text> The experimental group was treated orally with the extract, using doses of 250, 500 and 1000 mg/kg before inducing the ulcers.</text> </sentence> ... </passage> </document>

Aligned parallel corpora from Scielo

We have aligned the documents from the Scielo database with the GMA tool. The files derived from this alignment are located in the "scielo-gma" folder and include the following files for each section of the document (title and abstract/text):

DatasetES/ENFR/ENPT/EN
Biologicales-en-gma-biological.tar.gz-pt-en-gma-biological.tar.gz
Healthes-en-gma-health.tar.gzfr-en-gma-health.tar.gzpt-en-gma-health.tar.gz

Parallel corpora from Medline

The Medline documents are located in the "medline" folder.

DatasetES/ENFR/ENPT/EN
Medlinepubmed_en_es.txt.zippubmed_en_fr.txt.zippubmed_en_pt.txt.zip

Monolingual corpora from Scielo

The Medline documents will be located in the "scielo-monolingual" folder.

Out-of-domain corpora

For out-of-domain corpora, please check other machine translation tasks in the WMT'16 challenge, such as news and IT.

Evaluation

Evaluation will be carried out both automatically and manually. Automatic evaluation will make use of standard machine translation metrics, such as BLEU and/or METEOR. Native speakers in each of the languages will manually check the quality of the translation for a small sample of the submissions. The Appraise system will be used for this purpose.

Submission format

The training data and the test data are available in the BioC format. More information about BioC as well as readers are writer for many programming languages can be found in the BioC web site.

An example of the test set format is shown below for the English to Spanish (en2es) language pair:

<document> <id>S123456789</id> <passage> <infon key="language">EN</infon> <infon key="section">title</infon> <offset>-1</offset> <sentence> <infon key="sentnum">0</infon> <offset>-1</offset> <text>title sentence</text> </sentence> </passage> <passage> <infon key="language">EN</infon> <infon key="section">abstract</infon> <offset>-1</offset> <sentence> <infon key="sentnum">0</infon> <offset>-1</offset> <text>sentence 0</text> </sentence> <sentence> <infon key="sentnum">1</infon> <offset>-1</offset> <text>sentence 1</text> </sentence> ... </passage> </document>

An example of the submission format is shown below for the above en2es language pair:

<document> <id>S123456789</id> <passage> <infon key="language">ES</infon> <infon key="section">title</infon> <offset>-1</offset> <sentence> <infon key="sentnum">0</infon> <offset>-1</offset> <text>translation of title sentence</text> </sentence> </passage> <passage> <infon key="language">ES</infon> <infon key="section">abstract</infon> <offset>-1</offset> <sentence> <infon key="sentnum">0</infon> <offset>-1</offset> <text>translation of sentence 0</text> </sentence> <sentence> <infon key="sentnum">1</infon> <offset>-1</offset> <text>translation of sentence 1</text> </sentence> ... </passage> </document>

Please identify each sentence with the corresponding "sentnum" specified in the test file. The submission file has the same format of the test file, except for the "language" attribute, which should contain the target language instead of the source language, and the "text" tag, which should contain the translation of the text to the target language.

Submission Requirements

Please register your team using this form. You will receive a mail with the confirmation of your registration. The link for submission is informed in this mail.

The test files are available in the "testset" folder in the WMT'16 biomedical task Google Drive account and their file names are according to the dataset (biological or health) and language pairs (e.g., en2es or es2en). For instance, the test file for the biological dataset for English to Spanish is called "biological_en2es.xml".

The format for the submission files should included the original test file name preceded by the team identifier (as registered in the form above) and the run number, following this example: the submission file for run 1 of the "HPI" team for the biological dataset for English to Spanish should be called "HPI_run1_biological_en2es.xml".

Each team is allowed to submit up to 3 runs per test file, i.e., 3 runs for the "biological_en2es.xml" test file, 3 runs for the "biological_es2en.xml", etc. There is no biological test set for neither "fr2en" nor "en2fr" language pairs.

Important dates

Release of training data end of January 2016
Release of test data April 15, 2016
Results submission deadline April 22, 2016 April 26, 2016 (extended)
Paper submission deadlineMay 8, 2016 May 15, 2016 (extended)
Notification of acceptanceJune 5, 2016
Camera-ready deadlineJune 22, 2016

Organisers

Antonio Jimeno Yepes (IBM Research Australia)
Aurélie Névéol (LIMSI, CNRS, France)
Mariana Neves (Hasso-Plattner Institute, Germany)
Karin Verspoor (University of Melbourne, Australia)


Please contact us in the mail wmtbiomedical@gmail.com. Please also joing our discussion forum.