WMT17 Bandit Learning Task

The Task: Bandit Learning for Machine Translation

Bandit Learning for MT is a framework to train and improve MT systems by learning from weak or partial feedback: Instead of a gold-standard human-generated translation, the learner only receives feedback to a single proposed translation (this is why it is called partial), in form of a translation quality judgement (which can be as weak as a binary acceptance/rejection decision).

Amazon and University of Heidelberg organize this Shared Task with a goal to encourage researchers to investigate algorithms for learning from weak user feedback instead of from human references or post-edits that require skilled translators. We are interested in finding systems that learn efficiently and effectively from this type of feedback, i.e. they learn fast and achieve high translation quality. Developing such algorithms is interesting for interactive machine learning and for human feedback in NLP in general.

In the WMT task setup, the user feedback will be simulated by a service hosted on Amazon Web Services (AWS), where participants can submit translations and receive feedback and use this feedback for training an MT model. Reference translations will not be revealed at any point, also evaluations are done via the service.


Please find all details about setup, infrastructure, baselines, and final results in the 2017 shared task description paper.

    author = {Artem Sokolov and Julia Kreutzer and Kellen Sunderland and Pavel Danchenko and Witold Szymaniak and Hagen F\"{u}rstenau and Stefan Riezler},
    title = {A Shared Task on Bandit Learning for Machine Translation}, 
    booktitle = {Proceedings of the 2nd Conference on Machine Translation {(WMT)}}, 
    address = {Copenhagen, Denmark},
    month = sep,
    year = 2017

Important Dates

All dates are preliminary.

Registration via e-mailtill March 28, 2017
Access to mock serviceMarch 13, 2017
Access to development serviceMarch 30, 2017
Leaderboard is availableApril 05, 2017
Online learning startsApril 25, 2017
Notification of evaluation resultsMay 26, 2017
Paper submission deadlineJune 9, 2017
Camera-ready deadlineJuly 14, 2017

Why is it called Bandit Learning?

The name bandit is inherited from a model where in each round a gambler in a casino pulls an arm of a different slot machine, called "one-armed bandit", with the goal of maximizing his reward relative to the maximal possible reward, without apriori knowledge of the optimal slot machine. In MT, pulling an arm corresponds to proposing a translation; rewards correspond to user feedback on translation quality. Bandit learners can be seen as one-state Markov Decision Processes (MDPs), which connects them to reinforcement learning. In MT, proposing a translation corresponds to choosing an action.

Online Learning Protocol

Bandit learning follows an online learning protocol, where on each of a sequence of iterations, the learner receives a source sentence, predicts a translation, and receives a reward in form of a task loss evaluation of the predicted translation. The learner does not know what the correct prediction looks like, nor what would have happened if it had predicted differently.

For t = 1, ..., T do
  1. Receive source sentence
  2. Predict translation
  3. Receive feedback to predicted translation
  4. Update system

Online interaction is done via accessing an AWS-hosted service that provides source sentences to the learner (step 1), and provides feedback (step 3) to the translation predicted by the learner (step 2). The learner updates its parameters using the feedback (step 4) and continues to the next example.


For training seed systems, out-of-domain parallel data shall be restricted to German-English Europarl, NewsCommentary, CommonCrawl and Rapid data for the News Translation (constrained) task; monolingual English data from the constrained task is allowed. Tuning of the out-of-domain system should be done on the 'newstest2016-deen' development set. It is recommended to use the same pre-processing as for the in-domain data (see below).

The in-domain sequence of data for online learning will be e-commerce domain provided by Amazon, pre-processed with Moses' scripts (removing non-printing characters, replacing and normalizing unicode punctuation, lowercasing, pre-tokenizing and tokenizing). Since the data comes from a substantially different domain, expect a large number of out-of-vocabulary terms. These data can only be accessed via the service. No reference translations will be revealed, only feedback to submitted translations is returned from the service.

Simulated reward-type real-valued feedback will be based on a combination of several quality models, including automatic measures with respect to human references (pre-processed in the same way), and will be normalized to the range [0,1] ('very bad' to 'excellent'). Feedback can only be accessed via the service. Only one feedback is allowed per source sentence.


Three AWS-hosted services will be provided:
  1. Mock service to test client API: Will sample from a tiny in-domain dataset and simply return BLEU as feedback.
  2. Development service to tune algorithms and hyperparameters: Will sample from a larger in-domain dataset. Several runs will be allowed and evaluation results will be communicated to the participants.
  3. Online Learning service: Will sample from a very large in-domain dataset. Participants will have to consume a fixed number of samples during the allocated online learning period to be eligible for final evaluation. Feedback will be parameterized differently from the development service.

The respective data samples will be the same for all participants.


The following main evaluation metrics will be used:

Note that all evaluations are done during online learning and not in a separate offline testing phase.

How to Participate

  1. Pick your favourite MT system.
  2. Train an out-of-domain model on allowed data.
  3. Register for the task via email and receive further instructions on how to access the service.
  4. Wrap client code snippets around your MT system (all registered participant receive access to a GitHub repository with code examples).
  5. Setup: Test the in-domain-training procedure with the mock service and ensure that your client sends translations and receives feedback.
  6. Tune: Find a clever strategy and good hyperparameters to learn from weak feedback (e.g. by simulating weak feedback from parallel data, or by using the development service).
  7. Train your in-domain model by starting from your out-of-domain model, submitting translations to the online learning service, receiving feedback and updating your model from this feedback.

Contact / Questions

Feel free to contact us with any questions about the task or API at bandit_wmt@cl.uni-heidelberg.de. We also encourage participants to raise technical issues or general questions via the repository's issues page.


Pavel Danchenko, Amazon Development Center Berlin, Germany
Hagen Fuerstenau, Amazon Development Center Berlin, Germany
Julia Kreutzer, Heidelberg University, Germany
Stefan Riezler, Heidelberg University, Germany
Artem Sokolov, Heidelberg University and Amazon Development Center Berlin, Germany
Kellen Sunderland, Amazon Development Center Berlin, Germany
Witold Szymaniak, Amazon Development Center Berlin, Germany