This document provides instructions for downloading WMT22 General MT task datasets for constrained track using mtdata.

1. Setup

pip install mtdata==0.3.7
# pip install  # Install from develop branch

2. Get Recipes File

Config file for CONSTRAINED track (missing datasets behing registration: CzEng2.0, and CCMT):


By default, the recipe file has to be in the current directory (where mtdata is invoked) and the name has to match*.yml glob. If you would like to place all your recipe YML files in a specific directory, then export MTDATA_RECIPES=/path/to/dir

If you are considering to participate in UNCONSTRAINED track, then any data is allowed. For example, you may use following config file containing larger set of corpora.


3. List Available Recipes

$ mtdata list-recipe | cut -f1 | grep wmt22

wmt22* ids are all loaded from*.yml file.

4. Download Recipes

Download a Recipe
# example: wmt22-csen
mtdata get-recipe -ri wmt22-csen -o wmt22-csen
Download All Recipes
for ri in wmt22-{csen,deen,jaen,ruen,zhen,frde,hren,liven,uken,ukcs,sahru}; do
  mtdata get-recipe -ri $ri -o $ri
  1. Two datasets listed under WMT 22 page — CsEng2.0 and CCMT — require login and will not be downloaded using this tool.

  2. Newstest 2021 is not supported yet. See current status (#116)

Usage: mtdata get-recipe
$  mtdata get-recipe  -h
usage: mtdata get-recipe [-h] -ri RECIPE_ID [-f] [-j N_JOBS] [--merge | --no-merge] [--compress] [-dd] [-dt] -o OUT_DIR

optional arguments:
  -h, --help            show this help message and exit
  -ri RECIPE_ID, --recipe-id RECIPE_ID
                        Recipe ID (default: None)
  -f, --fail-on-error   Fail on error (default: False)
  -j N_JOBS, --n-jobs N_JOBS
                        Number of worker jobs (processes) (default: 1)
  --merge               Merge train into a single file (default: True)
  --no-merge            Do not Merge train into a single file (default: False)
  --compress            Keep the files compressed (default: False)
  -dd, --dedupe, --drop-dupes
                        Remove duplicate (src, tgt) pairs in training (if any); valid when --merge. Not recommended for large datasets. (default: False)
  -dt, --drop-tests     Remove dev/test sentences from training sets (if any); valid when --merge (default: False)
  -o OUT_DIR, --out OUT_DIR
                        Output directory name (default: None)

5. Add/Customize a Recipe

Here is an example

- id: wmt22-deen (1)
  langs: deu-eng
  desc: WMT 22 General MT
  dev:  (2)
    - Statmt-newstest_deen-2020-deu-eng
    - Statmt-newstest_ende-2020-eng-deu
  test: (2)
    #- Statmt-newstest_deen-2021-deu-eng
    #- Statmt-newstest_ende-2021-eng-deu
  train: (3)
    - Statmt-europarl-10-deu-eng
    - ParaCrawl-paracrawl-9-eng-deu
    - Statmt-commoncrawl_wmt13-1-deu-eng
    - Statmt-news_commentary-16-deu-eng
    - Statmt-wikititles-3-deu-eng
    - Tilde-rapid-2019-deu-eng # - Tilde-rapid-2016-deu-eng
    - Facebook-wikimatrix-1-deu-eng
  1. id has to be unique.

  2. dev and test are optional. They can be a single dataset (i.e. String) or list of datasets (i.e. list of strings)

  3. train is required.

6. Issues / Bugs

Please report them using GitHub issues at .