Shared Task: Efficiency



Efficiency Task

The efficiency task measures latency, throughput, memory consumption, and size of machine translation on CPUs and GPUs. Participants provide their own code and models using standardized data and hardware. This is a continuation of the WNGT 2020 Efficiency Shared Task.

Important dates

Dockers should be submitted (as a URL and sha512sum) to <wmt at kheafield.com> by July 28, 2021 (anywhere on Earth). This is the same as noon UTC on July 29, 2021. We follow the main workshop for paper deadlines and data has already been released for the news task.

Translation model

Systems should translate from English to German following the constrained condition of the 2021 news task. Using another group's constrained model is permissible with citation.

Hardware

There are GPU and CPU conditions. The GPU is one A100 via Oracle Cloud BM.GPU4.8 (but we will limit your Docker to one GPU) and the CPU is an Intel Ice Lake via Oracle Cloud BM.Optimized3.36. We went with Oracle Cloud because the Ice Lake is generally available while it's in preview on other providers.

Latency and Througput

Participants can choose to submit for latency, throughput, or ideally both.

Latency will be measured on the full GPU or one CPU core (with the rest of the CPU idle and an affinity constraint to limit to one core). The test harness will provide your system with one sentence on standard input and flush then wait for your system to provide a translation on its standard output (and flush) before providing the next sentence. The latency script is an example harness, though if systems are fast enough, we may rewrite it in C++.

Throughput will be measured on the GPU or entire 36-core CPU machine. If there is interest in a single-core CPU throughput task, we can also run that.

Measurement

Each system will be run with 1 million lines of raw English input, where each line has at most 150 space-separated words (though your tokenizer will probably break that into more). We will measure the following:

Participants may not use the WMT21 source or target sentences in preparing their submission (though if you submitted to the news task, your quota of 7 Ocelot submissions is allowed).

Results will be reported in a table showing all metrics. The presentation will include a series of Pareto frontiers comparing quality with each of the efficiency metrics. We welcome participants optimizing any of the metrics.

We will not be subtracting loading time from run times. The large input is intended to amortize loading time.

What is a model?

We will report model size on disk, which means we need to define a model as distinct from code. The model includes everything derived from data: all model parameters, vocabulary files, BPE configuration if applicable, quantization parameters or lookup tables where applicable, and hyperparameters like embedding sizes. You may compress your model using standard tools (gzip, bz2, xzip, etc.) and the compressed size will be reported. Code can include simple rule-based tokenizer scripts and hard-coded model structure that could plausibly be used for another language pair. If we suspect that your model is hidden in code, we may ask you to provide another model of comparable size for a surprise language pair with reasonable quality.

Docker submission

Competitors should submit a Docker image with all of the software and model files necessary to perform translation.

/run.sh $hardware $task <input >output runs translation. The $hardware argument will be either "GPU", "CPU-1" (single CPU thread, no hyperthreads), or "CPU-ALL" (all CPU cores). The $task argument will be "latency" or "throughput". The input and output files, which will not necessarily have that name, are UTF-8 plain text separated by UNIX newlines. Each line of input should be translated to one line of output. For the latency task, we will actually run /wmt/latency.py /run.sh CPU-1 latency <input >output (or the same with GPU instead).

As an example, here is the single CPU throughput condition:
image_name="$(docker load -i ${image_file_path} |cut -d " " -f 3)"
container_id="$(docker run -itd --cpuset-cpus=0 ${opt_memory} --memory-swap=0 ${image_name} /bin/sh)"
(time docker exec -i "${container_id}" /run.sh CPU-1 throughput) <input.txt >${result_directory}/run.stdout 2>${result_directory}/run.stderr

In the CPU-ALL condition, your docker container will be able to control CPU affinity so numactl and taskset will work (provided of course you include them in your container).

Participants in past editions should note the arguments to /run.sh have changed.

Multiple submission is encouraged. You can submit multiple Docker containers and indicate which conditions to run them with. Please include the name of your team in the name of the Docker file.

Post your Docker image online and send a sha512sum of the file to wmt at kheafield.com. If you need a place to upload to instead, contact us.

Contact

Kenneth Heafield
wmt at kheafield dot com

Sponsors

European Union
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825303.
Intel
Intel Corporation has supported organization.
Oracle for Research
Oracle has contributed cloud credits under the Oracle for Research program.
Microsoft
Microsoft is supporting human evaluation.