edit · history · print

WMT 2015

The WMT 2015 evaluation campaign is a good excuse to build baseline systems that use all best known methods.

The somewhat original contribution is the use of word classes for all components, i.e.,

  • language model (done before)
  • operation sequence model (done before, see paper)
  • sparse features (new!)
  • reordering mode (new!)

This can be seen as a last desperate attempt to avoid the inevitable onslaught of neural networks by emulating one of their benefits: the pooling of evidence in more generalized representations.

It is also a useful exercise how to run such large scale experiments on the cluster

Lessons on cluster usage

Lesson 1: Store each language pair onto its own disk.

I build machine translation systems for 8 language pairs (not doing Finnish). I started out storing them all on /export/b10 which was fine at first, but then ran into serious trouble, when running many (10-20) processes that all access this disk heavily.

Especially decoding requires the reading of typically 50GB of model files (mostly LM) from disk. While at the same time, other processes write on disk (building translation tables), the disk becomes very very slow, which is even quite noticeable on the command line (starting vi taking 1 minute...).

Lesson 2 Not learned yet. Maybe spread out decoding over multiple processes, or not.

Use of a Grid Engine cluster allows for distributing decoder runs onto multiple machines. This seemed to work at first, but then ran into problems with starting up the Moses processes - at crunch time it took up to 5 hours to load the models.

There is not final verdict on this, since it is related to lesson 1. Maybe it is possible to have 5 processes using 20 cores each to run the decoder, but there will be load time / decoding time tradeoff: more processes, longer load time for each process, but faster decoding.

Lesson 3: Cluster configuration in EMS's config

Some steps require a lot of memory or multiple CPUs. This needs to be properly communicated to GridEngine.

Run experiment.perl with the -cluster switch and have the following settings in your config

  • [GENERAL] qsub-settings = "-l 'arch=*64'"
  • [LM] train:qsub-settings = "-l 'arch=*64,mem_free=30G,ram_free=30G'"
  • [INTERPOLATED-LM] interpolate:qsub-settings = "-l 'arch=*64,mem_free=100G,ram_free=100G'"
  • [TRAINING] run-giza:qsub-settings = "-l 'arch=*64,mem_free=10G,ram_free=10G' -pe smp 9"
  • [TRAINING] run-giza-inverse:qsub-settings = "-l 'arch=*64,mem_free=10G,ram_free=10G' -pe smp 9"
  • [TUNING] set jobs to appropriate number (maybe just 1)
  • [TUNING] tune:qsub-settings = "-l 'arch=*64,mem_free=50G,ram_free=50G' -pe smp 20"
  • [EVALUATION] set jobs to appropriate number (maybe just 1)
  • [EVALUATION] decode:qsub-settings = "-l 'arch=*64,mem_free=50G,ram_free=50G' -pe smp 20"

Experiments

Start with hill-climbing to the desired setup. Note: mkcls for many classes gets very very slow. Worst case 2000 classes for 1 billion word French-English parallel corpus takes about a month (GIZA++ takes even longer, so what gives).

Language PairBaselinebrown60/200/600 OSM+LM+Sparse+Reorder+brown2000
[4] de-en[4-1] 27.16 (1.011)  [4-4] 27.46 (1.011)
[5] en-de[5-1] 20.41 (1.003)[5-2] 20.82 (1.004) +.41[5-3] 20.87 (1.004) +.05[5-4] 20.89 (1.006) +.02
[6] cs-en[6-1] 26.44 (1.026)[6-2] 26.69 (1.026) +.25 [6-4] 26.92 (1.028)
[7] en-cs[7-1] 18.96 (0.996)[7-2] 19.51 (0.995) +.55[7-3] 19.86 (0.990) +.35[7-4] 19.76 (0.991) -.10
[8] fr-en[8-1] 31.67 (1.030)
[9] en-fr[9-1] 31.22 (0.995)
[10] ru-en[10-1] 24.39 (1.026)[10-2] 24.69 (1.023) +.30[10-3] 24.65 (1.024) -.04[10-4] 24.83 (1.020) +.18
[11] en-ru[11-1] 19.37 (0.996)[11-2] 20.11 (0.995) +.74[11-3] 20.25 (0.996) +.14[11-4] 20.28 (0.997) +.03
edit · history · print
Page last modified on May 10, 2015, at 11:24 PM