This directory contains developement data that may be useful for automatic evaluation metrics for statistical machine translation.   The files in the segment-rankings/ subdirectory contain segment-level rankings for each of the data conditions in the Workshop on Statistical Machine Translation at ACL 2007 (http://www.statmt.org/wmt07/).  The files look like this:

% head -18 segment-rankings/cs-en.nc-test

1 cu = uedin
1 cu > pctranslator2007
1 uedin > pctranslator2007
1 umd > cu
1 umd > pctranslator2007
1 umd > uedin
14    cu > pctranslator2007
14    cu > uedin
14    cu > umd
14    uedin = umd
14    uedin > pctranslator2007
14    umd > pctranslator2007
16    cu = pctranslator2007
16    cu = uedin
16    cu = umd
16    pctranslator2007 = uedin
16    pctranslator2007 = umd
16    uedin = umd

The number indicates the segment being judged (indexed from 1, not zero).  The information following the segment number indicates the rank of two systems.  For instance on the first segment the cu system was better than the pctranslator2007 system, equal to the uedin system, and worse than the umd system.  The system translations are provided in the submissions/ subdirectory.  Here are the translations produced by the aforementioned four systems:

% head -1 submissions/cs-en/*
==> submissions/cs-en/wmt07.cu.nc-test.cs-en <==
Racially divided Europe

==> submissions/cs-en/wmt07.pctranslator2007.nc-test.cs-en <==
Racially fission Europe

==> submissions/cs-en/wmt07.uedin.nc-test.cs-en <==
A racially divided Europe

==> submissions/cs-en/wmt07.umd.nc-test.cs-en <==
A Racially Divided Europe

The corresponding reference segment is contained in the reference/ subdirectory:

% head -1 reference/nc-test2007.en 
Europe's Divided Racial House

The source segment is in the source/ subdir:

% head -1 source/nc-test2007.cs 
Rasově rozdělená Evropa

The rankings were produced by running the following script over the raw judgments file available at http://www.statmt.org/wmt07/judgements.gz

    zcat judgements.gz | scripts/extract_segment_rank.perl| grep "WMT07 English-Czech News Commentary" | sort -n | cut -f1,3 > rankings/en-cz.nc-test

When there were multiple judgements for a pair of systems for a single segment, the script took the majority over the judgements.  The method for collecting these relative rankings of each segment is described in 

@InProceedings{callisonburch-EtAl:2007:WMT,
  author    = {Callison-Burch, Chris  and  Fordyce, Cameron  and  Koehn, Philipp  and  Monz, Christof  and  Schroeder, Josh},
  title     = {(Meta-) Evaluation of Machine Translation},
  booktitle = {Proceedings of the Second Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2007},
  address   = {Prague, Czech Republic},
  publisher = {Association for Computational Linguistics},
  pages     = {136--158},
  url       = {http://www.aclweb.org/anthology/W/W07/W07-0218}
}

The system-rankings/ subdirectory contains system rankings which are based on the total number of times that one system's segments are ranked higher than another's.  Here's how those scores were calculated:

    zcat ~/Downloads/judgements.gz | perl scripts/calculate_system_rank.perl

If you have any questions about any of this data, feel free to contact Chris Callison-Burch (http://cs.jhu.edu).