SIGIR 2020 Tutorial: Searching the Web for Cross-lingual Web Data

While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.

Ahmed El-Kishky1, Philipp Koehn2, Holger Schwenk1

Facebook AI1, Johns Hopkins University2

Outline

  1. Preliminaries of mining the web for parallel data
  2. Web crawling and multilingual corpora
  3. Cross-lingual representations
  4. Parallel Document Retrieval
  5. Document Level Parallel Sentence Retrieval
  6. Global Parallel Sentence Retrieval
  7. Parallel Sentence Filtering

Tutorial

[Paper],[SLIDES]

Code

[FAISS][LASER][BiCleaner][MUSE]

Data

[WikiMatrix]

Publications

Presenters


Ahmed El-Kishky, is a Research Scientist at Facebook AI where he works on developing automated methods for ob-taining machine translation training data. Before that, El-Kishky received his PhD from the University of Illinois at Urbana-Champaign where he was supported by the National Science Foundation Graduate Research Fellowship (NSF-GRF) and the National Defense Science and Engineering Graduate (NDSEG) Fellowship. In his career, El-Kishky has published papers and given tutorials in venues such as KDD, VLDB, SIGMOD, WWW, WSDM, ICDM, and BIGDATA.

Philipp Koehn, is a professor in the Department of Computer Science at Johns Hopkins University, is recognized worldwide for his leading research in and applications for developing and understanding data-driven methods to solve long-standing, real-world challenges of machine translation and machine learning. Koehn authored the textbook Statistical Machine Translaton and Neural Machine Translation. Koehn serves on the editorial boards for multiple journals, among them: Transaction of the Association of Computational Linguistics; Machine Translation Journal; Artificial Intelligence Review; Computation, Corpora, Cognition, and ACM Transactions on Asian and Low-Resource Language Information Processing. Koehn is president of the ACL Special Interest Group on Machine Translation which organizes a series of ACL workshops and conferences on Machine Translation since 2005 (WMT).

Holger Schwenk, is a research scientist at Facebook Artificial Intelligence Research Paris. He received his PhD in Computer Science from the University Paris in 1996, and prior to joining Facebook in 2015, he was professor of computer science at the University of Le Mans where he led a large group on statistical machine translation. During Schwenk's career, he has authored papers in top machine learning and natural language processing venues such as ACL, NAACL, EMNLP, and NeurIPS.