Training Data for Transliteration

Since transliteration examples typically do not exist, there has been significant effort to collect such data.

Transliteration Training Data is the main subject of 43 publications. 18 are discussed here.

Topics in Transliteration

Transliteration With FSM | Transliteration With Other Methods | Forward Transliteration | Transliteration Training Data | Integrating Transliteration

Topics in LinguisticProblems

Publications

Training data may be collected from parallel corpora (Lee and Chang, 2003; Lee et al., 2004), or by mining comparable data such as news streams (Klementiev and Roth, 2006; Klementiev and Roth, 2006b). Training data for transliteration may also be obtained from monolingual text where the spelling of a foreign name is followed by its native form in parenthesis (Lin et al., 2004; Chen and Chen, 2006; Lin et al., 2008), which is common for instance for unusual English names in Chinese text. Such an acquisition may be improved by bootstrapping — iteratively extracting high-confidence pairs and improving the matching model (Sherif and Kondrak, 2007). Sproat et al. (2006) fish for name transliteration in comparable corpora, also using phonetic correspondences. Tao et al. (2006) exploit additionally temporal distributions of name mentions, and Yoon et al. (2007) use a Winnow algorithm and a classifier to bootstrap the acquisition process. Cao et al. (2007) use various features, including that a Chinese character is part of a transliteration a priori in a perceptron classifier. Large monolingual corpus resources such as the web are used for validation (Al-Onaizan and Knight, 2002; Al-Onaizan and Knight, 2002b; Qu and Grefenstette, 2004; Kuo et al., 2006; Yang et al., 2008). Of course, training data may also be manually created, possibly aided by an active learning component that suggests the most valuable new examples (Goldwasser and Roth, 2008).

Benchmarks

Discussion

New Publications

You, Gae-won and Cha, Young-rok and Kim, Jinhan and Hwang, Seung-won (2013): Enriching Entity Translation Discovery using Selective Temporality, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
add
@InProceedings{you-EtAl:2013:Short,
author = {You, Gae-won and Cha, Young-rok and Kim, Jinhan and Hwang, Seung-won},
title = {Enriching Entity Translation Discovery using Selective Temporality},
booktitle = {Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
month = {August},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {201--205},
url = {http://www.aclweb.org/anthology/P13-2036},
year = 2013
}
You et al. (2013)
Kunchukuttan, Anoop and Bhattacharyya, Pushpak (2015): Data representation methods and use of mined corpora for Indian language transliteration, Proceedings of the Fifth Named Entity Workshop
add
@InProceedings{kunchukuttan-bhattacharyya:2015:NEWS2015,
author = {Kunchukuttan, Anoop and Bhattacharyya, Pushpak},
title = {Data representation methods and use of mined corpora for Indian language transliteration},
booktitle = {Proceedings of the Fifth Named Entity Workshop},
month = {July},
address = {Beijing, China},
publisher = {Association for Computational Linguistics},
pages = {78--82},
url = {http://www.aclweb.org/anthology/W15-3912},
year = 2015
}
Kunchukuttan and Bhattacharyya (2015)
Richardson, John and Nakazawa, Toshiaki and Kurohashi, Sadao (2013): Robust Transliteration Mining from Comparable Corpora with Bilingual Topic Models, Proceedings of the Sixth International Joint Conference on Natural Language Processing
add
@InProceedings{richardson-nakazawa-kurohashi:2013:IJCNLP,
author = {Richardson, John and Nakazawa, Toshiaki and Kurohashi, Sadao},
title = {Robust Transliteration Mining from Comparable Corpora with Bilingual Topic Models},
booktitle = {Proceedings of the Sixth International Joint Conference on Natural Language Processing},
month = {October},
address = {Nagoya, Japan},
publisher = {Asian Federation of Natural Language Processing},
pages = {261--269},
url = {http://www.aclweb.org/anthology/I13-1030},
year = 2013
}
Richardson et al. (2013)
Yufeng Chen and Chengqing Zong and Keh-Yih Su (2013): A Joint Model to Identify and Align Bilingual Named Entities, Computational Linguistics
add
@Article{CL:2013-2001,
author = {Yufeng Chen and Chengqing Zong and Keh-Yih Su},
title = {A Joint Model to Identify and Align Bilingual Named Entities},
journal = {Computational Linguistics},
volume = {39},
number = {2},
url = {http://aclweb.org/anthology-new/J/J13/J13-2001.pdf},
year = 2013
}
Chen et al. (2013)
El-Kahki, Ali and Darwish, Kareem and Abdul-Wahab, Mohamed and Taei, Ahmed (2012): Transliteration Mining Using Large Training and Test Sets, Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
add
@InProceedings{elkahki-EtAl:2012:NAACL-HLT,
author = {El-Kahki, Ali and Darwish, Kareem and Abdul-Wahab, Mohamed and Taei, Ahmed},
title = {Transliteration Mining Using Large Training and Test Sets},
booktitle = {Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
address = {Montr\'{e}al, Canada},
publisher = {Association for Computational Linguistics},
pages = {243--252},
url = {http://www.aclweb.org/anthology/N12-1025},
year = 2012
}
El-Kahki et al. (2012)
Munro, Robert and Manning, Christopher D. (2012): Accurate Unsupervised Joint Named-Entity Extraction from Unaligned Parallel Text, Proceedings of the 4th Named Entity Workshop (NEWS) 2012
add
@InProceedings{munro-manning:2012:NEWS2012,
author = {Munro, Robert and Manning, Christopher D.},
title = {Accurate Unsupervised Joint Named-Entity Extraction from Unaligned Parallel Text},
booktitle = {Proceedings of the 4th Named Entity Workshop (NEWS) 2012},
month = {July},
address = {Jeju, Korea},
publisher = {Association for Computational Linguistics},
pages = {21--29},
url = {http://www.aclweb.org/anthology/W12-4403},
year = 2012
}
Munro and Manning (2012)
Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut (2012): A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
add
@InProceedings{sajjad-fraser-schmid:2012:ACL2012,
author = {Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut},
title = {A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining},
booktitle = {Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
month = {July},
address = {Jeju Island, Korea},
publisher = {Association for Computational Linguistics},
pages = {469--477},
url = {http://www.aclweb.org/anthology/P12-1049},
year = 2012
}
Sajjad et al. (2012)
Walid Aransa and Holger Schwenk and Loic Barrault (2012): Semi-supervised transliteration mining from parallel and comparable corpora, Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT)
add
@inproceedings{iwslt12:Aransa,
author = {Walid Aransa and Holger Schwenk and Loic Barrault},
title = {Semi-supervised transliteration mining from parallel and comparable corpora},
url = {http://www.mt-archive.info/IWSLT-2012-Aransa.pdf},
pages = {185-192},
booktitle = {Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT)},
location = {Hong Kong},
year = 2012
}
Aransa et al. (2012)
Chang, Ming-Wei and Goldwasser, Dan and Roth, Dan and Tu, Yuancheng (2009): Unsupervised Constraint Driven Learning For Transliteration Discovery, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
add
@InProceedings{chang-EtAl:2009:NAACLHLT09,
author = {Chang, Ming-Wei and Goldwasser, Dan and Roth, Dan and Tu, Yuancheng},
title = {Unsupervised Constraint Driven Learning For Transliteration Discovery},
booktitle = {Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
month = {June},
address = {Boulder, Colorado},
publisher = {Association for Computational Linguistics},
pages = {299--307},
url = {http://www.aclweb.org/anthology/N/N09/N09-1034},
year = 2009
}
Chang et al. (2009)
Yang, Fan and Zhao, Jun and Liu, Kang (2009): A Chinese-English Organization Name Translation System Using Heuristic Web Mining and Asymmetric Alignment, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
add
@InProceedings{yang-zhao-liu:2009:ACLIJCNLP,
author = {Yang, Fan and Zhao, Jun and Liu, Kang},
title = {A Chinese-English Organization Name Translation System Using Heuristic Web Mining and Asymmetric Alignment},
booktitle = {Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP},
month = {August},
address = {Suntec, Singapore},
publisher = {Association for Computational Linguistics},
pages = {387--395},
url = {http://www.aclweb.org/anthology/P/P09/P09-1044},
year = 2009
}
Yang et al. (2009)
You, Gae-won and Hwang, Seung-won and Song, Young-In and Jiang, Long and Nie, Zaiqing (2010): Mining Name Translations from Entity Graph Mapping, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
add
@InProceedings{you-EtAl:2010:EMNLP,
author = {You, Gae-won and Hwang, Seung-won and Song, Young-In and Jiang, Long and Nie, Zaiqing},
title = {Mining Name Translations from Entity Graph Mapping},
booktitle = {Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing},
month = {October},
address = {Cambridge, MA},
publisher = {Association for Computational Linguistics},
pages = {430--439},
url = {http://www.aclweb.org/anthology/D/D10/D10-1042},
year = 2010
}
You et al. (2010)
Ji, Heng (2009): Mining Name Translations from Comparable Corpora by Creating Bilingual Information Networks, Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
add
@InProceedings{ji:2009:BUCC,
author = {Ji, Heng},
title = {Mining Name Translations from Comparable Corpora by Creating Bilingual Information Networks},
booktitle = {Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora},
month = {August},
address = {Singapore},
publisher = {Association for Computational Linguistics},
pages = {34--37},
url = {http://www.aclweb.org/anthology/W/W09/W09-3107},
year = 2009
}
Ji (2009)
Chen, Yufeng and Zong, Chengqing and Su, Keh-Yih (2010): On Jointly Recognizing and Aligning Bilingual Named Entities, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
add
@InProceedings{chen-zong-su:2010:ACL,
author = {Chen, Yufeng and Zong, Chengqing and Su, Keh-Yih},
title = {On Jointly Recognizing and Aligning Bilingual Named Entities},
booktitle = {Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {631--639},
url = {http://www.aclweb.org/anthology/P10-1065},
year = 2010
}
Chen et al. (2010)
Udupa, Raghavendra and Saravanan, K and Kumaran, A and Jagarlamudi, Jagadeesh (2009): MINT: A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora, Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)
add
@InProceedings{udupa-EtAl:2009:EACL,
author = {Udupa, Raghavendra and Saravanan, K and Kumaran, A and Jagarlamudi, Jagadeesh},
title = {{MINT}: A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora},
booktitle = {Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)},
month = {March},
address = {Athens, Greece},
publisher = {Association for Computational Linguistics},
pages = {799--807},
url = {http://www.aclweb.org/anthology/E09-1091},
year = 2009
}
Udupa et al. (2009)
Kumaran, A and M. Khapra, Mitesh and Li, Haizhou (2010): Report of NEWS 2010 Transliteration Mining Shared Task, Proceedings of the 2010 Named Entities Workshop
add
@InProceedings{kumaran-mkhapra-li:2010:NEWS1,
author = {Kumaran, A and M. Khapra, Mitesh and Li, Haizhou},
title = {Report of {NEWS} 2010 Transliteration Mining Shared Task},
booktitle = {Proceedings of the 2010 Named Entities Workshop},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {21--28},
url = {http://www.aclweb.org/anthology/W10-2403},
year = 2010
}
Kumaran et al. (2010)
Kumaran, A and M. Khapra, Mitesh and Li, Haizhou (2010): Whitepaper of NEWS 2010 Shared Task on Transliteration Mining, Proceedings of the 2010 Named Entities Workshop
add
@InProceedings{kumaran-mkhapra-li:2010:NEWS2,
author = {Kumaran, A and M. Khapra, Mitesh and Li, Haizhou},
title = {Whitepaper of {NEWS} 2010 Shared Task on Transliteration Mining},
booktitle = {Proceedings of the 2010 Named Entities Workshop},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {29--38},
url = {http://www.aclweb.org/anthology/W10-2404},
year = 2010
}
Kumaran et al. (2010)
Li, Haizhou and Kumaran, A and Zhang, Min and Pervouchine, Vladimir (2010): Report of NEWS 2010 Transliteration Generation Shared Task, Proceedings of the 2010 Named Entities Workshop
add
@InProceedings{li-EtAl:2010:NEWS1,
author = {Li, Haizhou and Kumaran, A and Zhang, Min and Pervouchine, Vladimir},
title = {Report of {NEWS} 2010 Transliteration Generation Shared Task},
booktitle = {Proceedings of the 2010 Named Entities Workshop},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {1--11},
url = {http://www.aclweb.org/anthology/W10-2401},
year = 2010
}
Li et al. (2010)
Li, Haizhou and Kumaran, A and Zhang, Min and Pervouchine, Vladimir (2010): Whitepaper of NEWS 2010 Shared Task on Transliteration Generation, Proceedings of the 2010 Named Entities Workshop
add
@InProceedings{li-EtAl:2010:NEWS2,
author = {Li, Haizhou and Kumaran, A and Zhang, Min and Pervouchine, Vladimir},
title = {Whitepaper of {NEWS} 2010 Shared Task on Transliteration Generation},
booktitle = {Proceedings of the 2010 Named Entities Workshop},
month = {July},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {12--20},
url = {http://www.aclweb.org/anthology/W10-2402},
year = 2010
}
Li et al. (2010)
Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut (2011): An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies
add
@InProceedings{sajjad-fraser-schmid:2011:ACL-HLT2011,
author = {Sajjad, Hassan and Fraser, Alexander and Schmid, Helmut},
title = {An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies},
month = {June},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {430--439},
url = {http://www.aclweb.org/anthology/P11-1044},
year = 2011
}
Sajjad et al. (2011)
El Kahki, Ali and Darwish, Kareem and Saad El Din, Ahmed and Abd El-Wahab, Mohamed and Hefny, Ahmed and Ammar, Waleed (2011): Improved Transliteration Mining Using Graph Reinforcement, Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
add
@InProceedings{elkahki-EtAl:2011:EMNLP,
author = {El Kahki, Ali and Darwish, Kareem and Saad El Din, Ahmed and Abd El-Wahab, Mohamed and Hefny, Ahmed and Ammar, Waleed},
title = {Improved Transliteration Mining Using Graph Reinforcement},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing},
month = {July},
address = {Edinburgh, Scotland, UK.},
publisher = {Association for Computational Linguistics},
pages = {1384--1393},
url = {http://www.aclweb.org/anthology/D11-1128},
year = 2011
}
Kahki et al. (2011)
Freeman, Andrew and Condon, Sherri and Ackerman, Christopher (2006): Cross Linguistic Name Matching in English and Arabic, Proceedings of the Human Language Technology Conference of the NAACL, Main Conference
add
@InProceedings{freeman-condon-ackerman:2006:HLT-NAACL06-Main,
author = {Freeman, Andrew and Condon, Sherri and Ackerman, Christopher},
title = {Cross Linguistic Name Matching in {English} and {Arabic}},
booktitle = {Proceedings of the Human Language Technology Conference of the NAACL, Main Conference},
month = {June},
address = {New York City, USA},
publisher = {Association for Computational Linguistics},
pages = {471--478},
url = {http://www.aclweb.org/anthology/N/N06/N06-1060},
year = 2006
}
Freeman et al. (2006)
Wu, Jian-Cheng and Chang, Jason S. (2007): Learning to Find English to Chinese Transliterations on the Web, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
add
@InProceedings{wu-chang:2007:EMNLP-CoNLL2007,
author = {Wu, Jian-Cheng and Chang, Jason S.},
title = {Learning to Find {E}nglish to {C}hinese Transliterations on the Web},
booktitle = {Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)},
pages = {996--1004},
url = {http://www.aclweb.org/anthology/D/D07/D07-1106},
year = 2007
}
Wu and Chang (2007)
Jong-Hoon Oh and Hitoshi Isahara (2008): Hypothesis Selection in Machine Transliteration: A Web Mining Approach , Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)
add
@inproceedings{Oh:2008:IJCNLP,
author = {Jong-Hoon Oh and Hitoshi Isahara},
title = {Hypothesis Selection in Machine Transliteration: A Web Mining Approach },
url = {http://www.mt-archive.info/IJCNLP-2008-Oh.pdf},
googlescholar = {15247339783523356815},
booktitle = {Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP)},
year = 2008
}
Oh and Isahara (2008)
Chengguo Jin and Seung-Hoon Na and Dong-Il Kim and Jong-Hyeok Lee (2008): Automatic Extraction of English-Chinese Transliteration Pairs using Dynamic Window and Tokenizer, Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing
add
@inproceedings{Jin:2008:IJCNLP,
author = {Chengguo Jin and Seung-Hoon Na and Dong-Il Kim and Jong-Hyeok Lee},
title = {Automatic Extraction of {E}nglish-{C}hinese Transliteration Pairs using Dynamic Window and Tokenizer},
url = {http://oldsite.aclweb.org/anthology-new/I/I08/I08-4002.pdf},
googlescholar = {14103457912353076560},
booktitle = {Proceedings of the Sixth SIGHAN Workshop on {Chinese} Language Processing},
year = 2008
}
Jin et al. (2008)
Jin-Shea Kuo and Haizhou Li and Chih-Lung Lin (2008): Mining Transliterations from Web Query Results: An Incremental Approach , Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing
add
@inproceedings{Kuo2:2008:IJCNLP,
author = {Jin-Shea Kuo and Haizhou Li and Chih-Lung Lin},
title = {Mining Transliterations from Web Query Results: An Incremental Approach },
url = {http://www.mt-archive.info/IJCNLP-2008-Kuo-2.pdf},
googlescholar = {14247836374749958932},
booktitle = {Proceedings of the Sixth SIGHAN Workshop on {Chinese} Language Processing},
year = 2008
}
Kuo et al. (2008)

MT Research Survey Wiki

A Comprehensive Survey of Neural and Statistical Machine Translation Research Publications

Search Descriptions

Training Data for Transliteration

Publications

Benchmarks

Discussion

Related Topics

New Publications