Moving neural machine translation towards models that are based on linguistic insight into language include adding linguistic annotation at the word level or model syntactic or semantic structure.
Linguistic Annotation is the main subject of 19 publications.
Wu et al. (2012)
propose to use factored representations of words (using lemma, stem, and part of speech), with each factor encoded in a one-hot vector, in the input to a recurrent neural network language model. Sennrich and Haddow (2016)
use such representations in the input and output of neural machine translation models, demonstrating better translation quality.
Huck et al. (2017)
segment words based on morphological principles: separating prefixes and suffixes and splitting compounds, showing superior performance compared to the data-driven byte-pair encoding. Burlot et al. (2017)
also detach morphemes from the lemma, but replace them with tags that indicate their morphological features. Tamchyna et al. (2017)
use the same method for Czech, but with deterministic tags, avoiding a disambiguation post-editing step.
Nadejde et al. (2017)
add syntactic CCG tags to each output word, thus encouraging the model to also produce proper syntactic structure alongside a fluent sequence of words.
Pu et al. (2017)
first train a word sense disambiguation model based WordNet senses, based on their sense description and then use it to augment the input sentence with sense tags. Gonzales et al. (2017)
also perform word sense disambiguation and enrich the input with sense embeddings and semantically related words from previous input text.
- Hashimoto and Tsuruoka (2017)
- Bastings et al. (2017)
- Eriguchi et al. (2016)
- Martínez et al. (2016)
- Zhang et al. (2016)
- Yamagishi et al. (2016)
- Chen et al. (2017)
- Li et al. (2017)
- Wu et al. (2017)
- Aharoni and Goldberg (2017)
- Eriguchi et al. (2017)