A Curriculum in Neural Machine Translation

Everything you need to know about NMT as of May 2019.

2019 is the year for language

As of writing this post (May 2019), Generative Adversarial Networks (GANs) remain the state of the art in tasks related to image recognition and object detection. Most landmark work in deep learning produced up until last year focused on image datasets - and we did quite well. Computers have been able to out-perform humans in image recognition tasks since the early 2010’s. However, only recently have we reached a similar level of human-quality performance for tasks involving language (sequence) translation. For these tasks, encoder-decoder architectures with attention mechanisms such as Transformer-Style Networks are now state of the art.

There are a lot of papers out there describing experiments using attention mechanisms. Much like variants on gradient descent for backpropagation, people have found new and absurdly complex ways to eek out more performance and accuracy from their models by using more and more complex mathematical models. However, the most salient of attention mechanisms originate from a handful of landmark papers. I’ve done my best to remove the noise and list the essential material below.

Here it is…

Below are my favorite primary and secondary sources on NMT separated by topic. I’ve outlined the material such that someone could follow this curriculum starting from the basics. These are necessarily in order:

Fundamentals:

Tl;dr:

Word may be represented as vectors in a vocabulary.
Pre-existing methods of langauge modeling before deep-learning involved Markov models.
Language models could operate at the character level, n-gram level, sentence-level, or even paragraph-level.
LSTMs are fundamental operators of recurrent neural networks which do well on time-series data.

Sequence-to-Sequence Networks for Neural Machine Translation (NMT):

Tl;dr: Encoder-decoder networks of LSTMs perform fairly well in language translation.

Sequence-to-Sequence + Attention:

Tl;dr: LSTMs are not good at encoding long-range dependencies. Attention fixes this by allowing a model to pay attention to relevant parts of the full source sequence when translating individual words.

Google NMT:

Tl;dr: Google was able to reach state of the art results using an encoder and decoder network of bidirectional LSTMs, stacked LSTMs with residual connections, and an attention layer in between.

Transformer:

Tl;dr: Transformer is a network that excludes the use of LSTMs but maintains the encoder-decoder model consisting solely of stacked attention layers with dense layers in between. One notable element of the Transformer network is its suitability for GPUs since the most compute-intensive parts have been packed into GEMM operations.

The Evolved Transformer and Beyond:

Tl;dr:

Deeper self-attention mechanisms have been developed.
Using evolutionary search algorithms, Le et al. discovered more transformer-style architectures which maintained previous Transformer XL results with less overhead.
BERT is a massive transformer-style architecture made by Google to work really well on their TPUs.

Some more really great resources:

Thanks for reading.

-Justin

May 31, 2019 · blog