Justin Selig

A Curriculum in Neural Machine Translation

Everything you need to know about NMT as of May 2019.

2019 is the year for language

As of writing this post (May 2019), Generative Adversarial Networks (GANs) remain the state of the art in tasks related to image recognition and object detection. Most landmark work in deep learning produced up until last year focused on image datasets - and we did quite well. Computers have been able to out-perform humans in image recognition tasks since the early 2010’s. However, only recently have we reached a similar level of human-quality performance for tasks involving language (sequence) translation. For these tasks, encoder-decoder architectures with attention mechanisms such as Transformer-Style Networks are now state of the art.

There are a lot of papers out there describing experiments using attention mechanisms. Much like variants on gradient descent for backpropagation, people have found new and absurdly complex ways to eek out more performance and accuracy from their models by using more and more complex mathematical models. However, the most salient of attention mechanisms originate from a handful of landmark papers. I’ve done my best to remove the noise and list the essential material below.

Here it is…

Below are my favorite primary and secondary sources on NMT separated by topic. I’ve outlined the material such that someone could follow this curriculum starting from the basics. These are necessarily in order:


  1. Word Embeddings
  2. Language Modeling
  3. Long Short-Term Memory (LSTM)


Sequence-to-Sequence Networks for Neural Machine Translation (NMT):

  1. Seq2seq
  2. Reference Code & Walkthrough

Tl;dr: Encoder-decoder networks of LSTMs perform fairly well in language translation.

Sequence-to-Sequence + Attention:

  1. Luong et al.
  2. Bahdanau et al.

Tl;dr: LSTMs are not good at encoding long-range dependencies. Attention fixes this by allowing a model to pay attention to relevant parts of the full source sequence when translating individual words.

Google NMT:

  1. GNMT
  2. ML Perf Reference Code

Tl;dr: Google was able to reach state of the art results using an encoder and decoder network of bidirectional LSTMs, stacked LSTMs with residual connections, and an attention layer in between.


  1. Attention is All You Need (AIAYN)
  2. ML Perf Reference Code

Tl;dr: Transformer is a network that excludes the use of LSTMs but maintains the encoder-decoder model consisting solely of stacked attention layers with dense layers in between. One notable element of the Transformer network is its suitability for GPUs since the most compute-intensive parts have been packed into GEMM operations.

The Evolved Transformer and Beyond:

  1. Transformer XL
  2. The Evolved Transformer
  3. BERT


Some more really great resources:

Thanks for reading.


· blog