"the key idea that enabled NMT models to outperform classic phrase-based MT systems"
Xu et al 2015, "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention"
Reading from memory - output memory state most similar to input state
Writing to memory - creating readable states
Fougnie 2008, "The Relationship between Attention and Working Memory"
Weighted linear combination(s) of encoded input
Bahdanau et al 2015, "Neural Machine Translation by Jointly Learning to Align and Translate"
Step 1. compute a score that measures how important encoder hidden states are to decoder hidden states,
Dot product $$ score(\mathbf h_{i-1}^{(d)}, \mathbf h_j^{(e)}) = \mathbf h_{i-1}^{(d)}\cdot \mathbf h_j^{(e)} $$
Weighted dot product $$ score(\mathbf h_{i-1}^{(d)}, \mathbf h_j^{(e)}) = (\mathbf h_{i-1}^{(d)})^T \mathbf W_s \mathbf h_j^{(e)} $$
Get one number for each prior state decoder $j$, relating to current ($i$th) decoder state.
Step 2. Use softmax over scores to normalize over the decoder states
\begin{align} \alpha_{ij} &= softmax(score(\mathbf h_{i-1}^{(d)}, \mathbf h_j^{(e)})) \\ &= \frac{\exp(score(\mathbf h_{i-1}^{(d)}, \mathbf h_j^{(e)}))}{\sum_k\exp(score(\mathbf h_{i-1}^{(d)}, \mathbf h_k^{(e)}))} \end{align}Context vector = linear combination of encoder state vectors, weighted by normalized scores
$$ \mathbf c_i = \sum_j \alpha_{ij} \mathbf h_j^{(e)} $$Encoder-decoder network with attention. Computing the value for $\mathbf h_i$ is based on the previous hidden state, the previous word generated, and the current context vector $\mathbf c_i$. This context vector is derived from the attention computation based on comparing the previous hidden state to all of the encoder hidden states.