Back to Posts

LSTM Encoder-Decoder Translator with TensorFlow

Sequence to Sequence Learning with LSTM adapted from Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.

Project Overview

Sequence to Sequence Learning with Keras (Beta) – Authored by Hayson Cheung (hayson.cheung@mail.utoronto.ca).
Adapted from the seminal work of Ilya Sutskever, Oriol Vinyals, and Quoc V. Le in their paper Sequence to Sequence Learning with Neural Networks (NIPS 2014).

Overview LSTM enc-dec

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) designed to handle sequential data. Unlike traditional RNNs, LSTMs mitigate the vanishing gradient problem by using gating mechanisms that control the flow of information.

Encoders and Decoders in LSTM

Below is a figure of the LSTM Encoder-Decoder model. The encoder processes the input sequence and generates a context vector, which is then fed into the decoder to generate the output sequence base on the encoder's context vector and its hidden states and previous output.

LSTM Encoder-Decoder Model

LSTM Encoder-Decoder architecture for sequence-to-sequence learning.

Encoding Process

In an LSTM Encoder-Decoder framework, the encoder processes the input sequence $$ X = (x_1, x_2, ..., x_T) $$ and compresses it into a fixed-size latent vector $$ h_T $$, also known as the **context vector**. This latent space representation captures the semantics of the entire sequence.

The LSTM cell is governed by the following equations:

\[ f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f) \] \[ i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i) \] \[ \tilde{C}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c) \] \[ C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \] \[ o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o) \] \[ h_t = o_t \odot \tanh(C_t) \]

Here, $$ f_t, i_t, o_t $$ represent the forget, input, and output gates, while $$ C_t $$ is the cell state that stores long-term dependencies. The hidden state $$ h_T $$ at the final timestep encodes the full input sequence.

Latent Space Representation

The hidden state $$ h_T $$ and cell state $$ C_T $$ together form a **latent representation** of the input sequence. This latent space is crucial because it acts as a bottleneck that forces the model to learn a compact, meaningful representation of the input before generating output.

Decoding Process

The **decoder** receives the latent representation and generates an output sequence $$ Y = (y_1, y_2, ..., y_T') $$. At each timestep, it takes the previous output token and hidden state as input:

\[ s_t = \text{LSTM}(y_{t-1}, s_{t-1}) \] \[ p(y_t | y_1, ..., y_{t-1}, X) = \text{softmax}(W_s s_t + b_s) \]

This probability distribution is used to select the next token in the output sequence.

Limitations & Future Improvements

While this approach is effective, using a single context vector $$ h_T $$ to encode the entire sequence can be a bottleneck. To improve performance, modern architectures introduce **attention mechanisms**, which allow the decoder to focus on different parts of the input sequence dynamically.

Further Improvements: Attention Mechanisms

One key limitation of the basic sequence-to-sequence model is that it relies on a fixed-dimensional context vector to summarize the entire input sequence. This can be problematic for long input sequences, as the context vector may not capture all relevant information.

To address this, attention mechanisms were introduced in later works, allowing the decoder to focus on different parts of the input sequence at each time step. The Bahdanau Attention (Neural Machine Translation by Jointly Learning to Align and Translate) model proposed a soft alignment method where the decoder dynamically attends to different encoder hidden states.

This idea was further extended in Luong Attention (Effective Approaches to Attention-based Neural Machine Translation), which introduced different types of attention mechanisms. Attention has since become a foundational concept in deep learning for NLP, leading to the development of Transformer models.

Suggested Reading