Project Overview
Sequence to Sequence Learning with Keras (Beta) – Authored by Hayson Cheung (hayson.cheung@mail.utoronto.ca).
Adapted from the seminal work of Ilya Sutskever, Oriol Vinyals, and Quoc V. Le in their paper
Sequence to Sequence Learning with Neural Networks (NIPS 2014).
Overview LSTM enc-dec
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) designed to handle sequential data. Unlike traditional RNNs, LSTMs mitigate the vanishing gradient problem by using gating mechanisms that control the flow of information.
Encoders and Decoders in LSTM
Below is a figure of the LSTM Encoder-Decoder model. The encoder processes the input sequence and generates a context vector, which is then fed into the decoder to generate the output sequence base on the encoder's context vector and its hidden states and previous output.

LSTM Encoder-Decoder architecture for sequence-to-sequence learning.
Encoding Process
In an LSTM Encoder-Decoder framework, the encoder processes the input sequence $$ X = (x_1, x_2, ..., x_T) $$ and compresses it into a fixed-size latent vector $$ h_T $$, also known as the **context vector**. This latent space representation captures the semantics of the entire sequence.
The LSTM cell is governed by the following equations:
\[ f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f) \] \[ i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i) \] \[ \tilde{C}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c) \] \[ C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \] \[ o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o) \] \[ h_t = o_t \odot \tanh(C_t) \]Here, $$ f_t, i_t, o_t $$ represent the forget, input, and output gates, while $$ C_t $$ is the cell state that stores long-term dependencies. The hidden state $$ h_T $$ at the final timestep encodes the full input sequence.
Latent Space Representation
The hidden state $$ h_T $$ and cell state $$ C_T $$ together form a **latent representation** of the input sequence. This latent space is crucial because it acts as a bottleneck that forces the model to learn a compact, meaningful representation of the input before generating output.
Decoding Process
The **decoder** receives the latent representation and generates an output sequence $$ Y = (y_1, y_2, ..., y_T') $$. At each timestep, it takes the previous output token and hidden state as input:
\[ s_t = \text{LSTM}(y_{t-1}, s_{t-1}) \] \[ p(y_t | y_1, ..., y_{t-1}, X) = \text{softmax}(W_s s_t + b_s) \]This probability distribution is used to select the next token in the output sequence.
Limitations & Future Improvements
While this approach is effective, using a single context vector $$ h_T $$ to encode the entire sequence can be a bottleneck. To improve performance, modern architectures introduce **attention mechanisms**, which allow the decoder to focus on different parts of the input sequence dynamically.