# Recurrent Neural Networks

## April 15, 2020 — 18 min

Recurrent neural networks are very famous deep learning networks which are applied to sequence data: time series forecasting, speech recognition, sentiment classification, machine translation, Named Entity Recognition, etc..
The use of feedforward neural networks on sequence data raises two majors problems:

• Input & outputs can have different lengths in different examples
• MLPs do not share features learned across different positions of the data sample

In this article we will discover the mathematics behind the success of RNNs as well as some special types of cells such as LSTMs and GRUs. We will finally dig into the encoder-decoder architectures combined with attention mechanisms.

## Table of content

1. Notation
2. RNN model
3. Different types of RNNs
5. Encoder & Decoder architecture
6. Attention mechanisms

## 1. Notation

As an illustration, we will consider the task of Named Entity Recognition which consists of locating and identifying the named entity such as proper names: We denote:

• $x^{(i)}_t$: the $t^{th}$ element of the sequence example $x^{(i)}$
• $(x^{(i)})$: the i-th input sequence $x^{(i)}$
• $N_x^{(i)}$: the length of $x^{(i)}$
• $y^{(i)}_t$: the $t^{th}$ element of the output $y^{(i)}$
• $(y^{(i)})$: the i-th output sequence $y^{(i)}$
• $N_y^{(i)}$: the length of $y^{(i)}$

When dealing with non-numerical data, text for instance, it is very important to encode it to numerical vectors: this operation is called embedding. One of the most famous ways to encode text is Bert which was developed by Google.

## 2. RNN model

RNNs represent a special case of neural networks where the parameters of the model as well as the operations performed are the same throughout the architecture. The network performs the same task for each element in a sequence whose output depends on the input and the previous state of the memory.
The graph below shows a neural network of neurons having a single layer of hidden memory: ### Equations

The variables in the architecture are:

• $x_t$: the input at time t
• $h_t$: the state of the memory at time t
• $y_t$: the output at time t

where:

$h_t=\phi(U.x_t+W.h_{t-1})$ and $y_t = \psi(V.h_t)$

$h_{-1}$ is randomly initialized, $\phi$ and $\psi$ are non-linear functions, $U$, $V$, and $W$ are parameters of the various linear regressions, preceding the nonlinear activations.
It is important to note that they are the same throughout the architecture.

### Applications

Recurrent neural networks have significantly improved sequential models, in particular :

• NLP Tasks, Modeling and Text Generation
• Translation machine
• Voice recognition

We summarize the above applications in the following table:

Application Description Input Output Example
NLP Text modelisation and generation $x_t$ is the t-th embedded word of a given text $(y_t)$ can be the answer to the question $(x_t)$ Chatbots, text classification, …
Translation Machine Necessity to go through the entire sentence before its translation $x_t$ is the t-th embedded word of the native text $(y_t)$ is the translation of the native text $(x_t)$ Google Translation, DeepL, …
Speech recognition Recognize the content of a given recording $x_t$ is the value of the audio file’s spectogram at the instant t $(y_t)$ is the text or the command extracted from $(x_t)$ Google Home, Alexa, …

### Learning algorithm

As in classical neural networks, learning in the case of recurrent networks is done by optimizing a cost function with respect to $U$, $V$ and $W$. In other words, we aim to find the best parameters that give the best prediction/approximation $\hat{y_i}$, starting from the input $x_i$, of the real value $y_i$.
For this, we define an objective function called the loss function and denoted J which quantifies the distance between the real and the predicted values on the overall training set.
We minimize J following two major steps:

• Forward Propagation: we propagate the data through the network either in entirely or in batches, and we calculate the loss function on this batch which is nothing but the sum of the errors committed at the predicted output for the different rows.
• Backward Propagation Through Time: consists of calculating the gradients of the cost function with respect to the different parameters, then apply a descent algorithm to update them. It is called BPTT, since the gradients at each output depend both on the elements of the same instant and the state of the memory at the previous instant.

We iter the same process a number of times called epoch number. After defining the architecture, the learning algorithm is written as follows:

• Initialization of the model parameters, a step equivalent to injecting noise into the model.
• For k=1,2…N: (N is the number of epochs)

• Perform forward propagation:

• $\forall i$, Compute the predicted value of $(x_t^{(i)})$ through the neural network: $(\hat{y}_t^{(i)})$
• Evaluate the function : $J(\theta)=\frac{1}{m}\sum_{i=1}^m\sum_{t=1}^{N_y^{(i)}} \mathcal{L}(\hat{y}_t^{(i)}, y_t^{(i)})$ where m is the size of the training set, θ the model parameters and $\mathcal{L}$ the cost${}^{(*)}$ function
• Perform backpropagation through time:

• Apply a descent method to update the parameters : $\theta=:G(\theta)$

${}^{(*)}$ The cost function $\mathcal{L}$ evaluates the distances between the real and predicted value on a single point.

#### Forward propagation

Let us consider the prediction of the output of a single sequence, denoted $(x^{(j)}_t)$, through the neural network At each moment $t$, we compute:

$h^{(j)}_t=\phi(U.x_t^{(j)}+W.h_{t-1}^{(j)})\text{ and } y_t^{(j)} = \psi(V.h_t^{(j)})$

Until reaching the end of the sequence. Again, the parameters $U$, $W$ and $V$ remain the same all along the neural network. When dealing with a $m$-row data set, repeating these operations separately for each line is very costly. Hence, we truncate the dataset in order to have sequences described in the same timeline, i.e $\forall i, N_x^{(i)}=N_x$ and $N_y^{(i)}=N_y$ and We can use linear algebra to parallelize it as follows:

$H^{(j)}_t=\phi(U.X_t^{(j)}+W.H_{t-1}^{(j)})\text{ and } Y_t^{(j)} = \psi(V.H_t^{(j)})$

Where

$H^{(j)}_t=\begin{bmatrix} h^{(j)}_t \end{bmatrix}_{j\in [1,m]} \\ Y_t^{(j)}=\begin{bmatrix} y_t^{(j)} \end{bmatrix}_{j\in [1,m]}$

#### Backpropagation Through Time

The backpropagation is the second step of the learning, which consists of injecting the error committed in the prediction (forward) phase into the network and update its parameters to perform better on the next iteration. Hence, the optimization of the function $J$, usually through a descent method.
For simplification reasons we will consider that: $h_t=U.x_t+W.h_{t-1}$ and $y_t = V.h_t$

For a given sequence $(x_t)$, we perform the forward propagation and compute the following cost:

$L(U,W,V)=\sum_{t=1}^{N_y} \mathcal{L}(\hat{y}_t, y_t)$

and compute the gradient of L defined as follows:

$\nabla L(U,W,V)={}^T \begin{bmatrix} \frac{\partial L}{\partial U} & \frac{\partial L}{\partial W} & \frac{\partial L}{\partial V} \end{bmatrix}$

We have:

$\frac{\partial L}{\partial V}=\sum_{t=1}^{N_y} \frac{\partial \mathcal{L}(\hat{y}_t, y_t)}{\partial V}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}\frac{d\hat{y}_t}{dV}$

$\rightarrow \frac{\partial L}{\partial V}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.h_t$

and

$\frac{\partial L}{\partial W}=\sum_{t=1}^{N_y} \frac{\partial \mathcal{L}(\hat{y}_t, y_t)}{\partial W}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.\frac{d\hat{y}_t}{dh_t}.\frac{dh_t}{dW}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.V.\frac{dh_t}{dW}$

$\frac{\partial L}{\partial U}=\sum_{t=1}^{N_y} \frac{\partial \mathcal{L}(\hat{y}_t, y_t)}{\partial U}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.\frac{d\hat{y}_t}{dh_t}.\frac{dh_t}{dU}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.V.\frac{dh_t}{dU}$

We know that:

$h_{t+1}=U.x_{t+1}+W.h_{t}\rightarrow \frac{\partial h_{t+1}}{\partial h_t}=W^T\rightarrow \frac{\partial h_{N}}{\partial h_t}=(W^T)^{N-t}$

Then,

$\frac{dh_t}{dW}=\sum_{k=1}^t(W^T)^{t-j}h_j$ $\frac{dh_t}{dU}=\sum_{k=1}^t(W^T)^{t-j}x_j$

Finally:

$\frac{\partial L}{\partial U}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.V.\sum_{k=1}^t(W^T)^{t-j}x_j$
$\frac{\partial L}{\partial W}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.V.\sum_{k=1}^t(W^T)^{t-j}h_j$
$\frac{\partial L}{\partial V}=\sum_{t=1}^{N_y}\frac{d\mathcal{L}(\hat{y}_t, y_t)}{d\hat{y}_t}.h_t$

We can now apply a descend method as detailed is my previous article.

### Memory problem

There are several fields where we are interested in predicting the evolution of a time serie based on its history: music, finance, emotions…etc. The intrinsic recurrent networks described above, called ‘Vanilla’, suffer from a weak memory unable to take into account several elements of the past in the prediction of the future.
With this in mind, various extensions of the RNNs have been designed to trim the internal memory: bi-directional neural networks, LSTM cells, attention mechanisms…etc Memory enlargement can be crucial in certain fields such as finance where one seeks to memorize as much history as possible in order to predict a financial series.
The learning phase in RNN might also suffer from gradient vanishing or gradient exploding problems since the gradient of the cost function includes the term $(W^T)^{t-j}$ which affects its memorizing capacity.

## 3. Different types of RNNs

There are several extensions for classic or ‘Vanilla’ recurrent neural networks, these extensions have been designed to increase the memory capacity of the network along with the features extraction capacity.
The illustration below summarizes the different extensions: There exist other types of RNNs that have a specifically designed hidden layer which we will discuss in the next chapter.

## 4. Advanced types of cells

### GRU

GRU (Gated Recurrent Unit) cells allow the recurrent network to save more historical information for a better prediction. It introduces an update gate which determines the quantity of information to keep from the past as well as a reset gate which sets the quantity of information to forget.
The graph bellow schematizes the GRU cell: #### Equations

We define the equations in the GRU cell as follows:

• Update gate: $z_t=\phi(W^{zx}x_{t}+W^{zh}h_{t-1})$
• Reset gate: $r_t=\phi(W^{rx}x_{t}+W^{rh}h_{t-1})$
• Current memory content: $c_t=tanh(W^{cx}x_{t}+W^{ch}r_th_{t-1})$
• Memory state: $h_t=(1-z_t)h_{t-1}+z_t.c_t$

$\phi$ is a non-linear integer function and the parameters $W^{.}$ are learned by the model.

### LSTM

LSTMs (Long Short Term Memory) were also introduced to overcome the problem of short memory, they have 4 times more memory than Vanilla RNNs. This model uses the notion of gates and has three :

• Input gate i: controls the flow of incoming information.
• Forget gate f: Controls the amount of information from the previous memory state.
• Output gate o: controls the flow of outgoing information

The graph below shows the operation of the LSTM cell: When the input and output doors are closed, activation is blocked in the memory cell.

#### Equations

We define the equations in the LSTM cell as follows:

• Input gate: $i_t=\phi(W^{ix}x_t+W^{ih}h_{t-1})$
• Forget gate: $f_t=\phi(W^{fx}x_t+W^{fh}h_{t-1})$
• Output gate: $o_t=\phi(W^{ox}x_t+W^{oh}h_{t-1})$
• Input node: $g_t=tanh(W^{gx}x_t+W^{gh}h_{t-1})$
• Hidden node: $s_t=s_{t-1}.f_t+g_t.i_t$
• Memomy state: $h_t=tanh(s_t).o_t$

$\phi$ is a non-linear integer function and the parameters $W^{.}$ are learned by the model.

#### Pros & Cons

We can summarize the advantages and disadvantages of LSTM cells in 4 main points:

• They are able to model long-term sequence dependencies.
• They are more robust to the problem of short memory than ‘Vanilla’ RNNs since the definition of the internal memory is changed from $h_t=f(Ux_t+Wh_{t-1})$ to $h_t=tanh(s_{t-1}.f_t+g_t.i_t).o_t$

• They increase the computing complexity compared to the RNN with the introduction of more parameters to learn.
• The memory required is higher than the one of ‘Vanilla’ RNNs due to the presence of several memory cells.

## 5. Encoder & Decoder architecture

It is a sequential model consisting of two main parts:

• Encoder: the first part of the model processes the sequence and then returns at the end an encoding vector of the whole series called context vector which summarizes the information of the different inputs.
• Decoder: the context vector is then taken as the input to the decoder in order to make the predictions.

The diagram below illustrates the architecture of the model : The encoder can be considered as a dimension reduction tool, matter of fact, the context vector $e_n$ is nothing else than the encoding of the input vectors $(in_0, in_1,...in_n)$, the sum of the sizes of these vectors is much larger than that of en, hence the notion of dimension reduction.

## 6. Attention mechanisms

Attention mechanisms were introduced to address the problem of memory limitation, and mainly answer the following two questions:

• What weight (importance) $α_j$ is given to each output $e_j$ of the encoder ?
• How can we overcome the limited memory of the encoder in order to be able to ‘remember’ more of the encoding process?

The mechanism inserts itself between the encoder and the decoder and helps the decoder to significantly select the encoded inputs that are important for each step of the decoding process $out_i$ as follows: ### Mathematical formalism

Keeping the same notation as before, we set $α_{i,j}$ as the attention given by the output $i$, noted $out_i$, to the vector $e_j$. The attention is computed via a neural network which takes as inputs the vectors $(e_0, e_1, ..., e_n)$ and the previous memory state $h_{i-1}$, it is given by :

$\alpha_{i,j}=\frac{exp(b_{i,j})}{\sum_{j=0}^nexp(b_{i,j})}$

where $b_{i,j}=a(e_j, h_{i-1})=v\tanh(w_1e_j+w_2h_{i-1}))$, $a$ is the neural network and $w_{1/2}$ are learned by the architecture.
Finally, the input of the decoder is the weighted linear combination of the encoder’s outputs:

$c_i=\sum_{j=0}^n\alpha_{i,j}e_j$

Considering the definition of $\alpha_{i,j}$, we have:

$\forall i,~\sum_{j=0}^n\alpha_{i,j}=1$

### Application: Translation Machine

The use of an attention mechanism makes it possible to visualize and interpret what the model is doing internally, especially at the time of prediction.
For example, by plotting a ‘heatmap’ of the attention matrix of a translation system, we can see the words in the first language on which the model focuses to translate each word into the second language: As illustrated above, when translating a word into English, the system focuses in particular on the corresponding French word.

### Overlay of LSTM & Attention Mechanism

It is relevant to combine the two methods to improve internal memory, since the first one allows more elements of the past to be taken into account and the second one chooses to pay careful attention to them at the time of prediction.
The output $c_t$ of the attention mechanism is the new input of the LSTM cell, so the system of equations becomes as follows :

• Input gate: $i_t=\phi(W^{ic}c_t+W^{ih}h_{t-1})$
• Forget gate: $f_t=\phi(W^{fc}c_t+W^{fh}h_{t-1})$
• Output gate: $o_t=\phi(W^{oc}c_t+W^{oh}h_{t-1})$
• Input node: $g_t=tanh(W^{gc}c_t+W^{gh}h_{t-1})$
• Hidden output: $s_t=s_{t-1}.f_t+g_t.i_t$
• Prediction output: $h_t=tanh(s_t).o_t$

$\phi$ is a non-linear integer function and the parameters W and b are learned by the model.

• Z.Lipton, J.Berkowitz, C.Elkan, A Critical Review of Recurrent Neural Networks for Sequence Learning, arXiv: 1506.00019v4, 2015.
• H.Salehinejad, S.Sankar, J.Barfett, E.Colak, S.Valaee, Recent Advances in Recurrent Neural Networks, arXiv: 1801.01078v3, 2018.
• Y.Baveye, C.Chamaret, E.Dellandréa, L.Chen, Affective Video Content Analysis: A Multidisciplinary Insight, HAL Id: hal-01489729, 2017.
• A.Azzouni, G.Pujolle, A Long Short-Term Memory Recurrent Neural Network Framework for Network Traffic Matrix Prediction, arXiv: 1705.05690v3, 2017.
• Y.G.Cinar, H.Mirisaee, P.Goswami, E.Gaussier, A.Ait-Bachir, V.Strijov, Time Series Forecasting using RNNs: an Extended Attention Mechanism to Model Periods and Handle Missing Values, arXiv: 1703.10089v1, 2017.
• K.Xu, L.Wu, Z.Wang, Y.Feng, M.Witbrock, V.Sheinin, Graph2Seq: Graph to Sequence Learning with Attention-Based Neural Networks, arXiv: 1804.00823v3, 2018.
• Rose Yu, Yaguang Li, Cyrus Shahabi, Ugur Demiryurek, Yan Liu, Deep Learning: A Generic Approach for Extreme Condition Traffic Forecasting, Universit’e de Californie Sud, 2017.