Skip to the content.

RNN

The Unreasonable Effectiveness of Recurrent Neural Networks

A recurrent neural network can be thought of as multiple copies of the same network, each passing a different message to a successor.

An unrolled recurrent neural network.

问题:

RNN训练

BPTT(Backprop through time)

反向传播过程中将RNN展开(不记忆之前的模型权重,因此权重参数全部都是一样的),梯度从最近的计算步骤反向传播至之前的计算步骤,再依次传到上一时间步的所有计算步骤…

梯度消失/爆炸问题

During the backpropagation in the deep neural networks, the Vanishing gradient problem occurs due to the sigmoid and tan activation function and the exploding gradient problem occurs due to large weights.

”一锅老鼠屎“:BP是链式法则(梯度相乘)。只要一个乘数出问题就会坏事:

Derivation: Derivatives for Common Neural Network Activation Functions -  The Clever Machine

原因

  1. 展开的RNN通常非常非常深

  2. RNN包含许多相同的梯度项:

    non-recurrent:

    $w_1·\alpha_1·𝑤_2·\alpha_2···𝑤_𝑑·\alpha_𝑑$

    RNN:

    $w·\alpha_1·𝑤·\alpha_2···𝑤·\alpha_𝑑$

策略

其他激活函数(relu)

Batch Normalization

sigmoid、tanh这类激活函数出现梯度消失的根本原因是数值过大(>=5),标准化数据可以把大部分的数据范围限定在[-4,4]之间,因此可以很大程度缓解梯度消失问题。

img

Residual Connections

Truncated BPTT

gradient clipping

LSTM

img

The repeating module in a standard RNN contains a single layer.

A LSTM neural network.

The repeating module in an LSTM contains four interacting layers.

结构

cell state(细胞状态)

acts as a transport highway that transfers relative information all the way down to the sequence chain

(think of this as the memory of the network)

hidden state

short-term memory (上一个时间步输出)

cell state: 长期

hidden state:短期

gates

img

每个gate的组成:

forget gate

img

input gate

img

combining f+i

img

outputgate

img

activations

关于LSTM改善梯度消失问题

How LSTM networks solve the problem of vanishing gradients - by Nir Arbel - DataDrivenInvestor

原理:

如何优化

GA(遗传算法)优化LSTM神经网络-CSDN博客

Choosing the right Hyperparameters for a simple LSTM using Keras - Towards Data Science

LSTM 如何优化? - 知乎


Illustrated Guide to LSTM’s and GRU’s: A step by step explanation - YouTube

Understanding LSTM Networks – colah’s blog

GRU

轻量级的LSTM变种

使用hidden state替代cell state的作用,hidden state里存储长期+短期的记忆信息

A gated recurrent unit neural network.

reset gate: forget some of previous hidden state

update gate: combine previous hidden state + current hidden state(output)

Illustrated Guide to LSTM’s and GRU’s: A step by step explanation - YouTube

Understanding GRU Networks - Towards Data Science

encoder-decoder

encoder: input -> feature vector (feature representations)

decoder: feature vector -> output

*Simple terms , ENCODER folds the data to retain imp information and DECODER does the final task.

training

The encoders are trained with the decoders. There are no labels (hence unsupervised). The loss function is based on computing the delta between the actual and reconstructed input. The optimizer will try to train both encoder and decoder to lower this reconstruction loss.

Once trained, the encoder will gives feature vector for input that can be use by decoder to construct the input with the features that matter the most to make the reconstructed input recognizable as the actual input.

It is important to know that in actual application, people donot try to reconstruct the actual input, but rather want to map/translate/associate inputs to certain outputs. For example translating french to english sentences, etc.

What is an Encoder/Decoder in Deep Learning? - Quora

What is an encoder decoder model? - by Nechu BM - Towards Data Science