Weight Initialization

Home

Weight initialization for neural networks is an active area of research. Common methods include the Glorot and Xavier initializations, and He’s method.

https://arxiv.org/pdf/1502.01852.pdf [Delving Deep into Rectifiers : Surpassing Human-Level Performance on ImageNet Classification (2015)]

http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf [Understanding the difficulty of training deep feedforward neural networks (2010)]

https://arxiv.org/pdf/1312.6120.pdf [Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (2014)]

https://arxiv.org/abs/1511.06422 [All you need is a good init (2015)]

 

In all papers, people either use Normal or Uniform initialization. There is a simple relationship:

 

To summarize all non-iterative approaches:

Activation

Normal

Uniform

Sigmoid, Tanh, Linear, Softmax, Others (Glorot Xavier)

 

ReLU, PReLU, ELU derivatives (He)

 

 

https://github.com/keras-team/keras/issues/52 showcased that using a normal

distribution could in fact reduce accuracy on MNIST (though a small 0.4%).

 

For convolutional filters, https://stackoverflow.com/questions/42670274/how-to-calculate-fan-in-and-fan-out-in-xavier-initialization-for-neural-networks,

since each channel gets a filter separately,

Where is the number of channels, and h and w denote the kernel or filter’s height and weight.

A 3D kernel will just add an extra multiplicative term.

 

Another popular method is to use Orthogonal Initialization, where you draw from a standard normal, then orthogonalize it via

the QR Decomposition. To correct for variance scaling, the LSUV (Layer Sequential Unit-Variance Initialization) was proposed.

Using orthogonal initialization, it then corrects the variance at each layer.

 

DO NOT use LSUV on Sigmoid, Tanh, Linear, Softmax, Others.

LSUV would fail to converge!

 

LSUV terminates if Sigmoid, Tanh, Linear, Softmax, Others used.

 

 

 

 

 

 

LSUV sadly only improves accuracy by minuscule amounts. In fact, in my own testings, LSUV actually

makes training worse in the long run!!

You can see that at the start of training, LSUV actually has higher accuracies than Glorot. However,

as time goes on, LSUV has a lower final accuracy than Glorot.

 

BUT DONT GET FOOLLED!!! MUHAHAHA.

In fact, if you run multiple runs of LSUV and Glorot side by side say for 230 times, each time

randomizing the random seed, we see that LSUV wins!

 

Copyright Daniel Han 2024. Check out Unsloth!