**Weight initialization when using ReLUs**

I'm interested if there are any practical recommendations for or publications discussing weight initialization when using rectified linear units. I know that in practice (probably due to the AlexNet paper), everyone just initializes everything with 0.01 stddev Gaussian noise.

There's no worry that incorrect initialization will cause vanishing gradient due to unit saturation, but changing the scale of the weights can definitely affect the scale of the activations on the forward pass (which could matter when they terminate at a sigmoid or softmax output layer). In addition, the scales (singular values?) of the weight matrices do determine to what extent vanishing/exploding gradient will occur during backprop.

Personally, I've only run into problems when using layers with relatively small fan-in/out, and then I just have to increase the scale of the initialization.

Is there math that can guide us here? Or is the answer still just 0.01? ;)

View 17 previous comments

- anyways, for most practical purposes, I found the torch defaults to work well.

For conv layers:

stdv = 1/math.sqrt(self.kW*self.kH*self.nInputPlane)

For linear layers:

stdv = 1./math.sqrt(inputSize)Oct 8, 2014 - And I had really slow initial learning using Alex's 0.01 std and gaussian distribution for large-scale networks, compared to the torch's default initialization. I frankly dont see any reason why it should be a gaussian at all, and I dont understand why anyone should think 0.01is good either.Oct 8, 2014
- +Soumith Chintala, thanks for your input. I'm curious what your doubts are specifically with Gaussian initialization. Other than the empirical observation that uniform begins to learn faster.Oct 8, 2014
- I want my neurons to equally compete from when the training starts. I dont really see why there is a need to introduce such a huge bias (with a gaussian) in the neuron weights such that I might actively help a subset of neurons dominate a particular output. All of this only matters of course until the network actually starts learning!Oct 8, 2014
- How much of the 'speedup' is due to using the uniform distribution instead of the Gaussian though, and how much of it is due to choosing the right scale (I.e. variance)? Is there any literature on this? Your intuitive explanation makes some sense but some hard numbers would be nice as well to show that this is actually what happens :)Oct 9, 2014
- +Sander Dieleman there's definitely scope to sit down and do properly designed experiments on a few sets of problems. I dont have the time :)Oct 9, 2014