Weight initialization when using ReLUs

I'm interested if there are any practical recommendations for or publications discussing weight initialization when using rectified linear units.  I know that in practice (probably due to the AlexNet paper), everyone just initializes everything with 0.01 stddev Gaussian noise.

There's no worry that incorrect initialization will cause vanishing gradient due to unit saturation, but changing the scale of the weights can definitely affect the scale of the activations on the forward pass (which could matter when they terminate at a sigmoid or softmax output layer).  In addition, the scales (singular values?) of the weight matrices do determine to what extent vanishing/exploding gradient will occur during backprop.  

Personally, I've only run into problems when using layers with relatively small fan-in/out, and then I just have to increase the scale of the initialization.  

Is there math that can guide us here?  Or is the answer still just 0.01? ;)
Shared publiclyView activity