One useful neural network trick I picked up from Geoff Hinton's arXiv paper on dropout is using a hard inequality constraint on the norm of the weight vector going into each hidden unit, rather than adding a term to the objective function penalizing the sum of the squares of the weights.

This wasn't the main focus of his paper, and I'm not sure if it was employed elsewhere, but I think it's a very important trick.

A very nice property of this approach is that it will never drive the weights for a hidden unit to be near zero. As long as the norm is less than the constraint value, the constraint has no effect. I find that in the context of stochastic optimization, like stochastic gradient descent, this is very important. Classical weight decay exerts a constant pressure to move the weights near zero, while the main objective function (log likelihood or whatever you're optimizing) varies constantly. This makes it easy for weight decay to throw out interesting information from your model on steps when the main objective function doesn't provide any incentive for the weights to remain far from zero.

I've tried using this constraint on several tasks involving different optimization methods (stochastic gradient descent versus nonlinear conjugate gradient descent, training with dropout versus not training with dropout) and different model/objective function types (MLPs for classification, various unsupervised learning techniques) and found improvements in all of them by getting rid of weight decay and using the constraint approach instead.
Shared publiclyView activity