One useful neural network trick I picked up from Geoff Hinton's arXiv paper on dropout is using a hard inequality constraint on the norm of the weight vector going into each hidden unit, rather than adding a term to the objective function penalizing the sum of the squares of the weights.

This wasn't the main focus of his paper, and I'm not sure if it was employed elsewhere, but I think it's a very important trick.

A very nice property of this approach is that it will never drive the weights for a hidden unit to be near zero. As long as the norm is less than the constraint value, the constraint has no effect. I find that in the context of stochastic optimization, like stochastic gradient descent, this is very important. Classical weight decay exerts a constant pressure to move the weights near zero, while the main objective function (log likelihood or whatever you're optimizing) varies constantly. This makes it easy for weight decay to throw out interesting information from your model on steps when the main objective function doesn't provide any incentive for the weights to remain far from zero.

I've tried using this constraint on several tasks involving different optimization methods (stochastic gradient descent versus nonlinear conjugate gradient descent, training with dropout versus not training with dropout) and different model/objective function types (MLPs for classification, various unsupervised learning techniques) and found improvements in all of them by getting rid of weight decay and using the constraint approach instead.

This wasn't the main focus of his paper, and I'm not sure if it was employed elsewhere, but I think it's a very important trick.

A very nice property of this approach is that it will never drive the weights for a hidden unit to be near zero. As long as the norm is less than the constraint value, the constraint has no effect. I find that in the context of stochastic optimization, like stochastic gradient descent, this is very important. Classical weight decay exerts a constant pressure to move the weights near zero, while the main objective function (log likelihood or whatever you're optimizing) varies constantly. This makes it easy for weight decay to throw out interesting information from your model on steps when the main objective function doesn't provide any incentive for the weights to remain far from zero.

I've tried using this constraint on several tasks involving different optimization methods (stochastic gradient descent versus nonlinear conjugate gradient descent, training with dropout versus not training with dropout) and different model/objective function types (MLPs for classification, various unsupervised learning techniques) and found improvements in all of them by getting rid of weight decay and using the constraint approach instead.

View 4 previous comments

- I just use cross validation. You can monitor the constraint value during training to see if it's ever active. If it's never active you know it's not having any effect and you can use the plot of the norm over time to pick a constraint value that should have an effect, then try that value as your next point in your hyperparameter search.Apr 4, 2013
- Is it only for the weights are also for the biases?Apr 5, 2013
- I've only ever tried applying it to the weights. Constraining biases can cause a lot of problems. For example, if you train an RBM with binary units on MNIST, a lot of the input pixels are always black. The correct way to represent that is to make the biases on the visible units very negative. If you constrain the biases to be small, then the hidden units need to inhibit the background pixels. That's a much more complicated way of getting black pixels, and much more likely to overfit.Apr 5, 2013
- Thanks +Ian Goodfellow .Apr 5, 2013
- +Ian Goodfellow , I recently experimented with weight norm constraints, and it was really helpful to have your quick suggestion on how to easily implement them in Theano. Since this post is pretty old, I'm curious if your feelings and experiments still favor their use or if you've moved on to a new strategy for regularizing the weights.Oct 11, 2014
- I still use max col norm constraints when possible.Oct 11, 2014