How about this:

If we have single repeated image in the training set, the best way to get the least reconstruction error is to copy that image right on the weight vectors but rescale the intensity in a way that the result could pass through the nonlinear function without problems. The gradient descent helps going through this copying process iteratively.

Now, we know that we don't have single image in the training data, therefore, each image tries to copy itself onto the weight vectors so each image is somehow pulling the weight vectors towards itself.

This pulling process it done over and over through the epochs and is done slowly and iteratively through gradient descent.

However, in the training data, there are some regions of the image that do repeat a lot in the training data (factors of variation). These regions of image, similar to the whole image in the first example, somehow manage to retain a copy on the weight vectors. These are the blob-like regions on the filters.

Nevertheless, we have many hidden units and and an image can be reconstructed by combining the response of filters from many hidden units. Therefore, the optimization doesn't need to capture all variation through one hidden unit. So the variations will be distributed among the units and we will have different blob-like filters corresponding to different hidden units.