Profile

Cover photo
Behrang Mehrparvar
Works at University of Houston
Attends University of Houston
89 followers|85,600 views
AboutPostsPhotosYouTube

Stream

Behrang Mehrparvar

Shared publicly  - 
 
 
Il meccanismo è, purtroppo, esattamente questo. #TheFutureIsNow
 ·  Translate
1 comment on original post
1
Add a comment...
 
Visualizing the hidden units for qualitative interpretation is an interesting problem. visualizing the weights tells us how sensitive the activation is to each weight if it was a linear map. But adding nonlinearity makes things more complicated. By adding nonlinearity not only the activation is sensitive to a feature more than others, but also it would be sensitive to a specific range of values of that feature. So we can not say that the activation is sensitive to feature A more than feature B; it depends on which range of them we are comparing. So in my opinion the method of visualizing the weights can not work as a qualitative interpretation of the hidden unit either. Any comments or ideas or opinions? Is there any other way to interpret what a hidden unit is capturing?
2
1
Matt Siegel's profile photoBehrang Mehrparvar's profile photoli shen's profile photo
5 comments
 
I am glad that my question triggered this track of thoughts. I have been thinking about the same question as well. A hidden unit activation indicates a direction of variation. Visualization helps us investigate the "relative sensitivity" of that variation (or pattern in input) regarding each raw feature (pixel). If a pixel has high value in visualization image, it means that the direction of variation corresponding to that hidden unit is highly sensitive to that pixel. In other words, a little change in that pixel value, results in bigger jumps in the activation of the hidden unit. The activation itself, is the intensity of the corresponding pattern detected in the input space.
Add a comment...
 
If we have the network, the representation of a single sample and the error for that sample, can we reconstruct that input sample exactly, without error? If so, how?
1
Sancho McCann's profile photoYossi Biton's profile photo
2 comments
 
Most of the layers (fully-connected, convolutional,...) aren't invertible too
Add a comment...
 
OK, here is the problem: On one hand I have backpropagation which I can understand completely. On the other hand I can see the visual image of features that we get from let's say MNIST. I just can not see the connection. Why does backpropagation result in such well-shaped features? I can not accept it just as a black box. And whatever I read is a high level introduction of what is happening. Can anyone suggest a paper that gets deep into this?
9
3
John Newman's profile photoHao Wooi Lim's profile photoDeniz Yuret's profile photoBabak Eh's profile photo
12 comments
 
Actually, my gravity based example may be counter intuitive for non-linear problems. Water potentially moves uphill. Streams may conditionally join and split in non-river like ways, depending on the source of the flow. The idea of backpropogation eroding the path that was correctly trained, though, I think still may be a helpful analogy.
Add a comment...
 
Isn't sparsity against "distributed" representation?
6
Phillip Adkins's profile photoAlexandre Dalyac's profile photoBehrang Mehrparvar's profile photoMinseok Keum's profile photo
31 comments
 
Referring to below paper, it seems they are two opposing concepts.

https://www.cs.toronto.edu/~hinton/absps/rgbn.pdf

Add a comment...
 
why does depth help in CNNs? what does "convolution of a convolution" mean? Why not just a single convolution/pooling layer and many fully connected layers on top of it?
1
2
Sander Dieleman's profile photoDebanjan Bhattacharyya's profile photoSumit Sanyal's profile photoBabak Eh's profile photo
4 comments
 
Second that. The second conv layer has 3D filters. So a "single" filter is being applied to multiple feature maps from the first conv layer. Some times the 3rd dimension spans all the features maps like Sander said. Some times combinations are chosen (see Page 8 of Yann Lecuns seminal work on Gradient Based Learning Applied to Document Recognition)
Add a comment...
Have him in circles
89 people
Min Lin's profile photo
Xi Zhao's profile photo
arash sangari's profile photo
A J's profile photo
Nikolaos Mitsakos's profile photo
Jong-Jin Kim's profile photo
Danil Safin's profile photo
Marzieh Berenjkoub's profile photo
siva karthik mustikovela's profile photo

Communities

 
There is a very interesting claim in this paper [Intriguing properties of neural networks] though it doesn't completely explain it well. Can anyone elaborate the concept please or refer me to related papers?

"... This puts into question the conjecture that neural networks disentangle variation factors across coordinates. Generally, it seems that it is the entire space of activations, rather than the individual units, that contains the bulk of the semantic information. ..."
8
2
Kai Arulkumaran's profile photoHamed Aghdam's profile photoPär-Anders Aronsson's profile photoli shen's profile photo
19 comments
 
Here is a quote from Andrej Karpathy about this issue: "it is more appropriate to think of multiple ReLU neurons as the basis vectors of some space that represents in image patches. In other words, the visualization is showing the patches at the edge of the cloud of representations, along the (arbitrary) axes that correspond to the filter weights. This can also be seen by the fact that neurons in a ConvNet operate linearly over the input space, so any arbitrary rotation of that space is a no-op."

You can find more here: http://cs231n.github.io/understanding-cnn/
Add a comment...
 
Is there any specific paper and that claims that hidden units in higher layers are more class-specific? Is this just a hypothesis or assumption or is there any proof for that?
7
2
Philippe Castonguay's profile photoXundong Wu's profile photoPär-Anders Aronsson's profile photoJawae Chan's profile photo
10 comments
 
Believe you will find this paper relevant.
http://arxiv.org/abs/1312.6199
Add a comment...
 
Regarding visualizing activations of higher layer hidden units one method is to maximize the activation. Has anyone tried this method mentioned in [Visualizing Higher-Layer Features of a Deep Network]? Why do they set the norm constraint on image there? Doesn't hyper-parameter setting affect the activation? is there any better method for visualizing activations of hidden units?
1
Dumitru Erhan's profile photoAli Rafiei's profile photoBehrang Mehrparvar's profile photo
5 comments
 
also the samples existing on ||x||=p are completely random. By maximizing the activation using this constraint how do we expect to find meaningful interpretation?
Add a comment...
 
Is saturation good?
Does saturation imply sparsity or vise versa?
Does the continuous "value" of the hidden units really matter, or just binary is good enough?
1
Alessandro Ferrari's profile photoRakesh Chalasani's profile photoBehrang Mehrparvar's profile photo
5 comments
 
they are pushing the nonlinearity to it's flat region where the gradient is zero (figure 1 of paper) where the learning does not happen according to the gradient.
Add a comment...
 
Can anyone explain the co-adapting between units mentioned in dropout paper, in terms of back-propagation? What does it exactly mean?
3
2
Behrang Mehrparvar's profile photoKyle Kastner's profile photoHao Wooi Lim's profile photoBabak Eh's profile photo
9 comments
 
It encourages sparsity. It is similar in some ways to dropout in that you are encouraging only some units to be active, but there is no other constraint to keep the units from co-adapting except the fact that you want to encourage solutions with less activations which should force the network into learning simpler relationships. But dropout seems to work much better in practice.
Add a comment...
 
I trained an auto-encoder and then randomly re-initialized some filters (weight vectors) after some epochs. Surprisingly, after continuing training I realized that those filters (at least visually) learn the same pattern like before. Anyone can explain this?
2
SHUWEN HUAN's profile photoHuaxing Jin's profile photoBehrang Mehrparvar's profile photoDevendra Kumar's profile photo
10 comments
 
the definition of SOME filters is ambiguous. You can see it in cost if you decompose W into (W + dW(SOME filters)), here dW is a function of SOME. You can actually change SOME to various permissible ranges and see how much it can tolerate.  You can also shuffle all hidden nodes randomly and see how far it goes. This i guess would be extreme.

After doing these kind of experiments, only we could conclude what is really happening.
Now once you have these emperical results, then it is worth your time to do theoretical analysis, i believe. Why waste time if emperical results dont show something is really happening. But a minimal would be tuning SOME and scaling dW.
Add a comment...
People
Have him in circles
89 people
Min Lin's profile photo
Xi Zhao's profile photo
arash sangari's profile photo
A J's profile photo
Nikolaos Mitsakos's profile photo
Jong-Jin Kim's profile photo
Danil Safin's profile photo
Marzieh Berenjkoub's profile photo
siva karthik mustikovela's profile photo
Communities
Basic Information
Gender
Male
Work
Occupation
PhD Student
Employment
  • University of Houston
    TA, 2011 - present
Education
  • University of Houston
    PhD, 2011 - present
  • KIAU
    BS Computer Engineering, 2003 - 2008
  • IUST
    MS, 2008 - 2011