Caffe CNN: diversity of filters within a conv layer [closed] - neural-network

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have the following theoretical questions regarding the conv layer in a CNN. Imagine a conv layer with 6 filters (conv1 layer and its 6 filters in the figure).
1) what guarantees the diversity of learned filters within a conv layer? (I mean, how the learning (optimization process) makes sure that it does not learned the same (similar) filters?
2) diversity of filters within a conv layer is a good thing or not? Is there any research on this?
3) during the learning (optimization process), is there any interaction between the filters of the same layer? if yes, how?

1.
Assuming you are training your net with SGD (or a similar backprop variant) the fact that the weights are initialized at random encourage them to be diverse, since the gradient w.r.t loss for each different random filter is usually different the gradient will "pull" the weights in different directions resulting with diverse filters.
However, there is nothing that guarantees diversity. In fact, sometimes filters become tied to each other (see GrOWL and references therein) or drop to zero.
2.
Of course you want your filters to be as diverse as possible to capture all sorts of different aspects of your data. Suppose your first layer will only have filters responding to vertical edges, how is your net going to cope with classes containing horizontal edges (or other types of textures)?
Moreover, if you have several filters that are the same, why computing the same responses twice? This is highly inefficient.
3.
Using "out-of-the-box" optimizers, the learned filters of each layer are independent of each other (linearity of gradient). However, one can use more sophisticated loss functions/regularization methods to make them dependent.
For instance, using group Lasso regularization, can force some of the filters to zero while keeping the others informative.

Related

In this case, what's better: classification or clustering? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I collected data from different sources FB, Twitter, Linkedin, then made them in a structured format. As a result now: I'm having a csv file with 10000 rows (10000 person) and the data associated is about their names, age,their interests and buying habits.
I'm really stuck on this step: CLASSIFICATION or CLUSTERING. For the classification I don't really have predefined classes or a model for my users to classify them.
For clustering: I started calculating similarities and KMeans, but still can't get the result I wanted. How can I decide what to choose before moving on to the next step of Collaborative filtering?
Foremost, you have to understand that clustering is a pre-processing activity/task. The idea in clustering is to identify objects with similar properties and group them. The clustering process can be understood in terms of cattle-herding. Wherein the jockey herds loose cattle (read data points) into groups.
Note: If you are looking at the partitioning clustering algorithm family includes K-means, k-modes, k-prototype etc. The algorithm k-means will work only for numerical data. K-modes will work only for categorical data and k-prototype will work for both numerical and categorical data.
Question: Is the data preprocessed? If the answer is no, then you may try the following steps;
Is the data (column values) all categorical (=text) format or numerical or mixed?
a. If all categorical then discretize or bin or interval scale them.
b. if mixed, then discretize or bin or interval scale the categorical values only
c. Perform missing value and outlier treatment for both numerical and categorical data. This will help in retaining maximum variance as well as reduce dimensionality.
d. Normalize the numerical values to a median of zero.
Now apply a suitable clustering algorithm (based on your problem) to determine patterns. Once you have found the patterns, then you may label them. Once the identified patterns are labelled, thereafter or subsequently a classification algorithm can be used to classify any new incoming data points into an appropriate class.

How to choose the number of filters in each Convolutional Layer? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
When building a convolutional neural network, how do you determine the number of filters used in each convolutional layer. I know that there is no hard rule about the number of filters, but from your experience/ papers you have read, etc. is there an intuition/observation about number of filters used?
For instance (I'm just making this up as example):
use more/less filters as the network gets deeper.
use larger/smaller filter with large/small kernel size
If the object of interest in the image is large/small, use ...
As you said, there are no hard rules for this.
But you can get inspiration from VGG16 for example.
It double the number of filters between each conv layers.
For the kernel size, I usually keep 3x3 or 5x5.
But, you can also take a look at Inception by Google.
They use varying kernel size, then concat them. Very interesting.
As far as I am concerned there is no foxed depth for the convolutional layers. Just several suggestions:
In CS231 they mention using 3 x 3 or 5 x 5 filters with stride of 1 or 2 is a widely used practice.
How many of them: Depends on the dataset. Also, consider using fine-tuning if the data is suitable.
How the dataset will reflect the choice? A matter of experiment.
What are the alternatives? Have a look at the Inception and ResNet papers for approaches which are close to the state of the art.

Questions about word embedding(word2vec) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to understand word2vec(word embedding) architecture, and I have few questions about it:
first, why is word2vec model considered a log-linear model? Is it because it uses a soft max at output layer?
second, why does word2vec remove hidden layer? Is it just because of computational complexity?
third, why does word2vec not use activation function? (as compared to NNLM(Neural Network Language Model).
first, why word2vec model is log-linear model? because it uses a soft max at output layer?
Exactly, softmax is a log-linear classification model. The intent is to obtain values at the output that can be considered a posterior probability distribution
second, why word2vec removes hidden layer? it just because of
computational complexity?
third, why word2ved don't use activation function? compare for
NNLM(Neural Network Language Model).
I think your second and third question are linked in the sense that an extra hidden layer and an activation function would make the model more complex than necessary. Note that while no activation is explicitly formulated, we could consider it to be a linear classification function. It appears that the dependencies that the word2vec models try to model can be achieved with a linear relation between the input words.
Adding a non-linear activation function allows the neural network to map more complex functions, which could in turn lead to fit the input onto something more complex that doesn't retain the dependencies word2vec seeks.
Also note that linear outputs don't saturate which facilitates gradient-based learning.

Bag of Words Representation [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I would like to implement bag of words representation for my project. I computed the codebook of visual words of images by using their features and descriptors.Then, I obtained cluster centers using k-means. For the bag of words representation part, it is asked that you should use manually labeled segments provided as part of the dataset. In dataset, there are three different binary masks for each image. Are those labeled segments that binary masks? If so, how will I use that computed visual words?
The bag of words approach provides a concise representation of an image or a part of an image. That representation is typically used as an input to a classification algorithm which is used to estimate the class to which the image data belongs. Typically, the classifier is a supervised learning method which will require pairs (descriptor, label) from some training set during the training process. In your case, the descriptor is the BOW representation of the image data from your training set. Then, during testing you will feed the BOW descriptor of new image data to the classifier to infer the class.
From what I understand, the fact that you have three different masks for the images, means that you also have three classes. Then, each mask will tell you which part of an image should be considered image data belonging to a particular class. This is your training data.
Under that assumption, you should extract the parts of the images that correspond to each mask, compute the BOW representation for those image parts (separately for each mask) and use those with the mask number as a label to train the classifier.
This will allow you to later to e.g. use the sliding window approach to classify parts of a test image as belonging to one of the 3 classes used during training. That would be a simple case of a detection problem.
I am not sure I understood your problem correctly, but I hope that this will help you move forward a bit.

Can I use neural network in this case? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
can I use neural networks or svm or etc, if my output data is 27680 that all of them are zero and just one of them is one?
I mean that Is it right to do this?
when I use SVM I have this error:
Error using seqminopt>seqminoptImpl (line 198)
No convergence achieved within maximum number of iterations.
SVMs are usually binary classifiers. Basically that means that they seperate your datapoints into two groups, which signals whether a datapoint does or doesn't belong to a class. Common strategies for solving multi-class problems with SVMs are one-vs-rest and one-vs-one. In the case of one-vs-rest, you would train one classifier per class, which would be 27,680 for you. In the case of one-vs-one, you would train (K over 2) = (K(K-1))/2 classifiers, so in your case around 38 million. As you can see, both numbers are rather high, so I would be pessimistic about your probability of successfully solving your problem with SVMs.
Nevertheless you can try to increase the maximum amount iterations as described in another stackoverflow thread. Maybe it still works.
You can use Neural Nets for your task and a 1-of-K output is nothing unusual. However, even with only one hidden layer of 500 neurons (and using the input and output vector sizes mentioned in your comment) you will have (27680*2*500) + (500*27680) = 41,520,000 weights in your network. So I would expect rather long training times (although a Google employee would probably laugh about these numbers). You will also most likely need a lot of training examples, unless your input is really simple.
As an alternative you might look into Decision Trees/Random Forests, Naive Bayes or kNN.