Questions about word embedding(word2vec) [closed] - neural-network

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to understand word2vec(word embedding) architecture, and I have few questions about it:
first, why is word2vec model considered a log-linear model? Is it because it uses a soft max at output layer?
second, why does word2vec remove hidden layer? Is it just because of computational complexity?
third, why does word2vec not use activation function? (as compared to NNLM(Neural Network Language Model).

first, why word2vec model is log-linear model? because it uses a soft max at output layer?
Exactly, softmax is a log-linear classification model. The intent is to obtain values at the output that can be considered a posterior probability distribution
second, why word2vec removes hidden layer? it just because of
computational complexity?
third, why word2ved don't use activation function? compare for
NNLM(Neural Network Language Model).
I think your second and third question are linked in the sense that an extra hidden layer and an activation function would make the model more complex than necessary. Note that while no activation is explicitly formulated, we could consider it to be a linear classification function. It appears that the dependencies that the word2vec models try to model can be achieved with a linear relation between the input words.
Adding a non-linear activation function allows the neural network to map more complex functions, which could in turn lead to fit the input onto something more complex that doesn't retain the dependencies word2vec seeks.
Also note that linear outputs don't saturate which facilitates gradient-based learning.

Related

Which activation function should be used at intermediate Layers in a Time Series Prediction Task [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I want to build a Time series prediction model using LSTM,
Which activation function should be used at intermediate layers?
Is Linear activation function is good for Final or Output Layer?
I am normalising my input data in range (0, 1) and inverse normalise after prediction.
Here is My Model:
model = Sequential()
model.add(LSTM(32, input_shape=(input_n, n_features),return_sequences=True,activation='relu'))
model.add(LSTM(32, input_shape=(n_features, input_n), return_sequences=True,activation='relu'))
model.add(Dense(output_n))
model.add(Activation("linear"))
model.compile(loss = 'mean_squared_error', optimizer = 'adam')
model.summary()
Here I have used 'relu' in intermediate layers and Linear activation at my output layer.
Is this approach correct, or in the intermediate layer I should also try with tanh and sigmoid.
What will happen if I will not use any activation function in the intermediate layer, will LSTM take care of this.
Actually LSTM already having tanh and sigmoid activation function for its internal gate calculation.
Word of warning: this is my subjective impression which is mostly (but not completely backed) by scientific research.
I can verify that ReLU and its derivates (PReLU, Leaky ReLU, etc.) have produced the best results for me in the past.
Which of those implementations will produce the best results for you is probably best determined by trying them out, if you can afford to do so.
ReLU is prettymuch better for deep learning models as an activation function.This normalizes your input and output in the range of [0,1] and adds non linearity

Caffe CNN: diversity of filters within a conv layer [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have the following theoretical questions regarding the conv layer in a CNN. Imagine a conv layer with 6 filters (conv1 layer and its 6 filters in the figure).
1) what guarantees the diversity of learned filters within a conv layer? (I mean, how the learning (optimization process) makes sure that it does not learned the same (similar) filters?
2) diversity of filters within a conv layer is a good thing or not? Is there any research on this?
3) during the learning (optimization process), is there any interaction between the filters of the same layer? if yes, how?
1.
Assuming you are training your net with SGD (or a similar backprop variant) the fact that the weights are initialized at random encourage them to be diverse, since the gradient w.r.t loss for each different random filter is usually different the gradient will "pull" the weights in different directions resulting with diverse filters.
However, there is nothing that guarantees diversity. In fact, sometimes filters become tied to each other (see GrOWL and references therein) or drop to zero.
2.
Of course you want your filters to be as diverse as possible to capture all sorts of different aspects of your data. Suppose your first layer will only have filters responding to vertical edges, how is your net going to cope with classes containing horizontal edges (or other types of textures)?
Moreover, if you have several filters that are the same, why computing the same responses twice? This is highly inefficient.
3.
Using "out-of-the-box" optimizers, the learned filters of each layer are independent of each other (linearity of gradient). However, one can use more sophisticated loss functions/regularization methods to make them dependent.
For instance, using group Lasso regularization, can force some of the filters to zero while keeping the others informative.

How to compare different models configurations [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am implementing a neural network model for text classification. I am trying different configurations on RNN and lstm neural network.
My question: How to compare these configuration, should I compare the models using the training set accuracy, validation accuracy or testing set accuracy?
I will explain how I finally compared my different RNN models.
First of all, I used my CPU for model training. This will ensure that I get the same model parameters each run as GPU computations are known to be non-deterministic.
Secondly, I used the same tf seed for each run. To make sure that the random variables generated in each run is the same.
Finally, I used my validation accuracy to optimize my hyper-parameters. Each run I used a combination of different parameters until I choose the model with the highest validation accuracy to be my best model.

How to choose the number of filters in each Convolutional Layer? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
When building a convolutional neural network, how do you determine the number of filters used in each convolutional layer. I know that there is no hard rule about the number of filters, but from your experience/ papers you have read, etc. is there an intuition/observation about number of filters used?
For instance (I'm just making this up as example):
use more/less filters as the network gets deeper.
use larger/smaller filter with large/small kernel size
If the object of interest in the image is large/small, use ...
As you said, there are no hard rules for this.
But you can get inspiration from VGG16 for example.
It double the number of filters between each conv layers.
For the kernel size, I usually keep 3x3 or 5x5.
But, you can also take a look at Inception by Google.
They use varying kernel size, then concat them. Very interesting.
As far as I am concerned there is no foxed depth for the convolutional layers. Just several suggestions:
In CS231 they mention using 3 x 3 or 5 x 5 filters with stride of 1 or 2 is a widely used practice.
How many of them: Depends on the dataset. Also, consider using fine-tuning if the data is suitable.
How the dataset will reflect the choice? A matter of experiment.
What are the alternatives? Have a look at the Inception and ResNet papers for approaches which are close to the state of the art.

Single Neuron Neural Network - Types of Questions? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Can anybody think of a real(ish) world example of a problem that can be solved by a single neuron neural network? I'm trying to think of a trivial example to help introduce the concepts.
Using a single neuron to classification is basically logistic regression, as Gordon pointed out.
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more metric (interval or ratio scale) independent variables. (statisticssolutions)
This is a good case to apply logistic regression:
Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether or not the candidate is an incumbent. (ats)
For a single neuron network, I find solving logic functions a good example. Assuming say a sigmoid neuron, you can demonstrate how the network solves AND and OR functions, which are linearly sepparable and how it fails to solve the XOR function which is not.