I was wondering how was working LSTM under Keras.
Let's take an example.
I have maximum sentence length of 3 words.
Example : 'how are you'
I vectorize each words in a vector of len 4. So I will have a shape (3,4)
Now, I want to use an lstm to do translation stuff. (Just an example)
model = Sequential()
model.add(LSTM(1, input_shape=(3,4), return_sequences=True))
model.summary()
I'm going to have an output shape of (3,1) according to Keras.
Layer (type) Output Shape Param #
=================================================================
lstm_16 (LSTM) (None, 3, 1) 24
=================================================================
Total params: 24
Trainable params: 24
Non-trainable params: 0
_________________________________________________________________
And this is what I don't understand.
Each unit of an LSTM (With return_sequences=True to have all the output of each state) should give me a vector of shape (timesteps, x)
Where timesteps is 3 in this case, and x is the size of my words vector (In this case, 4)
So, why I got an output shape of (3,1) ?
I searched everywhere, but can't figure it out.
Your interpretation of what the LSTM should return is not right. The output dimensionality doesn't need to match the input dimensionality. Concretely, the first argument of keras.layers.LSTM corresponds to the dimensionality of the output space, and you're setting it to 1.
In other words, setting:
model.add(LSTM(k, input_shape=(3,4), return_sequences=True))
will result in a (None, 3, k) output shape.
Related
Consider the hypothetical neural network here
$o_1$ is the output of neuron 1.
$o_2$ is the output of neuron 2.
$w_1$ is the weight of connection between 1 and 3.
$w_2$ is the weight of connection between 2 and 3.
So the input to neuron 3 is $i =o_1w_1 +o_2w_2$
Let the activation function of neuron 3 be sigmoid function.
$f(x) = \dfrac{1}{1+e^{-x}}$ and the threshold value of neuron 3 be $\theta$.
Therefore, output of neuron 3 will be $f(i)$ if $i\geq\theta$ and $0$ if $i\lt\theta$.
Am I correct?
Thresholds are used for binary neurons (I forget the technical name), whereas biases are used for sigmoid (and pretty much all modern) neurons. Your understanding of the threshold is correct, but again this is used in neurons where the output is either 1 or 0, which is not very useful for learning (optimization). With a sigmoid neuron, you would simply add the bias (previously the threshold but moved to the other side of the equation), so you're output would be f(weight * input + bias). All the sigmoid function is doing (for the most part) is limiting your output to a value between 0 and 1
I do not think it is the place to ask this sort of questions. You will find lot of NN ressources online. For your simple case, each link has a weight, so basicly the input of neuron 3 is :
Neuron3Input = Neuron1Output * WeightOfLinkNeuron1To3 + Neuron2Output * WeightOfLinkNeuron2To3
+ bias.
Then, to get the output, just use the activation function. Neuron3Output = F_Activation(Neuron3Input)
O3 = F(O1 * W1 + O2 * W2 + Bias)
I have trained an RNN in Keras. Now, I want to get the values of the trained weights:
model = Sequential()
model.add(SimpleRNN(27, return_sequences=True , input_shape=(None, 27), activation = 'softmax'))<br>
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.get_weights()
This gives me 2 arrays of shape (27,27) and 1 array of shape (27,1). I am not getting the meaning of these arrays. Also, I should get 2 more array of shape (27,27) and (27,1) that will calculate the hidden state 'a' activation. How can I get these weights?
The arrays returned by model.get_weights() directly correspond to the weights used by SimpleRNNCell. They include:
The kernel matrix of size (input_shape[-1], units). In your case, input_shape=(None, 27) and units=27, so it's (27, 27). The kernel gets multiplied by the input.
The recurrent_kernel matrix of size (units, units), which also happens to be (27, 27). This matrix gets multiplied by the previous state.
The bias array of shape (units,) == (27,).
These arrays correspond to the standard formula:
# W = kernel
# U = recurrent_kernel
# B = bias
output = new_state = act(W * input + U * state + B)
Note that keras implementation uses a single bias vector, so all in all there are exactly three arrays.
In the snippet:
criterion = nn.CrossEntropyLoss()
raw_loss = criterion(output.view(-1, ntokens), targets)
output size is torch.Size([5, 5, 8967]), targets size is torch.Size([25]), and ntokens is 8967
After modifying the code, my
output size is torch.Size([5, 8967]) and targets size is torch.Size([25])
which rises dimensionality issues when computing the loss.
Is it sensible to increase the size of my Linear activation that produces the output by 5, so that I can resize the output later to be of the size torch.Size([5, 5, 8967])?
The problem with increasing the size of the tensor is that ntokens can become quite large and I can easily run out of memory because of that. Is there an alternative approach?
You should do something like this:
ntokens = 8000
output = Variable(torch.randn(5, 5, ntokens))
targets = Variable(torch.from_numpy(np.random.randint(0, ntokens, size=25)))
criterion = nn.CrossEntropyLoss()
loss = criterion(output.view(-1, ntokens), targets)
print(loss)
This prints:
Variable containing:
9.4613
[torch.FloatTensor of size 1]
Here, I am assuming output contains predictions of next word for 5 sentences (minibatch size is 5) and each sentence is of length 5 (sequence length is 5). 8000 is the vocabulary size, so your model is predicting a probability distribution over the entire vocabulary.
Now, you can compute the loss of predicting each word as your target shape is 25 as required.
Please note, CrossEntropyLoss expects input to contain scores for each class. So, input has to be a 2D Tensor of size (minibatch, C) and the target has to be a class index (0 to C-1) for each value of a 1D tensor of size minibatch.
Using the the example network of a mlp with 2 hidden layers and two drop outs
so my load_data() function has 400 rows of 20 features and my label dataset is just 400 rows of one variable that will be split into X_train X_test y_train_y_test and some taken out for validation
my lasagne input layer is :
l_in = lasagne.layers.InputLayer(shape=(None, 20), input_var=input_var)
and my train function is train_fn = theano.function([input_var, target_var], loss, updates=updates, allow_input_downcast=True)
at around here my program skips: train_err += train_fn(inputs, targets)
'Wrong number of dimensions: expected 1, got 2 with shape (20, 1).')
the 20, 1 I understand, as I passed in twenty values on one side and 1 value in the labels side, but I thought theano autonmatically flattened each array?
what can I do to fix this?
any help would be appreciated!
The inputs that you pass to train_fn() should be an ndarray with shape (n, 20), where n is the number of examples in your minibatch. The targets should be an ndarray with shape (n) (note that shapes (1, n) and (n, 1) won't work). Try double checking that the arrays you actually pass to the function match these shapes.
I guess my question is very simple, but anyway...
I've created neural network using
net = newff(entry_borders, [20, 10], {'logsig', 'logsig'}, 'traingdx');
where entry_borders is an array 50x2: [(0,1), (0,1), ...]
It must be a network with a hidden layer with 50 entries and 10 outputs, isn't it?
But when I run this:
test_result = sim(net, zeros(50));
disp(test_result);
I get matrix with 10x50 elements in test_result (instead of 10 scalar values) - what's that?? I'm not speaking about the teaching process that's why here's so sily code...
zeros(50) gives you a 50x50 matrix, so it is treated as 50 examples (each of dimension 50), which gives 50 predictions (each of size 10)