Consider the following code in Keras for building a LSTM model.
model = Sequential()
model.add(LSTM(30, input_dim=22,return_sequences=True, init = 'glorot_uniform'))
model.add(TimeDistributed(Dense(1)))
model.compile(loss='mean_squared_error', optimizer='Nadam')
model.fit(train3d, trainY3d, nb_epoch=100, batch_size=8)
As one can see there is only 1 LSTM layer. I see there are 14 matrices as output. If you check this link you would notice that there are 4 triples (Input gate, Forget gate, Cell State, Output gate) of W, U (parameter matrices) and b (bias vector) each for the LSTM layer. And then there are 2 matrices W and b for the dense layer.
My question is let's say for this case of 1 layer LSTM, is there a way to attribute 100% of Y to the impact for each input feature X_i for all i.
Related
I have 7 classes within my training examples (labeled 1-7). I'm running logistic regression and I want to create my ROC curve for each of my classes.
To train my model and make a prediction, I have the following code:
Theta = zeros(k, n+1); %initialize theta
[Theta, costs] = gradientDescent(Theta, #(t)(CostFunc(t, X, Y, lambda)),...
#(t)(DerivOfCostFunc(t, X, Y, lambda)), alpha, iter_num);
%Make prediction with trained model
[scores,prediction] = predict(Theta, X_test); %X_test is the design matrix (ones on the first col)
Within the predict script, I have
scores = g(X*all_theta'); %this is the sigmoid function
[p_max, IndexOfMax]=max(scores, [], 2);
prediction = IndexOfMax;
Note that scores is a m by k matrix, where m is the number of training examples and k is the number of classes. Prediction is a m by 1 vector with numbers going from 1-7, based on the predicted class.
To create the ROC curve, for class 3 for example,
classNum=3;
for i=1:size(scores,1)
temp=scores(i,:);
diffscore(i,:)=temp(classNum)-max([temp(:,1:classNum-1),temp(:,classNum+1:end)]);
end
This last part I did because I read that I had to establish my class 3 as positive and the others as negative.
At last, I made my curve with the following code:
[xROC,yROC,~,auc] = perfcurve(y_test,diffscore,classNum);
%y_test contains my true labels, m by 1 column vector
However, when running the ROC curve for each of my classes, I get the same plot for all. They all have an AUC of 1. Based on some analysis, I know this is not correct but can't figure out in which part of the code I went wrong! Is there additional code I should add or should I need to modify any of my existing code?
I have state-space model where:
A is 4x4 matrix, B is 4x1 matrix, C is 1x4 matrix.
I want that model to be simulated in Simulink, simple right? So I made a model as shown in this image.
Why i am getting only one output? Shouldnt I get output of matrix 4x1 therefore four outputs?
Analyzing the state space model consisting of system of matrix equations:
dx = A*x + B*u
y = C*x + D*u
We can see that size of y (the output) is determined by the number of rows in C and D matrices (number of rows in both matrices must be equal).
In your case size(C) = [1,4], that is the number of rows is 1 so you have only one output.
If you want to extract the whole state you can set C = eye(4) and modify D so that size(D) = [4,1] (as you have 4 outputs now and 1 input).
I am following this tutorial for making neural network
https://www.kaggle.com/antmarakis/another-neural-network-from-scratch
I do not understand the train part of this code where 1 is appended to the input feature vector.
def Train(X, Y, lr, weights):`
`layers = len(weights)`
`for i in range(len(X)):`
`x, y = X[i], Y[i]`
`x = np.matrix(np.append(1, x)) # Augment feature vector`
`activations = ForwardPropagation(x, weights, layers)`
`weights = BackPropagation(y, activations, weights, layers)`
`return weights
any help in understanding this would be appreciated.
Forward propagation includes multiplying by weights and adding a bias term. The equation is
y = X*W + b. This can be written in a more vectorised form as y = [X, 1] * [W, b]. (* stands for matrix multiplication here).
In the code, the weights and biases seemed to have been combined into a single weight matrix W and x is modified as an augmented vector by appending a one to it.
I am trying to implement neural network with RELU.
input layer -> 1 hidden layer -> relu -> output layer -> softmax layer
Above is the architecture of my neural network.
I am confused about backpropagation of this relu.
For derivative of RELU, if x <= 0, output is 0.
if x > 0, output is 1.
So when you calculate the gradient, does that mean I kill gradient decent if x<=0?
Can someone explain the backpropagation of my neural network architecture 'step by step'?
if x <= 0, output is 0. if x > 0, output is 1
The ReLU function is defined as: For x > 0 the output is x, i.e. f(x) = max(0,x)
So for the derivative f '(x) it's actually:
if x < 0, output is 0. if x > 0, output is 1.
The derivative f '(0) is not defined. So it's usually set to 0 or you modify the activation function to be f(x) = max(e,x) for a small e.
Generally: A ReLU is a unit that uses the rectifier activation function. That means it works exactly like any other hidden layer but except tanh(x), sigmoid(x) or whatever activation you use, you'll instead use f(x) = max(0,x).
If you have written code for a working multilayer network with sigmoid activation it's literally 1 line of change. Nothing about forward- or back-propagation changes algorithmically. If you haven't got the simpler model working yet, go back and start with that first. Otherwise your question isn't really about ReLUs but about implementing a NN as a whole.
If you have a layer made out of a single ReLU, like your architecture suggests, then yes, you kill the gradient at 0. During training, the ReLU will return 0 to your output layer, which will either return 0 or 0.5 if you're using logistic units, and the softmax will squash those. So a value of 0 under your current architecture doesn't make much sense for the forward propagation part either.
See for example this. What you can do is use a "leaky ReLU", which is a small value at 0, such as 0.01.
I would reconsider this architecture however, it doesn't make much sense to me to feed a single ReLU into a bunch of other units then apply a softmax.
Here is a good example, use ReLU to implement XOR:
reference, http://pytorch.org/tutorials/beginner/pytorch_with_examples.html
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
# N is batch size(sample size); D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 4, 2, 30, 1
# Create random input and output data
x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 0.002
loss_col = []
for t in range(200):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0) # using ReLU as activate function
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum() # loss function
loss_col.append(loss)
print(t, loss, y_pred)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T) # the second laye's error
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0 # the derivate of ReLU
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
plt.plot(loss_col)
plt.show()
More about the derivate of ReLU, you can see here: http://kawahara.ca/what-is-the-derivative-of-relu/
So when you calculate the gradient, does that mean I kill gradient
decent if x <= 0?
Yes! If the weighted sum of the inputs and bias of the neuron (activation function input) is less than zero and the neuron uses the Relu activation function, the value of the derivative is zero during backpropagation and the input weights to this neuron do not change (not updated).
Can someone explain the backpropagation of my neural network architecture 'step by step'?
A simple example can show one step of backpropagation. This example covers a complete process of one step. But you can also check only the part that related to Relu. This is similar to the architecture introduced in question and uses one neuron in each layer for simplicity. The architecture is as follows:
f and g represent Relu and sigmoid, respectively, and b represents bias.
Step 1:
First, the output is calculated:
This merely represents the output calculation. "z" and "a" represent the sum of the input to the neuron and the output value of the neuron activating function, respectively.
So h is the estimated value. Suppose the real value is y.
Weights are now updated with backpropagation.
The new weight is obtained by calculating the gradient of the error function relative to the weight, and subtracting this gradient from the previous weight, ie:
In backpropagation, the gradient of the last neuron(s) of the last layer is first calculated. A chain derivative rule is used to calculate:
The three general terms used above are:
The difference between the actual value and the estimated value
Neuron output square
And the derivative of the activator function, given that the activator function in the last layer is sigmoid, we have this:
And the above statement does not necessarily become zero.
Now we go to the second layer. In the second layer we will have:
It consisted of 4 main terms:
The difference between the actual value and the estimated value.
Neuron output square
The sum of the loss derivatives of the connected neurons in the next layer
A derivative of the activator function and since the activator function is Relu we will have:
if z2<=0 (z2 is the input of Relu function):
Otherwise, it's not necessarily zero:
So if the input of neurons is less than zero, the loss derivative is always zero and weights will not update.
*It is repeated that the sum of the neuron inputs must be less than zero to kill gradient descent.
The example given is a very simple example to illustrate the backpropagation process.
Yes the orginal Relu function has the problem you describe.
So they later made a change to the formula, and called it leaky Relu
In essence Leaky Relu tilts the horizontal part of the function slightly by a very small amount. for more information watch this :
An explantion of activation methods, and a improved Relu on youtube
Additionally, here you can find an implementation in caffe framework: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/relu_layer.cpp
The negative_slope specifies whether to "leak" the negative part by multiplying it with the slope value rather than setting it to 0. Of course you should set this parameter to zero to have classical version.
I'm very new to machine learning, I'v read about Matlab's Statistics toolbox for hidden Markov model, I want to classify a given sequence of signals using it. I'v 3D co-ordinates in matrix P i.e [501x3] and I want to train model based on that. Evert complete trajectory ends on a specfic set of points, i.e at (0,0,0) where it achieves its target.
What is the appropriate Pseudocode/approach according to my scenario.
My Pseudocode:
501x3 matrix P is Emission matrix where each co-ordinate is state
random NxN transition matrix values (but i'm confused in it)
generating test sequence using the function hmmgenerate
train using hmmtrain(sequence,old_transition,old_emission)
give final transition and emission matrix to hmmdecode with an unknown sequence to give the probability (confusing also)
EDIT 1:
In a nutshell, I want to classify 10 classes of trajectories having each of [501x3] with HMM. I want to sampled 50 rows i.e [50x3] for each trajectory in order to build model. However, I'v murphyk's toolbox of HMM for such random sequences.
Here is a general outline of the approach to classifying d-dimensional sequences using hidden Markov models:
1) Training:
For each class k:
prepare an HMM model. This includes initializing the following:
a transition matrix: Q-by-Q matrix, where Q is the number of states
a vector of prior probabilities: Q-by-1 vector
the emission model: in your case the observations are 3D points so you could use a mutlivariate normal distribution (with specified mean vector and covariance matrix) or a Guassian mixture model (a bunch of MVN distributions combined using mixture coefficient)
after properly initializing the above parameters, you train the HMM model, feeding it the set of sequences belong to this class (EM algorithm).
2) Prediction
Next to classify a new sequence X:
you compute the log-likelihood of the sequence using each model log P(X|model_k)
then you pick the class that gave the highest probability. This is the class prediction.
As I mentioned in the comments, the Statistics Toolbox only implement discrete observation HMM models, so you will have to find another libraries or implement the code yourself. Kevin Murphy's toolboxes (HMM toolbox, BNT, PMTK3) are popular choices in this domain.
Here are some answers I posted in the past using Kevin Murphy's toolboxes:
Issue in training hidden markov model and usage for classification
Simple example/use-case for a BNT gaussian_CPD
The above answers are somewhat different from what you are trying to do here, but it's a good place to start.
The statement/case tells to build and train a hidden Markov's model having following components specially using murphyk's toolbox for HMM as per the choice:
O = Observation's vector
Q = States vector
T = vectors sequence
nex = number of sequences
M = number of mixtures
Demo Code (from murphyk's toolbox):
O = 8; %Number of coefficients in a vector
T = 420; %Number of vectors in a sequence
nex = 1; %Number of sequences
M = 1; %Number of mixtures
Q = 6; %Number of states
data = randn(O,T,nex);
% initial guess of parameters
prior0 = normalise(rand(Q,1));
transmat0 = mk_stochastic(rand(Q,Q));
if 0
Sigma0 = repmat(eye(O), [1 1 Q M]);
% Initialize each mean to a random data point
indices = randperm(T*nex);
mu0 = reshape(data(:,indices(1:(Q*M))), [O Q M]);
mixmat0 = mk_stochastic(rand(Q,M));
else
[mu0, Sigma0] = mixgauss_init(Q*M, data, 'full');
mu0 = reshape(mu0, [O Q M]);
Sigma0 = reshape(Sigma0, [O O Q M]);
mixmat0 = mk_stochastic(rand(Q,M));
end
[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...
mhmm_em(data, prior0, transmat0, mu0, Sigma0, mixmat0, 'max_iter', 5);
loglik = mhmm_logprob(data, prior1, transmat1, mu1, Sigma1, mixmat1);