In the deeplearning.io course by Andrew Ng on coursera, the following gradients are given:
Here, a[l] = gl
I fail to understand the gradients obtained here.
If:
a[l] = g[l](z[l]),
then:
da[l] = ∂g[l](z[l])/∂z[l] * dz[l]
=> da[l] = g[l]'(z[l]) * dz[l]
But the 1st formula given in the slide is different, what am I doing wrong?
Please watch Andrew Ng's video on Backpropagation intuition (optional) from Week3 "Shallow neural networks" to understand the backpropagation derivative.
Related
Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.
I've been trying to fit a sine curve with a keras (theano backend) model using pymc3. I've been using this [http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learning/] as a reference point.
A Keras implementation alone fit using optimization does a good job, however Hamiltonian Monte Carlo and Variational sampling from pymc3 is not fitting the data. The trace is stuck at where the prior is initiated. When I move the prior the posterior moves to the same spot. The posterior predictive of the bayesian model in cell 59 is barely getting the sine wave, whereas the non-bayesian fit model gets it near perfect in cell 63. I created a notebook here: https://gist.github.com/tomc4yt/d2fb694247984b1f8e89cfd80aff8706 which shows the code and the results.
Here is a snippet of the model below...
class GaussWeights(object):
def __init__(self):
self.count = 0
def __call__(self, shape, name='w'):
return pm.Normal(
name, mu=0, sd=.1,
testval=np.random.normal(size=shape).astype(np.float32),
shape=shape)
def build_ann(x, y, init):
with pm.Model() as m:
i = Input(tensor=x, shape=x.get_value().shape[1:])
m = i
m = Dense(4, init=init, activation='tanh')(m)
m = Dense(1, init=init, activation='tanh')(m)
sigma = pm.Normal('sigma', 0, 1, transform=None)
out = pm.Normal('out',
m, 1,
observed=y, transform=None)
return out
with pm.Model() as neural_network:
likelihood = build_ann(input_var, target_var, GaussWeights())
# v_params = pm.variational.advi(
# n=300, learning_rate=.4
# )
# trace = pm.variational.sample_vp(v_params, draws=2000)
start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
step = pm.HamiltonianMC(scaling=start)
trace = pm.sample(1000, step, progressbar=True)
The model contains normal noise with a fixed std of 1:
out = pm.Normal('out', m, 1, observed=y)
but the dataset does not. It is only natural that the predictive posterior does not match the dataset, they were generated in a very different way. To make it more realistic you could add noise to your dataset, and then estimate sigma:
mu = pm.Deterministic('mu', m)
sigma = pm.HalfCauchy('sigma', beta=1)
pm.Normal('y', mu=mu, sd=sigma, observed=y)
What you are doing right now is similar to taking the output from the network and adding standard normal noise.
A couple of unrelated comments:
out is not the likelihood, it is just the dataset again.
If you use HamiltonianMC instead of NUTS, you need to set the step size and the integration time yourself. The defaults are not usually useful.
Seems like keras changed in 2.0 and this way of combining pymc3 and keras does not seem to work anymore.
I am trying to implement answer selection model in deep learning as shown below in keras based on this paper,
I understand implementing steps embedding, bi-LSTM and pooling in above flow.
But how do I implement the merge function to compute the cosine similarity and loss function in keras?
loss function is defined as,
L= max{0,M-cosine(q,a+)+cosine(q,a-)}
where,
M = constant margin
q = question
a+ = correct answer
a- = wrong answer
Update 1:
After going through few blogs, this is how I implemented.
#build model
input_question = Input(shape=(max_len, embedding_dim))
input_sentence = Input(shape=(max_len, embedding_dim))
question_lstm = Bidirectional(LSTM(64))
sentence_lstm = Bidirectional(LSTM(64))
encoded_question = question_lstm(input_question)
encoded_sentence = sentence_lstm(input_sentence)
cos_distance = merge([encoded_question, encoded_sentence], mode='cos', dot_axes=1)
cos_distance = Reshape((1,))(cos_distance)
cos_similarity = Lambda(lambda x: 1-x)(cos_distance)
predictions = Dense(1, activation='sigmoid')(cos_similarity)
model = Model([input_question, input_sentence], [predictions])
model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])
With above implementation, I am still not able to figure out how to implement hinge loss. Please help
I know its over-fitting to the training data set, yet I dont know how to change the parameters to avoid this.
I have tried changing the boxcontraint from 1e0, 1e1, 1e10 and got the same situation.
tTargets = ones(size(trainTargets,1),1);
tTargets(trainTargets(:,2)==1)=-1;
svmModel = fitcsvm(trainData, ...
tTargets,...
'KernelFunction','rbf',...
'BoxConstraint',1e0);
[Group, score] = predict(svmModel, trainData);
tTargets = ones(size(trainTargets,1),1);
tTargets(trainTargets(:,2)==1)=-1;
svmTrainError = sum(tTargets ~= Group)/size(trainTargets,1);
[Group, score] = predict(svmModel, testData);
tTargets = ones(size(testTargets,1),1);
tTargets(testTargets(:,2)==1)=-1;
svmTestError = sum(tTargets ~= Group)/size(testTargets,1);
I hope someone can help with this
Thanks,
I found out that I was using a big C for the training. This made that the separation on the training data was really good but not for the testing.
Changing the C for a smaller value (1e-2) made my code run faster and now I have comparable over all accuracy in the training and testing.
Thank you!!!!
I'm trying to optimize an image reconstruction algorithm using genetic algorithm.I took initial population size as 10.I have an input image an 10 reconstructed image.fitness function is the difference between these two.That is
fitness_1 = inputimage - reconstructedimage_1;
fitness_2 = inputimage - reconstructedimage_2;
:
:
fitness_10 = inputimage - reconstructedimage_10;
I want to chose the best fitness population among them.But my fitness result is an image(matrix with intensity values).So how can I get a single fitness value for each population for doing crossover in the next stage.
Please help.Thanks in advance
You need to define a function which measures the quality of the match as a single scalar value. Actually you have a choice here - anything which could measure the closeness in a more-or-less-continuous manner would work. However, probably the simplest is mean squared error of each pixel value in the image.
Here's how I might do this for your first reconstruction:
fitness_1 = abs(inputimage - reconstructedimage_1).^2;
fitness_1 = sum( fitness_1(:) ) / numel( fitness_1 );