Backpropagation formula seems to be unimplementable as is - matlab

I've been working on getting some proficiency on backpropagation, and have run across the standard mathematical formula for doing this. I implemented a solution which seemed to work properly (and passed the relevant test with flying colours).
However ... the actual solution (implemented in MATLAB, and using vectorization) is at odds with the formula in two important respects.
The formula looks like this:
delta-layer2 = (Theta-layer2 transpose) x delta-layer3 dot x gprime(-- not important right now)
The working code looks like this:
% d3 is delta3, d2 is delta2, Theta2 is minus the bias column
% dimensions: d3--[5000x10], d2--[5000x25], Theta2--[10x25]
d3 = (a3 - y2);
d2 = (d3 * Theta2) .* gPrime(z2);
I can't reconcile what I implemented with the mathematical formula, on two counts:
The working code reverses the terms in the first part of the expression;
The working code does not transpose Theta-layer2, but the formula does.
How can this be? The dimensions of the individual matrices don't seem to allow for any other working combination.
Josh

This isn't a wrong question, I don't not why those downvotes; the implementation of a backpropagation algorithm is not intuitive as it appears. I'm not so great in math and I've never used MATLAB ( usually c ), so I avoided to answer this question first, but it deserve it.
First of all we have to do some simplifications.
1° we will use only a in_Data set so: vector in_Data[N] ( in the case below N = 2 ) ( If we succeed whit only a pat is not difficult extend it in a matrix ).
2° we will use this structure: 2 I, 2 H, 2 O ( I we succeed whit this; we will succeed with all ) this Network ( that I've stolen from: this blog )
Let's start: we know that for update the weights:
note: M=num_pattern, but we have previous declare in_data as vector, so you can delete the sum in the formula above and the matrix in the formula below. So this is your new formula:
we will study 2 connections: w1 and w5. Let's write the derivative:
let's code them: ( I really don't know MATLAB so I'll write a pseudocode )
vector d[num_connections+num_output_neurons] // num derivatives = num connections whitout count bias there are 8 connections. ; +2 derivative of O)
vector z[num_neurons] // z is the output of each neuron.
vector w[num_connections] // Yes a Vector! we have previous removed matrix and the sum.
// O layer
d[10] = (a[O1] - y[O1]); // Start from last to calculate the error.
d[9] = (a[O2] - y[O2]);
// H -> O layer
for i=5; i<=8; i++ ( Hidden to Out layer connections){
d[i] = (d)*g_prime(z[i])
}
// I -> H layer
for i=1; i<=8 i++ (Input to Hidden layer connections){
for j=1; i<=num_connection_from_neuron i++ (Take for example d[1] it depends on how many connections have H1 versus Outputs){
d[i] = d1 + (d[j+4]*w[j+4] )
}
d[i] = d[i]*g_prime(z[i]);
}
If you need to extend it in a Matrix write it in a comment that I'll extend the code.
And so you have found all the derivatives. Maybe this is not what you are exactly searching. I'm not even sure that all what I wrote is correct (I hope it is) I will try to code backpropagation in these days so I will able to correct errors if there are. I hope this will be a little helpful; better than nothing.
Best regards, Marco.

Related

Small bug in MATLAB R2017B LogLikelihood after fitnlm?

Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.

Why is this the correct way to do a cost function for a neural network?

So after beating my head against the wall for a few hours, I looked online for a solution to my problem, and it worked great. I just want to know what caused the issue with the way I was originally going about it.
here are some more details. The input is a 20x20px image from the MNIST datset, and there are 5000 samples, so X, or A1 is 5000x400. There are 25 nodes in the single hidden layer. The output is a one hot vector of 0-9 digits. y (not Y, which is the one hot encoding of y) is a 5000x1 vector with the value of 1-10.
Here was my original code for the cost function:
Y = zeros(m, num_labels);
for i = 1:m
Y(i, y(i)) = 1;
endfor
H = sigmoid(Theta2*[ones(1,m);sigmoid(Theta1*[ones(m, 1) X]'))
J = (1/m) * sum(sum((-Y*log(H]))' - (1-Y)*log(1-H]))')))
But then I found this:
A1 = [ones(m, 1) X];
Z2 = A1 * Theta1';
A2 = [ones(size(Z2, 1), 1) sigmoid(Z2)];
Z3 = A2*Theta2';
H = A3 = sigmoid(Z3);
J = (1/m)*sum(sum((-Y).*log(H) - (1-Y).*log(1-H), 2));
I see that this may be slightly cleaner, but what functionally causes my original code to get 304.88 and the other to get ~ 0.25? Is it the element wise multiplication?
FYI, this is the same problem as this question if you need the formal equation written out.
Thanks for any help I can get! I really want to understand where I'm going wrong
Transfer from the comments:
With a quick look, in J = (1/m) * sum(sum((-Y*log(H]))' - (1-Y)*log(1-H]))'))) there is definetely something going on with the parenthesis, but probably on how you pasted it here, not with the original code as this would throw an error when you run it. If I understand correctly and Y, H are matrices, then in your 1st version Y*log(H) is matrix multiplication while in the 2nd version Y.*log(H) is an entrywise multiplication (not matrix-multiplication, just c(i,j)=a(i,j)*b(i,j) ).
Update 1:
In regards to your question in the comment.
From the first screenshot, you represent each value yk(i) in the entry Y(i,k) of the Y matrix and each value h(x^(i))k as H(i,k). So basically, for each i,k you want to compute Y(i,k) log(H(i,k)) + (1-Y(i,k)) log(1-H(i,k)). You can do it for all the values together and store the result in matrix C. Then C = Y.*log(H) + (1-Y).*log(1-H) and each C(i,k) has the above mentioned value. This is an operation .* because you want to do the operation for each element (i,k) of each matrix (in contrast to multiplying the matrices which is totally different). Afterwards, to get the sum of all the values inside the 2D dimensional matrix C, you use the octave function sum twice: sum(sum(C)) to sum both columnwise and row-wise (or as # Irreducible suggested, just sum(C(:))).
Note there may be other errors as well.

random process modeling in signal processing

i would like to understand one thing and please help me to clarify it,sometimes it is necessary to represent given data by sum of complex exponentials with additive white noise,let us consider following model using sinusoidal model
clear all;
A1=24;
A2=23;
A3=23;
A4=23;
A5=10;
f1=11.01;
f2=11.005;
f3= 10;
f4=10.9
phi=2*pi*(rand(1,4)-0.5);
t=0:0.01:2.93;
k=1:1:294;
x=rand([1,length(t)]);
y(k)=A1.*sin(2*pi*f1*t+phi(1))+A2.*cos(2*pi*f2*t+phi(2))+A3.*sin(2*pi*f3*t+phi(3))+A4.*cos(2*pi*f4*t+phi(4))+A5.*x;
[pxx,f]=periodogram(y,[],[],100);
plot(f,pxx)
there phases are distributed uniformly in range of [-pi pi],but my main question is related following fact.to represent data as linear combination of complex exponentials with phases uniformly distributed in [-pi pi] interval, should we generate these phases outside of sampling or at each sampling process we should generate new list of phases?please help me to clarify this things
As I stated in the my comment, I don't really understand what you're asking. But, I will answer this as if you had asked it on codereview.
The following is not good practice in MATLAB:
A1=24;
A2=23;
A3=23;
A4=23;
A5=10;
There are very few cases (if any), where you actually need such variable names. Instead, the following would be much better:
A = [24 23 23 23 10];
Now, if you want to use A1, you do A(1) instead.
These two lines:
t=0:0.01:2.93;
k=1:1:294;
They are of course the same size (1x294), but when you do it that way, it's easy to get it wrong. You will of course get errors later on if they're not the same size, so it's nice to make sure that you have it correct on the first try, thus using linspace might be a good idea. The following line will give you the same t as the line above. This way it's easier to be sure you have exactly 294 elements, not 293, 295 or 2940 (it is sometimes easy to miss).
t = linspace(0,2.93,294);
Not really important, but k = 1:1:294 can be simplified to k = 1:294, as the default step size is 1.
The syntax .*, is used for element-wise operations. That is, if you want to multiply each element of a vector (or matrix) with the corresponding element in another one. Using it when multiplying vectors with scalars is therefore unnecessary, * is enough.
Again, not an important point, but x=rand([1,length(t)]); is simpler written x=rand(1, length(t)); (without brackets).
You don't need the index k in y(k) = ..., as k is continuous, starting at 1, with increments of 1. This is the default behavior in MATLAB, thus y = ... is enough. If, however, you only wanted to fill in every other number between 1 and 100, you could do y(1:2:100).
This is far from perfect, but in my opinion big step in the right direction:
A = [24 23 23 23 10];
f = [11.01 11.005 10 10.9]; % You might want to use , as a separator here
phi = 2*pi*(rand(1,4)-0.5);
t = linspace(0,2.93,294);
x = rand(1, length(t));
w = 2*pi*f; % For simplicity
y = A(1)*sin(w(1)*t+phi(1)) + A(2)*cos(w(2)*t+phi(2)) + ...
A(3)*sin(w(3)*t+phi(3)) + A(4)*cos(w(4)*t+phi(4))+A(5)*x;
Another option would be:
z = [sin(w(1)*t+phi(1)); cos(w(2)*t+phi(2)); sin(w(3)*t+phi(3)); ...
cos(w(4)*t+phi(4)); x];
y = A.*z;
This will give you the same y as the first one. Having the same w, t and phi as above, the following will also give you the same results:
c = bsxfun(#times,w,t') + kron(phi,ones(294,1));
y = sum(bsxfun(#times,A,[sin(c(:,1)), cos(c(:,2)), sin(c(:,3)), cos(c(:,4)), x']),2)';
I hope something in here might help you some in your further work. And maybe I actually answered your question. =)

Finding local maximum between two peak or points using MATLAB

I have intensity points which is marked as pink in above plot, and these are stored in variable and is given as
intensity_info =[ 35.9349
46.4465
46.4790
45.7496
44.7496
43.4790
42.5430
41.4351
40.1829
37.4114
33.2724
29.5447
26.8373
24.8171
24.2724
24.2487
23.5228
23.5228
24.2048
23.7057
22.5228
22.0000
21.5210
20.7294
20.5430
20.2504
20.2943
21.0219
22.0000
23.1096
25.2961
29.3364
33.4351
37.4991
40.8904
43.2706
44.9798
47.4553
48.9324
48.6855
48.5210
47.9781
47.2285
45.5342
34.2310 ];
I also have information of point A, B and C which is calculated by :
[maxtab, mintab] = peakdet(intensity_info, 1); % maxtab has A and B information and
% mintab has C information
peakdet.m matlab code can be found here: (http://www.billauer.co.il/peakdet.html). I want to calculate point D (where there is sight increase in intensity value i.e. if we come down from point A intensity decreases but at point D there is slight increase in intensity). As seen from graph below point C can also lie in the left of point D and in this case if we come down from point B intensity decrease and at D there is slight increase in intensity. Intensity values for below graph below is given as:
intensity_info =[29.3424
39.4847
43.7934
47.4333
49.9123
51.4772
52.1189
51.6601
48.8904
45.0000
40.9561
36.5868
32.5904
31.0439
29.9982
27.9579
26.6965
26.7312
28.5631
29.3912
29.7496
29.7715
29.7294
30.2706
30.1847
29.7715
29.2943
29.5667
31.0877
33.5228
36.7496
39.7496
42.5009
45.7934
49.1847
52.2048
53.9123
54.7276
54.9781
55.0000
54.9781
54.7276
53.9342
51.4246
38.2512];
and Point A ,B and C calculated in same manner as above.
How can I calculate point D in these cases?
I'm not MATLAB literate, but if the 'tabs' are subtables, then perhaps you can manipulate them to create other subtables... something like (I repeat, illiterate)
left_of_graph = part of graph from A to C
right_of_graph = part of graph from C to B
left_delta = some fraction of the difference between A's y-value and C's y-value
right_delta = some fraction of the difference between C's y-value and B's y-value
[left_maxtab,left_mintab] = peakdet(left_of_graph,left_delta)
[right_maxtab,right_mintab] = peakdet(right_of_graph,right_delta)
I do have some experience with peak analysis, so I will say this will help, not answer, the problem. You can find all the peaks you want, but that doesn't mean the data is worth squinting at. Noise and imperfect resolution are real. Happy hunting!
PS You might also scan for all points higher than both neighbors in the whole band. That is gauranteed to miss none of the 'true' maxima, but to give you more 'false' maxima than you can count (although your data looks pretty smooth!).
The solution your looking for is a numerical method based on iterations,in your particular case bisection is the one who best suits cause others like uniform sequential search do not take an interval as an input.
This is the implementation for bisection:
function [ a, b, L ] = Bisection( f, a, b, e, d )
%[a,b] is the interval for your local maxima; e is the error for the result and d is the step(dx).
L = b - a;
while(L > e)
xa = ((a + b)/2) - d/2;
xb = ((a + b)/2) + d/2;
ya = subs(f,xa);
yb = subs(f,xb);
if(ya < yb)
a = xa;
else
b = xb;
end
L = b - a;
end
end
The method before is pretty efficient and straightforward to use, although there are others even better(in performance) like Fibonacci and Gold Section methods.
Cheers.
I have found the alternative solution. The extrema.m helped in finding Point D in above two garphs. The extrema.m can be downloaded from (http://www.mathworks.com/matlabcentral/fileexchange/12275-extrema-m-extrema2-m) and used in following way to find point D:
[ymax,imax,ymin,imin] = extrema(intensity_info);
figure;plot(x,intensity_info,x(imax),ymax,'g.',x(imin),ymin,'r.');

Decrete Fourier Transform in Matlab

I am asked to write an fft mix radix in matlab, but before that I want to let to do a discrete Fourier transform in a straight forward way. So I decide to write the code according to the formula defined as defined in wikipedia.
[Sorry I'm not allowed to post images yet]
http://en.wikipedia.org/wiki/Discrete_Fourier_transform
So I wrote my code as follows:
%Brutal Force Descrete Fourier Trnasform
function [] = dft(X)
%Get the size of A
NN=size(X);
N=NN(2);
%====================
%Declaring an array to store the output variable
Y = zeros (1, N)
%=========================================
for k = 0 : (N-1)
st = 0; %the dummy in the summation is zero before we add
for n = 0 : (N-1)
t = X(n+1)*exp(-1i*2*pi*k*n/N);
st = st + t;
end
Y(k+1) = st;
end
Y
%=============================================
However, my code seems to be outputting a result different from the ones from this website:
http://www.random-science-tools.com/maths/FFT.htm
Can you please help me detect where exactly is the problem?
Thank you!
============
Never mind it seems that my code is correct....
By default the calculator in the web link applies a window function to the data before doing the FFT. Could that be the reason for the difference? You can turn windowing off from the drop down menu.
BTW there is an FFT function in Matlab