Gradient descent in linear regression goes wrong - matlab

I actually want to use a linear model to fit a set of 'sin' data, but it turns out the loss function goes larger during each iteration. Is there any problem with my code below ? (gradient descent method)
Here is my code in Matlab
m=20;
rate = 0.1;
x = linspace(0,2*pi,20);
x = [ones(1,length(x));x]
y = sin(x);
w = rand(1,2);
for i=1:500
h = w*x;
loss = sum((h-y).^2)/m/2
total_loss = [total_loss loss];
**gradient = (h-y)*x'./m ;**
w = w - rate.*gradient;
end
Here is the data I want to fit

There isn't a problem with your code. With your current framework, if you can define data in the form of y = m*x + b, then this code is more than adequate. I actually ran it through a few tests where I define an equation of the line and add some Gaussian random noise to it (amplitude = 0.1, mean = 0, std. dev = 1).
However, one problem I will mention to you is that if you take a look at your sinusoidal data, you define a domain between [0,2*pi]. As you can see, you have multiple x values that get mapped to the same y value but of different magnitude. For example, at x = pi/2 we get 1 but at x = -3*pi/2 we get -1. This high variability will not bode well with linear regression, and so one suggestion I have is to restrict your domain... so something like [0, pi]. Another reason why it probably doesn't converge is the learning rate you chose is too high. I'd set it to something low like 0.01. As you mentioned in your comments, you already figured that out!
However, if you want to fit non-linear data using linear regression, you're going to have to include higher order terms to account for the variability. As such, try including second order and/or third order terms. This can simply be done by modifying your x matrix like so:
x = [ones(1,length(x)); x; x.^2; x.^3];
If you recall, the hypothesis function can be represented as a summation of linear terms:
h(x) = theta0 + theta1*x1 + theta2*x2 + ... + thetan*xn
In our case, each theta term would build a higher order term of our polynomial. x2 would be x^2 and x3 would be x^3. Therefore, we can still use the definition of gradient descent for linear regression here.
I'm also going to control the random generation seed (via rng) so that you can produce the same results I have gotten:
clear all;
close all;
rng(123123);
total_loss = [];
m = 20;
x = linspace(0,pi,m); %// Change
y = sin(x);
w = rand(1,4); %// Change
rate = 0.01; %// Change
x = [ones(1,length(x)); x; x.^2; x.^3]; %// Change - Second and third order terms
for i=1:500
h = w*x;
loss = sum((h-y).^2)/m/2;
total_loss = [total_loss loss];
% gradient is now in a different expression
gradient = (h-y)*x'./m ; % sum all in each iteration, it's a batch gradient
w = w - rate.*gradient;
end
If we try this, we get for w (your parameters):
>> format long g;
>> w
w =
Columns 1 through 3
0.128369521905694 0.819533906064327 -0.0944622478526915
Column 4
-0.0596638117151464
My final loss after this point is:
loss =
0.00154350916582836
This means that our equation of the line is:
y = 0.12 + 0.819x - 0.094x^2 - 0.059x^3
If we plot this equation of the line with your sinusoidal data, this is what we get:
xval = x(2,:);
plot(xval, y, xval, polyval(fliplr(w), xval))
legend('Original', 'Fitted');

Related

Correct frequency axis using FFT

How can I get the correct frequency vector to plot using the FFT of MATLAB?
My problem:
N = 64;
n = 0:N-1;
phi1 = 2*(rand-0.5)*pi;
omega1 = pi/6;
phi2 = 2*(rand-0.5)*pi;
omega2 = 5*pi/6;
w = randn(1,N); % noise
x = 2*exp(1i*(n*omega1+phi1))+4*sin(n*omega2+phi2);
h = rectwin(N).';
x = x.*h;
X = abs(fft(x));
Normally I'd do this :
f = f = Fs/Nsamples*(0:Nsamples/2-1); % Prepare freq data for plot
The problem is this time I do not have a Fs (sample frequency).
How can I do it correctly in this case?
If you don't have a Fs, simply set it to 1 (as in one sample per sample). This is the typical solution I've always used and seen everybody else use. Your frequencies will run from 0 to 1 (or -0.5 to 0.5), without units. This will be recognized by everyone as meaning "periods per sample".
Edit
From your comment I conclude that you are interested in radial frequencies. In that case you want to set your plot x-axis to
omega = 2*pi*f;

[Octave]Using fminunc is not always giving a consistent solution

I am trying to find the coefficients in an equation to model the step response of a motor which is of the form 1-e^x. The equation I'm using to model is of the form
a(1)*t^2 + a(2)*t^3 + a(3)*t^3 + ...
(It is derived in a research paper used to solve for motor parameters)
Sometimes using fminunc to find the coefficients works out okay, and I get a good result, and it matches the training data fairly well. Other times the returned coefficients are horrible (going extremely higher than what the output should be and is orders of magnitude off). This especially happens once I started using higher order terms: using any model that uses x^8 or higher (x^9, x^10, x^11, etc.) always produces bad results.
Since it works sometimes, I can't think why my implementation would be wrong. I have tried fminunc while providing the gradients and while also not providing the gradients yet there is no difference. I've looked into using other functions to solve for the coefficients, like polyfit, but in that instance it has to have terms that are raised from 1 to the highest order term, but the model I'm using has its lowest power at 2.
Here is the main code:
clear;
%Overall Constants
max_power = 7;
%Loads in data
%data = load('TestData.txt');
load testdata.mat
%Sets data into variables
indep_x = data(:,1); Y = data(:,2);
%number of data points
m = length(Y);
%X is a matrix with the independant variable
exps = [2:max_power];
X_prime = repmat(indep_x, 1, max_power-1); %Repeats columns of the indep var
X = bsxfun(#power, X_prime, exps);
%Initializes theta to rand vals
init_theta = rand(max_power-1,1);
%Sets up options for fminunc
options = optimset( 'MaxIter', 400, 'Algorithm', 'quasi-newton');
%fminunc minimizes the output of the cost function by changing the theta paramaeters
[theta, cost] = fminunc(#(t)(costFunction(t, X, Y)), init_theta, options)
%
Y_line = X * theta;
figure;
hold on; plot(indep_x, Y, 'or');
hold on; plot(indep_x, Y_line, 'bx');
And here is costFunction:
function [J, Grad] = costFunction (theta, X, Y)
%# of training examples
m = length(Y);
%Initialize Cost and Grad-Vector
J = 0;
Grad = zeros(size(theta));
%Poduces an output based off the current values of theta
model_output = X * theta;
%Computes the squared error for each example then adds them to get the total error
squared_error = (model_output - Y).^2;
J = (1/(2*m)) * sum(squared_error);
%Computes the gradients for each theta t
for t = 1:size(theta, 1)
Grad(t) = (1/m) * sum((model_output-Y) .* X(:, t));
end
endfunction
Any help or advice would be appreciated.
Try adding regularization to your costFunction:
function [J, Grad] = costFunction (theta, X, Y, lambda)
m = length(Y);
%Initialize Cost and Grad-Vector
J = 0;
Grad = zeros(size(theta));
%Poduces an output based off the current values of theta
model_output = X * theta;
%Computes the squared error for each example then adds them to get the total error
squared_error = (model_output - Y).^2;
J = (1/(2*m)) * sum(squared_error);
% Regularization
J = J + lambda*sum(theta(2:end).^2)/(2*m);
%Computes the gradients for each theta t
regularizator = lambda*theta/m;
% overwrite 1st element i.e the one corresponding to theta zero
regularizator(1) = 0;
for t = 1:size(theta, 1)
Grad(t) = (1/m) * sum((model_output-Y) .* X(:, t)) + regularizator(t);
end
endfunction
The regularization term lambda is used to control the learning rate. Start with lambda=1. The grater the value for lambda, the slower the learning will occur. Increase lambda if the behavior you describe persists. You may need to increase the number of iterations if lambda gets high.
You may also consider normalization of your data, and some heuristic for initializing theta - setting all theta to 0.1 may be better than random. If nothing else it'll provide better reproducibility from training to training.

Multivariate Normal Distribution Matlab, probability area

I have 2 arrays: one with x-coordinates, the other with y-coordinates.
Both are a normal distribution as a result of a Monte-Carlo simulation. I know how to find the sigma and mu for both array's, and get a 95% confidence interval:
[mu,sigma]=normfit(x_array);
hist(x_array);
x=norminv([0.025 0.975],mu,sigma)
However, both array's are correlated with each other. To plot the probability distribution of the combined array's, i use the multivariate normal distribution. In MATLAB this gives me:
[MuX,SigmaX]=normfit(x_array);
[MuY,SigmaY]=normfit(y_array);
mu = [MuX MuY];
Sigma=cov(x_array,y_array);
x1 = MuX-4*SigmaX:5:MuX+4*SigmaX; x2 = MuY-4*SigmaY:5:MuY+4*SigmaY;
[X1,X2] = meshgrid(x1,x2);
F = mvnpdf([X1(:) X2(:)],mu,Sigma);
F = reshape(F,length(x2),length(x1));
surf(x1,x2,F);
caxis([min(F(:))-.5*range(F(:)),max(F(:))]);
set(gca,'Ydir','reverse')
xlabel('x0-as'); ylabel('y0-as'); zlabel('Probability Density');
So far so good. Now I want to calculate the 95% probability area. I'am looking for a function as mndinv, just as norminv. However, such a function doesn't exist in MATLAB, which makes sense because there are endless possibilities... Does somebody have a tip about how to get a 95% probability area? Thanks in advance.
For the bivariate case you can add the ellispe whose area corresponds to NORMINV(95%). This ellipse is uniquely identified and for proof see the first source in the link.
% Suppose you know the distribution params, or you got them from normfit()
mu = [3, 7];
sigma = [1, 2.5
2.5 9];
% X/Y values for plotting grid
x = linspace(mu(1)-3*sqrt(sigma(1)), mu(1)+3*sqrt(sigma(1)),100);
y = linspace(mu(2)-3*sqrt(sigma(end)), mu(2)+3*sqrt(sigma(end)),100);
% Z values
[X1,X2] = meshgrid(x,y);
Z = mvnpdf([X1(:) X2(:)],mu,sigma);
Z = reshape(Z,length(y),length(x));
% Plot
h = pcolor(x,y,Z);
set(h,'LineStyle','none')
hold on
% Add level set
alpha = 0.05;
r = sqrt(-2*log(alpha));
rho = sigma(2)/sqrt(sigma(1)*sigma(end));
M = [sqrt(sigma(1)) rho*sqrt(sigma(end))
0 sqrt(sigma(end)-sigma(end)*rho^2)];
theta = 0:0.1:2*pi;
f = bsxfun(#plus, r*[cos(theta)', sin(theta)']*M, mu);
plot(f(:,1), f(:,2),'--r')
Sources
https://upload.wikimedia.org/wikipedia/commons/a/a2/Cumulative_function_n_dimensional_Gaussians_12.2013.pdf
https://en.wikipedia.org/wiki/Multivariate_normal_distribution
To get the numerical value of F where the top part lies, you should use top5=prctile(F(:),95) . This will return the value of F that limits the bottom 95% of data with the top 5%.
Then you can get just the top 5% with
Ftop=zeros(size(F));
Ftop=F>top5;
Ftop=Ftop.*F;
%// optional: Ftop(Ftop==0)=NaN;
surf(x1,x2,Ftop,'LineStyle','none');

Testing for Unimodal (Unimodality) or Bimodal (Bimodality) Distribution in MATLAB

Is there a way in MATLAB to check whether the histogram distribution is unimodal or bimodal?
EDIT
Do you think Hartigan's Dip Statistic would work? I tried passing an image to it, and get the value 0. What does that mean?
And, when passing an image, does it test the distribution of the histogram of the image on the gray levels?
Thanks.
Here is a script using Nic Price's implementation of Hartigan's Dip Test to identify unimodal distributions. The tricky point was to calculate xpdf, which is not probability density function, but rather a sorted sample.
p_value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. In this case null hypothesis is that distribution is unimodal.
close all; clear all;
function [x2, n, b] = compute_xpdf(x)
x2 = reshape(x, 1, prod(size(x)));
[n, b] = hist(x2, 40);
% This is definitely not probability density function
x2 = sort(x2);
% downsampling to speed up computations
x2 = interp1 (1:length(x2), x2, 1:1000:length(x2));
end
nboot = 500;
sample_size = [256 256];
% Unimodal
sample2d = normrnd(0.0, 10.0, sample_size);
[xpdf, n, b] = compute_xpdf(sample2d);
[dip, p_value, xlow, xup] = HartigansDipSignifTest(xpdf, nboot);
figure;
subplot(1,2,1);
bar(n, b)
title(sprintf('Probability of unimodal %.2f', p_value))
% Bimodal
sample2d = sign(sample2d) .* (abs(sample2d) .^ 0.5);
[xpdf, n, b] = compute_xpdf(sample2d);
[dip, p_value, xlow, xup] = HartigansDipSignifTest(xpdf, nboot);
subplot(1,2,2);
bar(n, b)
title(sprintf('Probability of unimodal %.2f', p_value))
print -dpng modality.png
There are many different ways to do what you are asking. In the most literal sense, "bimodal" means there are two peaks. Usually though, you want the "two peaks" to be separated by some reasonable distance, and you want them to each contain a reasonable proportion of the total counts. Only you know what is "reasonable" for your situation, but the following approach might help.
Create a histogram of the intensities
Form the cumulative distribution with cumsum
For different values of the "cut" between distributions (25%, 30%, 50%, …), compute the mean and standard deviation of the two distributions (above and below the cut).
Compute the distance between the means divided by the sum of the standard deviations of the two distributions
That quantity will be a maximum at the "best cut"
You have to decide what size of that quantity represents "bimodal" for you. Here is some code that demonstrates what I am talking about. It generates bimodal distributions of different degrees of severity - two Gaussians, with increasing delta between them (steps = size of standard deviation). I compute the quantity described above, and plot it for a range of different values of delta. I then fit a parabola through this curve over a range corresponding to +- 1 sigma of the entire distribution. As you can see, when the distribution becomes more bimodal, two things happen:
The curvature of this curve flips (it goes from a valley to a peak)
The maximum increases (it is about 1.33 for a Gaussian).
You can look at these quantities for some of your own distributions, and decide where you want to put the cutoff.
% test for bimodal distribution
close all
for delta = 0:10:50
a1 = randn(100,100) * 10 + 25;
a2 = randn(100,100) * 10 + 25 + delta;
a3 = [a1(:); a2(:)];
[h hb] = hist(a3, 0:100);
cs = cumsum(h);
llimi = find(cs < 0.2 * max(cs(:)));
ulimi = find(cs > 0.8 * max(cs(:)));
llim = hb(llimi(end));
ulim = hb(ulimi(1));
cuts = linspace(llim, ulim, 20);
dmean = mean(a3);
dstd = std(a3);
for ci = 1:numel(cuts)
d1 = a3(a3<cuts(ci));
d2 = a3(a3>=cuts(ci));
m(ci,1) = mean(d1);
m(ci, 2) = mean(d2);
s(ci, 1) = std(d1);
s(ci, 2) = std(d2);
end
q = (m(:, 2) - m(:, 1)) ./ sum(s, 2);
figure;
plot(cuts, q);
title(sprintf('delta = %d', delta))
% compute curvature of plot around mean:
xlims = dmean + [-1 1] * dstd;
indx = find(cuts < xlims(2) && cuts > xlims(1));
pf = polyfit(cuts(indx), q(indx), 2);
m = polyval(pf, dmean);
fprintf(1, 'coefficients: a = %.2e, peak = %.2f\n', pf(1), m);
end
Output values:
coefficients: a = 1.37e-03, peak = 1.32
coefficients: a = 1.01e-03, peak = 1.34
coefficients: a = 2.85e-04, peak = 1.45
coefficients: a = -5.78e-04, peak = 1.70
coefficients: a = -1.29e-03, peak = 2.08
coefficients: a = -1.58e-03, peak = 2.48
Sample plots:
And the histogram for delta = 40:

Stochastic gradient Descent implementation - MATLAB

I'm trying to implement "Stochastic gradient descent" in MATLAB. I followed the algorithm exactly but I'm getting a VERY VERY large w (coffients) for the prediction/fitting function. Do I have a mistake in the algorithm ?
The Algorithm :
x = 0:0.1:2*pi // X-axis
n = size(x,2);
r = -0.2+(0.4).*rand(n,1); //generating random noise to be added to the sin(x) function
t=zeros(1,n);
y=zeros(1,n);
for i=1:n
t(i)=sin(x(i))+r(i); // adding the noise
y(i)=sin(x(i)); // the function without noise
end
f = round(1+rand(20,1)*n); //generating random indexes
h = x(f); //choosing random x points
k = t(f); //chossing random y points
m=size(h,2); // length of the h vector
scatter(h,k,'Red'); // drawing the training points (with noise)
%scatter(x,t,2);
hold on;
plot(x,sin(x)); // plotting the Sin function
w = [0.3 1 0.5]; // starting point of w
a=0.05; // learning rate "alpha"
// ---------------- ALGORITHM ---------------------//
for i=1:20
v = [1 h(i) h(i).^2]; // X vector
e = ((w*v') - k(i)).*v; // prediction - observation
w = w - a*e; // updating w
end
hold on;
l = 0:1:6;
g = w(1)+w(2)*l+w(3)*(l.^2);
plot(l,g,'Yellow'); // drawing the prediction function
If you use too big learning rate, SGD is likely to diverge.
The learing rate should converge to zero.
typically, if w ended up with too large values, there is overfitting. I didn't really look at your code carefully. But I think, what is missing from your code is a proper regularization term, which prevents the training overfitting. Also, here:
e = ((w*v') - k(i)).*v;
The v here is not the gradient of the predicted value, isn't it? According to algorithm, you should replace it. Let's see how it will be like after doing this.