Calculating Empirical Risk using LIBSVM and MATLAB - matlab

I am trying to understand how to calculate empirical risk using MATLAB and LIBSVM's MATLAB bindings. I have Y outcomes (1,100) labeled as either -1 or +1, and 10D observations given by X (100,10). I then call svmtrain to get my model. The empirical risk is given by the following equation:
based on the values I receive from svmtrain how do I get f(xi, alpha)?
Here is what I have so far:
params = sprintf('-s 0 -t 0 -c %d', C);
%X1 and Y1 are values I generate
m1 = svmtrain(Y1, X1, params);
y = diag(Y1(m1.sv_indices));
x = X1(m1.sv_indices, :);
alpha = m1.sv_coef;
w = alpha'*y*x;

The empirical risk is just the ratio of misclassified training data points. yi is the actual label and f(xi, alpha) is the predicted label of xi based on the trained SVM with support vectors alpha. The formula assumes that labels are either +1 or -1.

Related

Cross validation and ROC curve using Matlab: how plot mean ROC curve?

I am using k-fold cross validation with k = 10. Thus, I have 10 ROC curves.
I would like to average between the curves. I can't just average the values ​​on the Y axes (using perfcurve) because the vectors returned are not the same size.
[X1,Y1,T1,AUC1] = perfcurve(t_test(1),resp(1),1);
.
.
.
[X10,Y10,T10,AUC10] = perfcurve(t_test(10),resp(10),1);
How to solve this? How can I plot the average curve of the 10 ROC curves?
So, you have k curves with different number of points, all bound in [0..1] interval in both dimensions. First, you need to calculate interpolated values for each curve at specified query points. Now you have new curves with fixed number of points and can compute their mean. The interp1 function will do the interpolation part.
%% generating sample data
k = 10;
X = cell(k, 1);
Y = cell(k, 1);
hold on;
for i=1:k
n = 10+randi(10);
X{i} = sort([0 1 rand(1, n)]);
Y{i} = sort([0 1 rand(1, n)].^.5);
end
%% Calculating interpolations
% location of query points
X2 = linspace(0, 1, 50);
n = numel(X2);
% initializing values for different curves at different query points
Y2 = zeros(k, n);
for i=1:k
% finding interpolated values for i-th curve
Y2(i, :) = interp1(X{i}, Y{i}, X2);
end
% finding the mean
meanY = mean(Y2, 1);
Notice that different interpolation methods can affect your results. For example, the ROC plot data are kind of stairs data. To find the exact values on such curves, you should use the Previous Neighbor Interpolation method, instead of the Linear Interpolation which is the default method of interp1:
Y2(i, :) = interp1(X{i}, Y{i}, X2); % linear
Y3(i, :) = interp1(X{i}, Y{i}, X2, 'previous');
This is how it affects the final results:
I solved it using Matlab's perfcurve. For that, I had to pass as a parameter a list of vectors (size vectors 1xn) for "label" and "scores". Thus, the perfcurve function already understands as a set of resolutions made using k-fold and returns the average ROC curve and its confidence interval, in addition to the AUC and its confidence interval.
[X1,Y1,T1,AUC1] = perfcurve(t_test_list,resp_list,1);
t_test and resp they are lists of size 1xk (k is the number of folds / k-fold) and each element of the lists is a 1xn vector with scores and labels.
resp = nnet(x_test(i));
t_test_act = t_test(i);
resp has 2xn format (n is the number of predicted samples). There are two classes.
t_test_act contains the labels of the current set of tests, it has formed 2xn and is composed of 0 and 1 (each column has a 1 and a 0, indicating the true class of the sample).
resp_list{i} = resp(1,:) %(scores)
t_test_list{i} = t_test_act(1,:) %(labels)
[X1,Y1,T1,AUC1] = perfcurve(t_test_list,resp_list,1);

Command `gmdistribution` in Matlab?

I have a question on the Matlab command gmdistribution to generate draws from mixtures of Gaussians.
Consider the following code to draw from a mixture of two bivariate normals
clear
rng default
P=10^4; %number draws
%First component (X1,X2)
v=1;
mu_a = [0,2];
sigma_a = [v,0;0,v];
%Second component (Y1,Y2)
mu_b = [0,4];
sigma_b = [v,0;0,v];
MU = [mu_a;mu_b];
SIGMA = cat(3,sigma_a,sigma_b);
w = ones(1,2)/2; %equal weight 0.5
obj = gmdistribution(MU,SIGMA,w);
%Draws of the mixture (R1,R2)
R = random(obj,P);%nx2
We know that (R1, R2) may be correlated. Indeed, we can show that
cov(R1, R2)=1/4*cov(X1,Y2)+1/4*cov(X2, Y1)
because
cov(W1,W2)=E(W1*W2)-E(W1)E(W2)
=1/4E(X1*X2)+1/4E(X1*Y2)+1/4E(Y1* X2)+1/4E(Y1* Y2)
- [1/2E(X1)+1/2E(Y1)][1/2E(X2)+1/2E(Y2)]
=1/4 cov(X1, Y2)+1/4cov(Y1, X2)
However, if I check their correlation
corr(R(:,1), R(:,2))
I get almost zero (0.0024)
I checked for many other values of MU, SIGMAbut I couldn't find any case with a correlation noticeably far from 0. Is this just a case, or is that the command gmdistribution imposes (X1,X2) independent of (Y1,Y2)?
We can best illustrate the problem with a figure. To make the effect more visible, I decreased the variance of the both components from 1 to 0.2 (v = 0.2). If we then draw some realisations from the mixed model, we get the following scatterplot:
Each "blob" corresponds to one component, one with its centre at 0,2 the other at 0,4.
Now, on its basis the linear correlation coefficient tells us how much W2 increases if W1 increases by one. But as we can see there is no such trend in the realisations; If W1 increases W2 is not increasing or decreasing.
This due to both distributions having the same mean (0) in W1. If that is not the case, e.g mu_a = [0,2]; and mu_b = [2,5]; we get following plot:
Here it is clearly visible that if W1 is high, chances are that W2 is also very high. This leads to a high positive correlation of about 0.87. Summing this up, if either mu_a(1) == mu_b(1) or mu_a(2) == mu_b(2) then the correlation will be near zero.

Check weather predicted values follow the gaussian distribution or not using matlab?

I have used Gaussian Process for my prediction. Now let us assume I have predicted value store in x of size 1900 X 1. Now I want to check whether its distribution follow the gaussian distribution or not . I need this in order to compare the distribution functions of other methods predicted values like NN,KNN in order to judge which one is following smooth gaussian or normal distribution functions
How I can Do this ? Better if I can get some result in the form of numerical data. the code is written as follows,
m = mean(ypred); % mean of r
s = std(ypred); % stdev of r
pd = makedist('Normal','mu',m,'sigma',s); % make probability distribution with mu = m and sigma = s
[h,p] = kstest(ypred,'CDF',pd); % calculate probability that it is a normal distribution
The ypred value is the output obtain from fitrgp of matlab. Sample of ypred values are attached here
The [figure]2 is a residual qq_plot of measured and predicted values.
You can make a One-sample Kolmogorov-Smirnov test:
x = 1 + 2.*randn(1000,1); % just some random normal distributed data, replace it with your actual 1900x1 vector.
m = mean(x); % mean of r
s = std(x); % stdev of r
pd = makedist('Normal','mu',m,'sigma',s); % make probability distribution with mu = m and sigma = s
[h,p] = kstest(x,'CDF',pd); % calculate probability that it is a normal distribution
Where p is the probability that it follows a normal distribution and h = 1 if the null-hypothesis is rejected with a significance of 0.05. Since the null-hypothesis is "it follows a normal distribution", h = 0means that it is normal distributed.
Since x was in this example was sampled from a normal distribution, most likely h = 0 and p > 0.05. If you run above code with
x = 1 + 2.*rand(1000,1); % sampled from uniform distribution
h will most likely be 1 and p<0.05. Of course you can write the whole thing as a one-liner to avoid creating m,s and pd.

Plot SVM margins using MATLAB and libsvm

I am using svmlib to classify linearly two dimensional non-separable data. I am able to train the svm and obtain w and b using svmlib. Using this information I can plot the decision boundary, along with the support vectors, but I am not sure about how to plot the margins, using the information that svmlib gives me.
Below is my code:
model = svmtrain(Y,X, '-s 0 -t 0 -c 100');
w = model.SVs' * model.sv_coef;
b = -model.rho;
if (model.Label(1) == -1)
w = -w; b = -b;
end
y_hat = sign(w'*X' + b);
sv = full(model.SVs);
% plot support vectors
plot(sv(:,1),sv(:,2),'ko', 'MarkerSize', 10);
% plot decision boundary
plot_x = linspace(min(X(:,1)), max(X(:,1)), 30);
plot_y = (-1/w(2))*(w(1)*plot_x + b);
plot(plot_x, plot_y, 'k-', 'LineWidth', 1)
It depends on what you mean by "the margins". It also depends on what SVM version you are talking about (separable on non-separable), but since you mentioned libsvm I'll assume you mean the more general, non-separable version.
The term "margin" can refer to the Euclidean distance from the separating hyperplane to the hyperplane defined by wx+b=1 (or wx+b=-1). This distance is given by 1/norm(w).
"Margin" can also refer to the margin of a specific sample x, which is the Euclidean distance of x from the separating hyperplane. It is given by
(wx+b)/norm(w)
note that this is a signed distance, that is it is negative/positive, depending on which side of the hyperplane the point x resides. You can draw it as a line from the point, perpendicular to the hyperplane.
Another interesting value is the slack variable xi, which is the "algebraic" distance (not Euclidean) of a support vector from the "hard" margin defined by wx+b=+1 (or -1). It is positive only for support vectors, and if a point is not a support vector, its xi equals 0. More compactly:
xi = max(0, 1 - y*(w'*x+b))
where y is the label.

Random samples from Lognormal distribution

I have a parameter X that is lognormally distributed with mean 15 and standard deviation 0.48. For monte carlo simulation in MATLAB, I want to generate 40,000 samples from this distribution. How could be done in MATLAB?
To generate an MxN matrix of lognornally distributed random numbers with parameter mu and sigma, use lognrnd (Statistics Toolbox):
result = lognrnd(mu,sigma,M,N);
If you don't have the Statistics Toolbox, you can equivalently use randn and then take the exponential. This exploits the fact that, by definition, the logarithm of a lognormal random variable is a normal random variable:
result = exp(mu+sigma*randn(M,N));
The parameters mu and sigma of the lognormal distribution are the mean and standard deviation of the associated normal distribution. To see how the mean and standard deviarion of the lognormal distribution are related to parameters mu, sigma, see lognrnd documentation.
To generate random samples, you need the inverted cdf. If you have done this, generating samples is nothing more than 'my_icdf(rand(n, m))'
First get the cdf (integrating the pdf) and then invert the function to get the inverted cdf.
You can convert between the mean and variance of the Lognormal distribution and its parameters (mu,sigma) which correspond to the associated Normal (Gaussian) distribution using the formulas.
The approach below uses the Probability Distribution Objects introduced in MATLAB 2013a. More specifically, it uses the makedist, random, and pdf functions.
% Notation
% if X~Lognormal(mu,sigma) the E[X] = m & Var(X) = v
m = 15; % Target mean for Lognormal distribution
v = 0.48; % Target variance Lognormal distribution
getLmuh=#(m,v) log(m/sqrt(1+(v/(m^2))));
getLvarh=#(m,v) log(1 + (v/(m^2)));
mu = getLmuh(m,v);
sigma = sqrt(getLvarh(m,v));
% Generate Random Samples
pd = makedist('Lognormal',mu,sigma);
X = random(pd,1000,1); % Generates a 1000 x 1 vector of samples
You can verify the correctness via the mean and var functions and the distribution object:
>> mean(pd)
ans =
15
>> var(pd)
ans =
0.4800
Generating samples via the inverse transform is also made easy using the icdf (inverse CDF) function.
% Alternate way to generate X~Lognormal(mu,sigma)
U = rand(1000,1); % U ~ Uniform(0,1)
X = icdf(pd,U); % Inverse Transform
The graphic generated by following code (MATLAB 2018a).
Xrng = [0:.01:20]';
figure, hold on, box on
h(1) = histogram(X,'DisplayName','Random Sample (N = 1000)');
h(2) = plot(Xrng,pdf(pd,Xrng),'b-','DisplayName','Theoretical PDF');
legend('show','Location','northwest')
title('Lognormal')
xlabel('X')
ylabel('Probability Density Function')
% Options
h(1).Normalization = 'pdf';
h(1).FaceColor = 'k';
h(1).FaceAlpha = 0.35;
h(2).LineWidth = 2;