The perfcurve function in Matlab falsely, asserts AUC=1 when two records are clearly misclassified for reasonable cutoff values.
If I run the same data through a confusion matrix with cutoff 0.5, the accuracy is rightfully below 1.
The MWE contains data from one of my folds. I noticed the problem because I saw perfect auc with less than perfect accuracy in my results.
I use Matlab 2016a and Ubuntu 16.4 64bit.
% These are the true classes in one of my test-set folds
classes = transpose([ones(1,9) 2*ones(1,7)])
% These are predictions from my classifier
% Class 1 is very well predicted
% Class 2 has two records predicted as class 1 with threshold 0.5
confidence = transpose([1.0 1.0 1.0 1.0 0.9999 1.0 1.0...
1.0 1.0 0.0 0.7694 0.0 0.9917 0.0 0.0269 0.002])
positiveClass = 1
% Nevertheless, the AUC yields a perfect 1
% I understand why X, Y, T have less values than classes and confidence
% Identical records are dealt with by one point on the ROC curve
[X,Y,T,AUC] = perfcurve(classes, confidence, positiveClass)
% The confusion matrix for comparison
threshold = 0.5
confus = confusionmat(classes,(confidence<threshold)+1)
accuracy = trace(confus)/sum(sum(confus))
This simply means that there is another cutoff where the separation is perfect.
Try:
threshold = 0.995
confus = confusionmat(classes,(confidence<threshold)+1)
accuracy = trace(confus)/sum(sum(confus))
Related
I am trying to fit a distribution to a discrete dataset.
The possible outcomes are A = [1 3 4 5 9 10] with a respective probability of prob
prob = [0.2 0.15 0.1 0.05 0.35 0.15];
I have used makedist to find the distribution
pd = makedist('Multinomial','probabilities', prob);
I wonder if there is a way to include the outcomes 1 to 10 from A in the distribution, such that I can calculate the mean and the variance of the possible outcomes with var(pd), mean(pd). Up till now the mean is 3.65, but my aim is it to have mean(pd) = 5.95, which is the weighted sum of the possible outcomes . Thanks in advance.
The solution is pretty easy. The possible outcomes of a multinomial distrubution are defined by a sequence of values starting at 1 and ending at numel(prob). From this official documentation page:
Create a multinomial distribution object for a distribution with three
possible outcomes. Outcome 1 has a probability of 1/2, outcome 2 has a
probability of 1/3, and outcome 3 has a probability of 1/6.
pd = makedist('Multinomial','probabilities',[1/2 1/3 1/6])
Basically, your vector of possible outcomes includes a few values associated to a null (signally 0) probability. Thus, define your distribution as follows in order to obtain the expected result:
p = [0.20 0.00 0.15 0.10 0.05 0.00 0.00 0.00 0.35 0.15];
pd = makedist('Multinomial','probabilities',p);
mean(pd) % 5.95
var(pd) % 12.35
I would like to create 500 ms of band-limited (100-640 Hz) white Gaussian noise with a (relatively) flat frequency spectrum. The noise should be normally distributed with mean = ~0 and 99.7% of values between ± 2 (i.e. standard deviation = 2/3). My sample rate is 1280 Hz; thus, a new amplitude is generated for each frame.
duration = 500e-3;
rate = 1280;
amplitude = 2;
npoints = duration * rate;
noise = (amplitude/3)* randn( 1, npoints );
% Gaus distributed white noise; mean = ~0; 99.7% of amplitudes between ± 2.
time = (0:npoints-1) / rate
Could somebody please show me how to filter the signal for the desired result (i.e. 100-640 Hz)? In addition, I was hoping somebody could also show me how to generate a graph to illustrate that the frequency spectrum is indeed flat.
I intend on importing the waveform to Signal (CED) to output as a form of transcranial electrical stimulation.
The following is Matlab implementation of the method alluded to by "Some Guy" in a comment to your question.
% In frequency domain, white noise has constant amplitude but uniformly
% distributed random phase. We generate this here. Only half of the
% samples are generated here, the rest are computed later using the complex
% conjugate symmetry property of the FFT (of real signals).
X = [1; exp(i*2*pi*rand(npoints/2-1,1)); 1]; % X(1) and X(NFFT/2) must be real
% Identify the locations of frequency bins. These will be used to zero out
% the elements of X that are not in the desired band
freqbins = (0:npoints/2)'/npoints*rate;
% Zero out the frequency components outside the desired band
X(find((freqbins < 100) | (freqbins > 640))) = 0;
% Use the complex conjugate symmetry property of the FFT (for real signals) to
% generate the other half of the frequency-domain signal
X = [X; conj(flipud(X(2:end-1)))];
% IFFT to convert to time-domain
noise = real(ifft(X));
% Normalize such that 99.7% of the times signal lies between ±2
noise = 2*noise/prctile(noise, 99.7);
Statistical analysis of around a million samples generated using this method results in the following spectrum and distribution:
Firstly, the spectrum (using Welch method) is, as expected, flat in the band of interest:
Also, the distribution, estimated using histogram of the signal, matches the Gaussian PDF quite well.
I can't get my mind around the concept of how to calculate bias and variance from a random set.
I have created the code to generate a random normal set of numbers.
% Generate random w, x, and noise from standard Gaussian
w = randn(10,1);
x = randn(600,10);
noise = randn(600,1);
and then extract the y values
y = x*w + noise;
After that I split my data into a training (100) and test (500) set
% Split data set into a training (100) and a test set (500)
x_train = x([ 1:100],:);
x_test = x([101:600],:);
y_train = y([ 1:100],:);
y_test = y([101:600],:);
train_l = length(y_train);
test_l = length(y_test);
Then I calculated the w for a specific value of lambda (1.2)
lambda = 1.2;
% Calculate the optimal w
A = x_train'*x_train+lambda*train_l*eye(10,10);
B = x_train'*y_train;
w_train = A\B;
Finally, I am computing the square error:
% Compute the mean squared error on both the training and the
% test set
sum_train = sum((x_train*w_train - y_train).^2);
MSE_train = sum_train/train_l;
sum_test = sum((x_test*w_train - y_test).^2);
MSE_test = sum_test/test_l;
I know that if I create a vector of lambda (I have already done that) over some iterations I can plot the average MSE_train and MSE_test as a function of lambda, where then I will be able to verify that large differences between MSE_test and MSE_train indicate high variance, thus overfit.
But, what I want to do extra, is to calculate the variance and the bias^2.
Taken from Ridge Regression Notes at page 7, it guides us how to calculate the bias and the variance.
My questions is, should I follow its steps on the whole random dataset (600) or on the training set? I think the bias^2 and the variance should be calculated on the training set. Also, in Theorem 2 (page 7 again) the bias is calculated by the negative product of lambda, W, and beta, the beta is my original w (w = randn(10,1)) am I right?
Sorry for the long post, but I really want to understand how the concept works in practice.
UPDATE 1:
Ok, so following the previous paper didn't generate any good results. So, I took the standard form of Ridge Regression Bias-Variance which is:
Based on that, I created (I used the test set):
% Bias and Variance
sum_bias=sum((y_test - mean(x_test*w_train)).^2);
Bias = sum_bias/test_l;
sum_var=sum((mean(x_test*w_train)- x_test*w_train).^2);
Variance = sum_var/test_l;
But, after 200 iterations and for 10 different lambdas this is what I get, which is not what I expected.
Where in fact, I was hoping for something like this:
sum_bias=sum((y_test - mean(x_test*w_train)).^2); Bias = sum_bias/test_l
Why have you squared the difference between y_test and y_predicted = x_test*w_train?
I don't believe your formula for bias is correct. In your question, the 'bias term' above in blue is the bias^2 however surely your formula is neither the bias nor the bias^2 since you have only squared the residuals, not the entire bias?
I have a time varying signal (time,amplitude) and a measured frequency sensitivity (frequency,amplitude conversion factor (Mf)).
I know that if I use the center frequency of my time signal to select the amplitude conversion factor (e.g. 0.0312) for my signal I get a max. converted amplitude value of 1.4383.
I have written some code to deconvolve the time varying signal and known sensitivity (i.e. for all frequencies).
where Pt is the output/converted amplitude and Mf is amplitude conversion factor data and fft(a) is the fft of the time varying signal (a).
I take the real part of the fft(a):
xdft = fft(a);
xdft = xdft(1:length(x)/2+1); % only retaining the positive frequencies
freq = Fs*(0:(L/2))/L;
where Fs is sampling frequency and L is length of signal.
convS = real(xdft).*Mf;
assuming Mf is magnitude = real (I don't have phase info). I also interpolate
Mf=interp1(freq_Mf,Mf_in,freq,'cubic');
so at the same interrogation points as freq.
I then reconstruct the signal in time domain using:
fftRespI=complex(real(convS),imag(xdft));
pt = ifft(fftRespI,L,'symmetric')
where I use the imaginary part of the fft(a).
The reconstructed signal shape looks correct but the amplitude of the signal is not.
If I set all values of Mf = 0.0312 for f=0..N MHz I expect a max. converted amplitude value of ~1.4383 (similar to if I use the center frequency) but I get 13.0560.
How do I calibrate the amplitude axis? i.e. how do I correctly multiply fft(a) by Mf?
Some improved understanding of the y axis of the abs(magnitude) and real FFT would help me I think...
thanks
You need to re-arrange the order of your weights Mf to match the MATLAB's order for the frequencies. Let's say you have a signal
N=10;
x=randn(1,N);
y=fft(x);
The order of the frequencies in the output y is
[0:1:floor(N/2)-1,floor(-N/2):1:-1] = [0 1 2 3 4 -5 -4 -3 -2 -1]
So if your weights
Mf = randn(1,N)
are defined in this order: Mf == [0 1 2 3 4 5 6 7 8 9], you will have to re-arrange
Mfshift = [Mf(1:N/2), fliplr(Mf(2:(N/2+1)))];
and then you can get your filtered output
z = ifft(fft(x).*Mshift);
which should come out real.
I have a question with regards to Portfolio Optimization in Matlab. Is there a way to plot and obtain the values in the IN-efficent frontier (the bottom locus of points that envelopes the feasible solutions as opposed to the efficient frontier which envelopes the top portion)?
if ~exist('quadprog')
msgbox('The Optimization Toolbox(TM) is required to run this example.','Product dependency')
return
end
returns = [0.1 0.15 0.12];
STDs = [0.2 0.25 0.18];
correlations = [ 1 0.3 0.4
0.3 1 0.3
0.4 0.3 1 ];
% Converting to correlation and STD to covariance matrix
covariances = corr2cov(STDs , correlations);
% Calculating and Plotting Efficient Frontier
portopt(returns , covariances , 20)
% random Portfolio Generation
weights = exprnd(1,1000,3);
total = sum(weights , 2);
total = total(:,ones(3,1));
weights = weights./total;
[portRisk , portReturn] = portstats(returns , covariances , weights);
hold on
plot(portRisk , portReturn , '.r')
title('Mean-Variance Efficient Frontier and Random Portfolios')
hold off
Is there a way/command to obtain the return/risk/weight of the lower envelope of the feasible solution in the same way that the efficient frontier can be calculated?
Thanks in advance!