Matlab Kolmogorov-Smirnov Test - matlab

I'm using MATLAB to analyze some neuroscience data, and I made an interspike interval distribution and fit an exponential to it. Then, I wanted to check this fit using a Kolmogorov-Smirnov test with MATLAB.
The data for the neuron spikes is just stored in a vector of spikes. The spikes vector is a 111 by 1 vector, where each entry is another vector. Each entry in thie spikes vector represents a trial. The number of spikes in each trial varies. For example, spikes{1} is a [1x116 double], meaning there are 116 spikes. The next has 115 spikes, then 108, etc.
Now, I understand that the kstest in MATLAB takes a couple of parameters. You enter the data in the first one, so I took all the interspike intervals and created a row vector alldiffs which stores all the interspike intervals. I want to set my CDF to that for an exponential function fit:
test_cdf = [transpose(alldiffs), transpose(1-exp(-alldiffs*firingrate))];
Note that the theoretical exponential (with which I fit the data) is r*exp(-rt) where r is the firing rate. I get a firing rate of about 0.2. Now, when I put this all together, I run the kstest:
[h,p] = kstest(alldiffs, 'CDF', test_cdf)
However, the result is a p value on the order of 1.4455e-126. I've tried redoing the test_cdf with another of the methods on Mathworks' website documentation:
test_cdf = [transpose(alldiffs), cdf('exp', transpose(alldiffs), 1/firingrate)];
This gives the exact same result! Is the fit just horrible? I don't know why I get such low p-values. Please help!
I would post an image of the fit, but I don't have enough reputation.
P.S. If there is a better place to post this, let me know and I'll repost.

Here is an example with fake data and yet another way to create the CDF:
>> data = exprnd(.2, 100);
>> test_cdf = makedist('exp', 'mu', .2);
>> [h, p] = kstest(data, 'CDF', test_cdf)
h =
0
p =
0.3418
However, why are you doing a KS Test?
All models are wrong, some are useful.
No neuron is perfectly a Poisson process and with enough data, you'll always have a significantly non-exponential ISI, as measured by a KS test. That doesn't mean you can't make the simplifying assumption of an exponential ISI, depending on what phenomena you're trying model.

Related

Different results using filter object and z,p,k implementation

I'm comparing the output of digital filtering using MATLAB filter object and zero-pole-gain method, and they are not quite the same. I'll appreciate if anyone could point me out in the right direction, and to know what exactly matlab does, because it seems it doesn't use b-a coefficients nor zero-pole-gain method. Here there are my results and the test script (download data.txt):
srate=64;
freqrange=[0.4 3.5];
file='data.txt';
load(file);
file=split(file,'.');
file=file{1};
data=eval(file);
st=srate*60;
ed=srate*60*2;
data=data(st:ed);%1 minute data
m=numel(data);
x=data;
R=0.1;%10% of signal
Nr=50;
NR=min(round(m*R),Nr);%At most 50 points
x1=2*x(1)-flipud(x(2:NR+1));%maintain continuity in level and slope
x2=2*x(end)-flipud(x(end-NR:end-1));
x=[x1;x;x2];
[xx,dbp]=bandpass(x,freqrange,srate);%same result if use filter(dbp,xx)
data_fil=xx(NR+1:end-NR);
[z,p,k]=dbp.zpk;
[sos, g] = zp2sos(z, p, k);
xx=filtfilt(sos, g, x);
data_fil_zpk=xx(NR+1:end-NR);
f=figure('Numbertitle','off','Name','test_filters','units','normalized','outerposition',[0 0
1 1],'menubar','none','Visible','on');
plot([data data_fil data_fil_zpk])
legend('raw','dbp filter','zpk filter')
title('raw vs dbp vs zpk')
error=norm(data_fil-data_fil_zpk)/norm(data_fil)*100
d1=fvtool(dbp);
set(d1,'Name','MR_dbp');
d2=fvtool(sos);
set(d2,'Name','MR_zpk');
and here there is the result I obtain:
as you can see, the filtered signals are almost the same. At this point I'm lost at what exactly matlab does.
Here you can see the magnitude response plots:
They have the same shape, so they are filtering, but in the case of the z-p-k one, max magnitude is around 60 dB while the other is 0db. Why does this happens? Does that means that zpk is amplifying the frequencies?
Finally, you can the the PSD plot for both filtered signals (first dbp, second zpk):
Once again we can see the filter is filtering but you can also see a greater value in the first peak of the second plot (zpk filter)...

comparing generated data to measured data

we have measured data that we managed to determine the distribution type that it follows (Gamma) and its parameters (A,B)
And we generated n samples (10000) from the same distribution with the same parameters and in the same range (between 18.5 and 59) using for loop
for i=1:1:10000
tot=makedist('Gamma','A',11.8919,'B',2.9927);
tot= truncate(tot,18.5,59);
W(i,:) =random(tot,1,1);
end
Then we tried to fit the generated data using:
h1=histfit(W);
After this we tried to plot the Gamma curve to compare the two curves on the same figure uing:
hold on
h2=histfit(W,[],'Gamma');
h2(1).Visible='off';
The problem s the two curves are shifted as in the following figure "Figure 1 is the generated data from the previous code and Figure 2 is without truncating the generated data"
enter image description here
Any one knows why??
Thanks in advance
By default histfit fits a normal probability density function (PDF) on the histogram. I'm not sure what you were actually trying to do, but what you did is:
% fit a normal PDF
h1=histfit(W); % this is equal to h1 = histfit(W,[],'normal');
% fit a gamma PDF
h2=histfit(W,[],'Gamma');
Obviously that will result in different fits because a normal PDF != a gamma PDF. The only thing you see is that for the gamma PDF fits the curve better because you sampled the data from that distribution.
If you want to check whether the data follows a certain distribution you can also use a KS-test. In your case
% check if the data follows the distribution speccified in tot
[h p] = kstest(W,'CDF',tot)
If the data follows a gamma dist. then h = 0 and p > 0.05, else h = 1 and p < 0.05.
Now some general comments on your code:
Please look up preallocation of memory, it will speed up loops greatly. E.g.
W = zeros(10000,1);
for i=1:1:10000
tot=makedist('Gamma','A',11.8919,'B',2.9927);
tot= truncate(tot,18.5,59);
W(i,:) =random(tot,1,1);
end
Also,
tot=makedist('Gamma','A',11.8919,'B',2.9927);
tot= truncate(tot,18.5,59);
is not depending in the loop index and can therefore be moved in front of the loop to speed things up further. It is also good practice to avoid using i as loop variable.
But you can actually skip the whole loop because random() allows to return multiple samples at once:
tot=makedist('Gamma','A',11.8919,'B',2.9927);
tot= truncate(tot,18.5,59);
W =random(tot,10000,1);

Resample factors are too large

I have a large vector of recorded data which I need to resample. The problem I encounter is that when using resample, I get the following error:
??? Error using ==> upfirdn at 82 The product of the downsample factor
Q and the upsample factor P must be less than 2^31.
Now, I understand why this is happening - my two sampling rates are very close together, so the integer factors need to be quite large (something like 73999/74000). Unfortunately this means that the appropriate filter can't be created by MATLAB. I also tried resampling just up, with the intention of then resampling down, but there is not enough memory to do this to even 1 million samples of data (mine is 93M).
What other methods could I use to properly resample this data?
An interpolated polyphase FIR filter can be used to interpolate just the new set of sample points without using an upsampling+downsampling process.
But if performance is completely unimportant, here's a Quick and Dirty windowed-Sinc interpolator in Basic.
here's my code, I hope it helps :
function resig = resamplee(sig,upsample,downsample)
if upsample*downsample<2^31
resig = resample(sig,upsample,downsample);
else
sig1half=sig(1:floor(length(sig)/2));
sig2half=sig(floor(length(sig)/2):end);
resig1half=resamplee(sig1half,floor(upsample/2),length(sig1half));
resig2half=resamplee(sig2half,upsample-floor(upsample/2),length(sig2half));
resig=[resig1half;resig2half];
end

Data fitting for time dependent matrix sets in Matlab

I have 6 data sets, each data set is a 576by576 matrix. Each set of data represents measurements taken within 30 second intervals. e.g. set1 at t=0, set2 at time =30, ... ,set5 at 150 seconds.
You can look at these sets as frames if you will. I need to take first data point (1,1) from each data set -> (1,1,0), (1,1,3),(1,1,6),(1,1,9),(1,1,12),(1,1,15) and based on those 6 points find a fitting formula, then assign that general solution to the first spot of my solution matriz SM(1,1). I need to do this for every data point in the 6 sets until I have a 576by576 solution matriz.
if everything goes right I should be able to plot SM(0s)=set1, SM(30s)=set2,etc. but not only that. SM(45) should return a prediction of measurements at t=45 and so on and so forth. The purpose is to have one matrix than can predict data fluctuation from time t= 0 to 150 seconds.
Additional information:
1.- Each data point is independent form the rest of the data points in the same set.
2.- it is a none-linear fit
3.- all values are real
Does Matlab have an optimization tool for this kind of problem?
Should I treat the problem as 1D data fit and create a for loop that does the job 576^2 times?
(I don't even know where to begin)
Feel free to ask or edit anything if I wasn't clear enough. I am not sure that I've chosen the most precise title for this kind of problem.Thanks
Update:
Based on Guddu's answer I came up with this:
%% Loadint data Matrix A
A(:,:,1) = abs(set1);
A(:,:,2) = abs(set2);
A(:,:,3) = abs(set3);
A(:,:,4) = abs(set4);
A(:,:,5) = abs(set5);
A(:,:,6) = abs(set6);
%% Creating Solution Matrix B
t=0:30:150;
SM=zeros([576 576 150]);
for i=1:576
for j=1:576
y=squeeze(A(i,j,1:6));
f=fit(t',y,'smoothingspline');
data=feval(f,1:150);
SM(i,j,:)=data;
end
end
%% Plotting Frame at t=45
figure(1);
imshow(SM(:,:,45),[])
I am not sure if this is the most efficient way to do it, but it works. I am open to new ideas or suggestions. Thanks
i would suggest your main data matrix would be of size (6,576,576) -> a. first point (1,1) from each dataset would be a(1,1,1), a(2,1,1), a(3,1,1) .. a(6,1,1). as you have said each point (i,j) in each dataset from other point (k,l), i would suggest treat each position (i,j) for all datasets separately from other positions. so it would be looping 576*576 times. code could be something like this
t=0:30:150;
for i=1:576
for j=1:576
datavec=squeeze(a(1:6,i,j)); % Select (i,j) point from all 6 frames
% do the curve fitting and save in SM(i,j)
end
end
i am just curious what kind of non-linear function you want to fit to 6 points. this might not be the answer you want but it was kind of long to put in a comment

How to get level of fitness of data to a distribution by using probplot() in Matlab?

I have 2 sets of data of float numbers, set A and set B. Both of them are matrices of size 40*40. I would like to find out which set is closer to the normal distribution. I know how to use probplot() in matlab to plot the probability of one set. However, I do not know how to find out the level of the fitness of the distribution is.
In python, when people use problot, a parameter ,R^2, shows how good the distribution of the data is against to the normal distribution. The closer the R^2 value to value 1, the better the fitness is. Thus, I can simply use the function to compare two set of data by their R^2 value. However, because of some machine problem, I can not use the python in my current machine. Is there such parameter or function similar to the R^2 value in matlab ?
Thank you very much,
Fitting a curve or surface to data and obtaining the goodness of fit, i.e., sse, rsquare, dfe, adjrsquare, rmse, can be done using the function fit. More info here...
The approach of #nate (+1) is definitely one possible way of going about this problem. However, the statistician in me is compelled to suggest the following alternative (that does, alas, require the statistics toolbox - but you have this if you have the student version):
Given that your data is Normal (not Multivariate normal), consider using the Jarque-Bera test.
Jarque-Bera tests the null hypothesis that a given dataset is generated by a Normal distribution, versus the alternative that it is generated by some other distribution. If the Jarque-Bera test statistic is less than some critical value, then we fail to reject the null hypothesis.
So how does this help with the goodness-of-fit problem? Well, the larger the test statistic, the more "non-Normal" the data is. The smaller the test statistic, the more "Normal" the data is.
So, assuming you have converted your matrices into two vectors, A and B (each should be 1600 by 1 based on the dimensions you provide in the question), you could do the following:
%# Build sample data
A = randn(1600, 1);
B = rand(1600, 1);
%# Perform JB test
[ANormal, ~, AStat] = jbtest(A);
[BNormal, ~, BStat] = jbtest(B);
%# Display result
if AStat < BStat
disp('A is closer to normal');
else
disp('B is closer to normal');
end
As a little bonus of doing things this way, ANormal and BNormal tell you whether you can reject or fail to reject the null hypothesis that the sample in A or B comes from a normal distribution! Specifically, if ANormal is 1, then you fail to reject the null (ie the test statistic indicates that A is probably drawn from a Normal). If ANormal is 0, then the data in A is probably not generated from a Normal distribution.
CAUTION: The approach I've advocated here is only valid if A and B are the same size, but you've indicated in the question that they are :-)