Resample factors are too large - matlab

I have a large vector of recorded data which I need to resample. The problem I encounter is that when using resample, I get the following error:
??? Error using ==> upfirdn at 82 The product of the downsample factor
Q and the upsample factor P must be less than 2^31.
Now, I understand why this is happening - my two sampling rates are very close together, so the integer factors need to be quite large (something like 73999/74000). Unfortunately this means that the appropriate filter can't be created by MATLAB. I also tried resampling just up, with the intention of then resampling down, but there is not enough memory to do this to even 1 million samples of data (mine is 93M).
What other methods could I use to properly resample this data?

An interpolated polyphase FIR filter can be used to interpolate just the new set of sample points without using an upsampling+downsampling process.
But if performance is completely unimportant, here's a Quick and Dirty windowed-Sinc interpolator in Basic.

here's my code, I hope it helps :
function resig = resamplee(sig,upsample,downsample)
if upsample*downsample<2^31
resig = resample(sig,upsample,downsample);
else
sig1half=sig(1:floor(length(sig)/2));
sig2half=sig(floor(length(sig)/2):end);
resig1half=resamplee(sig1half,floor(upsample/2),length(sig1half));
resig2half=resamplee(sig2half,upsample-floor(upsample/2),length(sig2half));
resig=[resig1half;resig2half];
end

Related

Dimensionality reduction using PCA - MATLAB

I am trying to reduce dimensionality of a training set using PCA.
I have come across two approaches.
[V,U,eigen]=pca(train_x);
eigen_sum=0;
for lamda=1:length(eigen)
eigen_sum=eigen_sum+eigen(lamda,1);
if(eigen_sum/sum(eigen)>=0.90)
break;
end
end
train_x=train_x*V(:, 1:lamda);
Here, I simply use the eigenvalue matrix to reconstruct the training set with lower amount of features determined by principal components describing 90% of original set.
The alternate method that I found is almost exactly the same, save the last line, which changes to:
train_x=U(:,1:lamda);
In other words, we take the training set as the principal component representation of the original training set up to some feature lamda.
Both of these methods seem to yield similar results (out of sample test error), but there is difference, however minuscule it may be.
My question is, which one is the right method?
The answer depends on your data, and what you want to do.
Using your variable names. Generally speaking is easy to expect that the outputs of pca maintain
U = train_x * V
But this is only true if your data is normalized, specifically if you already removed the mean from each component. If not, then what one can expect is
U = train_x * V - mean(train_x * V)
And in that regard, weather you want to remove or maintain the mean of your data before processing it, depends on your application.
It's also worth noting that even if you remove the mean before processing, there might be some small difference, but it will be around floating point precision error
((train_x * V) - U) ./ U ~~ 1.0e-15
And this error can be safely ignored

Matlab Kolmogorov-Smirnov Test

I'm using MATLAB to analyze some neuroscience data, and I made an interspike interval distribution and fit an exponential to it. Then, I wanted to check this fit using a Kolmogorov-Smirnov test with MATLAB.
The data for the neuron spikes is just stored in a vector of spikes. The spikes vector is a 111 by 1 vector, where each entry is another vector. Each entry in thie spikes vector represents a trial. The number of spikes in each trial varies. For example, spikes{1} is a [1x116 double], meaning there are 116 spikes. The next has 115 spikes, then 108, etc.
Now, I understand that the kstest in MATLAB takes a couple of parameters. You enter the data in the first one, so I took all the interspike intervals and created a row vector alldiffs which stores all the interspike intervals. I want to set my CDF to that for an exponential function fit:
test_cdf = [transpose(alldiffs), transpose(1-exp(-alldiffs*firingrate))];
Note that the theoretical exponential (with which I fit the data) is r*exp(-rt) where r is the firing rate. I get a firing rate of about 0.2. Now, when I put this all together, I run the kstest:
[h,p] = kstest(alldiffs, 'CDF', test_cdf)
However, the result is a p value on the order of 1.4455e-126. I've tried redoing the test_cdf with another of the methods on Mathworks' website documentation:
test_cdf = [transpose(alldiffs), cdf('exp', transpose(alldiffs), 1/firingrate)];
This gives the exact same result! Is the fit just horrible? I don't know why I get such low p-values. Please help!
I would post an image of the fit, but I don't have enough reputation.
P.S. If there is a better place to post this, let me know and I'll repost.
Here is an example with fake data and yet another way to create the CDF:
>> data = exprnd(.2, 100);
>> test_cdf = makedist('exp', 'mu', .2);
>> [h, p] = kstest(data, 'CDF', test_cdf)
h =
0
p =
0.3418
However, why are you doing a KS Test?
All models are wrong, some are useful.
No neuron is perfectly a Poisson process and with enough data, you'll always have a significantly non-exponential ISI, as measured by a KS test. That doesn't mean you can't make the simplifying assumption of an exponential ISI, depending on what phenomena you're trying model.

How to build an ARMAX model in Matlab

I'm trying to build an ARMAX model which predicts reservoir water elevation as a function of previous elevations and an upstream inflow. My data is on a timestep of roughly 0.041 days, but it does vary slightly, and I have 3643 time series points. I've tried using the basic armax Matlab command, but am getting this error:
Error using armax (line 90)
Operands to the || and && operators must be convertible to
logical scalar values.
The code I'm trying is:
data = iddata(y,x,[],'SamplingInstants',JDAYs)
m1 = armax(data, [30 30 30 1])
where y is a vector of elevations that starts like y=[135.780
135.800
135.810
135.820
135.820
135.830]', x is a vector of flowrates that starts like x=[238.865
238.411
238.033
237.223
237.223
233.828]', and JDAYs is a vector of timestamps that starts like JDAYs=[122.604
122.651
122.688
122.729
122.771
122.813]'.
I'm new to this model type and the system identification toolbox, so I'm having issues figuring out what's causing that error. The Matlab examples aren't very helpful...
I hope this is not getting to you a bit late.
Checking your code i see that you are using a parameter called SamplingInstants. I'm not sure ARMAX functions works with it. Actually i'm sure. I have tried several times, and no, it doesn't. And it don't seems to be a well documented option for ARMAX -or for other methods- too.
The ARX, ARMAX, and other models are based on linear discrete systems from the Z-Transform formalism, that is, one can ussualy assume that your system has been sampled under a regular sampling rate. Although of course, this is not a law, this is the standard framework when dealing with linear -and also non-linear- systems. And also most industrial control & acquisition systems work under a regular rate sampling. Yet.
Try to get inside the ARMAX standard setting, like this:
y=[135.780 135.800 135.810 135.820 135.820 135.830 .....]';
x=[238.865 238.411 238.033 237.223 237.223 233.828 .....]';
%JDAYs=[122.604 122.651 122.688 122.729 122.771 122.813 .....]';
JDAYs=122.601+[0:length(y)-1]*4.18';
data = iddata(y,x,[],'SamplingInstants',JDAYs);
m1 = armax(data, [30 30 30 1])
And this will always work. Please just ensure that x and y are long enough to enable the proper estimation of all the free coefficients, greater than mean(4*orders), for ARMAX to work -in this case, greater than 121-, and desirable greater than 10*mean(4*orders), for ARMAX algorithm to properly solve your problem, and enough time-variant for prevent reaching onto ill-conditioned solutions.
Good Luck ;)...

fit function of Matlab is really slow

Why is the fitfunction from Matlab so slow? I'm trying to fit a gauss4 so I can get the means of the gaussians.
here's my plot,
I want to get the means from the blue data and red data.
I'm fitting a gaussian there but this function is really slow.
Is there an alternative?
fa = fit(fn', facm', 'gauss4');
acm = [fa.b1 fa.b2 fa.b3 fa.b4];
a_cm = sort(acm, 'ascend');
I would apply some of the options available with fit. These include smoothing by setting SmoothingParam (your data is quite noisy, the alternative of applying a time domain filter may also help*), and setting the values of your initial parameter estimates, with StartPoint. Your fits may also not be converging because you set your tolerances (TolFun, TolX) too low, although from inspection of your fits that does not appear to be the case, in fact the opposite is likely, you probably want to increase the MaxIter and/or MaxFunEvals.
To figure out how to get going you can also try the Spectr-O-Matic toolbox. It requires Matlab 7.12. It includes a script called GaussFit.m to fit gauss4 to data, but it also uses the fit routine and provides examples on how to set and get parameters.
Note that smoothing will of course broaden your peaks, but you can subtract the contribution after the fact. The effect on the mean should not be deleterious, on the contrary, since you are presumably removing noise this should be more accurate.
In general functions will be faster if you apply it to a shorter series. Hence, if speedup is really important you could downsample.
For example, if you have a vector that you want to downsample by a factor 2: (you may need to make sure it fits first)
n = 2;
x = sin(0.01:0.01:pi);
x_downsampled = x(1:n:end)+x(2:n:end);
You will now see that x_downsampled is much smaller (and should thus be easier to process), but will still have the same shape. In your case I think this is sufficient.
To see what you got try:
plot(x)
Now you can simply process x_downsampled and map your solution, for example
f = find(x_downsampled == max(x_downsampled));
location_of_maximum = f * n;
Needless to say this should be done in combination with the most efficient options that the fit function has to offer.

function parameters in matlab wander off after curve fitting

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)
My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.
I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following:
1. Taking an individual longitudinal datafile
2. curve fitting this data on a non-parametric model using lsqcurvefit
3. obtaining figures and the points where f' and f'' are zero
This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.
validFitPersons = true(nbValidPersons,1);
for i=1:nbValidPersons
personalData = data{validPersons(i),3};
personalData = personalData(personalData(:,1)>=minAge,:);
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
Above is a the part of the code i've written to fit the datafiles into a specific model.
Below is an example of a non-parametric model i use with its function parameters.
elseif strcmpi(model,'jpa2')
% y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};
% Define initial values
if strcmpi('male',gender)
initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
else
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
end
I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.
Some extra adjustments:
The idea for this code was the following:
I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.
Anyone any idea how this could be done?
Anybody willing to help a psychology student?
Cheers
This is a common issue when dealing with non-linear models.
If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.
If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:
x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)
defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.
Cheers
You state:
I'm trying to compare different fitting models (they are designed for
this purpose). So what I do is I have 5 models with different
parameters and different starting values ( the second part of my code)
and next I have the general curve fitting file.
You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).
The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by #David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.
edit
To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:
Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.