How do I compare two weighted regressions in MatLab? - matlab

I've been using MatLab as a statistics tool. I like how much I can customise and code myself.
I was delighted to find that it's fairly straightforward to do a weighted linear regression in MatLab. As a slightly silly example, I can load the "carbig" data file and compare horsepower vs mileage for US cars to that of cars from other countries, but decide I only trust 8-cylinder cars.
load carbig
w=(Cylinders==8)+0.5*(Cylinders~=8)%1 if 8 cylinders, 0.5 otherwise.
for i=1:length(org)
o(i,1)=strcmp(org(i,:),org(1,:));%strcmp only works on one string.
end
x1=Horsepower(o==1)
x2=Horsepower(o==0)
y1=MPG(o==1)
y2=MPG(o==0)
w1=w(o==1)
w2=w(o==0)
lm1=fitlm(x1,y1,'Weights',w1)
lm2=fitlm(x2,y2,'Weights',w2)
This way, data from 8-cylinder cars will count as one data-point, and data frm 3,4,5,6-cylinder cars will count as half a data point.
Problem is, the obvious way to compare the two regressions is to use ANCOVA, which MatLab has a function for:
aoctool(Horsepower,MPG,o)
This function compares linear regressions on the two groups, but I haven't found an obvious way to include weights.
I suspect I can have a closer look at what the ANCOVA does and include the weights manually. Any easier solution?

I figured if I give the "trusted" measuremets weight 2, the "untrusted" measurements weight 1, for regression purposes that's the same thing as having an extra 1 identical measurement for each trusted one. Setting the weight to 1 and 0.5 should do the same thing. I can do this with a script.
That also increases the degrees of freedom quite a bit, so I manually set the degrees of freedom to sum(w)-rank instead on n-rank.
x=[];
y=[];
g=[];
w=(Cylinders==8)+0.5*(Cylinders~=8);
df=sum(w)
for i=1:length(w)
while w(i)>0
x=[x;Horsepower(i)];
y=[y;MPG(i)];
g=[g;o(i)];
w(i)=w(i)-0.5
end
end
I then copied the aoctool.m file (edit aoctool) and inserted the value of df somewhere in the new file. It isn't elegant, but it seems to work.
edit aoctool.m
%(insert new df somewhere. Save as aoctool2.m)
aoctool2(x,y,g)

Related

Resample factors are too large

I have a large vector of recorded data which I need to resample. The problem I encounter is that when using resample, I get the following error:
??? Error using ==> upfirdn at 82 The product of the downsample factor
Q and the upsample factor P must be less than 2^31.
Now, I understand why this is happening - my two sampling rates are very close together, so the integer factors need to be quite large (something like 73999/74000). Unfortunately this means that the appropriate filter can't be created by MATLAB. I also tried resampling just up, with the intention of then resampling down, but there is not enough memory to do this to even 1 million samples of data (mine is 93M).
What other methods could I use to properly resample this data?
An interpolated polyphase FIR filter can be used to interpolate just the new set of sample points without using an upsampling+downsampling process.
But if performance is completely unimportant, here's a Quick and Dirty windowed-Sinc interpolator in Basic.
here's my code, I hope it helps :
function resig = resamplee(sig,upsample,downsample)
if upsample*downsample<2^31
resig = resample(sig,upsample,downsample);
else
sig1half=sig(1:floor(length(sig)/2));
sig2half=sig(floor(length(sig)/2):end);
resig1half=resamplee(sig1half,floor(upsample/2),length(sig1half));
resig2half=resamplee(sig2half,upsample-floor(upsample/2),length(sig2half));
resig=[resig1half;resig2half];
end

Create different `randperm` numbers in loops

Suppose that we have this structure:
for i=1:x1
Out = randperm(40);
Out_Final = %% divide 'Out' to 10 parts. and select these parts for some purposes
for j=1:x2
%% Process on `Out_Final`
end
end
I'm using outer loop (for i=1:x1) to repeat main process (for j=1:x2) loop and average between outputs to have more robust results. I want randperm doesn't result equal (or near equal) outputs. I want have different Output for this function as far as possible in every calling in (for i=1:x1) loop.
How can i do that in MATLAB R2014a?
The randomness algorithms used by randperm are very good. So, don't worry about that.
However, if you draw 10 random numbers from 1 to 10, you are likely to see some more frequently than others.
If you REALLY don't want this, you should probably not focus on randomly selecting the numbers, but on selecting the numbers in a way that they are nicely spread out througout their possible range. (This is a quite different problem to solve).
To address your comment:
The rng function allows you to create reproducible results, make sure to check doc rng for examples.
In your case it seems like you actually don't want to reset the rng each time, as that would lead to correlated random numbers.

function parameters in matlab wander off after curve fitting

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)
My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.
I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following:
1. Taking an individual longitudinal datafile
2. curve fitting this data on a non-parametric model using lsqcurvefit
3. obtaining figures and the points where f' and f'' are zero
This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.
validFitPersons = true(nbValidPersons,1);
for i=1:nbValidPersons
personalData = data{validPersons(i),3};
personalData = personalData(personalData(:,1)>=minAge,:);
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
Above is a the part of the code i've written to fit the datafiles into a specific model.
Below is an example of a non-parametric model i use with its function parameters.
elseif strcmpi(model,'jpa2')
% y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};
% Define initial values
if strcmpi('male',gender)
initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
else
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
end
I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.
Some extra adjustments:
The idea for this code was the following:
I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.
Anyone any idea how this could be done?
Anybody willing to help a psychology student?
Cheers
This is a common issue when dealing with non-linear models.
If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.
If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:
x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)
defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.
Cheers
You state:
I'm trying to compare different fitting models (they are designed for
this purpose). So what I do is I have 5 models with different
parameters and different starting values ( the second part of my code)
and next I have the general curve fitting file.
You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).
The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by #David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.
edit
To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:
Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.

Suppress kinks in a plot matlab

I have a csv file which contains data like below:[1st row is header]
Element,State,Time
Water,Solid,1
Water,Solid,2
Water,Solid,3
Water,Solid,4
Water,Solid,5
Water,Solid,2
Water,Solid,3
Water,Solid,4
Water,Solid,5
Water,Solid,6
Water,Solid,7
Water,Solid,8
Water,Solid,7
Water,Solid,6
Water,Solid,5
Water,Solid,4
Water,Solid,3
The similar pattern is repeated for State: "Solid" replaced with Liquid and Gas.
And moreover the Element "Water" can be replaced by some other element too.
Time as Integer's are in seconds (to simplify) but can be any real number.
Additionally there might by some comment line starting with # in between the file.
Problem Statement: I want to eliminate the first dip in Time values and smooth out using some quadratic or cubic or polynomial interpolation [please notice the first change from 5->2 --->8. I want to replace these numbers to intermediate values giving a gradual/smooth increase from 5--->8].
And I wish this to be done for all the combinations of Elements and States.
Is this possible through some sort of coding in Matlab etc ?
Any Pointers will be helpful !!
Thanks in advance :)
You can use the interp1 function for 1D-interpolation. The syntax is
yi = interp1(x,y,xi,method)
where x are your original coordinates, y are your original values, xi are the coordinates at which you want the values to be interpolated at and yi are the interpolated values. method can be 'spline' (cubic spline interpolation), 'pchip' (piece-wise Hermite), 'cubic' (cubic polynomial) and others (see the documentation for details).
You have alot of options here, it really depends on the nature of your data, but I would start of with a simple moving average (MA) filter (which replaces each data point with the average of the neighboring data points), and see were that takes me. It's easy to implement, and fine-tuning the MA-span a couple of times on some sample data is usually enough.
http://www.mathworks.se/help/curvefit/smoothing-data.html
I would not try to fit a polynomial to the entire data set unless I really needed to compress it, (but to do so you can use the polyfit function).

How to get level of fitness of data to a distribution by using probplot() in Matlab?

I have 2 sets of data of float numbers, set A and set B. Both of them are matrices of size 40*40. I would like to find out which set is closer to the normal distribution. I know how to use probplot() in matlab to plot the probability of one set. However, I do not know how to find out the level of the fitness of the distribution is.
In python, when people use problot, a parameter ,R^2, shows how good the distribution of the data is against to the normal distribution. The closer the R^2 value to value 1, the better the fitness is. Thus, I can simply use the function to compare two set of data by their R^2 value. However, because of some machine problem, I can not use the python in my current machine. Is there such parameter or function similar to the R^2 value in matlab ?
Thank you very much,
Fitting a curve or surface to data and obtaining the goodness of fit, i.e., sse, rsquare, dfe, adjrsquare, rmse, can be done using the function fit. More info here...
The approach of #nate (+1) is definitely one possible way of going about this problem. However, the statistician in me is compelled to suggest the following alternative (that does, alas, require the statistics toolbox - but you have this if you have the student version):
Given that your data is Normal (not Multivariate normal), consider using the Jarque-Bera test.
Jarque-Bera tests the null hypothesis that a given dataset is generated by a Normal distribution, versus the alternative that it is generated by some other distribution. If the Jarque-Bera test statistic is less than some critical value, then we fail to reject the null hypothesis.
So how does this help with the goodness-of-fit problem? Well, the larger the test statistic, the more "non-Normal" the data is. The smaller the test statistic, the more "Normal" the data is.
So, assuming you have converted your matrices into two vectors, A and B (each should be 1600 by 1 based on the dimensions you provide in the question), you could do the following:
%# Build sample data
A = randn(1600, 1);
B = rand(1600, 1);
%# Perform JB test
[ANormal, ~, AStat] = jbtest(A);
[BNormal, ~, BStat] = jbtest(B);
%# Display result
if AStat < BStat
disp('A is closer to normal');
else
disp('B is closer to normal');
end
As a little bonus of doing things this way, ANormal and BNormal tell you whether you can reject or fail to reject the null hypothesis that the sample in A or B comes from a normal distribution! Specifically, if ANormal is 1, then you fail to reject the null (ie the test statistic indicates that A is probably drawn from a Normal). If ANormal is 0, then the data in A is probably not generated from a Normal distribution.
CAUTION: The approach I've advocated here is only valid if A and B are the same size, but you've indicated in the question that they are :-)