Multiple comparison for repeated measures ANOVA in matlab - matlab

I want to find possible differences between different conditions. I have n subjects for which I have a mean value for every condition for every subject respectively. The values between subjects vary a lot, that's why I wanted to perform a repeated measures anova to control for that.
My within subject factor would be the condition then and I don't have any between subjects factor.
So far I have the following code:
%% create simulated numbers
meanPerf = randn(20,3);
%% create a table array with the mean performance for every condition
tableData = table(meanPerf(:,1),meanPerf(:,2),meanPerf(:,3),'VariableNames',{'meanPerf1','meanPerf2','meanPerf3'})
tableInfo = table([1,2,3]','VariableNames',{'Conditions'})
%% fit repeated measures model to the table data
repMeasModel = fitrm(tableData,'meanPerf1meanPerf3~1','WithinDesign',tableInfo);
%% perform repeated measures anova to check for differences
ranovaTable = ranova(repMeasModel)
My first question is: Am I doing this correctly?
The second question is: How can I perform a post hoc analysis to find out which of the condition are significantly different from each other?
I tried using:
multcompare(ranovaTable,'Conditions');
but that produced the following error:
Error using internal.stats.parseArgs (line 42)
Wrong number of arguments.
I am using Matlab 2015b.
Would be great if you could help me out. I think I'm loosing my mind over this.
Best,
Phill

I was trying the same thing using Matlab R2016a, and I get the following multcompare error message: "STATS must be a stats output structure from ANOVA1, ANOVA2, ANOVAN, AOCTOOL, KRUSKALWALLIS, or FRIEDMAN.".
However, this discussion was helpful for me:
https://www.mathworks.com/matlabcentral/answers/140799-3-way-repeated-measures-anova-pairwise-comparisons-using-multcompare
You might try something like:
multcompare(repMeasModel,'Factor1','By','Factor2)
I believe you'll need to create factors in the within structure of your model too.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Parfor works with test data but not real data

I am using Matlab 2016a. I have four matrices of size 2044x1572x84 and am trying to regress each column of each matrix to produce a new 2044x1572 matrix of regression coefficients. I need to use parfor; a for loop would take way too long.
When I use the below code using test data (e.g. using rand to make four matrices of 50x50x40) the code executes with no errors. However, when I try using the same code in a cluster with the full 2044x1572x84 matrices I get a transparency violation error with regards to the table: Error using table (line 247) Transparency violation error. I've tried modifying the table code to fix this but only get a suite of other errors.
I'm unsure how to fix the error in this case, particularly given that the success of the code seems to be dependent on the size of the input data. I'm not particularly familiar with parfor, and any feedback on what I may be doing wrong would be greatly appreciated.
COEFF_LST=ones(2044,1572);
parfor i=1:2044
for j=1:1572
ZZ=squeeze(ARRAY_DETREND_L2_LST(i,j,:));
XX=squeeze(ARRAY_DETREND_L2_ONDVI(i,j,:));
YY=squeeze(ARRAY_DETREND_WB_85(i,j,:));
LL=squeeze(ARRAY_DETREND_L2_CNDVI(i,j,:));
T=table(ZZ,XX,YY,LL,'VariableNames',{'LST','ONDVI','DROUGHT','NDVI'});
lm=fitlm(T);
array=table2array(lm.Coefficients);
COEFF_LST(i,j)=array(3,1);
end
end
The table constructor uses inputname under certain circumstances - that can cause transparency violations inside parfor. I realise it's inconvenient, but perhaps you could try "hiding" the table call inside a separate function. I.e.
parfor ...
T = myTableBuilder(ZZ,XX,...);
end
function t = myTableBuilder(varargin)
t = table(varargin{:});
end
In this case I'm getting a transparency error with table, so a simple solution that works is to not use table.
In this case the code would be:
Predictor_Matrix=horzcat(ZZ,XX,YY);
lm = fitlm(Predictor_Matrix,WW);
This works on a cluster without throwing any errors.

data driven curve - MATLAB

I have several sets of data that I want to fit but not all of them look the same (some look like a Gaussian with one peak, some like two Gaussians with 2 peaks or Lorentzians). I wanted to try this method
http://www.mathworks.com/matlabcentral/fileexchange/31562-data-driven-fitting-with-matlab/content/fitit.m
but the program given is not complete so I can not use it (there is no line that defines 'train' and 'test'). I am writing it so that it suits and works for my data (based on the code that it is given and the demo). I was able to find the best fit but I am also trying to use the bootstrap technique in order to find the confidence intervals. My data is xdata and ydata and they are sorted and the duplicates have been removed before I use them in my program.
cpart=cvpartition(size(xdata,1),'k',10);
tr_x=xdata(training(cpart,1));
tr_y=ydata(training(cpart,1));
tst_x=xdata(test(cpart,1));
tst_y=ydata(test(cpart,1));
all_span=linspace(0.01,0.99,99);
s=zeros(length(all_span);
for k=1:length(all_span)
f = #(tr_x,tr_y,tst_x,tst_y) norm(tst_y mylowess (tr_x, tr_y, tst_x, all_span (k)))^2
s(k) = sum(crossval(f,datax,datay,'partition',cpart));
end
[~,mj]=min(s);
n_span=all_span(mj);%n_span is the optimal span
function ys=mylowess(x1,y1,xs,span)
ys1 = smooth(x1,y1,span,'loess');
ys = interp1(x1,ys1,xs,'linear',NaN);
if any(isnan(ys))
ys(xs<x1(1)) = ys1(1);
ys(xs>x1(end)) = ys1(end);
end
So up to this point I understand the program and I have managed to find the optimal span. I want to find the confidence intervals but so far I was not able to make it work.
When I type:
NB=length(xdata);
f=#(xdata,ydata) mylowess(xdata,ydata,xdata,n_span);
yboot2 = bootstrp(NB,f,xdata,ydata)';
I get the following error
Error using griddedInterpolant
The grid vectors are not strictly monotonic increasing.
Error in interp1 (line 186)
F = griddedInterpolant(X,V,method);
Error in mylowess (line 26)
ysmooth=interp1(xdata,ysmooth1,xinput,'linear',NaN);
As I said before there are no duplicates in xdata and I have already sorted xdata before I used them in the program. Can anyone see the mistake I am making? Or is there an easier way to get the confidence intervals?
Thank you for your help.

matlab zplane function: handles of vectors

I'm interested in understanding the variety of zeroes that a given function produces with the ultimate goal of identifying the what frequencies are passed in high/low pass filters. My idea is that finding the lowest value zero of a filter will identify the passband for a LPF specifically. I'm attempting to use the [hz,hp,ht] = zplane(z,p) function to do so.
The description for that function reads "returns vectors of handles to the zero lines, hz". Could someone help me with what a vector of a handle is and what I do with one to be able to find the various zeros?
For example, a simple 5-point running average filter:
runavh = (1/5) * ones(1,5);
using zplane(runavh) gives an acceptable pole/zero plot, but running the [hz,hp,ht] = zplane(z,p) function results in hz=175.1075. I don't know what this number represents and how to use it.
Many thanks.
Using the get command, you can find out things about the data.
For example, type G=get(hz) to get a list of properties of the zero lines. Then the XData is given by G.XData, i.e. X=G.XData.
Alternatively, you can only pull out the data you want
X=get(hz,'XData')
Hope that helps.

function parameters in matlab wander off after curve fitting

first a little background. I'm a psychology student so my background in coding isn't on par with you guys :-)
My problem is as follow and the most important observation is that curve fitting with 2 different programs gives completly different results for my parameters, altough my graphs stay the same. The main program we have used to fit my longitudinal data is kaleidagraph and this should be seen as kinda the 'golden standard', the program I'm trying to modify is matlab.
I was trying to be smart and wrote some code (a lot at least for me) and the goal of that code was the following:
1. Taking an individual longitudinal datafile
2. curve fitting this data on a non-parametric model using lsqcurvefit
3. obtaining figures and the points where f' and f'' are zero
This all worked well (woohoo :-)) but when I started comparing the function parameters both programs generate there is a huge difference. The kaleidagraph program stays close to it's original starting values. Matlab wanders off and sometimes gets larger by a factor 1000. The graphs stay however more or less the same in both situations and both fit the data well. However it would be lovely if I would know how to make the matlab curve fitting more 'conservative' and more located near it's original starting values.
validFitPersons = true(nbValidPersons,1);
for i=1:nbValidPersons
personalData = data{validPersons(i),3};
personalData = personalData(personalData(:,1)>=minAge,:);
% Fit a specific model for all valid persons
try
opts = optimoptions(#lsqcurvefit, 'Algorithm', 'levenberg-marquardt');
[personalParams,personalRes,personalResidual] = lsqcurvefit(heightModel,initialValues,personalData(:,1),personalData(:,2),[],[],opts);
catch
x=1;
end
Above is a the part of the code i've written to fit the datafiles into a specific model.
Below is an example of a non-parametric model i use with its function parameters.
elseif strcmpi(model,'jpa2')
% y = a.*(1-1/(1+(b_1(t+e))^c_1+(b_2(t+e))^c_2+(b_3(t+e))^c_3))
heightModel = #(params,ages) abs(params(1).*(1-1./(1+(params(2).* (ages+params(8) )).^params(5) +(params(3).* (ages+params(8) )).^params(6) +(params(4) .*(ages+params(8) )).^params(7) )));
modelStrings = {'a','b1','b2','b3','c1','c2','c3','e'};
% Define initial values
if strcmpi('male',gender)
initialValues = [176.76 0.339 0.1199 0.0764 0.42287 2.818 18.52 0.4363];
else
initialValues = [161.92 0.4173 0.1354 0.090 0.540 2.87 14.281 0.3701];
end
I've tried to mimick the curve fitting process in kaleidagraph as good as possible. There I've found they use the levenberg-marquardt algorithm which I've selected. However results still vary and I don't have any more clues about how I can change this.
Some extra adjustments:
The idea for this code was the following:
I'm trying to compare different fitting models (they are designed for this purpose). So what I do is I have 5 models with different parameters and different starting values ( the second part of my code) and next I have the general curve fitting file. Since there are different models it would be interesting if I could put restrictions into how far my starting values could wander off.
Anyone any idea how this could be done?
Anybody willing to help a psychology student?
Cheers
This is a common issue when dealing with non-linear models.
If I were, you, I would try to check if you can remove some parameters from the model in order to simplify it.
If you really want to keep your solution not too far from the initial point, you can use upper bounds and lower bounds for each variable:
x = lsqcurvefit(fun,x0,xdata,ydata,lb,ub)
defines a set of lower and upper bounds on the design variables in x so that the solution is always in the range lb ≤ x ≤ ub.
Cheers
You state:
I'm trying to compare different fitting models (they are designed for
this purpose). So what I do is I have 5 models with different
parameters and different starting values ( the second part of my code)
and next I have the general curve fitting file.
You will presumably compare the statistics from fits with different models, to see whether reductions in the fitting error are unlikely to be due to chance. You may want to rely on that comparison to pick the model that not only fits your data suitably but is also simplest (which is often referred to as the principle of parsimony).
The problem is really with the model you have shown resulting in correlated parameters and therefore overfitting, as mentioned by #David. Again, this should be resolved when you compare different models and find that some do just as well (statistically speaking) even though they involve fewer parameters.
edit
To drive the point home regarding the problem with the choice of model, here are (1) results of a trial fit using simulated data (2) the correlation matrix of the parameters in graphical form:
Note that absolute values of the correlation close to 1 indicate strongly correlated parameters, which is highly undesirable. Note also that the trend in the data is practically linear over a long portion of the dataset, which implies that 2 parameters might suffice over that stretch, so using 8 parameters to describe it seems like overkill.