in order to identify important variables in a dataset, I am starting out with making a univariate logistic regression model for each of the variables. However, for some categorical variables, there are only a few observations in some categories (the total number of observations is 1155; there are no missing values). Running the code
modelspec1='Y ~ X';
B1 = fitglm(table,modelspec1,'Distribution','binomial');
returns the error
Warning: Iteration limit reached.
In glmfit (line 324)
In GeneralizedLinearModel/fitter (line 575)
In classreg.regr.FitObject/doFit (line 94)
In GeneralizedLinearModel.fit (line 882)
In fitglm (line 142)
and the estimate of beta becomes around 100, with a huge standard error and a p-value close to 1. I have tried to increase the number of iterations using
opts = statset('glmfit');
opts.MaxIter = 10000; % default value for glmfit is 100.
but it did not help. I really need to get some kind of an estimate (just dropping a category is sadly not an option). How can I fix the problem?
Related
I've been running a variation of the doseResponse function downloaded from here to generate dose-response sigmoid curves. However, I've had trouble with one of my datasets generating a linear curve instead. By running the following code, I get the following error and produce the following graph. I also uploaded the data called dose2.csv and resp2.csv to google drive here. Does anyone know how I can fix this? Thanks.
Code to generate graph
% Plotting Dose-Response Curve
response = resp2;
dose = dose2;
% Deal with 0 dosage by using it to normalise the results.
normalised=0;
if (sum(dose(:)==0)>0)
%compute mean control response
controlResponse=mean(response(dose==0));
%remove controls from dose/response curve
response=response(dose~=0)/controlResponse;
dose=dose(dose~=0);
normalised=1;
end
%hill equation sigmoid
sigmoid=#(beta,x)beta(1)+(beta(2)-beta(1))./(1+(x/beta(3)).^beta(4));
%calculate some rough guesses for initial parameters
minResponse=min(response);
maxResponse=max(response);
midResponse=mean([minResponse maxResponse]);
minDose=min(dose);
maxDose=max(dose);
%fit the curve and compute the values
%[coeffs,r,J]=nlinfit(dose,response,sigmoid,[minResponse maxResponse midResponse 1]); % nlinfit doesn't work as well
beta_new = lsqcurvefit(sigmoid,[minResponse maxResponse midResponse 1],dose,response);
[coeffs,r,J]=nlinfit(dose,response,sigmoid, beta_new);
ec50=coeffs(3);
hillCoeff=coeffs(4);
%plot the fitted sigmoid
xpoints=logspace(log10(minDose),log10(maxDose),1000);
semilogx(xpoints,sigmoid(coeffs,xpoints),'Color',[1 0 0],'LineWidth',2)
hold on
%notate the EC50
text(ec50,mean([coeffs(1) coeffs(2)]),[' \leftarrow ' sprintf('EC_{50}=%0.2g',ec50)],'FontSize',20,'Color',[1 0 0]);
%plot mean response for each dose with standard error
doses=unique(dose);
meanResponse=zeros(1,length(doses));
stdErrResponse=zeros(1,length(doses));
for i=1:length(doses)
responses=response(dose==doses(i));
meanResponse(i)=mean(responses);
stdErrResponse(i)=std(responses)/sqrt(length(responses));
%stdErrResponse(i)=std(responses);
end
errorbar(doses,meanResponse,stdErrResponse,'o','Color',[1 0 0],'LineWidth',2,'MarkerSize',12)
Warning Message
Solver stopped prematurely.
lsqcurvefit stopped because it exceeded the function evaluation limit,
options.MaxFunctionEvaluations = 4.000000e+02.
Warning: Iteration limit exceeded. Returning results from final iteration.
Graph (looking to generate a sigmoid curve not linear)
You also need to optimize your initial value [minResponse maxResponse midResponse 1] for lsqcurvefit. Don't just simply start with minimum or maximum values of given values. Instead, you may first start with your equations to estimate your coefficients.
Given the sigmoid model of sigmoid=#(beta,x)beta(1)+(beta(2)-beta(1))./(1+(x/beta(3)).^beta(4)). As x gets arbitrarily close to inf, equation will return beta(2). And as x gets arbitrarily close to 0, equation will return beta(1). Therefore, initial estimation of minResponse, maxResponse, and midResponse seems reasonable enough. Actually your problem lies in your initial estimation of 1. beta(4) can be roughly estimated with the inclination of your log graph. To my rough sketch it was around 1/4 and therefore you may conclude that your initial estimation of 1 was too large for convergence.
beta_new = lsqcurvefit(sigmoid,[minResponse maxResponse midResponse 1/4],dose,response);
I've been searching for a way to create dose response curves from data within MATLAB. So far, I've been running the doseResponse function downloaded from here. However, I've been having trouble generating a sigmoidal curve that fits my data. By running the following code, I get the following error and produce the following graph. I also uploaded my data to google drive here. Does anyone know how I can fix this? Or does anyone know of a better way to produce dose response curves? Thanks.
Code
doseResponse(drug_conc,resp_vals)
xlabel('5HT Dose (nm)','FontSize',20)
ylabel('Normalized Response to Stimuli (\DeltaF / F)','FontSize',20)'
Warning Message
Warning: Rank deficient, rank = 3, tol = 1.196971e-11.
> In nlinfit>LMfit (line 587)
In nlinfit (line 284)
In doseResponse (line 47)
Warning: Rank deficient, rank = 1, tol = 3.802134e-12.
> In nlinfit>LMfit (line 587)
In nlinfit (line 284)
In doseResponse (line 47)
Warning: Some columns of the Jacobian are effectively zero at the solution, indicating that the model is insensitive to some of its parameters.
That may be because those parameters are not present in the model, or otherwise do not affect the predicted values. It may also be due to
numerical underflow in the model function, which can sometimes be avoided by choosing better initial parameter values, or by rescaling or
recentering. Parameter estimates may be unreliable.
> In nlinfit (line 381)
In doseResponse (line 47)
ans =
-5.2168e+07
Major of rank deficit problem during non-linear curve fitting comes from wrong initial value setting. You may refer to John's answer for the recognition of its importance. In case of the doseRespose function you used. The code was formulated as below for sigmoid curve fitting.
%calculate some rough guesses for initial parameters
minResponse=min(response);
maxResponse=max(response);
midResponse=mean([minResponse maxResponse]);
minDose=min(dose);
maxDose=max(dose);
%fit the curve and compute the values
[coeffs,r,J]=nlinfit(dose,response,sigmoid, [minResponse maxResponse midResponse 1]);
The initial cofficients for sigmoid was estimated using min, and max values of given dose and response. I'm not sure if that's the convention in the field of drug-response plot. However if not, you may use a smarter way to estimate the initial values of those if you have access to Optimization Toolbox.
lsqcurvefit(sigmoid, [minResponse maxResponse midResponse 1], dose, response);
> [0.898627445275206,-0.0795383479232744,57.8616285607284,0.537847599487817]
It can do two things that the Statistics Toolbox functions cannot: (1) accept bounds on the parameters, and (2) fit matrix dependent variables (https://kr.mathworks.com/matlabcentral/answers/131109-parameter-estimation-nlinfit-vs-fitnlm), and therefore provide better estimation. Afterward, simple substitution will give you the desired result.
%fit the curve and compute the values
beta_new = [0.898627445275206,-0.0795383479232744,57.8616285607284,0.537847599487817];
[coeffs,r,J]=nlinfit(dose,response,sigmoid, beta_new);
I am using the [h,p,ksstat,cv] = kstest(x,'cdf',y); function in Matlab to find the ksstat and p-value. My x is x(1,1:10) = [0.16;1.21;4.41;0.09;0.64;0.36;0.04;6.76;0.04;0.49]and my y = chi2cdf(x,9); which is the cdf I am specifying or testing. Although I get this error:
Error using kstest (line 160)
Hypothesized CDF matrix must have 2 columns.
Normally I would have [h,p,ksstat,cv] = kstest(x,'cdf',y); where
y = makedist('ChiSquared'); but as you may know the distribution Chi-squared does not exist so I am not sure how to get around this issue. Any suggestions are greatly appreciated.
I think you should write:
[h,p,ksstat,cv] = kstest(x,'cdf',[x y]);
as the documentation says:
When CDF is a matrix, column 1 contains a set of possible x values, and column 2 contains the corresponding hypothesized cumulative distribution function values G(x).
I'm trying to build a Gaussian Mixture Model using random initializations and compare the results with one using Kmeans initializations. However, I have difficulty creating the initial covariance matrix. I randomly selected 10 data points from my data set of 2500 data points (each "point" is actually an image), and used them as the means. Then I'm trying to create the covariance matrix from each of these random points.
Here's what I have.
% Randomly initialize GMM parameters
rng(1);
rand_index = randperm(2500);
Mu = data(:,rand_index(1:10));
for i = 1 : 10
Sigma(:,:,i) = cov(Mu);
Pxi(:,i) = mvnpdf(data', Mu(:,i)', Sigma(:,:,i));
end
data is a 50x2500 matrix. I keep getting an error because my Sigma is of the wrong size, or it's not positive definite, etc.
For example, the code above gave the error
Error using mvnpdf (line 116)
SIGMA must be a square matrix with size equal to the number of columns in X, or a row vector with length equal to the number of
columns in X.
If I use
Sigma(:,:,i) = cov([Mu(:,i) Mu(:,i)]');
I get the error
Error using mvnpdf (line 129)
SIGMA must be a square, symmetric, positive definite matrix.
How should I create this covariance matrix?
I assume what you are experiencing is not happening at every run. This is a numerical instability that you can avoid using a simple technique:
%Add a tiny variance to avoid numerical instability
Sigma(:,:,i) = cov([Mu(:,i) Mu(:,i)]');
D = size(Sigma,1);
Sigma(:,:,i) = Sigma(:,:,i) + 1E-5.*diag(ones(D,1));
I'm trying to find optimized parameters for a model defined by an implicit function to fit a dataset using fsolve and lsqcurvefit. I have defined 3 functions in separate m-files: first one is a definition for the implicit function in 4 parameters to be defined, second one uses fsolve to find the roots of the implicit function defined and the third one uses lsqcurvefit to find optimized values for the four parameters. I naturally need to define good enough initial values for the parameters, but having tried various reasonable combinations, lsqcurvefit always runs for some 20-30 iterations (matlab prints out the vector values calculated with the solution found by fsolve after each iteration) and then prints
No solution found.
fsolve stopped because the problem appears regular as measured by the gradient,
but the vector of function values is not near zero as measured by the
default value of the function tolerance.
<stopping criteria details>
??? Error using ==> lsqcurvefit at 253
Function value and YDATA sizes are incommensurate.
Error in ==> optimointi at 5
z = lsqcurvefit('laske_i',parametrit,V_vektori,I_vektori_mitattu,[],[],options);
I can't see how "Function value and YDATA sizes are incommensurate." suddenly, as the iteration first runs for 20-30 times. The values printed after each iteration are pretty much full of zeros (good fit), but the last few 'explode' from 0 to 1 (with a coefficient of several powers of ten). Any help on the error appreciated!
The error lies in how fsolve works. What actually worked was fsolve, but I had to add a for-loop to make the result of fsolve for each element of the domain vector also a vector. This is why Function value and YDATA really were incommensurate.
In my case, the error in lsqcurvefit that said "Function value and YDATA sizes are incommensurate" was due to the vector I was using as Ydata, it was quite a silly thing actually.
The vector must be in column form y=[1;2;3], not like y=[1 2 3].That was causing the problem in lsqcurvefit, because the xdata were like columns too