Short description
I am trying to run a (GLM) regression in Matlab (using the fitglm function) where one of the regressors is a categorical variable. However instead of adding an intercept and dropping the first level, I would like to include each level of the categorical variable and exclude the constant term.
Motivation
I know, that theoretically the results are the same either way, but I have two reasons against estimating the model with a constant and interpreting the dummy level coefficients differently:
The smaller problem is that I am running many regressions as part of
a larger estimation procedure using different subsets of a large
dataset, and the available levels of my categorical variable might
not be the same across the regressions. In the end I would like
to compare estimated coefficients for the levels. It can be solved with
some additional code/hacking, but it would not be an elegant solution.
The bigger problem is that there are orders of magnitudes
of difference between the coefficients of the levels: some of them
are extremely small. If such a level gets used as a base level, I am
afraid that it might cause numerical problems / optimization
problems.
Tried approaches
I tried subclassing the GeneralizedLinearModel class but unfortunately it is marked as final. Class composition also does not work as I cannot even inherit from the parent of the GeneralizedLinearModel class. Modifying Matlab's files is no option as I use a shared Matlab installation.
The only idea I could come up with is using dummyvar or something similar to turn my categorical variable into a set of dummies, and then using these dummy variables in the regression. AFAIK this is how Matlab works internally, but by taking this approach I lose the user-friendliness of dealing with categorical variables.
P.S. This question was also posted on MatlabCentral at this link.
As there seems to be no built-in way to do this, I am posting a short function that I wrote to get the job done.
I have a helper function to convert the categorical variable into an array of dummies:
function dummyTable = convert_to_dummy_table(catVar)
dummyTable = array2table(dummyvar(catVar));
varName = inputname(1);
levels = categories(catVar)';
dummyTable.Properties.VariableNames = strcat(varName, '_', levels);
end
The usage is quite simple. If you have a table T with some continuous explanatory variables X1, X2, X3, a categorical explanatory variable C and a response variable Y, then instead of using
M = fitglm(T, 'Distribution', 'binomial', 'Link', 'logit', 'ResponseVar', 'Y')
which would fit a logit model using k - 1 levels for the categorical variable and an intercept, one would do
estTable = [T(:, {'X1', 'X2', 'X3', 'Y'}), convert_to_dummy_table(T.C)]
M = fitglm(estTable, 'Distribution', 'binomial', 'Link', 'logit', ...
'ResponseVar', 'Y', 'Intercept', false)
It is not as nice and readable as the default way of handling categorical variables, but it has the advantage that the names of the dummy variables are identical to the names that Matlab automatically assigns during estimation using a categorical variable. Therefore the Coefficients table of the resulting M object is easy to parse or understand.
Related
I have a multivariable linear optimization problem that I could use some guidance with on finding an optimal function/code method (Matlab). My problem is as as follows:
I have a set of observed data, I'll call this d(i), which is a 5000x1 vector (# of rows may change).
I have 10 - 100 sets of simulated data, the number of sets is a number I decide on. Each of these sets is also a 5000x1 vector (again, # of rows may change). I'll call these c1(i), c2(i), etc.
I would like to fit the simulated data sets to the observed data set with this equation:
sf1*c1(i) + sf2*c2(i) + sf3*c3(i) sf4*c4(i) + ... = d(i) + error
In this equation, I would like to solve for all of the scale factors (sf) (non-negative constants) and the error. I am assuming I need to set initial values for all the scale factors for this problem to work. I have looked into things like lssqnonneg, but I am unclear on whether that function can solve or optimize for this many variables per equation.
See above - I have also manually input the values of some scale factors and I can get a pretty good fit to the data by hand, but this is impractical for large quantities of simulated data sets.
did you try looking at https://www.mathworks.com/help/stats/linear-regression.html?s_tid=CRUX_lftnav ?
Instead of using c1,c2,...c100 as different vectors better concatenate them into an array 100x5000, say A=[c1;c2;...;c100] this will be needed to make life easier.
Then look for example at ridge regression
Ans= ridge(d,A,k)
where k is the regularization parameter that can be found by cross-validation:
[U,s,V]=svd( A,"econ");
k=gcv(U,diag(s),d,'tsvd');
see the function gcv here https://www.mathworks.com/matlabcentral/fileexchange/52-regtools
I want to fit the non-linear function (variation is T) with experimental data.
In here, I used lsqcurvefit, but I don’t know the exact principle of this function, and even I don't know how to use the command func, also. (so the script I wrote is bad.)
For this, how can I modify the script below?
(even w/o using lsqcurve fit, I don’t care. Chi-square, least-square, any form is okay.)
Code is as follows (3 parts):
%script 1
x=coefficient1
y=coefficient2
z=coefficient3
A1=x*0.321*T
A2=(y/0.2)+0.5*T
A3=(z+0.3)*0.17/T
%script 2
global A1
global A2
global A3
Result=(A1+0.3)*A2+0.3
%script 3
global Result
Sample=readmatrix(‘experimentaldata’);
XX=sample(1,:)’;
YY=sample(2,:)’;
Xdata=linespace(min(XX),max(XX),2000);
Yadata=(interpl(XX,YY,xdata);
Fitting=lsqscurvefit(Result,T,xdata,ydata,250,2000)
I'm afraid if I had mistake while understanding your script (including bizarre global variables and coefficients...). But until now it seems to me that you want to fit your data into quadratic polynomial (considering T as single variable and x,y,z as constant, inserting A1 & A2 to your Result gives quadratic polynomial)
%Result = (x*0.321*T+0.3)*((y/0.2)+0.5*T)+0.3
Result = #(c,T)c(1)*T^2+c(2)*T^3+c(3);
Here, your final equation form will be as above, c stands for your constant, and T stands for your variable(or xdata).
After you clarify your equation form, you should set your initial state of constant c. In lsqcurve, as given in their explanation, they find their final answer by iterating until error(or euclidean distance) becomes smaller than given criteria. Therefore setting valid initial value will be critical here, if not answer will converge to local minimum (you can refer to John's perfect answer about this issue here, curve fitting error:nlinfit rank deficient).
c0=[0.1, 0.2, 0.3];
But in this case I'll just randomly give initial value as above.
Afterward, answer(or c) is ready to be calculated
Fitting=lsqscurvefit(Result,c0,xdata,ydata)
Other matlab's non-linear fitting function including nlinfit have similar workflow. I think you can simply understand more detailed information about them if you take more attention to their description.
But personally, I recommend you to use matlab's polyfit if your final equation form is in polynomial. You don't have to take care tuning your initial value then...
I am taking an Econometrics course, and have been trying to use Python rather than the propreitry STATA and EVIEWS they set the assignments in.
In one of the questions, I have consumption data over time. I am asked to compute it in two ways.
The first way is calculating a model of the form consumption = Aexp(Bt), and the second way is to log both sides and do ordinary OLS on log(consumption) = alpha + Bt
I know how to do the second way. Howver, when I try to do the first way it goes wrong. Using statsmodels, I can exponentiate the time data (after normalising), but this calculates a regression in the form consumption = Aexp(t) + B, which is not what I want. (I want to specify where the parameters go). In sklearn I could find a polynomial regression, but not exponential.
Then I found scipy.curve_fit
However this seems to have two problems:
(1) It seems to rely on initial guesses for parameters, which means my output will end up being different from proprietry software (whereas output for things like OLS are the same) [as I assume initial guesses means some iterative solution is done which is helpful for very weird and wonderful functions, but I assume fairly standard results hold for exponential regression]
(2) every time I try to implement it, it just returns the guess parameters.
Here is my code
`consumption_data = pd.read_csv(......\consumption.csv")
def func(x,a,b):
return a * np.exp(b*x)
xdata = consumption_data.YEAR
ydata = consumption_data.CONSUMPTION
ydata = (ydata - 1948)/100
popt, pcov = curve_fit(func, xdata, ydata, (1,1))
print(popt)
plt.plot(xdata, func(xdata, *popt), 'g--',)
`
The scipy.optimize code is basically just copy-pasted from their tutorial
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
short answer: use statsmodels GLM
statsmodels does not have nonlinear least squares. The best python library for that is lmfit https://pypi.org/project/lmfit/
curve_fit, lmfit and nonlinear least squares algorithm in general find an iterative solution to the optimization problem. Even when we have to provide starting values, the solution is in many cases the same across packages up to convergence tolerance, e.g. 1e-5 or 1e-6.
Many standard models in statistics and econometrics have a single global maximum with well behaved data. However, in other cases like mixture models, there might be many local optima and the estimation might converge to one of them.
To the specific case:
consumption = A exp(B t)
can be rewritten as
consumption = exp(a + B t)
So this is just a single index model or a generalized linear model with an exponential mean function.
The general version has the expectation of the dependent variable as a nonlinear function of a linear combination of the explanatory variables:
E(y | x) = g(x b)
This can be estimated with statsmodels with GLM with family Gaussian and the log-link.
Aside: In econometrics, there is a literature to use Poisson quasi-likelihood as an estimator for exp models instead of taking the log of the dependent variable.
Poisson usually uses the log-link function as in the above.
However, using GLM allows us to use log-link, i.e. exponential mean function, with any of the supported distribution families. The main difference is in the underlying variance assumption. Gaussian assumes constant variance, Poisson assumes that the variance is proportional to the mean and Gamma assumes that the variance is quadratic in the mean.
If we use a robust sandwich covariance estimator for parameter inference, then standard errors and inference are correct even if the variance function is misspecified.
I am trying to perform a multiple linear regression in MATLAB using the regress function, and I am using a number of different variables that involve different scales and units. I am assume the answer to this question is yes, but should I normalize each variable before running the regression? I'm not sure if MATLAB does so automatically. Thanks for the help!
Yes, you should. If you want to normalize it between 0 and 1, you could use mat2gray function (assuming "vector" as your list of variables).
norm_vect = mat2gray(vector);
This function is used to convert matrix into an image, but works well if you don't want to write yours. You also can use a simple normalization like:
for i = 1:length(vector)
norm_vect(i) = (vector(i)-min(vector))/(max(vector)-min(vector));
end
Thanks in advance for the help.
I am trying to create a repeated measures model. I have a table with several response variables and several predictor variables. I created a model using Wilkinson notation then passed my table with the model to fitrm.
mdl = fitrm(t,model);
I am getting the following error.
Error using RepeatedMeasuresModel.fit (line 1331)
The between-subjects design must have full column rank.
Error in fitrm (line 67)
s = RepeatedMeasuresModel.fit(ds,model,varargin{:});
What exactly does this error mean? When I change my model to 'response ~ 1' (in addition to a few others I have found) fitrm runs just fine. I am fairly certain that my problem then has to be either with my formulation of the model, or (what I think is the problem) there is something wrong with my table.
The error itself hints to me that the rank of my table (if I were to convert to a matrix) is not full. However, I have checked to make sure that neither my rows nor my columns are linearly independent. In particular, I isolated two features, lets call them x and y, that when present give me the above error. When I use either
model = 'response ~ y' %or
model = 'response ~ x'
I get no error. When I use
model = 'response ~ y + x
I get the above error. y and x are linearly independent. What could possibly be going on here?
Note, I cannot provide much more information than what I have already provided because of the nature of the data and model (medical data).
Edit:
A majority of my features are categorical numbers. For example x may be 1,2, or 3 where the values of x are categorical. I determined the the linear independence of my features by treating all predictors as numerical values. My thought is if they are linearly independent in numerical form, they will certainly be so when represented in a different format; for example ASCII. However, it maybe possible that the binary representations for two feature sets are linearly dependent. I consider this highly unlikely though considering the size of my dataset.
When I ran the my code treating all categorical variables as numeric variables I did not receive an error. This may mean that this is a internal matlab error. Could it be possible that my features are linearly dependent when represented as either strings, categorical, or nominal values? (these are the ways I have attempted to classify my categorical variables to test if this was indeed the problem;the matlab wiki indicates that these variable types are all treated equally as categorical variables).