Thanks in advance for the help.
I am trying to create a repeated measures model. I have a table with several response variables and several predictor variables. I created a model using Wilkinson notation then passed my table with the model to fitrm.
mdl = fitrm(t,model);
I am getting the following error.
Error using RepeatedMeasuresModel.fit (line 1331)
The between-subjects design must have full column rank.
Error in fitrm (line 67)
s = RepeatedMeasuresModel.fit(ds,model,varargin{:});
What exactly does this error mean? When I change my model to 'response ~ 1' (in addition to a few others I have found) fitrm runs just fine. I am fairly certain that my problem then has to be either with my formulation of the model, or (what I think is the problem) there is something wrong with my table.
The error itself hints to me that the rank of my table (if I were to convert to a matrix) is not full. However, I have checked to make sure that neither my rows nor my columns are linearly independent. In particular, I isolated two features, lets call them x and y, that when present give me the above error. When I use either
model = 'response ~ y' %or
model = 'response ~ x'
I get no error. When I use
model = 'response ~ y + x
I get the above error. y and x are linearly independent. What could possibly be going on here?
Note, I cannot provide much more information than what I have already provided because of the nature of the data and model (medical data).
Edit:
A majority of my features are categorical numbers. For example x may be 1,2, or 3 where the values of x are categorical. I determined the the linear independence of my features by treating all predictors as numerical values. My thought is if they are linearly independent in numerical form, they will certainly be so when represented in a different format; for example ASCII. However, it maybe possible that the binary representations for two feature sets are linearly dependent. I consider this highly unlikely though considering the size of my dataset.
When I ran the my code treating all categorical variables as numeric variables I did not receive an error. This may mean that this is a internal matlab error. Could it be possible that my features are linearly dependent when represented as either strings, categorical, or nominal values? (these are the ways I have attempted to classify my categorical variables to test if this was indeed the problem;the matlab wiki indicates that these variable types are all treated equally as categorical variables).
Related
I am trying to build a multiple linear regression in MATLAB with 20 predictors, which are categorical with 4 levels each. I am using the function "regress", like this (these are not the actual variables):
X = [ones(size(x1)) x1 x2 x3...x20];
[b,bint,r,rint] = regress(Y, X);
Before this, I transformed the vectors x1,x2...x20 in categorical variables with dummyvar.
I get this error and a lot of 0's in the b coefficients and this error:
Warning: X is rank deficient to within machine precision.
In the dummyvar documentation it is mentioned:
To use the dummy variables in a regression model, you must either delete a column (to create a reference group) or fit a regression model with no intercept term.
I tried not using the intercept ones(size(x1)) and I get the same error.
I would appreciate any input on how to solve this.
Try to simplify the problem down to the minimum working example, and then post that here, so we can reproduce it and help you through. See https://en.wikipedia.org/wiki/Rank_(linear_algebra)
for examples of rank deficiency.
Say I have a problem having D outputs with isotopic data, I would like to use independent noise for each output dimension of a multi-output GP model (Intrinsic Coregionalisation Model) in gpflow, which is the most general case like:
I have seen some example of using multi-output GPs in GPflow, like this notebook and this question
However, it seems for the GPR model class in gpflow, the likelihood variance ($\Sigma$) is still one number instead of D numbers even if a product kernel (i.e. Kernel * Coregionalization) is specified.
Is there any way to achieve that?
Just like you can augment X with a column that designates for each data point (row) which output it relates to (the column is specified by the active_dims keyword argument to the Coregion kernel; note that it is zero-based indexing), you can augment Y with a column to specify different likelihoods (the SwitchedLikelihood is hard-coded to require the index to be in the last column of Y) - there is an example (Demo 2) in the varying noise notebook in the GPflow tutorials. You just have to combine the two, use a Coregion kernel and a SwitchedLikelihood, and augment both X and Y with the same column indicating outputs!
However, as plain GPR only works with a Gaussian likelihood, the GPR model has been hard-coded for a Gaussian likelihood. It would certainly be possible to write a version of it that can deal with different Gaussian likelihoods for the different outputs, but you would have to do it all manually in the _build_likelihood method of a new model (incorporating the stitching code from the SwitchedLikelihood).
It would be much easier to simply use a VGP model that can handle any likelihood - for Gaussian likelihoods the optimisation problem is very simple and should be easy to optimise using ScipyOptimizer.
Short description
I am trying to run a (GLM) regression in Matlab (using the fitglm function) where one of the regressors is a categorical variable. However instead of adding an intercept and dropping the first level, I would like to include each level of the categorical variable and exclude the constant term.
Motivation
I know, that theoretically the results are the same either way, but I have two reasons against estimating the model with a constant and interpreting the dummy level coefficients differently:
The smaller problem is that I am running many regressions as part of
a larger estimation procedure using different subsets of a large
dataset, and the available levels of my categorical variable might
not be the same across the regressions. In the end I would like
to compare estimated coefficients for the levels. It can be solved with
some additional code/hacking, but it would not be an elegant solution.
The bigger problem is that there are orders of magnitudes
of difference between the coefficients of the levels: some of them
are extremely small. If such a level gets used as a base level, I am
afraid that it might cause numerical problems / optimization
problems.
Tried approaches
I tried subclassing the GeneralizedLinearModel class but unfortunately it is marked as final. Class composition also does not work as I cannot even inherit from the parent of the GeneralizedLinearModel class. Modifying Matlab's files is no option as I use a shared Matlab installation.
The only idea I could come up with is using dummyvar or something similar to turn my categorical variable into a set of dummies, and then using these dummy variables in the regression. AFAIK this is how Matlab works internally, but by taking this approach I lose the user-friendliness of dealing with categorical variables.
P.S. This question was also posted on MatlabCentral at this link.
As there seems to be no built-in way to do this, I am posting a short function that I wrote to get the job done.
I have a helper function to convert the categorical variable into an array of dummies:
function dummyTable = convert_to_dummy_table(catVar)
dummyTable = array2table(dummyvar(catVar));
varName = inputname(1);
levels = categories(catVar)';
dummyTable.Properties.VariableNames = strcat(varName, '_', levels);
end
The usage is quite simple. If you have a table T with some continuous explanatory variables X1, X2, X3, a categorical explanatory variable C and a response variable Y, then instead of using
M = fitglm(T, 'Distribution', 'binomial', 'Link', 'logit', 'ResponseVar', 'Y')
which would fit a logit model using k - 1 levels for the categorical variable and an intercept, one would do
estTable = [T(:, {'X1', 'X2', 'X3', 'Y'}), convert_to_dummy_table(T.C)]
M = fitglm(estTable, 'Distribution', 'binomial', 'Link', 'logit', ...
'ResponseVar', 'Y', 'Intercept', false)
It is not as nice and readable as the default way of handling categorical variables, but it has the advantage that the names of the dummy variables are identical to the names that Matlab automatically assigns during estimation using a categorical variable. Therefore the Coefficients table of the resulting M object is easy to parse or understand.
I have created an Auto Encoder Neural Network in MATLAB. I have quite large inputs at the first layer which I have to reconstruct through the network's output layer. I cannot use the large inputs as it is,so I convert it to between [0, 1] using sigmf function of MATLAB. It gives me a values of 1.000000 for all the large values. I have tried using setting the format but it does not help.
Is there a workaround to using large values with my auto encoder?
The process of convert your inputs to the range [0,1] is called normalization, however, as you noticed, the sigmf function is not adequate for this task. This link maybe is useful to you.
Suposse that your inputs are given by a matrix of N rows and M columns, where each row represent an input pattern and each column is a feature. If your first column is:
vec =
-0.1941
-2.1384
-0.8396
1.3546
-1.0722
Then you can convert it to the range [0,1] using:
%# get max and min
maxVec = max(vec);
minVec = min(vec);
%# normalize to -1...1
vecNormalized = ((vec-minVec)./(maxVec-minVec))
vecNormalized =
0.5566
0
0.3718
1.0000
0.3052
As #Dan indicates in the comments, another option is to standarize the data. The goal of this process is to scale the inputs to have mean 0 and a variance of 1. In this case, you need to substract the mean value of the column and divide by the standard deviation:
meanVec = mean(vec);
stdVec = std(vec);
vecStandarized = (vec-meanVec)./ stdVec
vecStandarized =
0.2981
-1.2121
-0.2032
1.5011
-0.3839
Before I give you my answer, let's think a bit about the rationale behind an auto-encoder (AE):
The purpose of auto-encoder is to learn, in an unsupervised manner, something about the underlying structure of the input data. How does AE achieves this goal? If it manages to reconstruct the input signal from its output signal (that is usually of lower dimension) it means that it did not lost information and it effectively managed to learn a more compact representation.
In most examples, it is assumed, for simplicity, that both input signal and output signal ranges in [0..1]. Therefore, the same non-linearity (sigmf) is applied both for obtaining the output signal and for reconstructing back the inputs from the outputs.
Something like
output = sigmf( W*input + b ); % compute output signal
reconstruct = sigmf( W'*output + b_prime ); % notice the different constant b_prime
Then the AE learning stage tries to minimize the training error || output - reconstruct ||.
However, who said the reconstruction non-linearity must be identical to the one used for computing the output?
In your case, the assumption that inputs ranges in [0..1] does not hold. Therefore, it seems that you need to use a different non-linearity for the reconstruction. You should pick one that agrees with the actual range of you inputs.
If, for example, your input ranges in (0..inf) you may consider using exp or ().^2 as the reconstruction non-linearity. You may use polynomials of various degrees, log or whatever function you think may fit the spread of your input data.
Disclaimer: I never actually encountered such a case and have not seen this type of solution in literature. However, I believe it makes sense and at least worth trying.
When trying to fit Naive Bayes:
training_data = sample; %
target_class = K8;
# train model
nb = NaiveBayes.fit(training_data, target_class);
# prediction
y = nb.predict(cluster3);
I get an error:
??? Error using ==> NaiveBayes.fit>gaussianFit at 535
The within-class variance in each feature of TRAINING
must be positive. The within-class variance in feature
2 5 6 in class normal. are not positive.
Error in ==> NaiveBayes.fit at 498
obj = gaussianFit(obj, training, gindex);
Can anyone shed light on this and how to solve it? Note that I have read a similar post here but I am not sure what to do? It seems as if its trying to fit based on columns rather than rows, the class variance should be based on the probability of each row belonging to a specific class. If I delete those columns then it works but obviously this isnt what I want to do.
Assuming that there is no bug anywhere in your code (or NaiveBayes code from mathworks), and again assuming that your training_data is in the form of NxD where there are N observations and D features, then columns 2, 5, and 6 are completely zero for at least a single class. This can happen if you have relatively small training data and high number of classes, in which a single class may be represented by a few observations. Since NaiveBayes by default treats all features as part of a normal distribution, it cannot work with a column that has zero variance for all features related to a single class. In other words, there is no way for NaiveBayes to find the parameters of the probability distribution by fitting a normal distribution to the features of that specific class (note: the default for distribution is normal).
Take a look at the nature of your features. If they seem to not follow a normal distribution within each class, then normal is not the option you want to use. Maybe your data is closer to a multinomial model mn:
nb = NaiveBayes.fit(training_data, target_class, 'Distribution', 'mn');