GMM in MATLAB gives different results for the same file - matlab

I constructed a Gaussian Mixture Model in Matlab with a dataset:
model = gmdistribution.fit(data,M,'Replicates',5);
with M = 3 Gaussian components. I tested new data with:
[P, l] = posterior(model,new_data);
I ran the program several times and didn't get the same result. Each run produces different log-likelihood values. I use the log-likelihood to make decisions, and this value for the same data (new_data) differs for each run. What does it depend on? How can I resolve this problem?

First, assuming that you're using a newish version of Matlab, the gmdistribution.fit documentation indicates that the fit method is deprecated and that fitgmdist should be used. See here for an example.
Second, the documentation for gmdistribution.fit indicates that if the 'Replicates' option is larger than 1, the 'randSample' start method will be used to produce the initial parameters. This may be the cause (or at least one of the causes) of your observed variability.
Finally, you can also try using rng before calling gmdistribution.fit to set the seed of the global random number stream (assuming the function doesn't use it's own stream internally). Alternatively, you can try specifying an 'Options' parameter via statset:
seed = 1;
s = RandStream('mt19937ar','Seed',seed);
opts = statset('Streams',s);
model = gmdistribution.fit(data,M,'Replicates',5,'Options',opts);
I can't test this fully myself – see the gmdistribution class documentation for further details.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

How can I optimize machine learning hyperparameters to be reused in multiple models?

I have a number of datasets, to each of which I want to fit a Gaussian process regression model. The default hyperparameters selected by fitrgp seem subjectively to produce less-than-ideal models. Enabling hyperparameter optimisation tends to result in a meaningful improvement but occasionally produces extreme overfitted values and is a computationally hungry process which prohibits an optimization for every model anyway.
Since fitrgp simply wraps bayesopt for its hyperparameter optimization, is it possible to call bayesopt directly to minimize some aggregate of the loss for multiple models (say, the mean) rather than the loss for one model at a time?
For example, if each dataset is contained in a cell array of tables tbls, I want to find a single value for sigma which can be imposed in calls to fitrgp for each table:
gprMdls = cellfun(#(tbl) {fitrgp(tbl,'ResponseVarName', 'Sigma',sigma)}, tbls);
Where numel(tbls) == 1 the process would be equivalent to:
gprMdl = fitrgp(tbls{1},'ResponseVarName', 'OptimizeHyperparameters','auto');
sigma = gprMdl.Sigma;
but this implementation doesn't naturally extend to a result where a single Sigma value is optimized for multiple models.
I managed this in the end by directly intervening in the built-in optimization routines.
By placing a breakpoint at the start of bayesopt (via edit bayesopt) and calling fitrgp with a single input dataset, I was able to determine from the Function Call Stack that the objective function used by bayesopt is constructed with a call to classreg.learning.paramoptim.createObjFcn. I also captured and stored the remaining input arguments to bayesopt to ensure my function call would be exactly analagous to one constructed by fitrgp.
Placing a breakpoint at the start of classreg.learning.paramoptim.createObjFcn and making a fresh call to fitrgp I was able to capture and store the input arguments to this function, so I could then create objective functions for different tables of predictors.
For my cell array of tables tbls, and all other variables kept as named in the captured createObjFcn scope:
objFcns = cell(size(tbls));
for ii = 1:numel(tbls)
objFcn{ii} = classreg.learning.paramoptim.createObjFcn( ...
BOInfo, FitFunctionArgs, tbls{ii}, Response, ...
ValidationMethod, ValidationVal, Repartition, Verbose);
end
An overall objective function can then be constructed by taking the mean of the objective functions for each dataset:
objFcn = #(varargin) mean(cellfun(#(f) f(varargin{:}),objFcns));
I was then able to call bayesopt with this objFcn along with the remaining arguments captured from the original call. This produced a set of hyperparameters as required and they seem to perform well for all datasets.

What does 'maxfev' do in iPython Notebook?

I was using the curve_fit function to find two coefficients and could not get a result until I altered something called maxfev to be a much larger value, since my error was that 'maxfev=600 has been reached', I took a total guess and added maxfev=10000 into my curve_fit function and this seemed to work.
My question is: what is maxfev? What does it do, how does it work, and how has this affected my data?
The function curve_fit is a wrapper around leastsq (both from the scipy.optimize library). The parameter that you are adjusting specifies how many times the parameters for the model that you are trying to fit are allowed to be altered, while the program is attempting to find a local minimum (see below example).
data = [(1,0),(2,1),(3,2),(4,3)...]
model = a*x+b
Let us assume that you initialize the a and b to 0. The program attempts it once, gets a given array of leastsquares back, then the program will attempt to alter either a or b and run it again. This repeats itself until an optimal value for a and b were found (yielding the lowest least squares, which should be a=1 and b=-1).
The fact that your program can not find the optimal value after 600 alterations of the parameters is a clear indication that you are fitting the wrong model.
PS: Your problem has nothing to do with the IPython Notebook

Performing additional validation in LIBSVM matlab

I am working on MATLAB LIBSVM for a while to do prediction. I have a dataset out of which I use 75% for training, 15% for finding best parameters and remaining for testing. The code is given below.
trainX and trainY are the input and output training instances
testValX and testValY are the validation dataset I use
for j = 1:100
for jj = 1:10
model(j,jj) = svmtrain(trainY,trainX,...
['-s 3 -t 2 -c ' num2str(j) ' -p 0.001 -g ' num2str(jj) '-v 5']);
[predicted_label, ~, ~]=svmpredict(testValY,...
testValX,model(j,jj));
MSE(j,jj) = sum(((predicted_label-testValY).^2)/2);
end
end
[min_val,min_indi] = min(MSE(:));
best_predicted_model_rbf(i) = model(min_indi);
My question here is whether this is correct. I am creating model matrix with different values of c and g. I use -v option which is a key here. From the predicted models I use validation dataset for prediction and there by compute mean square error. Using this MSE I pick the best c and g. Since I am using -v which returns the cross validated output, is the procedure I follow correct?
First, I think there is a slight problem with the code shown, which is that num2str(jj) '-v 5']); doesn't have a space before the -v. That may cause that flag to not be read. In the other question, you stated that this 'sometimes returns a model', which is what would happen if that flag was not read. If the flag is read, you should only get a number, not a model, when the '-v' flag is used.
Second, it looks like you are doing two different things here, either one of which would be reasonable on its own. Calling svmtrain with '-v' runs cross validation on the training set. That shouldn't return a model, it should just return an mse estimate. You could use these estimates to determine which parameter setting was best, and then train one model with that setting on all of the training data.
Anyway, next you call svmpredict(y,x,model) on a hold-out validation set, testValX, but having called svmtrain with '-v', model should just be a scalar at this point. In order for this call to run correctly, you have to get the model from svmtrain without '-v', so that it is a struct. The rest of what you are doing makes sense for this case, in which you are doing hold-out validation using testValX.

Getting the solver type and step size (for fixed step solvers)

we are trying to integrate a simulation model into Simulink as a block. We have a custom continuous block which loads an m file that contains the functions Derivatives, Outputs etc.
My question is: is there a way to find out which solver is used currently and with which parameters? Our model won't be able to support variable time solvers and I would like to give a warning. Similarly, the model requires the fixed step time for initialization.
Thanks in advance.
You can get the current solver name using
get_param('modelName', 'SolverName');
Some of the other common solver parameters are
AbsTol
FixedStep
InitialStep
ZcThreshold
ExtrapolationOrder
MaxStep
MinStep
RelTol
SolverMode
You can find other parameters you may wish to query by opening the .mdl file in your favorite text editor and digging through it.
If I'm understanding your use case correctly, you are trying to determine the type of solver (and other solver params) for the top-level simulink system containing your block.
I think the following should give you what you want:
get_param(bdroot, 'SolverType'); % //Returns 'Variable-step' or 'Fixed-step'
get_param(bdroot, 'FixedStep'); % //Returns the fixed step size
Notice that for purposes of generality/reusability, this uses bdroot to identify the top-level system (rather than explicitly specifying the name of this system).
If you want to find out more about other model parameters that you can get/set, I would check out this doc.
Additionally, I'm interested to know why it is that your model doesn't support a variable-step solver?