Matlab Weka Interface AdaBoost Issues: Out of Bounds Exception - matlab

I'm doing some cross-validation using a Matlab Weka Interface that I got from file exchange. My loop structure seems to work fine for Weka's Logistic classifier. However, when I try to do the exact same thing for AdaBoostM1, it throws the following error:
??? Java exception occurred: java.lang.ArrayIndexOutOfBoundsException
Error in ==> wekaClassify at 24 classProbs(t+1,:) = (classifier.distributionForInstance(testData.instance(t)))';
Error in ==> classifier_search at 225 [pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I have determined through some testing that this only occurs when the number of instances in the training set is greater than the number of instances in the test set. I am sure you can see why that is a problem for me, since in most situations the training set is greater than the test set in size.
Is there something different about how I should format my inputs when using Adaboost rather than Logistic? Any information you can give regarding this problem would be so helpful.
I downloaded this code from this page: http://www.mathworks.com/matlabcentral/fileexchange/21204-matlab-weka-interface
Emails bounce from the account of the guy who made it, and he doesn't seem to respond to comments on the page - I'm hoping that maybe someone here has used this.
EDIT: Here is the code that I use to train and test the classifier:
classifier = trainWekaClassifier(matlab2weka('training', featurelabels, train), 'meta.AdaBoostM1', { strcat('-P 100 -S 1 -I ', num2str(r), '-W weka.classifiers.trees.DecisionStump')});
[pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);

I haven't used this combination of software, so I can only take a guess at what could cause this.
Are your training/testing data matrices the right way round? They should be N-by-D (N instances, D features).
If you were passing in a D-by-N training matrix and a D-by-M testing matrix, then I would expect it to work only when M < N - which is what you describe - and even then, it wouldn't give a meaningful result.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Calling function and getting - not enough input arguments, even though syntax is correct

I'm teaching myself classification, I read and understood the MatLab online help of the simple LDA classifier which uses the fisher iris dataset.
I have now moved to SVM. But even though I use the exact syntax from the help page I get an error of either not enough or too many input arguments.
I have made trained my SVMClassifier using svmtrain via the command:
SVMStruct = svmtrain(training,labels);
Where training is a 207 by 900 training matrix. There are 207 samples and 900 HoG descriptors or features. Similarly labels is a 207 by 1 column vector consisting of either +1 or -1 for their respective samples.
I then wanted to test it and see if this works by calling:
Group = svmclassify(SVMStruct,sample,'Showplot',true)
Where sample is a 2 by 900 matrix containing 2 test samples. I was expecting to get +1 and -1 as these are what the test samples should be labelled. But I get the error:
Too many input arguments.
And when I use the command
Group = svmclassify(SVMStruct,sample)
I get the error
Not enough input arguments.
You might have overloaded svmclassify function.
try
>> which svmclassify
to verify that you are actually calling the right function.
In case that you overloaded the function (that is, created a different function with the same name svmclassify) and it is located higher in your path then you'll need to rename the overloaded function and run svmclassify again.

How to build an ARMAX model in Matlab

I'm trying to build an ARMAX model which predicts reservoir water elevation as a function of previous elevations and an upstream inflow. My data is on a timestep of roughly 0.041 days, but it does vary slightly, and I have 3643 time series points. I've tried using the basic armax Matlab command, but am getting this error:
Error using armax (line 90)
Operands to the || and && operators must be convertible to
logical scalar values.
The code I'm trying is:
data = iddata(y,x,[],'SamplingInstants',JDAYs)
m1 = armax(data, [30 30 30 1])
where y is a vector of elevations that starts like y=[135.780
135.800
135.810
135.820
135.820
135.830]', x is a vector of flowrates that starts like x=[238.865
238.411
238.033
237.223
237.223
233.828]', and JDAYs is a vector of timestamps that starts like JDAYs=[122.604
122.651
122.688
122.729
122.771
122.813]'.
I'm new to this model type and the system identification toolbox, so I'm having issues figuring out what's causing that error. The Matlab examples aren't very helpful...
I hope this is not getting to you a bit late.
Checking your code i see that you are using a parameter called SamplingInstants. I'm not sure ARMAX functions works with it. Actually i'm sure. I have tried several times, and no, it doesn't. And it don't seems to be a well documented option for ARMAX -or for other methods- too.
The ARX, ARMAX, and other models are based on linear discrete systems from the Z-Transform formalism, that is, one can ussualy assume that your system has been sampled under a regular sampling rate. Although of course, this is not a law, this is the standard framework when dealing with linear -and also non-linear- systems. And also most industrial control & acquisition systems work under a regular rate sampling. Yet.
Try to get inside the ARMAX standard setting, like this:
y=[135.780 135.800 135.810 135.820 135.820 135.830 .....]';
x=[238.865 238.411 238.033 237.223 237.223 233.828 .....]';
%JDAYs=[122.604 122.651 122.688 122.729 122.771 122.813 .....]';
JDAYs=122.601+[0:length(y)-1]*4.18';
data = iddata(y,x,[],'SamplingInstants',JDAYs);
m1 = armax(data, [30 30 30 1])
And this will always work. Please just ensure that x and y are long enough to enable the proper estimation of all the free coefficients, greater than mean(4*orders), for ARMAX to work -in this case, greater than 121-, and desirable greater than 10*mean(4*orders), for ARMAX algorithm to properly solve your problem, and enough time-variant for prevent reaching onto ill-conditioned solutions.
Good Luck ;)...

LIBSVM in MATLAB/Octave - what's the output of libsvmread?

The second output of the libsvmread command is a set of features for each given training example.
For example, in the following MATLAB command:
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
This second variable (heart_scale_inst) holds content in a form that I don't understand, for example:
<1, 1> -> 0.70833
What is the meaning of it? How is it to be used (I can't plot it, the way it is)?
PS. If anyone could please recommend a good LIBSVM tutorial, I'd appreciate it. I haven't found anything useful and the README file isn't very clear... Thanks.
The definitive tutorial for LIBSVM for beginners is called: A Practical Guide to SVM Classification it is available from the site of the authors of LIBSVM.
The second parameter returned is called the instance matrix. It is a matrix, let call it M, M(1,:) are the features of data point 1 and so on. The matrix is sparse that is why it prints out weirdly. If you want to see it fully print full(M).
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
with heart_scale_label and heart_scale_inst you should be able to train an SVM by issuing:
mod = svmtrain(heart_scale_label,heart_scale_inst,'-c 1 -t 0');
I strong suggest you read the above linked guide to learn how to set the c parameter (and possibly, in case of RBF kernel the gamma parameter), but the above line is how you would train with that data.
I think it is the probability with which test case has been predicted to heart_scale label category

keving murphy's hmm matlab toolbox assertion error

I am working on a project that needs to use hidden markov models. I downloaded Kevin Murphy's toolbox. I have some problems about the usage. In the toolbox webpage, he says that first input of dhmm_em and dhmm_logprob are symbol sequence data. On their examples, they give row vectors as data. So, when I give my symbol sequence as row vector, I get error;
??? Error using ==> assert at 9
assertion violated:
Error in ==> fwdback at 105
assert(approxeq(sum(alpha(:,t)),1))
Error in ==> dhmm_logprob at 17
[alpha, beta, gamma, ll] = fwdback(prior,
transmat, obslik, 'fwd_only', 1);
Error in ==> mainCourseProject at 110
loglik(train_act) =
dhmm_logprob(orderedSymbols,
hmm{train_act}.prior,
hmm{train_act}.trans,
hmm{act}.emiss);
However, before giving this error, code works for some symbol vectors. When I give my data as column vector, functions work fine, no errors. So why exactly am I getting this error?
You might say that I should be giving not single vectors, but vector sets, I also tried to collect my feature vectors in a struct and give row vectors as such, but nothing changed, I still get assertion error.
By the way, my symbol sequence does not have any zeros, I am doing everything almost the same as they showed in their examples, so I would be greatful if anyone could help me please.
Im not sure, but from the function call stack shown above, shouldn't the last line be hmm{train_act}.emiss instead of hmm{act}.emiss.
In other words when you computing the log-probability of a sequence, you should pass components that belong to the same HMM model (transition matrix, emission matrix, and prior probabilities).
By the way, the ASSERT in the code is a sanity check that a vector of probabilities should sum to 1. Oftentimes, when working with very small values (log-probabilities), numerical stability issues can creep in... You could edit the APPROXEQ function to relax the comparison a bit, by giving it a bigger margin of error
This error message and the code it refers to are human-readable. An assertion is a guard put in by the programmer, to ensure that certain conditions are met. In this case, what is the condition? approxeq(sum(alpha(:,t)),1) I'd venture to say that approxeq wants the values to be approximately equal, so this boils down to: sum(alpha(:,t)) ~= 1
Without knowing anything about the code, I'd also guess that these refer to probabilities. The probabilities of a node's edges must sum to one. Hopefully this starts you down a productive debugging path. If you can't figure out what's wrong with your input that produces this condition, start wading into the code a bit to see where this alpha vector comes from, and how it ended up invalid.