Partitioning dataset into train, test and validate subset (Matlab) - matlab

Im new to using Matlab and i'm trying to achieve the following situation:
I've one dataset of 7000+ entries. The goal is to train a classification tree (fitctree) on this data. I've seperated the data into a matrix with observations (predictors) and a matrix with classes(class). To partition the data i'm using cvpartition. Everything works fine up until this point.
Problem: I want to create three subsets with data: 1 training set, 1 validation set and 1 testing set. I want to train the tree using the training set, and validate its performance using the validation set. After tweeking the parameters I want to run the final test on the test data partition.
To partition the data I tried creating a cvpartition, which works, e.g.
cvpart = cvpartition(class, 'k', 10);
and then performing another cvpartition on that testing set, seperating this into another two sets:
cvpart2 = cvpartition(cvpart.TestSize, 'k', 10);
Sadly, when validating the performance of the tree this doesn't seem to work. When I skip the seond cvpartition, and validate the performance on the test set of cvpart, the model performs perfectly.
Update: after days I found that it seems to work when using it in this way:
cvpart2 = cvpartition(cvpart.TrainSize, 'k', 10);
Anyone care to explain me why it does work in this way, but not when using the test set?
Hope you guys can help me out;)
Kind regards.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Using Weka NaiveBayes with Matlab

I created a NaiveBayes model in Weka. I exported the model to disk. I now want to inject this model into MATLAB 2018, so that I can check how it performs via some data that I am receiving.
I load my model in MATLAB, by stating something like this:
loadedModel = weka.core.SerializationHelper.read('myweka.model');
I then create a Weka Instance object, and let it contain this data:
instance = infrequent,low,high,medium-high,high,medium,medium,low,low
If I run these two commands:
loadedModel.distributionForInstance(instance)
loadedModel.classifyInstance(instance)
I see the following output:
0.0001
0.9999
1
This is odd to me because if I observe the same record in WEKA ui, I see the same instance with probabilities 0.993 and 0.007, classified as '2'. (I can load the same model multiple times from disk in WEKA, and reproduce this behavior, which is correct) After further investigation, I noticed that regardless of the sequence of attributes my Instance object has, I always get the same probability output and the same classification by invoking the model via MATLAB.
There are some posts on the net that share the same problem, like these:
Always getting the same output
Weka - Classifier returns the same distribution for any input
However, the recommended solution to call 'instance.setClassMissing()' did not solve my issue. Is there anything I am missing, or can try to do in order to further troubleshoot the issue?
Does your test instance has same structure as your train set? If not, you need yo provide the same structure.
Weka indexes nominal attributes and stores the indices internally. So the nominal attributes order in train file is important. For example if your attribute is mapped as low=>0, high=>1 in training, you need to map them like this in your test set. Usually this is achieved by serializing the train header with the model.
Sample code for creating train header:
Instances trainHeader = new Instances(instances, 0);
trainHeader.setClassIndex(instances.classIndex());
When creating a new instance set its dataset:
Instance instance = ...
instance.setDataset(trainHeader);

Target Encoding in Python Model

I made a model in python and this uses target encoding. I used a dataset with 25000 rows and this gets divided into training and test data sets. The model is really working fine. However, I now want to run the model on totally fresh data - say just one row of data in an excel file. I need to know the code for it and will really appreciate it if someone can help. I am somewhat new to python. Here is the part of code I have written to create the training and test data sets from 25000 rows and train the model on training and predict on the test. However, I need the code that runs this model that uses target encoding to predict fresh data. If I need to post more code for greater clarity please let me know.
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(train_x.values, train_y.values)
pred_train = rf.predict(train_x.values)
pred = rf.predict(test_x.values)
Thanks
You might want to have have a look the comment section in this notebook-
here
"After we apply target encoding on the train data and target. We can have the result for one category like column A has a,b,c. Then we calculate the mean of each a,b,c in column A and apply it to the test data.We then apply it to test using the pd.merge function."

How to use `crossval` in matlab for a Leave one Out Validation method

I have been reading the documentation: here and here but it's really unclear for me and I don't see how to use pratically crossval to do a leave one out cross-validation.
vals = crossval(fun,X)
vals = crossval(fun,X,Y,...)
mse = crossval('mse',X,y,'Predfun',predfun)
mcr = crossval('mcr',X,y,'Predfun',predfun)
val = crossval(criterion,X1,X2,...,y,'Predfun',predfun)
vals = crossval(...,'name',value)
I really don't understand the funpart.
I have estimatimate chlorophyll rate with different index. Then I have done a linear regression between those index and the field taken chlorophyll rate. Now I want to validate them, one of my estimation is a column with 22 entries, so I want to use 21 of them as trainee and 1 as a test, and do 22 loops so that all the data have been used as test.
But I don't where should I put the regression model? If my regression is Y=aX+b,
do I re-use the a and b calculated before for the train part, or do I do a new linear regression with the train part then see what's the test will be with that?
I am not sure I totally understood how to make a leave one out model.
Then I want to know the result of the test by calculating the RMSE (and maybe the R²).
How do I code that using crossval?
I saw the answer to the question here but I don't have access to the crossvalind fonction with my license.
Well I finaly figure it out: so this is my script:
First I charged my data and the linear regression fonction
X=indicesCha_without_Cloud(:,3);
y=Cha_g_m2t_without_Cloud(:,3);
testval=#(XTRAIN,ytrain,XTEST)Linear_regression_indices( XTRAIN,ytrain,XTEST);
where in my case fun(in the Mathwork help) is testvaland Linear_regression_indices is a very simple fonction:
function [ Linear_regression_indices ] = Linear_regression_indices(XTRAIN,ytrain,XTEST )
Linear_regression_indices=(polyval(polyfit(XTRAIN,ytrain,1),XTEST));
end
There is 2 ways to do it and they both give the same result:
one by using simply the crossval fonction
cvMse = crossval('mse',X,y,'predfun',testval,'leaveout',1);
this will do as many fold as the data size, using each time one of the data as Xtest
the second one is using cvpartition
c = cvpartition(n,'LeaveOut') creates a random partition for leave-one-out cross validation on n observations. Leave-one-out is a special case of 'KFold', in which the number of folds equals the number of observations. link
c = cvpartition(y,'LeaveOut');
cvMse2=crossval('mse',X,y,'predfun',testval,'partition',c);
then the RMSE can be easily calculated
RMSE=sqrt(cvMse);
RMSE2=sqrt(cvMse2);
then I simply get my answer, in my case RMSE=0,3548

How to use Decision Tree Classification Matlab?

I have data in form of rows and columns where rows represent a record and column represents its attributes.
I also have the labels (classes) for those records.
I know about decision trees concept and I would like to use matlab for classification of unseen records using decision trees.
How can this be done? I followed this link but its not giving me correct output-
Decision Tree in Matlab
Essentially I want to construct a decision tree based on training data and then predict the labels of my testing data using that tree. Can someone please give me a good and working example for this ?
I used following code to achieve it. And it is working correctly
function DecisionTreeClassifier(trainingFile, testingFile, labelsFile, outputFile)
training = csvread(trainingFile);
labels = csvread(labelsFile);
testing = csvread(testingFile);
tree = ClassificationTree.fit(training,labels)
prediction = predict(tree, testing)
csvwrite(outputFile, prediction)
ClassificationTree.fit will be removed in a future release. Use fitctree instead.