How to use Decision Tree Classification Matlab? - matlab

I have data in form of rows and columns where rows represent a record and column represents its attributes.
I also have the labels (classes) for those records.
I know about decision trees concept and I would like to use matlab for classification of unseen records using decision trees.
How can this be done? I followed this link but its not giving me correct output-
Decision Tree in Matlab
Essentially I want to construct a decision tree based on training data and then predict the labels of my testing data using that tree. Can someone please give me a good and working example for this ?

I used following code to achieve it. And it is working correctly
function DecisionTreeClassifier(trainingFile, testingFile, labelsFile, outputFile)
training = csvread(trainingFile);
labels = csvread(labelsFile);
testing = csvread(testingFile);
tree = ClassificationTree.fit(training,labels)
prediction = predict(tree, testing)
csvwrite(outputFile, prediction)

ClassificationTree.fit will be removed in a future release. Use fitctree instead.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Target Encoding in Python Model

I made a model in python and this uses target encoding. I used a dataset with 25000 rows and this gets divided into training and test data sets. The model is really working fine. However, I now want to run the model on totally fresh data - say just one row of data in an excel file. I need to know the code for it and will really appreciate it if someone can help. I am somewhat new to python. Here is the part of code I have written to create the training and test data sets from 25000 rows and train the model on training and predict on the test. However, I need the code that runs this model that uses target encoding to predict fresh data. If I need to post more code for greater clarity please let me know.
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(train_x.values, train_y.values)
pred_train = rf.predict(train_x.values)
pred = rf.predict(test_x.values)
Thanks
You might want to have have a look the comment section in this notebook-
here
"After we apply target encoding on the train data and target. We can have the result for one category like column A has a,b,c. Then we calculate the mean of each a,b,c in column A and apply it to the test data.We then apply it to test using the pd.merge function."

Gaussian mixture model in pyspark

I have gone through the link https://spark.apache.org/docs/latest/mllib-clustering.html regarding fitting a GMM in pyspark. I have carried out the same operation successfully in python, but after several iteration, I am unable to run in pyspark.
The questions i have are as follow;
1. The link mentioned above & another example of fitting GMM in pyspark I checked, takes a txt file with no column headings. I have a csv with 17 columns. The code is,
data = sc.textFile("..path/mydata.csv")
parsedData = data.map(lambda line: array([float(x) for x in line.strip().split(' ')]))
This worked, but when i am trying to fit GaussianMixture.train specifying some components, It is not working.
If the data used in the examples have no column headings, how can I judge which column is coming from which distribution & how the change in pattern is appearing?
How can I get heat-map from here so that whenever a new data comes in, I will use my trained model's heat-map to judge the distribution pattern of my new test data & can point out the mis-matches.
Thanks.

How to use `crossval` in matlab for a Leave one Out Validation method

I have been reading the documentation: here and here but it's really unclear for me and I don't see how to use pratically crossval to do a leave one out cross-validation.
vals = crossval(fun,X)
vals = crossval(fun,X,Y,...)
mse = crossval('mse',X,y,'Predfun',predfun)
mcr = crossval('mcr',X,y,'Predfun',predfun)
val = crossval(criterion,X1,X2,...,y,'Predfun',predfun)
vals = crossval(...,'name',value)
I really don't understand the funpart.
I have estimatimate chlorophyll rate with different index. Then I have done a linear regression between those index and the field taken chlorophyll rate. Now I want to validate them, one of my estimation is a column with 22 entries, so I want to use 21 of them as trainee and 1 as a test, and do 22 loops so that all the data have been used as test.
But I don't where should I put the regression model? If my regression is Y=aX+b,
do I re-use the a and b calculated before for the train part, or do I do a new linear regression with the train part then see what's the test will be with that?
I am not sure I totally understood how to make a leave one out model.
Then I want to know the result of the test by calculating the RMSE (and maybe the R²).
How do I code that using crossval?
I saw the answer to the question here but I don't have access to the crossvalind fonction with my license.
Well I finaly figure it out: so this is my script:
First I charged my data and the linear regression fonction
X=indicesCha_without_Cloud(:,3);
y=Cha_g_m2t_without_Cloud(:,3);
testval=#(XTRAIN,ytrain,XTEST)Linear_regression_indices( XTRAIN,ytrain,XTEST);
where in my case fun(in the Mathwork help) is testvaland Linear_regression_indices is a very simple fonction:
function [ Linear_regression_indices ] = Linear_regression_indices(XTRAIN,ytrain,XTEST )
Linear_regression_indices=(polyval(polyfit(XTRAIN,ytrain,1),XTEST));
end
There is 2 ways to do it and they both give the same result:
one by using simply the crossval fonction
cvMse = crossval('mse',X,y,'predfun',testval,'leaveout',1);
this will do as many fold as the data size, using each time one of the data as Xtest
the second one is using cvpartition
c = cvpartition(n,'LeaveOut') creates a random partition for leave-one-out cross validation on n observations. Leave-one-out is a special case of 'KFold', in which the number of folds equals the number of observations. link
c = cvpartition(y,'LeaveOut');
cvMse2=crossval('mse',X,y,'predfun',testval,'partition',c);
then the RMSE can be easily calculated
RMSE=sqrt(cvMse);
RMSE2=sqrt(cvMse2);
then I simply get my answer, in my case RMSE=0,3548

Partitioning dataset into train, test and validate subset (Matlab)

Im new to using Matlab and i'm trying to achieve the following situation:
I've one dataset of 7000+ entries. The goal is to train a classification tree (fitctree) on this data. I've seperated the data into a matrix with observations (predictors) and a matrix with classes(class). To partition the data i'm using cvpartition. Everything works fine up until this point.
Problem: I want to create three subsets with data: 1 training set, 1 validation set and 1 testing set. I want to train the tree using the training set, and validate its performance using the validation set. After tweeking the parameters I want to run the final test on the test data partition.
To partition the data I tried creating a cvpartition, which works, e.g.
cvpart = cvpartition(class, 'k', 10);
and then performing another cvpartition on that testing set, seperating this into another two sets:
cvpart2 = cvpartition(cvpart.TestSize, 'k', 10);
Sadly, when validating the performance of the tree this doesn't seem to work. When I skip the seond cvpartition, and validate the performance on the test set of cvpart, the model performs perfectly.
Update: after days I found that it seems to work when using it in this way:
cvpart2 = cvpartition(cvpart.TrainSize, 'k', 10);
Anyone care to explain me why it does work in this way, but not when using the test set?
Hope you guys can help me out;)
Kind regards.