Invalid labels for classification logistic regression model in pyspark databricks - pyspark

I am using Spark ML library for classification problem using a logistic regression.
I have vectorized input features and created training dataset and test dataset.
While fitting the model I get invalid labels issue.
the training dataset is :
where my input features as Independent_features and my target feature as Category_con.

Use the words : label, features instead of independent_features and Category_con while creating your vectors.

For the labels, you would need to change them into just 3 categories. It looks like you might have 6 from the error message. You would need to use conditional replacement to group or bin the categories like below:
train_df.withColumn('label', when((col('Category_con') == firstCondition) ).otherwise(when((col('Category_con') == secondCondition) ).otherwise(lastCondition))

Related

How to use multiple labels as targets in Neural Net Pattern Recognition Toolbox?

I am trying to use the Neural Net Pattern Recognition toolbox in MATLAB for recognizing different types of classes in my dataset. I have a 21392 x 4 table, with the columns 1-3 which I would like to use as predictors and the 4th column has the labels with 14 different categories (strings like Angry, Sad, Happy, Neutral etc.). It seems that the Neural Net Pattern Recognition toolbox, unlike the MATLAB Classification Learner toolbox doesn't allow me to import the table and automatically extract the predictors and responses from it. Moreover, I am unable to either specify the inputs and targets to the neural network manually as it isn't showing up in the options.
I looked into the examples like the Iris Dataset, Wine Dataset, Cancer Dataset etc., but all of them only have 2-3 classes as outputs which are being Identified (and encoded in binary like 000, 010, 011 etc.) and the labels are not string type unlike mine like Angry, Sad, Happy, Neutral etc. (total 14 different classes). I would like to know how I can use my table as input to the neural network pattern recognition toolbox, or otherwise, any way in which I can extract the data from my table and use it in the toolbox. I am new to using the toolbox, so any help in this regard would be highly appreciated. Thanks!
The first step to use the Neural Net Pattern Recognition Toolbox is to convert the table to a numeric array, as neural networks work only with numeric arrays, not other datatypes directly. Considering the table as my_table, it can be converted to a numeric array using
my_table_array = table2array(my_table);
From my_table_array, the inputs (predictors) and outputs/targets can be extracted. But, it is imperative to mention that the inputs and outputs need to be transposed (as the data is needed to be in column format for the toolbox, each column is one datapoint, and each row is the feature), which can easily be accomplished using:-
inputs = inputs'; %(now of dimensions 3x21392)
labels = labels'; %(now of dimensions 1x21392)
The string type labels (categorical) can be converted to numeric values using a one-hot encoding technique with categorical, followed by ind2vec:
my_table_vector = ind2vec(double(categorical(labels)));
Now, the my_table_vector (final targets) and inputs (final input predictors) can easily be fed to the neural network and used for classification/prediction of the target labels.

How to specify multiple columns in xgboost.trainWithDataframe when using spark?

enter image description here
This is the api doc present in xgboost.com , it seems that I can just set only one column as the "featureCol" .
As with any ML Estimator on Spark, this one expects inputCol to be a Vector of assembled features. Before you apply the Estimator, you should use tools from org.apache.spark.ml.feature to extract, transform and assemble feature vector.
You can check How to vectorize DataFrame columns for ML algorithms? for example Pipeline.

libsvm in matlab for multiclass prediction of test data

when i use the libsvm in matlab for multiclass classification, the svmpredict command consists of also the testing labels. As I dont have the labels for test set, is it possible to predict it somehow using the libsvm in matlab?
Yes, just provide a meaningless label vector. The only use of the labels is so the prediction function can report some statistics. They are not actually required for prediction in any way.

Libsvm vs Weka (WLSVM)

I've got to deal with an unbalanced dataset (95% record of negative class and 5% positive). I developed a model using decision tree and Weka framework. Now I'd like to try SVM and Libsvm to get better results. I'm trying to use Libsvm for matlab an Libsvm weka wrapper. I'd like to know how to compare results that I get from them. In weka a model is built from the whole dataset and after a 10-fold cross validation is performed. How can I do it with Libsvm? From Libsvm FAQ's I discovered that CV is made only to discover best parameters for kernels,not during train/predict, so what is the exact sequence of action that I should do in Matlab to obtain similar results in order to compare them with Weka?

svm classification

I am a beginner in MATLAB and doing my Programming project in Digital Image Processing,i.e. Magnetic Resonance image classification using wavelet features+SVM+PCA+ANN. I executed the example SVM classification from MATLAB tool and modified that to fit my requirements. I am facing problems in storing more than one feature in an input vector and in giving new input to SVM. Please help.
Simply feed multidimensional feature data to svmtrain(Training, Group) function as Training parameter (Training can be matrix, each column represents separate feature). After that use svmclassify(SVMStruct, Sample) for testing data classification.