Model Probability Calibration in Pyspark - pyspark

I am using PySpark to implement a Churn classification model for a business problem and the dataset I have is imbalanced. So when I train the model, I randomly select a dataset with equal numbers of 1's and 0's.
Then I applied the model in a real-time data and the number of predicted 1's and 0's were obviously equal.
Now, I need to calibrate my trained model. But I couldn't find a way to do it in PySpark. Does anyone have an idea how to calibrate a model in PySpark, May be something like CalibratedClassifierCV ?

Related

Adding noise to dataset with binary columns

I am trying to add noise to the telco dataset in order to compare prediction accuracy levels using neural networks.
I have converted most of the dataset to binary variables(0 or 1). Want to add noise to the entire dataset before training neural network on them. Would the only option be to flip them? Or is there any other way?
this is the dataset
https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Classification Using Weka did not give any result for precision , Fmeasure and MCC

I have a dataset. The dataset has some categorical values and some discrete value. My dataset is an imbalance dataset. I divide the dataset into 60% training data and 40% test data using Resample filter which is available in Weka. To make the dataset balanced I am using SMOTE technique. After that I used Random Forest to classify the dataset.
The result is
Now I can not understand what is the meaning of ? in the result? Secondly, Why there is no value for False Positive and True Positive? Does that mean the dataset is still bias towards No class even after applying SMOTE?
Note: I applied SMOTE only on training data not in test data.
It would be helpful if someone clarify my doubts.
That was asked on the Weka mailing list before (2019-07-26, How can I explain the tag "?" in the performance of the model). Here is Eibe's answer:
It means the statistic could not be computed. For example, precision for class “High” cannot be computed because the classifier did not assign any instances to that class. This means the denominator in the calculation for precision is zero.

How to make a hybrid model (LSTM and Ensemble) in MATLAB

I am working on C02 prediction in matlab. My data set is 3787 (Including test and validation set). I am trying to predict the CO2 with a Standard Deviation of 179.60. My number of predictor is 15 and response is 1. Within that predictors I have two types of datasets (1. Sequential number data such as temperature and humidity, 2. Conditions i.e yes/no ). So that I have decided to use two types of networks to train my model.
1) LSTM - For the sequential data
2) Ensemble or SVM - for the yes/no data
3) Combine two models and predict the response variable.
How to achieve this? Can anyone help me to achieve this?

sample weights in pyspark decision trees

Do you know if there is some way to put sample weights on DecisionTreeClassifier algorithm in pySpark (2.0+)?
Thanks in advance!
There's no hyperparameter currently in the pyspark DecisionTree or DecisionTreeClassifier class to specify weights to classes (usually required in an biased dataset or where importance of true prediction of one class is more important)
In near update it might be added and you can track the progress in the jira here
There has been a git branch which has already implemented this, though not available officially but you can use this pull request for now:
https://github.com/apache/spark/pull/16722
You have not specified the current scenario and why you want to use weights, but suggested work around now are
1. Undersampling the dataset
If your data set has a very high Bias, you can perform a random undersample of the dataset which has a very high frequency
2. Force Fitting the weights
Not a nice approach, but works. You can repeat the the rows for each class as per the weight.
Eg, for binary classification if you need a weight of 1:2 for (0/1) classification, you can repeat all the rows with the label 1 twice.

KNN giving highest accuracy with K=1?

I am using Weka's IBk for performing classification on text (tweets). I am converting the training and test data to vector space, and when I am performing the classification on test data, the best result comes from K=1. The training and testing data are separate from each other. Why does K=1 give the best accuracy?
Because you are using vectors; so at k=1 the value you get for proximity (for k=1) is more important than what the common class is in case of k=n (ex: when k=5)