Should I check data balance/imbalance on training dataset only or whole dataset? - classification

I'm confused by doing classification modeling, should I check the if the dataset is imbalance or not on training data or whole dataset? Thank you!

Related

Model Probability Calibration in Pyspark

I am using PySpark to implement a Churn classification model for a business problem and the dataset I have is imbalanced. So when I train the model, I randomly select a dataset with equal numbers of 1's and 0's.
Then I applied the model in a real-time data and the number of predicted 1's and 0's were obviously equal.
Now, I need to calibrate my trained model. But I couldn't find a way to do it in PySpark. Does anyone have an idea how to calibrate a model in PySpark, May be something like CalibratedClassifierCV ?

How to project PCA features from train data to test data in spark scala?

I read this link that explains: Anomaly detection with PCA in Spark
But whats the code to extract PCA features and project them from training data to test data?
From what I understood, we have to use the same set of features for train on test data.

SVM prediction does not predict OK although the support vectors are valid

I have a following(fig 1) unlabeled training set which I am trying to detect the outliers, have come up with a procedure to label the data with 0:normal data and 1:outlier and want to train it with SVM.
I followed this instructions to train the SVM's model but when I am trying to predict the labels of same data I have trained the SVM it does not predict any(fig 2)!
fig 1: the support vectors after training
fig 2: the prediction of SVM model on the same data it has been training with
The output of prediction is not supposed to look like this!
The code I have used for prediction is:
out = predict(model, data');
Question:
What is wrong with my approach?
For what it worth, I have found the answer to my question and now its working fine.
The result of prediction after using a non-linear kernel, but I don't know why this happened?

Neural network for pattern recognition

I want you to help me figure out which problem am I dealing with (pattern recognition or time series forecasting) and find the best NN architecture suited for this problem.
In my problem, I have many finite sets of two dimensional data (learning sets)
Lets N be the size of the data set I want to calculate using the NN.
I want my NN to learn these data and by giving it the first m data of the data set it gives me the remaining N-m data.
I think it's rather a pattern recognition problem, so which is the best NN architecture suited for this kind.
Thank you.
As far as I have understood you problem, you have a dataset with N rows. And you want to train your network using first M rows. And then you want your NN to predict the rest N-M rows.
Typically, in forecasting (timeseries prediction), we do this kind of stuffs. We train our model with historical data and try to predict future values.
So, in your case, top M rows could be training data in the training phase.And during the model accuracy evaluation phase, future values could be your N-M rows.
Typically, recurrent networks are best suited for temporal data, because, they can take care of ordered data.
ENCOG also provides a special dataset for temporal data.And you can use them for your problem.

Clustering of data - Pre- processing of data

These days I am using some clustering algorithm and I just wanted to ask a question related to this field. Maybe those who are working in this field already have this answer.
During clustering I need to have some training data which I am going to cluster. The number of iterations (e.x. K-Means algorithm) is depended on the number of training data(number of vectors). Is there any method to find the most important data from training data. What I mean is: Instead of training the K-Means with all the data maybe there is a method to find just the important vectors (those vectors who affect most the clusters) and use these "important" vectors(from training data) to traing the algorithm.
I hope you understood me.
Thank You for reading and trying to answer.
"Training" and "Test" data is a concept from classification, not from cluster analysis.
K-means is a statistical method. If you want to speed it up, running it on a large enough random sample should give you nearly the same result.