Should I check data balance/imbalance on training dataset only or whole dataset?

Should I check data balance/imbalance on training dataset only or whole dataset? - classification

I'm confused by doing classification modeling, should I check the if the dataset is imbalance or not on training data or whole dataset? Thank you!

Related

Model Probability Calibration in Pyspark

I am using PySpark to implement a Churn classification model for a business problem and the dataset I have is imbalanced. So when I train the model, I randomly select a dataset with equal numbers of 1's and 0's.
Then I applied the model in a real-time data and the number of predicted 1's and 0's were obviously equal.
Now, I need to calibrate my trained model. But I couldn't find a way to do it in PySpark. Does anyone have an idea how to calibrate a model in PySpark, May be something like CalibratedClassifierCV ?

How to project PCA features from train data to test data in spark scala?

I read this link that explains: Anomaly detection with PCA in Spark
But whats the code to extract PCA features and project them from training data to test data?
From what I understood, we have to use the same set of features for train on test data.

SVM prediction does not predict OK although the support vectors are valid

I have a following(fig 1) unlabeled training set which I am trying to detect the outliers, have come up with a procedure to label the data with 0:normal data and 1:outlier and want to train it with SVM.
I followed this instructions to train the SVM's model but when I am trying to predict the labels of same data I have trained the SVM it does not predict any(fig 2)!
fig 1: the support vectors after training
fig 2: the prediction of SVM model on the same data it has been training with
The output of prediction is not supposed to look like this!
The code I have used for prediction is:
out = predict(model, data');
Question:
What is wrong with my approach?

For what it worth, I have found the answer to my question and now its working fine.
The result of prediction after using a non-linear kernel, but I don't know why this happened?

Neural network for pattern recognition

I want you to help me figure out which problem am I dealing with (pattern recognition or time series forecasting) and find the best NN architecture suited for this problem.
In my problem, I have many finite sets of two dimensional data (learning sets)
Lets N be the size of the data set I want to calculate using the NN.
I want my NN to learn these data and by giving it the first m data of the data set it gives me the remaining N-m data.
I think it's rather a pattern recognition problem, so which is the best NN architecture suited for this kind.
Thank you.

As far as I have understood you problem, you have a dataset with N rows. And you want to train your network using first M rows. And then you want your NN to predict the rest N-M rows.
Typically, in forecasting (timeseries prediction), we do this kind of stuffs. We train our model with historical data and try to predict future values.
So, in your case, top M rows could be training data in the training phase.And during the model accuracy evaluation phase, future values could be your N-M rows.
Typically, recurrent networks are best suited for temporal data, because, they can take care of ordered data.
ENCOG also provides a special dataset for temporal data.And you can use them for your problem.

Clustering of data - Pre- processing of data

These days I am using some clustering algorithm and I just wanted to ask a question related to this field. Maybe those who are working in this field already have this answer.
During clustering I need to have some training data which I am going to cluster. The number of iterations (e.x. K-Means algorithm) is depended on the number of training data(number of vectors). Is there any method to find the most important data from training data. What I mean is: Instead of training the K-Means with all the data maybe there is a method to find just the important vectors (those vectors who affect most the clusters) and use these "important" vectors(from training data) to traing the algorithm.
I hope you understood me.
Thank You for reading and trying to answer.

"Training" and "Test" data is a concept from classification, not from cluster analysis.
K-means is a statistical method. If you want to speed it up, running it on a large enough random sample should give you nearly the same result.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Should I check data balance/imbalance on training dataset only or whole dataset? - classification

I'm confused by doing classification modeling, should I check the if the dataset is imbalance or not on training data or whole dataset? Thank you!

Related

Model Probability Calibration in Pyspark

How to project PCA features from train data to test data in spark scala?

SVM prediction does not predict OK although the support vectors are valid

Neural network for pattern recognition

Clustering of data - Pre- processing of data

Categories

Resources