How does mllib weight the classes internally for unbalanced datasets?

How does mllib weight the classes internally for unbalanced datasets? - pyspark

I have a dataframe with 1% positive classes (1's) and 99% negatives (0's) and I am working with a Logistic Regression in Pyspark. I rode here about dealing with unbalanced datasets, and the solution is to add a weightCol, as it says in the answer provided in the link, in order to tell the model to focus more on the 1's, as there are less.
I've tried it and it works well, but I don't know how mllib balances the data internally. Someone has a clue ? I don't like working with "black boxes" I can't comprehend.

from the Spark documentation it says
We implemented two algorithms to solve logistic regression: mini-batch gradient descent and L-BFGS. We recommend L-BFGS over mini-batch gradient descent for faster convergence.
You can check the LBFGS.scala to see the how optimization algorithm updates the weights after each iteration.

Related

When to use PCA for dimensionality reduction?

I am using the Matlab Classification Learner app to test different classifiers over a training set (size = 700). My response variable is a categorical label with 5 possible values. I have 7 numerical features and 2 categorical ones. I found a Cubic SVM to have the highest accuracy of 83%. But the performance goes down considerably when I enable PCA with 95% explained variance (accuracy = 40.5%). I am a student and this is the first time I am using PCA.
Why do I see such a result?
Could it be because of a small / unbalanced data set?
When is it useful to apply PCA? When we say "reduce dimensionality", is there a minimum number of features (dimensionality) in the original set?
Any help is appreciated. Thanks in advance!

I want to share my opinion
I think training set 700 means, your data is < 1k.
I'm even surprised that svm performs 83%.
Even MNIST dataset is considered to be small (60.000 training - 10.000 test). Your data is much-much smaller.
You try to reduce your small data even smaller using pca. So what will svm learns? There is no discriminating samples left?
If I were you I would test using random-forest classifier. Random-forest might even perform better.
Even if you balanced your data, it is small data.
I believe using SMOTE will not improve the result. If your data consist of images then you could use ImageDataGenerator for replicating your data. Though I'm not sure matlab contains ImageDataGenerator.
You will use PCA, when you have lots of samples. Yet the samples are not directly effecting the accuracy but they are the components of data.
For instance: Let's consider handwritten digit classification data.
From above can we say each pixel is directly effecting the accuracy?
The answer is no? Above the black pixels are not important for the accuracy, therefore to remove them we use pca.
If you want a detailed explanation with a python example. Check out my other answer

SVM Matlab classification

I'm approaching a 4 class classification problem, it's not particularly unbalanced, no missing features a lot of observation.. It seems everything good but when I approach the classification with fitcecoc it classifies everything as part of the first class. I try. to use fitclinear and fitcsvm on one vs all decomposed data but gaining the same results. Do you have any clue about the reason of that problem ?

Here are a few recommendations:
Have you normalized your data? SVM is sensitive to the features being
from different scales.
Save the mean and std you obtain during the training and use
those values during the prediction phase for normalizing the test
samples.
Change the C value and see if that changes the results.
I hope these help.

Good MSE doesnt imply good prediction in logistic regression?

I am writing some code for regularized logistic regression. I observe this interesting phenomena and wonder if it is a normal thing or just my code is wrong.
For loss function, I am using the logistic loss function ( maximize likelihood of binary variables). To make prediction, predicted probabilities are obtained for new observations and AUC is used to find the best thresh-hold.
The funny thing is, I often run into cases where an estimated parameter has far better MSE (deviance) than another parameter on new observations but the prediction is worse ( a lot worse). So it seems to me that mean square error might not have anything to do with the prediction performance ( like the case with linear regression). Anyone saw the same thing?

Good results with NN, not with SVM; cause for concern?

I have painstakingly gathered data for a proof-of-concept study I am performing. The data consists of 40 different subjects, each with 12 parameters measured at 60 time intervals and 1 output parameter being 0 or 1. So I am building a binary classifier.
I knew beforehand that there is a non-linear relation between the input-parameters and the output so a simple perceptron of Bayes classifier would be unable to classify the sample. This assumption proved correct after initial tests.
Therefore I went to neural networks and as I hoped the results were pretty good. An error of about 1-5% is generally the result. The training is done by using 70% as training and 30% as evaluation. Running the complete dataset again (100%) through the model I was very happy with the results. The following is a typical confusion matrix (P = positive, N = negative):
P N
P 13 2
N 3 42
So I am happy and with the notion that I used a 30% for evaluation I am confident that I am not fitting noise.
Therefore I resolved to SVM for a double check and the SVM was unable to converge to a good solution. Most of the time the solutions are terrible (say 90% error...). Maybe I am not fully aware of SVM's or the implementations are not correct, but it troubles me because I thought that when NN provide a good solution, SVM's are most of the time better in seperating the data due to their maximum-margin hyperplane.
What does this say of my result? Am I fitting noise? And how do I know if this is a correct result?
I am using Encog for the calculations but the NN results are comparable to home-grown NN models I made.

If it is your first time to use SVM, I strongly recommend you to take a look at A Practical Guide to Support Vector Classication, by authors of a famous SVM package libsvm. It gives a list of suggestions to train your SVM classifier.
Transform data to the format of an SVM package
Conduct simple scaling on the data
Consider the RBF kernel
Use cross-validation to nd the best parameter C and γ
Use the best parameter C and γ
to train the whole training set
Test
In short, try scaling your data and carefully choosing the kernal plus the parameters.

Optimization of Neural Network input data

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.
But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?

1558 is actually a modest number of features/attributes. The # of instances(3279) is also small. The problem is not on the dataset side, but on the training algorithm side.
ANN is slow in training, I'd suggest you to use a logistic regression or svm. Both of them are very fast to train. Especially, svm has a lot of fast algorithms.
In this dataset, you are actually analyzing text, but not image. I think a linear family classifier, i.e. logistic regression or svm, is better for your job.
If you are using for production and you cannot use open source code. Logistic regression is very easy to implement compared to a good ANN and SVM.
If you decide to use logistic regression or SVM, I can future recommend some articles or source code for you to refer.

If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)
You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.
Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)

aplly a seperate ANN for each category of features
for example
457 inputs 1 output for url terms ( ANN1 )
495 inputs 1 output for origurl ( ANN2 )
...
then train all of them
use another main ANN to join results