Neural networks for an imbalanced dataset - matlab

I have a very imbalanced dataset consisting of 186219 rows of data by 6 dimensions including 132 true positives against 186087 false positives, what types of neural network would you recommend to try? This spreadsheet in my google drive IPDC_algorithm_training_dataset contains my training dataset. If the value in output tab has a value of 100, that feature is a true positive, and if a feature has a value of 0 means that feature is a false positive.
I am tied up with MATLAB now, so it would be more convenient for me if I use MATLAB for this problem.

With a dataset that imbalanced you have limited options. If you trained a neural network on the entire dataset, it'd achieve 99.9% accuracy just by always predicting false positives. You need to deal with that imbalance in some way, such as discarding (vast swathes of) false positive samples or weighting your loss function to account for the imbalance. With an imbalance as extreme as this, you'd probably need to apply both (along with regularisation to prevent overfitting the remaining data).
In terms of what network type to use, you could try just a basic MLP (Multi-Layer Perceptron), at least as a baseline – there's no point in building a complicated architecture, with more parameters to train, with a very limited dataset.
In reality, you'd probably be better off using a shallow learning algorithm, such as boosted trees or naive Bayes, or getting more data to enable use of a neural network. If new data is likely to remain as imbalanced, you'd need a very large amount of extra data.

Related

Dimensionality reduction, noralization, resampling, k-fold CV... In what order?

In Python I am working on a binary classification problem of Fraud detection on travel insurance. Here is the characteristic about my dataset:
Contains 40,000 samples with 20 features. After one hot encoding, the number of features is 50(4 numeric, 46 categorical).
Majority unlabeled: out of 40,000 samples, 33,000 samples are unlabeled.
Highly imbalanced: out of 7,000 labeled samples, only 800 samples(11%) are positive(Fraud).
Metrics is precision, recall and F2 score. We focus more on avoiding false positive, therefore high recall is appreciated. As preprocessing I oversampled positive cases using SMOTE-NC, which takes into account categorical variables as well.
After trying several approaches including Semi-Supervised Learning with Self Training and Label Propagation/Label Spreading etc, I achieved high recall score(80% on training, 65-70% on test). However, my precision score shows some trace of overfitting(60-70% on training, 10% on testing). I understand that precision is good on training because it's resampled, and low on test data because it directly reflects the imbalance of the classes in test data. But this precision score is unacceptably low so I want to solve it.
So to simplify the model I am thinking about applying dimensionality reduction. I found a package called prince which comes with FAMD(Factor Analysis for Mixture Data).
Question 1: How I should do normalization, FAMD, k-fold Cross Validation and resampling? Is my approach below correct?
Question 2: The package prince does not have methods such as fit or transform like in Sklearn, so I cannot do the 3rd step described below. Any other good packages to do fitand transform for FAMD? And is there any other good way to reduce dimensionality on this kind of dataset?
My approach:
Make k folds and isolate one of them for validation, use the rest for training
Normalize training data and transform validation data
Fit FAMD on training data, and transform training and test data
Resample only training data using SMOTE-NC
Train whatever model it is, evaluate on validation data
Repeat 2-5 k times and take the average of precision, recall F2 score
*I would also appreciate for any kinds of advices on my overall approach to this problem
Thanks!

TensorFlow: Binary classification accuracy

In the context of a binary classification, I use a neural network with 1 hidden layer using a tanh activation function. The input is coming from a word2vect model and is normalized.
The classifier accuracy is between 49%-54%.
I used a confusion matrix to have a better understanding on what’s going on. I study the impact of feature number in input layer and the number of neurons in the hidden layer on the accuracy.
What I can observe from the confusion matrix is the fact that the model predict based on the parameters sometimes most of the lines as positives and sometimes most of the times as negatives.
Any suggestion why this issue happens? And which other points (other than input size and hidden layer size) might impact the accuracy of the classification?
Thanks
It's a bit hard to guess given the information you provide.
Are the labels balanced (50% positives, 50% negatives)? So this would mean your network is not training at all as your performance corresponds to the random performance, roughly. Is there maybe a bug in the preprocessing? Or is the task too difficult? What is the training set size?
I don't believe that the number of neurons is the issue, as long as it's reasonable, i.e. hundreds or a few thousand.
Alternatively, you can try another loss function, namely cross entropy, which is standard for multi-class classification and can also be used for binary classification:
https://www.tensorflow.org/api_docs/python/nn/classification#softmax_cross_entropy_with_logits
Hope this helps.
The data set is well balanced, 50% positive and negative.
The training set shape is (411426,X)
The training set shape is (68572,X)
X is the number of the feature coming from word2vec and I try with the values between [100,300]
I have 1 hidden layer, and the number of neurons that I test varied between [100,300]
I also test with mush smaller features/neurons size: 2-20 features and 10 neurons on the hidden layer.
I use also the cross entropy as cost fonction.

Accuracy of Neural network Output-Matlab ANN Toolbox

I'm relatively new to Matlab ANN Toolbox. I am training the NN with pattern recognition and target matrix of 3x8670 containing 1s and 0s, using one hidden layer, 40 neurons and the rest with default settings. When I get the simulated output for new set of inputs, then the values are around 0 and 1. I then arrange them in descending order and choose a fixed number(which is known to me) out of 8670 observations to be 1 and rest to be zero.
Every time I run the program, the first row of the simulated output always has close to 100% accuracy and the following rows dont exhibit the same kind of accuracy.
Is there a logical explanation in general? I understand that answering this query conclusively might require the understanding of program and problem, but its made of of several functions to clearly explain. Can I make some changes in the training to get consistence output?
If you have any suggestions please share it with me.
Thanks,
Nishant
Your problem statement is not clear for me. For example, what you mean by: "I then arrange them in descending order and choose a fixed number ..."
As I understand, you did not get appropriate output from your NN as compared to the real target. I mean, your output from NN is difference than target. If so, there are different possibilities which should be considered:
How do you divide training/test/validation sets for training phase? The most division should be assigned to training (around 75%) and rest for test/validation.
How is your training data set? Can it support most scenarios as you expected? If your trained data set is not somewhat similar to your test data sets (e.g., you have some new records/samples in the test data set which had not (near) appear in the training phase, it explains as 'outlier' and NN cannot work efficiently with these types of samples, so you need clustering approach not NN classification approach), your results from NN is out-of-range and NN cannot provide ideal accuracy as you need. NN is good for those data set training, where there is no very difference between training and test data sets. Otherwise, NN is not appropriate.
Sometimes you have an appropriate training data set, but the problem is training itself. In this condition, you need other types of NN, because feed-forward NNs such as MLP cannot work with compacted and not well-separated regions of data very well. You need strong function approximation such as RBF and SVM.

Issues with neural network

I am having some issues with using neural network. I am using a non linear activation function for the hidden layer and a linear function for the output layer. Adding more neurons in the hidden layer should have increased the capability of the NN and made it fit to the training data more/have less error on training data.
However, I am seeing a different phenomena. Adding more neurons is decreasing the accuracy of the neural network even on the training set.
Here is the graph of the mean absolute error with increasing number of neurons. The accuracy on the training data is decreasing. What could be the cause of this?
Is it that the nntool that I am using of matlab splits the data randomly into training,test and validation set for checking generalization instead of using cross validation.
Also I could see lots of -ve output values adding neurons while my targets are supposed to be positives. Could it be another issues?
I am not able to explain the behavior of NN here. Any suggestions? Here is the link to my data consisting of the covariates and targets
https://www.dropbox.com/s/0wcj2y6x6jd2vzm/data.mat
I am unfamiliar with nntool but I would suspect that your problem is related to the selection of your initial weights. Poor initial weight selection can lead to very slow convergence or failure to converge at all.
For instance, notice that as the number of neurons in the hidden layer increases, the number of inputs to each neuron in the visible layer also increases (one for each hidden unit). Say you are using a logit in your hidden layer (always positive) and pick your initial weights from the random uniform distribution between a fixed interval. Then as the number of hidden units increases, the inputs to each neuron in the visible layer will also increase because there are more incoming connections. With a very large number of hidden units, your initial solution may become very large and result in poor convergence.
Of course, how this all behaves depends on your activation functions and the distributio of the data and how it is normalized. I would recommend looking at Efficient Backprop by Yann LeCun for some excellent advice on normalizing your data and selecting initial weights and activation functions.

What's usual success rate for neural network models?

I am building a system with a NN trained for classification.
I am interested in what is error rate for systems you have built?
Classic example from UCI ML is the Iris data set.
NN trained on it is almost perfect - error rate 0-1%; however it is a very basic dataset.
My network has following structure: 80in, 208hid, 2out.
My result is 8% error rate on testing dataset.
Basically in this question I want to ask about various research results you encountered,
in your work, papers etc.
Addition 1:
the error rate is of course on testing data - not training. So it is completely new dataset for the network
Addition 2 (from my comment under the question):
My new results. 1200 entries, 900 training, 300 testing. 85 in Class1, 1115 in Class2. Out of 85, 44 in testing set. Error rate - 6%. It is not so bad because 44 is ~15% of 300. So I am 2.5 times better..
Model performance is completely problem-specific. Even among situations with similar quality and volumes of development data, with identical target variable definitions, performance can vary substantially. Obviously, the more similar the problem definitions, the more likely the performance of different models are to match.
Another thing to consider is the difference between technical performance and business performance. In some applications, an accuracy of 52% is tremendously profitable, whereas in other areas, and accuracy of 98% would be hopelessly low.
Let me also add that besides what Predictor mentions, measuring your performance on the training set is usually useless as a guide to determine how your classifier would perform on previously unseen data. Many times with relatively simple classifiers you can get 0% error rate on the training set without learning anything useful (this is called overfitting).
What is more commonly used (and more helpful in determining how your classifier works) is either held out data or cross validation, even better if you separate your data in three: training, validating and testing.
Also it is very hard to get a sense of how good a classifier works from one threshold and giving only true positive + true negatives. People tend to also evaluate false positives and false negatives and plot ROC curves to see/evaluate the tradeoff. So, saying "2.5 times better" you should be clear that your comparing to a classifier that classifies everything as C2, which is a pretty crappy baseline.
See for example this paper:
Danilo P. Mandic and Jonathon A. Chanbers (2000). Towards the Optimal Learning Rate for
Backpropagation, Neural Processing Letters 11: 1–5. PDF