I am new to matlab and I can't find a solution to my problem...
What is the problem?
I have to create a neural network using matlab that will have almost 25k inputs and 10 outputs. There is also 300 patterns to learn.
When I was reading info about neural networks in matlab I saw that all input/learing data are in one matrix. It's ok for xor or something like that small. Then I realized that I would have to create matrix that contains 25 000 * 300 elements (7,5 mln of integers).
1) Is there any possibility that I can expand matrix by adding new rows (learning patterns)?
2) or maybe it got something like:
learnPatternMatrix1 = [1, 2, 3 , ..., 25 000];
perfectOutputMatrix1 = [1, 2, 3, ... , 10];
network.addPattern(learnPatternMatrix1, perfectOutputMatrix1);
network.addPattern(learnPatternMatrix2, perfectOutputMatrix2);
% ...
network.addPattern(learnPatternMatrix300, perfectOutputMatrix300);
network.learn()?
Thanks for help ;)
I'm sorry I don't have an answer to making Matlab deal with that size of matrix. I do have some comments which may be relevant to the problem, however.
Neural networks are, like most machine learning algorithms, unlikely to perform well when there are a large number of features (inputs) compared to the number of data points. Unless you have an order or two magnitude more data points that the 250,000 features you describe, this approach may not work. You seem to have only 300 cases. Even support vector machines, supposedly robust to this problem, are unlikely to perform well under these conditions.
In the case of not enough data for number of features, you can think of it as guaranteed overfitting, as each data point will be uniquely situated and widely separated in feature space.
Have you considered feature reduction? That will solve your Matlab problem, and is likely to improve the performance of your ANN.
Related
Im new with NN and i have this problem:
I have a dataset with 300 rows and 33 columns. Each row has 3 more columns for the results.
Im trying to use MLP for trainning a model so that when i have a new row, it estimates those 3 result columns.
I can easily reduce the error during trainning to 0.001 but when i use cross validation it keep estimating very poorly.
It estimates correctly if i use the same entry it used to train, but if i use another values that werent used for trainning the results are very wrong
Im using two hidden layers with 20 neurons each, so my architecture is [33 20 20 3]
For activation function im using biporlarsigmoid function.
Do you guys have some suggestion on where i could try to change to improve this?
Overfitting
As mentioned in the comments, this perfectly describes overfitting.
I strongly suggest reading the wikipedia article on overfitting, as it well describes causes, but I'll summarize some key points here.
Model complexity
Overfitting often happens when you model is needlessly complex for the problem. I don't know anything about your dataset, but I'm guessing [33 20 20 3] is more parameters than necessary for predicting.
Try running your cross-validation methods again, this time with either fewer layers, or fewer nodes per layer. Right now you are using 33*20 + 20*20 + 20*3 = 1120 parameters (weights) to make your prediction, is this necessary?
Regularization
A common solution to overfitting is regularization. The driving principle is KISS (keep it simple, stupid).
By applying an L1 regularizer to your weights, you keep preference for the smallest number of weights to solve your problem. The network will pull many weights to 0 as they aren't need.
By applying an L2 regularizer to your weights, you keep preference for lower rank solutions to your problem. This means that your network will prefer weights matrices that span lower dimensions. Practically this means your weights will be smaller numbers, and are less likely to be able to "memorize" the data.
What is L1 and L2? These are types of vector norms. L1 is the sum of the absolute value of your weights. L2 is the sqrt of the sum of squares of your weights. (L3 is the cubed root of the sum of cubes of weights, L4 ...).
Distortions
Another commonly used technique is to augment your training data with distorted versions of your training samples. This only makes sense with certain types of data. For instance images can be rotated, scaled, shifted, add gaussian noise, etc. without dramatically changing the content of the image.
By adding distortions, your network will no longer memorize your data, but will also learn when things look similar to your data. The number 1 rotated 2 degrees still looks like a 1, so the network should be able to learn from both of these.
Only you know your data. If this is something that can be done with your data (even just adding a little gaussian noise to each feature), then maybe this is worth looking into. But do not use this blindly without considering the implications it may have on your dataset.
Careful analysis of data
I put this last because it is an indirect response to the overfitting problem. Check your data before pumping it through a black-box algorithm (like a neural network). Here are a few questions worth answering if your network doesn't work:
Are any of my features strongly correlated with each other?
How do baseline algorithms perform? (Linear regression, logistic regression, etc.)
How are my training samples distributed among classes? Do I have 298 samples of one class and 1 sample of the other two?
How similar are my samples within a class? Maybe I have 100 samples for this class, but all of them are the same (or nearly the same).
I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html
I use a neural network with 3 layers for categorization problem: 1) ~2k neurons 2) ~2k neurons 3) 20 neurons. My training set consists of 2 examples, most of the inputs in each example are zeros. For some reason after the backpropagation training the network gives virtually the same output for both examples (which is either valid for only 1 of examples or have 1.0 for outputs where one of example has 1s). It comes to this state after the first epoch and doesn't change much afterwards, even if learning rate is minimal double vale. I use sigmoid as activation function.
I thought it could be something wrong with my code so I've used AForge open source library, and seems like it suffers from the same issue.
What might be the problem here?
Solution: I've removed one layer and decreased the number of neurons in hidden layer to 800
2000 by 2000 by 20 is huge. That's approximately 4 million weights to determine, meaning the algorithm has to search a 4-million-dimensional space. Any optimization algorithm will be totally at a loss in this case. I'm assuming you're using gradient descent, which is not even that powerful, so likely the algorithm is stuck in a local optimum somewhere in this gigantic search space.
Simplify your model!
Added:
And please also describe in more detail what you're trying to do. Do you really have only 2 training examples? That's like trying to categorize 2 points using a 4-million-dimensional plane. It doesn't make sense to me.
You mentioned that most of the inputs are zero. To your reduce the size of your search space, try removing redundancy in your training examples. For instance if
trainingExample[0].inputValue[i] == trainingExample[1].inputValue[i]
then x.inputValue[i] has no information bearing data for the NN.
Also, perhaps it's not clear, but it seems that two training examples seem small.
I have been playing around with the SVM and I have stumbled upon something interesting.
It might be something I may be doing wrong, hence the post for comments and clarification.
I have data set of around 3000 x 30.
Each value is in the range of -100 to 100. Plus, they are not integers. They are floating point numbers. They are not evenly distributed.
It's like,
the numbers are -99.659, -99.758, -98.234 and then we wont have something till like -1.234, -1.345 and so.
So even though the range is big, the data is clustered around at some points and they usually differ by fraction values.
( I thought and from what my readings and understanding goes, this shouldn't ideally affect the SVM classification accuracy. Correct me if I am wrong please. Do Comment on this with a yes or no of I am right or wrong. )
My labels for the classification are 0 and 1.
So, then I take a test data of 30 x 30 and tried to test my SVM.
I am getting an accuracy of somewhere around 50% when is the kernel_function as mlp.
In other methods, I simply get 0's and NaN's as result which is weird as no 1s were in the output and I didn't understand the NaN's in the output labels.
So, mlp was basically giving me the best results and that too just 50%.
I have then used the method as 'QP' with 'mlp' as kernel_function and the code has been running for like 8 hours now. I don't suppose, something as small as 3400 x 30 should take that much time.
So the question really is, is the SVM a wrong choice for the data I have? (As asked above).
Or is there something I am missing out that is causing the accuracy to drop significantly?
Also, I know the input data is not screwed up, because I tested the same using a Neural Network and I was able to have a very good accuracy.
Is there a way to make SVM work? Because, from what I have read on the internet- SVM should generally work better than Neural Network in this label deciding problem.
It sounds like you might be having some numerical stability problems that are being caused by the small size of the data clusters (although I'm not sure why that would be: it really shouldn't). SVM shouldn't care as an algorithm about the distributions you are describing: in fact, it should do a pretty good job under normal circumstances when presented something so distinctly separated.
One thing to investigate is if any of your columns are very strongly correlated. Really strongly correlated column groups should be replaced by a single column for performance reasons and I have seen implementations that become numerically unstable when faced with almost perfect correlation in columns.
While independent features are nice, it is not neccesary for the algorithm, after all, we are saying in advance we do not know what features contribute what to the data. Are you scaling your data? Also, 30 data points is perhaps a little small to create a training set. Can we see your code?
I have 200 samples, each of them has 60 features. I use PCA to find the principal components. I use neural network and also try k nearest neighbor However, the classification results are not good. I don't mind to take out some samples, but how I can tell which samples destroy my classification results? I know I can try them one by one, but it would be very ineffective. Please help
Instead of throwing out some samples you need to throw out some attributes.
PCA computes a matrix with d x d entries. At 60 attributes, this matrix has 3600 entries. You have only 200 samples to compute the contents of this matrix - no wonder that the result is pretty much random. You need fewer variables and more data.
This is a classical machine learning problem. There is always a risk with such a high number of features (in your case 60) with only 200 samples. Please check whether you have features which are redundant. Let me give an example
Imagine, we have to predict housing prices from the following features
1. Size in m2
2. Number of bedrooms
3. House age
4. Size in foot2
Please note that here number 2 and number 4 features both gives the same information and they are redundant. At first it does not look that disturbing. But if you have data like that its better to remove those features.
Therefore, i would recommend you to look first in your features and then into data. For more details you have a look in Machine Learning class(by Prof. Ng) from Stanford available in coursera