The number of hidden nodes in autoencoder with small number of features - neural-network

I have a data set which have 2 features and 10000 samples. I would like to convert(integrate) these two features into one feature, for further analysis. So I want to use feature extraction method. As the relationship between two features are not linear, I want to use methods other than conventional PCA.
Because the number of samples are much larger than that of features, I think autoencoder is a good way for feature extraction. But the input feature is only 2, then the shape of autoencoder will be only 2-1-2, which is a linear extraction.
Is it possible to set hidden nodes more than the number of inputs and make stacked autoencoder, such as 2-16-8-1-8-16-2 nodes?
Also, it a good choice to use autoencoder for this kind of data integration? If not, are there any better solutions?

Why would this be a linear extraction? If you use any non-regularity in the hidden and output layer you will get a non-linear relationship between them. Your encoding will in essential be sigmoid(Ax + b).
If you truly want to make your network more complex I would suggest using multiple 2 neuron layers before the single neuron layer. So something like this 2 - 2 - 2 - 1 - 2 - 2 - 2 nodes. I do not see any reason why you would need to make it larger.

Related

Is it possible to use evaluation metrics (like NDCG) as a loss function?

I am working on a Information Retrieval model called DPR which is a basically a neural network (2 BERTs) that ranks document, given a query. Currently, This model is trained in binary manners (documents are whether related or not related) and uses Negative Log Likelihood (NLL) loss. I want to change this binary behavior and create a model that can handle graded relevance (like 3 grades: relevant, somehow relevant, not relevant). I have to change the loss function because currently, I can only assign 1 positive target for each query (DPR uses pytorch NLLLoss) and this is not what I need.
I was wondering if I could use a evaluation metric like NDCG (Normalized Discounted Cumulative Gain) to calculate the loss. I mean, the whole point of a loss function is to tell how off our prediction is and NDCG is doing the same.
So, can I use such metrics in place of loss function with some modifications? In case of NDCG, I think something like subtracting the result from 1 (1 - NDCG_score) might be a good loss function. Is that true?
With best regards, Ali.
Yes, this is possible. You would want to apply a listwise learning to rank approach instead of the more standard pairwise loss function.
In pairwise loss, the network is provided with example pairs (rel, non-rel) and the ground-truth label is a binary one (say 1 if the first among the pair is relevant, and 0 otherwise).
In the listwise learning approach, however, during training you would provide a list instead of a pair and the ground-truth value (still a binary) would indicate if this permutation is indeed the optimal one, e.g. the one which maximizes nDCG. In a listwise approach, the ranking objective is thus transformed into a classification of the permutations.
For more details, refer to this paper.
Obviously, the network instead of taking features as input may take BERT vectors of queries and the documents within a list, similar to ColBERT. Unlike ColBERT, where you feed in vectors from 2 docs (pairwise training), for listwise training u need to feed in vectors from say 5 documents.

Using neural networks (MLP) for estimation

Im new with NN and i have this problem:
I have a dataset with 300 rows and 33 columns. Each row has 3 more columns for the results.
Im trying to use MLP for trainning a model so that when i have a new row, it estimates those 3 result columns.
I can easily reduce the error during trainning to 0.001 but when i use cross validation it keep estimating very poorly.
It estimates correctly if i use the same entry it used to train, but if i use another values that werent used for trainning the results are very wrong
Im using two hidden layers with 20 neurons each, so my architecture is [33 20 20 3]
For activation function im using biporlarsigmoid function.
Do you guys have some suggestion on where i could try to change to improve this?
Overfitting
As mentioned in the comments, this perfectly describes overfitting.
I strongly suggest reading the wikipedia article on overfitting, as it well describes causes, but I'll summarize some key points here.
Model complexity
Overfitting often happens when you model is needlessly complex for the problem. I don't know anything about your dataset, but I'm guessing [33 20 20 3] is more parameters than necessary for predicting.
Try running your cross-validation methods again, this time with either fewer layers, or fewer nodes per layer. Right now you are using 33*20 + 20*20 + 20*3 = 1120 parameters (weights) to make your prediction, is this necessary?
Regularization
A common solution to overfitting is regularization. The driving principle is KISS (keep it simple, stupid).
By applying an L1 regularizer to your weights, you keep preference for the smallest number of weights to solve your problem. The network will pull many weights to 0 as they aren't need.
By applying an L2 regularizer to your weights, you keep preference for lower rank solutions to your problem. This means that your network will prefer weights matrices that span lower dimensions. Practically this means your weights will be smaller numbers, and are less likely to be able to "memorize" the data.
What is L1 and L2? These are types of vector norms. L1 is the sum of the absolute value of your weights. L2 is the sqrt of the sum of squares of your weights. (L3 is the cubed root of the sum of cubes of weights, L4 ...).
Distortions
Another commonly used technique is to augment your training data with distorted versions of your training samples. This only makes sense with certain types of data. For instance images can be rotated, scaled, shifted, add gaussian noise, etc. without dramatically changing the content of the image.
By adding distortions, your network will no longer memorize your data, but will also learn when things look similar to your data. The number 1 rotated 2 degrees still looks like a 1, so the network should be able to learn from both of these.
Only you know your data. If this is something that can be done with your data (even just adding a little gaussian noise to each feature), then maybe this is worth looking into. But do not use this blindly without considering the implications it may have on your dataset.
Careful analysis of data
I put this last because it is an indirect response to the overfitting problem. Check your data before pumping it through a black-box algorithm (like a neural network). Here are a few questions worth answering if your network doesn't work:
Are any of my features strongly correlated with each other?
How do baseline algorithms perform? (Linear regression, logistic regression, etc.)
How are my training samples distributed among classes? Do I have 298 samples of one class and 1 sample of the other two?
How similar are my samples within a class? Maybe I have 100 samples for this class, but all of them are the same (or nearly the same).

How is the number of hidden and output neurons calculated for a neural network?

I'm very new to neural networks but I am trying to create one for optical character recognition. I have 100 images of every number from 0-9 in the size of 24x14. The number of inputs for the neural network are 336 but I don't know how to get the number of hidden neurons and output neurons.
How do I calculate it?
While for the output neurons the number should be equal to the number of classes you want to discriminate, for the hidden layer, the size is not so straight forward to set and it is mainly dependent on the trade-off between complexity of the model and generalization capabilities (see https://en.wikipedia.org/wiki/Artificial_neural_network#Computational_power).
The answers to this question can help:
training feedforward neural network for OCR
The number of output neurons is simply your number of classes (unless you only have 2 classes and are not using the one-hot representation, in which case you can make do with just 2 output neuron).
The number of hidden layers, and subsequently number of hidden neurons is not as straightforward as you might think as a beginner. Every problem will have a different configuration that will work for it. You have to try multiple things out. Just keep this in mind though:
The more layers you add, the more complex your calculations become and hence, the slower your network will train.
One of the best and easiest practices is to keep the number of hidden neurons fixed in each layer.
Keep in mind what hidden neurons in each layer mean. The input layer is your starting features and each subsequent hidden layer is what you do with those features.
Think about your problem and the features you are using. If you are dealing with images, you might want a large number of neurons in your first hidden layer to break apart your features into smaller units.
Usually you results would not vary much when you increase the number of neurons to a certain extent. And you'll get used to this as you practice more. Just keep in mind the trade-offs you are making
Good luck :)

How can I add concurrency to neural network processing?

The basics of neural networks, as I understand them, is there are several inputs, weights and outputs. There can be hidden layers that add to the complexity of the whole thing.
If I have 100 inputs, 5 hidden layers and one output (yes or no), presumably, there will be a LOT of connections. Somewhere on the order of 100^5. To do back propagation via gradient descent seems like it will take a VERY long time.
How can I set up the back propagation in a way that is parallel (concurrent) to take advantage of multicore processors (or multiple processors).
This is a language agnostic question because I am simply trying to understand structure.
If you have 5 hidden layers (assuming with 100 nodes each) you have 5 * 100^2 weights (assuming the bias node is included in the 100 nodes), not 100^5 (because there are 100^2 weights between two consecutive layers).
If you use gradient descent, you'll have to calculate the contribution of each training sample to the gradient, so a natural way of distributing this across cores would be to spread the training sample across the cores and sum the contributions to the gradient in the end.
With backpropagation, you can use batch backpropagation (accumulate weight changes from several training samples before updating the weights, see e.g. https://stackoverflow.com/a/11415434/288875 ).
I would think that the first option is much more cache friendly (updates need to be merged only once between processors in each step).

Neural Network Input Bias in MATLAB

In Matlab (Neural Network Toolbox + Image Processing Toolbox), I have written a script to extract features from images and construct a "feature vector". My problem is that some features have more data than others. I don't want these features to have more significance than others with less data.
For example, I might have a feature vector made up of 9 elements:
hProjection = [12,45,19,10];
vProjection = [3,16,90,19];
area = 346;
featureVector = [hProjection, vProjection, area];
If I construct a Neural Network with featureVector as my input, the area only makes up 10% of the input data and is less significant.
I'm using a feed-forward back-propogation network with a tansig transfer function (pattern-recognition network).
How do I deal with this?
When you present your input data to the network, each column of your feature vector is fed to the input layer as an attribute by itself.
The only bias you have to worry about is the scale of each (ie: we usually normalize the features to the [0,1] range).
Also if you believe that the features are dependent/correlated, you might want to perform some kind of attribute selection technique. And in your case it depends one the meaning of the hProj/vProj features...
EDIT:
It just occurred to me that as an alternative to feature selection, you can use a dimensionality reduction technique (PCA/SVD, Factor Analysis, ICA, ...). For example, factor analysis can be used to extract a set of latent hidden variables upon which those hProj/vProj depends on. So instead of these 8 features, you can get 2 features such that the original 8 are a linear combination of the new two features (plus some error term). Refer to this page for a complete example