can use more than ten inputs with a single layer neural network to separate into two categories - matlab

I have a pattern data with 12 categories. And I want to separate these data into two categories. So Can anyone tell me is it possible to do with a single layer neural network with 12 input values with the bias term? And also I implemented it with matlab but i'm having some doubt what should be the best initial weight values(range) and possible learning rate? can you please guide me on these cases.

Is a single layer enough?
Whether a single hidden layer suffices to correctly label your input data depends on the complexity of your data. You should empirically try different topologies (combinations of layers and number of neurons) until you discover a setting that works for you.
What are the best weight ranges?
The recommended weight ranges depends on the activation function you intend to use. For the sigmoid function, the range is a small interval centered around 0, eg: [-0.1, 0.1]
What is the ideal learning rate?
The learning rate often set to a small value such as 0.03, but if the data is easily learned by your network you can often increase the rate drastically eg: 0.3. Check out this discussion on how learning rates affect the learning process:
A side note
You should search the Web for a few pointers and tips, and rather post more to the point questions on StackOverflow.
Check this out:


Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?
How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

Neural Network - Working with a imbalanced dataset

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool:

In what order should we tune hyperparameters in Neural Networks?

I have a quite simple ANN using Tensorflow and AdamOptimizer for a regression problem and I am now at the point to tune all the hyperparameters.
For now, I saw many different hyperparameters that I have to tune :
Learning rate : initial learning rate, learning rate decay
The AdamOptimizer needs 4 arguments (learning-rate, beta1, beta2, epsilon) so we need to tune them - at least epsilon
nb of iterations
Lambda L2-regularization parameter
Number of neurons, number of layers
what kind of activation function for the hidden layers, for the output layer
dropout parameter
I have 2 questions :
1) Do you see any other hyperparameter I might have forgotten ?
2) For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way.
Is there a special order to tune the parameters ? E.g learning rate first, then batch size, then ...
I am not sure that all these parameters are independent - in fact, I am quite sure that some of them are not. Which ones are clearly independent and which ones are clearly not independent ? Should we then tune them together ?
Is there any paper or article which talks about properly tuning all the parameters in a special order ?
Here are the graphs I got for different initial learning rates, batch sizes and regularization parameters. The purple curve is completely weird for me... Because the cost decreases like way slowly that the others, but it got stuck at a lower accuracy rate. Is it possible that the model is stuck in a local minimum ?
For the learning rate, I used the decay :
LR(t) = LRI/sqrt(epoch)
Thanks for your help !
My general order is:
Batch size, as it will largely affect the training time of future experiments.
Architecture of the network:
Number of neurons in the network
Number of layers
Rest (dropout, L2 reg, etc.)
I'd assume that the optimal values of
learning rate and batch size
learning rate and number of neurons
number of neurons and number of layers
strongly depend on each other. I am not an expert on that field though.
As for your hyperparameters:
For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source)
For the learning rate with Adam and RMSProp, I found values around 0.001 to be optimal for most problems.
As an alternative to Adam, you can also use RMSProp, which reduces the memory footprint by up to 33%. See this answer for more details.
You could also tune the initial weight values (see All you need is a good init). Although, the Xavier initializer seems to be a good way to prevent having to tune the weight inits.
I don't tune the number of iterations / epochs as a hyperparameter. I train the net until its validation error converges. However, I give each run a time budget.
Get Tensorboard running. Plot the error there. You'll need to create subdirectories in the path where TB looks for the data to plot. I do that subdir creation in the script. So I change a parameter in the script, give the trial a name there, run it, and plot all the trials in the same chart. You'll very soon get a feel for the most effective settings for your graph and data.
For parameters that are less important you can probably just pick a reasonable value and stick with it.
Like you said, the optimal values of these parameters all depend on each other. The easiest thing to do is to define a reasonable range of values for each hyperparameter. Then randomly sample a parameter from each range and train a model with that setting. Repeat this a bunch of times and then pick the best model. If you are lucky you will be able to analyze which hyperparameter settings worked best and make some conclusions from that.
I don't know any tool specific for tensorflow, but the best strategy is to first start with the basic hyperparameters such as learning rate of 0.01, 0.001, weight_decay of 0.005, 0.0005. And then tune them. Doing it manually will take a lot of time, if you are using caffe, following is the best option that will take the hyperparameters from a set of input values and will give you the best set.
for more information, you can follow this tutorial as well:
For number of layers, What I suggest you to do is first make smaller network and increase the data, and after you have sufficient data, increase the model complexity.
Before you begin:
Set batch size to maximal (or maximal power of 2) that works on your hardware. Simply increase it until you get a CUDA error (or system RAM usage > 90%).
Set regularizes to low values.
The architecture and exact numbers of neurons and layers - use known architectures as inspirations and adjust them to your specific performance requirements: more layers and neurons -> possibly a stronger, but slower model.
Then, if you want to do it one by one, I would go like this:
Tune learning rate in a wide range.
Tune other parameters of the optimizer.
Tune regularizes (dropout, L2 etc).
Fine tune learning rate - it's the most important hyper-parameter.

K means Analysis on KDD Cup Dataset 99

What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
We ploted some graphs using matlab they looks like this:::
Experiment 1: Plot of dst_host_count vs serror_rate
Experiment 2: Plot of srv_count vs srv_serror_rate
Experiment 3: Plot of count vs serror_rate
I just extracted saome features from kddcup data set and ploted them.....
The main problem am facing is due to lack of domain knowledge I cant determine what inference can be drawn form this graphs another one is if I have chosen wrong axis then what should be the correct chosen feature?
I got very less time to complete this thing so I don't understand the backgrounds very well
Any help telling the interpretation of these graphs would be helpful
What kind of unsupervised learning can be made using this data and plots?
Just to give you some domain knowledge: the KDD cup data set contains information about different aspects of network connections. Each sample contains 'connection duration', 'protocol used', 'source/destination byte size' and many other features that describes one connection connection. Now, some of these connections are malicious. The malicious samples have their unique 'fingerprint' (unique combination of different feature values) that separates them from good ones.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
You can try k-means clustering to initially cluster the normal and bad connections. Also, the bad connections falls into 4 main categories themselves. So, you can try k = 5, where one cluster will capture the good ones and other 4 the 4 malicious ones. Look at the first section of the tasks page for details.
You can also check if some dimensions in your data set have high correlation. If so, then you can use something like PCA to reduce some dimensions. Look at the full list of features. After PCA, your data will have a simpler representation (with less number of dimensions) and might give better performance.
What should be the correct chosen feature?
This is hard to tell. Currently data is very high dimensional, so I don't think trying to visualize 2/3 of the dimensions in a graph will give you a good heuristics on what dimensions to choose. I would suggest
Use all the dimensions for for training and testing the model. This will give you a measure of the best performance.
Then try removing one dimension at a time to see how much the performance is affected. For example, you remove the dimension 'srv_serror_rate' from your data and the model performance comes out to be almost the same. Then you know this dimension is not giving you any important info about the problem at hand.
Repeat step two until you can't find any dimension that can be removed without hurting performance.

Artificial Neural Networks: Choosing initial neurons

How is the initial structure (Neurons and connections between them) chosen? My book only states that we give the connection random weights in the beginning before we train the network.
I think that we would add neurons during the training like this:
Start with a completely empty network
The first value I generate during the training will not exist
Add a neuron to correspond to this value, with a random weight
What you are after is a self-organizing ANN. Usually, the way the connections are organized is man-made into a model that the developer thinks will have sufficient power to perform the computation neccessary. You can of course start with a random selection of nodes with random connections, but the evolution of such a network will probably take a lot longer time than a standard two or three layer network.
So, yes, you are right in that you would use a similar approach when doing a self-organizing network. Keep track of two sets of genetic algorithms, one for the structure and one for the weights (or combine the two in some devious way) and evolve as you please.
I do not believe the question is about self-organising or GA-evolved ANNs. It sounds more like it is about a the most common ANN: a perceptron (single or multi-layer), in which case the structure of the network: the number of layers and the size of the layers, must be hand chosen at the beginning. A simple initial rule of thumb for initialising the weight is simply picking uniformly random values between -1.0 and 1.0.