Handling binary target variables in encoding, activation function and loss function - classification

When we have non-ordinal binary class labels, say ['black', 'black', 'white'], there are two scenarios:
One can either encode labels as integers ([0, 0, 1]), or
Encode it by one-hot encoding ([[1, 0],
[1, 0],
[0, 1]])
It should not matter for binary cases, however, some say if labels are encoded as integer (via e.g LabelEncoder()), this would mean something like 'black' and 'white' have an order (white > black), but they are actually nominal. My first question is, is this normal? Should we strictly use one hot encoding while handling nominal targets?
Second is, does encoding strategy changes whether we use softmax or sigmoid for the output layer? When using softmax, there would be 2 nodes, and the loss function should be sparse categorical cross entropy whereas for sigmoid they would be 1 and binary cross entropy, respectively, right?
I am a beginner as you can see and it's confusing that different tutorials favor different settings for binary classification. Currently I am using LabelEncoder() and encode my targets as 1D integers only, with 1 node and sigmoid activation in the output layer and binary cross entropy as the loss. I've tried the other mentioned settings combinatorially but the results vary and this was the best. I am concerned it's wrong and should've used one hot encoding instead.

Related

Rescale Input Feature in a Neural Network

I'm reading through the Make Your Own Neural Network book and in the example where it is shown on how to classify the hand written digits, the text says that the input color values that are in the range from 0 to 255 will be rescaled to the much smaller range between 0.01 to 1.0. A few questions on this!
What is against using the actual range which is 0 to 255? What would rescaling bring me?
Does this mean that if I rescale my training set, train my model with this rescaled data, I then should also use a rescaled test data?
Any arguments please?
Rescaling the data will lead to faster convergence when using methods like gradient descent. Also when your dataset features highly varying in magnitudes, using solution that includes eucliden distance can lead to bad results. In order to avoid it, scaling the features to range between 0.0 and 1.0 will be a wise solution.
For the second question, you should rescale test data.
Click those links 1, 2 and 3 to obtain more information.

Normalization before data split in Neural Network

I am trying to run a MLP regressor on my dataset with one hidden layer. I am doing a standardization of my data but I want to be clear as whether it matters if I do the standardization after or before splitting the dataset in Training and Test set. I want to know if there will be any difference in my prediction values if I carry out standardization before data split.
Yes and no. If mean and variance of the training and test set are different, standardization can lead to a different outcome.
That being said, a good training and test set should be similar enough so that the data points are distributed in a similar way, and post-split standardization should give the same results.
You should absolutely do it before splitting.
Imagine having [1,2,3,4,5,6,7,8,9,10] as your inputs, which get split into [1, 2, 3, 4, 5, 7, 9, 10] for train and [6,8] for test.
It's immediately clear that min-max ranges, as well as the mean and standard deviation of both samples are completely different, so by applying standardization "post-split", you are completely scrambling the relationship between the values in the 1st and the 2nd set.

Do I have to use a Scale-Layer after every BatchNorm Layer?

I am using caffe , in detail pycaffe, to create my neuronal network. I noticed that I have to use BatchNormLayer to get a positive result. I am using the Kappa-Score as a result matrix.
I now have seen several different locations for the BatchNorm-Layers in my network. But I came across the ScaleLayer, too which is not in the Layer Catalogue but gets often mentioned with the BatchNorm Layer
Do you always need to put a ScaleLayer after a BatchNorm - Layer and what does it do?
From the original batch normalization paper by Ioffe & Szegedy: "we make sure that the transformation inserted in the network can represent the identity transform." Without the Scale layer after the BatchNorm layer, that would not be the case because the Caffe BatchNorm layer has no learnable parameters.
I learned this from the Deep Residual Networks git repo; see item 6 under disclaimers and known issues there.
In general, you will get no benefit from a scale layer juxtaposed with batch normalization. Each is a linear transformation. Where BatchNorm translates so that the new distribution has a mean of 0 and variance of 1, Scale compresses the entire range into a specified interval, typically [0,1]. Since they're both linear transformations, if you do them in sequence, the second will entirely undo the work of the first.
They also deal somewhat differently with outliers. Consider a set of data: ten values, five each of -1 and +1. BatchNorm will not change this at all: it already has mean 0 and variance 1. For consistency, let's specify the same interval for Scale, [-1, 1], which is also a popular choice.
Now, add an outlier of, say 99 to the mix. Scale will transform the set to the range [-1, 1] so that there are now five -1.00 values, one +1.00 value (the former 99), and five values of -0.96 (formerly +1).
BatchNorm worries about the mean standard deviation, not the max and min values. The new mean is +9; the S.D. is 28.48 (rounding everything to 2 decimal places). The numbers will be scaled to be roughly five values each of -.35 and -.28, and one value of 3.16
Whether one scaling works better than the other depends much on the skew and scatter of your distribution. I prefer BatchNorm, as it tends to differentiate better in dense regions of a distribution.

Neural Network theory to implementation mix up

I'm looking to create a neural network for the first time in matlab. As such I'm just a little confused and need some quick guidance. Below is an image:
Now the problem I'm currently having/ needs verification is the values that are generated from my hidden layer that move to my outer layer are these values 0's and 1's? i.e from u0 to unh do these nodes output 0's and 1's or values in between 0 and 1 like 0.8,0.4 etc? Another question is then my output node that should be outputting for me a value in between 0 and 1, so that an error can be found and used in the back propagation?
Like I said it's my first time doing this so I just need some guidance.
Not quite, the output of the hidden layer is like any other layer and each node gives a ranged value. The output of any node in a neural network is thus usually restricted to the [0, 1] or the [-1, 1] range. Your output node will similarly output a range of values, but that range is oftentimes thresholded to snap to 0 or 1 for simplicity of interpretation.
This however, doesn't mean that the outputs are linearly distributed. Usually you have a sigmoid, or some other non-linear, distribution which spreads more information through the middle, [-0.5, 0.5], range rather than evenly across the domain. Sometimes specialty functions are used to detect certain patterns, such as sinusoids -- though generally this is rarer and usually unnecessary.

ANN multiple vs single outputs

I recently started studying ANN, and there is something that I've been trying to figure out that I can't seem to find an answer to (probably because it's too trivial or because I'm searching for the wrong keywords..).
When do you use multiple outputs instead of single outputs? I guess in simplest case of 1/0-classification its the easiest to use the "sign" as the output activiation function. But in which case do you use several outputs? Is it if you have for instance a multiple classification problem, so you want to classify something as, say for instance, A, B or C and you choose 1 output neuron for each class? How do you determine which class it belongs to?
In a classification context, there are a couple of situations where using multiple output units can be helpful: multiclass classification, and explicit confidence estimation.
Multiclass
For the multiclass case, as you wrote in your question, you typically have one output unit in your network for each class of data you're interested in. So if you're trying to classify data as one of A, B, or C, you can train your network on labeled data, but convert all of your "A" labels to [1 0 0], all your "B" labels to [0 1 0], and your "C" labels to [0 0 1]. (This is called a "one-hot" encoding.) You also probably want to use a logistic activation on your output units to restrict their activation values to the interval (0, 1).
Then, when you're training your network, it's often useful to optimize a "cross-entropy" loss (as opposed to a somewhat more intuitive Euclidean distance loss), since you're basically trying to teach your network to output the probability of each class for a given input. Often one uses a "softmax" (also sometimes called a Boltzmann) distribution to define this probability.
For more info, please check out http://www.willamette.edu/~gorr/classes/cs449/classify.html (slightly more theoretical) and http://deeplearning.net/tutorial/logreg.html (more aimed at the code side of things).
Confidence estimation
Another cool use of multiple outputs is to use one output as a standard classifier (e.g., just one output unit that generates a 0 or 1), and a second output to indicate the confidence that this network has in its classification of the input signal (e.g., another output unit that generates a value in the interval (0, 1)).
This could be useful if you trained up a separate network on each of your A, B, and C classes of data, but then also presented data to the system later that came from class D (or whatever) -- in this case, you'd want each of the networks to indicate that they were uncertain of the output because they've never seen something from class D before.
Have a look at softmax layer for instance. Maximum output of this layer is your class. And it has got nice theoretical justification.
To be concise : you take previous layer's output and interpret it as a vector in m dimensional space. After that you fit K gaussians to it, which are sharing covariance matrices. If you model it and write out equations it amounts to softmax layer. For more details see "Machine Learning. A Probabilistic Perspective" by Kevin Murphy.
It is just an example of using last layer for multiclass classification. You can as well use multiple outputs for something else. For instance you can train ANN to "compress" your data, that is calculate a function from N dimensional to M dimensional space that minimizes loss of information (this model is called autoencoder)