Neural Network theory to implementation mix up - matlab

I'm looking to create a neural network for the first time in matlab. As such I'm just a little confused and need some quick guidance. Below is an image:
Now the problem I'm currently having/ needs verification is the values that are generated from my hidden layer that move to my outer layer are these values 0's and 1's? i.e from u0 to unh do these nodes output 0's and 1's or values in between 0 and 1 like 0.8,0.4 etc? Another question is then my output node that should be outputting for me a value in between 0 and 1, so that an error can be found and used in the back propagation?
Like I said it's my first time doing this so I just need some guidance.

Not quite, the output of the hidden layer is like any other layer and each node gives a ranged value. The output of any node in a neural network is thus usually restricted to the [0, 1] or the [-1, 1] range. Your output node will similarly output a range of values, but that range is oftentimes thresholded to snap to 0 or 1 for simplicity of interpretation.
This however, doesn't mean that the outputs are linearly distributed. Usually you have a sigmoid, or some other non-linear, distribution which spreads more information through the middle, [-0.5, 0.5], range rather than evenly across the domain. Sometimes specialty functions are used to detect certain patterns, such as sinusoids -- though generally this is rarer and usually unnecessary.

Related

Do I have to use a Scale-Layer after every BatchNorm Layer?

I am using caffe , in detail pycaffe, to create my neuronal network. I noticed that I have to use BatchNormLayer to get a positive result. I am using the Kappa-Score as a result matrix.
I now have seen several different locations for the BatchNorm-Layers in my network. But I came across the ScaleLayer, too which is not in the Layer Catalogue but gets often mentioned with the BatchNorm Layer
Do you always need to put a ScaleLayer after a BatchNorm - Layer and what does it do?
From the original batch normalization paper by Ioffe & Szegedy: "we make sure that the transformation inserted in the network can represent the identity transform." Without the Scale layer after the BatchNorm layer, that would not be the case because the Caffe BatchNorm layer has no learnable parameters.
I learned this from the Deep Residual Networks git repo; see item 6 under disclaimers and known issues there.
In general, you will get no benefit from a scale layer juxtaposed with batch normalization. Each is a linear transformation. Where BatchNorm translates so that the new distribution has a mean of 0 and variance of 1, Scale compresses the entire range into a specified interval, typically [0,1]. Since they're both linear transformations, if you do them in sequence, the second will entirely undo the work of the first.
They also deal somewhat differently with outliers. Consider a set of data: ten values, five each of -1 and +1. BatchNorm will not change this at all: it already has mean 0 and variance 1. For consistency, let's specify the same interval for Scale, [-1, 1], which is also a popular choice.
Now, add an outlier of, say 99 to the mix. Scale will transform the set to the range [-1, 1] so that there are now five -1.00 values, one +1.00 value (the former 99), and five values of -0.96 (formerly +1).
BatchNorm worries about the mean standard deviation, not the max and min values. The new mean is +9; the S.D. is 28.48 (rounding everything to 2 decimal places). The numbers will be scaled to be roughly five values each of -.35 and -.28, and one value of 3.16
Whether one scaling works better than the other depends much on the skew and scatter of your distribution. I prefer BatchNorm, as it tends to differentiate better in dense regions of a distribution.

Neural Network to identify Seven-Segment Numerals

I am studying machine learning and I am working on my first neural network as a project for one of my classes. I am programming the network in java. The point of the network is to identify seven-segmented numeral (like on a regular digital clock). The network does not actually have to be linked to any real sensors, it just needs to work in theory based on inputs as 0's and 1's in text form, not binary, which correspond to a hypothetical sensor matrix laid across the top of the number.
My question is, what sort of output am I looking to get?
Will the binary output be just correspond to the same sort of matrix as input or is the binary output supposed to represent the input number in binary such as returning 111 for 7?
If it does just return another matrix, what is the point of the network?
The input for a seven-segment numeral would be a (1 X 7) vector, with 1 for segments that are on and 0 for segments that are off.
As for the output, you don't specify what you want it to be, so let's assume you want it to tell you "which digit is the screen showing". Since there are 10 digits (0 through 9), you have 10 possible answers. The output would be a (1 X 10) vector, with each number corresponding to one of the digits. Its value represents how confident the network is that this is the correct answer (typically the output values lie in [0, 1], but it depends on your setup) Ideally you would want the network to return a vector having 1 on one attribute and zeros in all others.
Note however, that this case a classifier is not useful. A classification algorithm generalizes what it has seen in the past. So, it would be useful for handwritten recognition, because even if the same person writes the same digit twice, it is not exactly the same. In your case, each digit is the same across all the 7-segment displays, so you network is not exactly learning, rather memorizing the input.

sigmoid - back propagation neural network

I'm trying to create a sample neural network that can be used for credit scoring. Since this is a complicated structure for me, i'm trying to learn them small first.
I created a network using back propagation - input layer (2 nodes), 1 hidden layer (2 nodes +1 bias), output layer (1 node), which makes use of sigmoid as activation function for all layers. I'm trying to test it first using a^2+b2^2=c^2 which means my input would be a and b, and the target output would be c.
My problem is that my input and target output values are real numbers which can range from (-/infty, +/infty). So when I'm passing these values to my network, my error function would be something like (target- network output). Would that be correct or accurate? In the sense that I'm getting the difference between the network output (which is ranged from 0 to 1) and the target output (which is a large number).
I've read that the solution would be to normalise first, but I'm not really sure how to do this. Should i normalise both the input and target output values before feeding them to the network? What normalisation function is best to use cause I read different methods in normalising. After getting the optimized weights and use them to test some data, Im getting an output value between 0 and 1 because of the sigmoid function. Should i revert the computed values to the un-normalized/original form/value? Or should i only normalise the target output and not the input values? This really got me stuck for weeks as I'm not getting the desired outcome and not sure how to incorporate the normalisation idea in my training algorithm and testing..
Thank you very much!!
So to answer your questions :
Sigmoid function is squashing its input to interval (0, 1). It's usually useful in classification task because you can interpret its output as a probability of a certain class. Your network performes regression task (you need to approximate real valued function) - so it's better to set a linear function as an activation from your last hidden layer (in your case also first :) ).
I would advise you not to use sigmoid function as an activation function in your hidden layers. It's much better to use tanh or relu nolinearities. The detailed explaination (as well as some useful tips if you want to keep sigmoid as your activation) might be found here.
It's also important to understand that architecture of your network is not suitable for a task which you are trying to solve. You can learn a little bit of what different networks might learn here.
In case of normalization : the main reason why you should normalize your data is to not giving any spourius prior knowledge to your network. Consider two variables : age and income. First one varies from e.g. 5 to 90. Second one varies from e.g. 1000 to 100000. The mean absolute value is much bigger for income than for age so due to linear tranformations in your model - ANN is treating income as more important at the beginning of your training (because of random initialization). Now consider that you are trying to solve a task where you need to classify if a person given has grey hair :) Is income truly more important variable for this task?
There are a lot of rules of thumb on how you should normalize your input data. One is to squash all inputs to [0, 1] interval. Another is to make every variable to have mean = 0 and sd = 1. I usually use second method when the distribiution of a given variable is similiar to Normal Distribiution and first - in other cases.
When it comes to normalize the output it's usually also useful to normalize it when you are solving regression task (especially in multiple regression case) but it's not so crucial as in input case.
You should remember to keep parameters needed to restore the original size of your inputs and outputs. You should also remember to compute them only on a training set and apply it on both training, test and validation sets.

Interpret the output of neural network in matlab

I have build a neural network model, with 3 classes. I understand that the best output for a classification process is the boolean 1 for a class and boolean zeros for the other classes , for example the best classification result for a certain class, where the output of a classifire that lead on how much this data are belong to this class is the first element in a vector is [1 , 0 , 0]. But the output of the testing data will not be like that,instead it will be a rational numbers like [2.4 ,-1 , .6] ,So how to interpret this result? How to decide to which class the testing data belong?
I have tried to take the absolute value and turn the maximum element to 1 and the other to zeros, so is this correct?
Learner.
It appears your neural network is bad designed.
Regardless your structure is -number of input-hidden-output- layers, when you are doing a multiple classification problem, you must ensure each of your output neurones are evaluating an individual class, that is, each them has a bounded output, in this case, between 0 and 1. Use almost any of the defined function on the output layer for performing this.
Nevertheles, for the Neural Network to work properly, you must strongly remember, that every single neuron loop -from input to output- operates as a classificator, this is, they define a region on your input space which is going to be classified.
Under this framework, every single neuron has a direct interpretable sense on the non-linear expansion the NN is defining, particularly when there are few hidden layers. This is ensured by the general expression of Neural Networks:
Y_out=F_n(Y_n-1*w_n-t_n)
...
Y_1=F_0(Y_in-1*w_0-t_0)
For example, with radial basis neurons -i.e. F_n=sqrt(sum(Yni-Rni)^2) and w_n=1 (identity):
Yn+1=sqrt(sum(Yni-Rni)^2)
a dn-dim spherical -being dn the dimension of the n-1 layer- clusters classification is induced from the first layer. Similarly, elliptical clusters are induced. When two radial basis neuron layers are added under that structure of spherical/elliptical clusters, unions and intersections of spherical/elliptical clusters are induced, three layers are unions and intersections of the previous, and so on.
When using linear neurons -i.e. F_n=(.) (identity), linear classificators are induced, that is, the input space is divided by dn-dim hyperplanes, and when adding two layers, union and intersections of hyperplanes are induced, three layers are unions and intersections of the previous, and so on.
Hence, you can realize the number of neurons per layer is the number of classificators per each class. So if the geometry of the space is -lets put this really graphically- two clusters for the class A, one cluster for the class B and three clusters for the class C, you will need at least six neurons per layer. Thus, assuming you could expect anything, you can consider as a very rough approximate, about n neurons per class per layer, that is, n neurons to n^2 minumum neurons per class per layer. This number can be increased or decreased according the topology of the classification.
Finally, the best advice here is for n outputs (classes), r inputs:
Have r good classificator neurons on the first layers, radial or linear, for segmenting the space according your expectations,
Have n to n^2 neurons per layer, or as per the dificulty of your problem,
Have 2-3 layers, only increase this number after getting clear results,
Have n thresholding networks on the last layer, only one layer, as a continuous function from 0 to 1 (make the crisp on the code)
Cheers...

Artificial neural network presented with unclassified inputs

I am trying to classify portions of time series data using a feed forward neural network using 20 neurons in a single hidden layer, and 3 outputs corresponding to the 3 events I would like to be able to recognize. There are many other things that I could classify in the data (obviously), but I don't really care about them for the time being. Neural network creation and training has been performed using Matlab's neural network toolbox for pattern recognition, as this is a classification problem.
In order to do this I am sequentially populating a moving window, then inputting the window into the neural network. The issue I have is that I am obviously not able to classify and train every possible shape the time series takes on. Due to this, I typically get windows filled with data that look very different from the windows I used to train the neural network, but still get outputs near 1.
Essentially, the 3 things I trained the ANN with are windows of 20 different data sets that correspond to shapes that would correspond to steady state, a curve that starts with a negative slope and levels off to 0 slope (essentially the left half side of a parabola that opens upwards), and a curve corresponding to 0 slope that quickly declines (right half side of a parabola that opens downwards).
Am I incorrect in thinking that if I input data that doesn't correspond to any of the items I trained the ANN with it should output values near 0 for all outputs?
Or is it likely due to the fact that these basically cover all the bases of steady state, increasing and decreasing, despite large differences in slope, and therefore something is always classified?
I guess I just need a nudge in the right direction.
Neural network output values
A neural network may not guarantee specific output values if these input values / expected output values were presented during the training period.
A neural network will not consistently output 0 for untrained input values.
A solution is to simply present the network with an array of input values that should result in the network outputting 0.