Deep Belief Network inference: hidden layers need random number generator? - neural-network

I am learning Deep Belief Network and Restricted Boltzmann Machine.
In training DBN (CD-1, greedy, layer-wise), inputs to the second, third, and nth RBM should be stochastic binary (0 or 1) and not probability?
As for the inference process in DBN, are hidden units also stochastic binary and not probability? Can sigmd{Σ(W*V+b)} be used as input to the layer immediately above? Or do I further need a random number generator to obtain stochastic binary result for h units then use these h values as inputs to the layer right above?
Could someone please explain to me?

Related

Neural Network XOR not converging

I have tried to implement a neural network in Java by myslef to be an XOR-gate, it has kinda of worked. About ~20% of the time when I try to train it, the weights converges to produce a good enough output (RMS < 0.05) but the other 80% of the time it doesn't.
The neural network (can be seen here) is composed of 2 inputs (+ 1 bias), 2 hidden (+ 1 bias) and 1 output unit. The activation function I used was the sigmoid function
e / ( 1 + e^-x)
which maps the input values to between 0 and 1. The learning algorithm used is Stochastic Gradient Descent using RMS as the cost function. The bias neurons has a constant output of 1. I have tried changing the learning rate between 0.1 and 0.01 and doesn't seem to fix the problem.
I have the network track the weights and rms of the network and plotted in on a graph. There is basically three different behaviours the weights can get. I can only post one the three.
two (or more) value diverges in different directions
One of the other two is just the weights converging to a good value and the second is a random wiggle of one weight.
I don't know if this is just a thing that happens or if there is some kind of way to fix it so please tell me if you know anything.

Backpropagation and training set for dummies

I'm at the very beginning of studying neural networks but my scarce skills or lack of intelligence do not allow me to understand from popular articles how to correctly prepare training set for backpropagation training method (or its limitations). For example, I want to train the simplest two-layer perceptron to solve XOR with backpropagation (e. g. modify random initial weights for 4 synapses from first layer and 4 from second). Simple XOR function has two inputs, one output: {0,0}=>0, {0,1}=>1, {1,0}=>1, {1,1}=>0. But neural networks theory tells that "backpropagation requires a known, desired output for each input value in order to calculate the loss function gradient". Does it means that backpropagation can't be applied if in training set amount of inputs is not strictly equal to amount of outputs and this restriction can not be avoided? Or does it means, if I want to use backpropagation for solving such classification tasks as XOR (i. e. number of inputs is bigger than number of outputs), theory tells that it's always necessary to remake training set in the similarly way (input=>desired output): {0,0}=>{0,0}, {0,1}=>{1,1}, {1,0}=>{1,1}, {1,1}=>{0,0}?
Thanks for any help in advance!
Does it means that backpropagation can't be applied if in training set amount of inputs is not strictly equal to amount of outputs
If you mean the output is "the class" in classification task, I don't think so,
backpropagation requires a known, desired output for each input value in order to calculate the loss function gradient
I think it's mean every input should have an output, not a different output.
In real life problem, like handwriting digit classification (MNIST), there are around 50.000 data training (input), but only classed to 10 digit

How do i take a trained neural network and implement in another system?

I have trained a feedforward neural network in Matlab. Now I have to implement this neural network in C language (or simulate the model in Matlab using mathematical equations, without using direct functions). How do I do that? I know that I have to take the weights and bias and activation function. What else is required?
There is no point in representing it as a mathematical function because it won't save you any computations.
Indeed all you need is the weights, biases, activation and your architecture. I'm assuming it is a simple feedforward network as you said, you need to implement some kind of matrix multiplication and addition in C. Also, you'll need to implement the activation function. After that, you're ready to go. Your feed forward NN is ready to be implemented. If the C code will not be used for training, it won't be necessary to implement the backpropagation algorithm in C.
A feedforward layer would be implemented as follows:
Output = Activation_function(Input * weights + bias)
Where,
Input: (1 x number_of_input_parameters_for_this_layer)
Weights: (number_of_input_parameters_for_this_layer x number_of_neurons_for_this_layer)
Bias: (1 x number_of_neurons_for_this_layer)
Output: (1 x number_of_neurons_for_this_layer)
The output of a layer is the input to the next layer.
After some days of searching, I have found the following webpage to be very useful http://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/
The picture below shows a simple feedforward neural network. Picture taken from the above website.
In this figure, the circles denote the inputs to the network. The circles labeled “+1” are called bias units, and correspond to the intercept term. The leftmost layer of the network is called the input layer, and the rightmost layer the output layer (which, in this example, has only one node). The middle layer of nodes is called the hidden layer, because its values are not observed in the training set. In this example, the neural network has 3 input units (not counting the bias unit), 3 hidden units, and 1 output unit.
The mathematical equations representing this feedforward network are
This neural network has parameters (W,b)=(W(l),b(l),W(2),b(2)), where we write Wij(l) to denote the parameter (or weight) associated with the connection between unit j in layer l, and unit i in layer l+1. (Note the order of the indices.) Also, bi(l) is the bias associated with unit i in layer l+1.
So, from the trained model, as Mido mentioned in his answer, we have to take the input weight matrix which is W(1), the layer weight matrix which is W(2), biases, hidden layer transfer function and output layer transfer function. After this, use the above equations to estimate the output hW,b(x). A popular transfer function used for a regression problem is tan-sigmoid transfer function in the hidden layer and linear transfer function in the output layer.
Those who use Matlab, these links are highly useful
try to simulate neural network in Matlab by myself
Neural network in MATLAB
Programming a Basic Neural Network from scratch in MATLAB

Hyper-parameters of Gaussian Processes for Regression

I know a Gaussian Process Regression model is mainly specified by its covariance matrix and the free hyper-parameters act as the 'weights'of the model. But could anyone explain what do the 2 hyper-parameters (length-scale & amplitude) in the covariance matrix represent (since they are not 'real' parameters)? I'm a little confused on the 'actual' meaning of these 2 parameters.
Thank you for your help in advance. :)
First off I would like to point out that there are infinite number of kernels that could be used in a gaussian process. One of the most common however is the RBF (also referred to as squared exponential, the expodentiated quadratic, etc). This kernel is of the following form:
The above equation is of course for the simple 1D case. Here l is the length scale and sigma is the variance parameter (note they go under different names depending on the source). Effectively the length scale controls how two points appear to be similar as it simply magnifies the distance between x and x'. The variance parameter controls how smooth the function is. These are related but not the same.
The Kernel Cookbook give a nice little description and compares RBF kernels to other commonly used kernels.

Neural Network with softmax activation

edit:
A more pointed question:
What is the derivative of softmax to be used in my gradient descent?
This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)
I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.
Currently, I have two layers, and I set the hidden layer to contain 10 nodes.
I came up with two different types of implementations
Logistic sigmoid for hidden layer activation, softmax for output activation
Softmax for both hidden layer and output activation
I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct for sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.
I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.
My conclusions:
I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
Perhaps I am not using enough hidden nodes
Perhaps I should increase the number of layers
Any help would be greatly appreciated!
The dataset I am working with can be found here (processed Cleveland):
http://archive.ics.uci.edu/ml/datasets/Heart+Disease
The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:
(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))
where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.
Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.
To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:
(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))
please look at sites.google.com/site/gatmkorn for the open-source Desire simulation program.
For the Windows version, /mydesire/neural folder has several softmax classifiers, some with softmax-specific gradient-descent algorithm.
In the examples, this works nicely for a simplemcharacter-recognition task.
ASee also
Korn, G.A.: Advanced dynamic-system Simulation, Wiley 2007
GAK
look at the link:
http://www.youtube.com/watch?v=UOt3M5IuD5s
the softmax derivative is: dyi/dzi= yi * (1.0 - yi);