I want the network output to be within the range of (-0.5, 0.5), i.e. -0.5 to the sigmoid output.
It's a simple operation but it seems not simple with Caffe.
Is adding another layer (that does nothing but subtracting 0.5 from the input) to Caffe the only way to do that?
You can use a Power layer to make a "multiply constant" layer.
Check this.
Related
I was reading this interesting article on convolutional neural networks. It showed this image, explaining that for every receptive field of 5x5 pixels/neurons, a value for a hidden value is calculated.
We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information.
So max-pooling is applied.
With multiple convolutional layers, it looks something like this:
But my question is, this whole architecture could be build with perceptrons, right?
For every convolutional layer, one perceptron is needed, with layers:
input_size = 5x5;
hidden_size = 10; e.g.
output_size = 1;
Then for every receptive field in the original image, the 5x5 area is inputted into a perceptron to output the value of a neuron in the hidden layer. So basically doing this for every receptive field:
So the same perceptron is used 24x24 amount of times to construct the hidden layer, because:
is that we're going to use the same weights and bias for each of the 24×24 hidden neurons.
And this works for the hidden layer to the pooling layer as well, input_size = 2x2; output_size = 1;. And in the case of a max-pool layer, it's just a max() function on an array.
and then finally:
The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons.
which is a perceptron again.
So my final architecture looks like this:
-> 1 perceptron for every convolutional layer/feature map
-> run this perceptron for every receptive field to create feature map
-> 1 perceptron for every pooling layer
-> run this perceptron for every field in the feature map to create a pooling layer
-> finally input the values of the pooling layer in a regular ALL to ALL perceptron
Or am I overseeing something? Or is this already how they are programmed?
The answer very much depends on what exactly you call a Perceptron. Common options are:
Complete architecture. Then no, simply because it's by definition a different NN.
A model of a single neuron, specifically y = 1 if (w.x + b) > 0 else 0, where x is the input of the neuron, w and b are its trainable parameters and w.b denotes the dot product. Then yes, you can force a bunch of these perceptrons to share weights and call it a CNN. You'll find variants of this idea being used in binary neural networks.
A training algorithm, typically associated with the Perceptron architecture. This would make no sense to the question, because the learning algorithm is in principle orthogonal to the architecture. Though you cannot really use the Perceptron algorithm for anything with hidden layers, which would suggest no as the answer in this case.
Loss function associated with the original Perceptron. This notion of Peceptron is orthogonal to the problem at hand, you're loss function with a CNN is given by whatever you try to do with your whole model. You can eventually use it, but it is non-differentiable, so good luck :-)
A sidenote rant: You can see people refer to feed-forward, fully-connected NNs with hidden layers as "Multilayer Perceptrons" (MLPs). This is a misnomer, there are no Perceptrons in MLPs, see e.g. this discussion on Wikipedia -- unless you go explore some really weird ideas. It would make sense call these networks as Multilayer Linear Logistic Regression, because that's what they used to be composed of. Up till like 6 years ago.
I am currently trying to use Neural Network to make regression predictions.
However, I don't know what is the best way to handle this, as I read that there were 2 different ways to do regression predictions with a NN.
1) Some websites/articles suggest to add a final layer which is linear.
http://deeplearning4j.org/linear-regression.html
My final layers would look like, I think, :
layer1 = tanh(layer0*weight1 + bias1)
layer2 = identity(layer1*weight2+bias2)
I also noticed that when I use this solution, I usually get a prediction which is the mean of the batch prediction. And this is the case when I use tanh or sigmoid as a penultimate layer.
2) Some other websites/articles suggest to scale the output to a [-1,1] or [0,1] range and to use tanh or sigmoid as a final layer.
Are these 2 solutions acceptable ? Which one should one prefer ?
Thanks,
Paul
I would prefer the second case, in which we use normalization and sigmoid function as the output activation and then scale back the normalized output values to their actual values. This is because, in the first case, to output the large values (since actual values are large in most cases), the weights mapping from penultimate layer to the output layer would have to be large. Thus, for faster convergence, the learning rate has to be made larger. But this may also cause learning of the earlier layers to diverge since we are using a larger learning rate. Hence, it is advised to work with normalized target values, so that the weights are small and they learn quickly.
Hence in short, the first method learns slowly or may diverge if a larger learning rate is used and on the other hand, the second method is comparatively safer to use and learns quickly.
I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae')
model.fit(data, results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0
In my project, one of my objectives is to find outliers in aeronautical engine data and chose to use the Replicator Neural Network to do so and read the following report on it (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3366&rep=rep1&type=pdf) and am having a slight understanding issue with the step-wise function (page 4, figure 3) and the prediction values due to it.
The explanation of a replicator neural network is best described in the above report but as a background the replicator neural network I have built works by having the same number of outputs as inputs and having 3 hidden layers with the following activation functions:
Hidden layer 1 = tanh sigmoid S1(θ) = tanh,
Hidden layer 2 = step-wise, S2(θ) = 1/2 + 1/(2(k − 1)) {summation each variable j} tanh[a3(θ −j/N)]
Hidden Layer 3 = tanh sigmoid S1(θ) = tanh,
Output Layer 4 = normal sigmoid S3(θ) = 1/1+e^-θ
I have implemented the algorithm and it seems to be training (since the mean squared error decreases steadily during training). The only thing I don't understand is how the predictions are made when the middle layer with the step-wise activation function is applied since it causes the 3 middle nodes' activations to be become specific discrete values (e.g. my last activations on the 3 middle were 1.0, -1.0, 2.0 ) , this causes these values to be forward propagated and me getting very similar or exactly the same predictions every time.
The section in the report on page 3-4 best describes the algorithm but i have no idea what i have to do to fix this, i don't have much time either :(
Any help would be greatly appreciated.
Thank you
I'm facing the problem of implementing this algorithm and here is my insight into the problem that you might have had: The middle layer, by utilizing a step-wise function, is essentially performing clustering on the data. Each layer transforms the data into a discrete number which could be interpreted as a coordinate in a grid system. Imagine we use two neurons in the middle layer with step-wise values ranging from -2 to +2 in increments of 1. This way we define a 5x5 grid where each set of features will be placed. The more steps you allow, the more grids. The more grids, the more "clusters" you have.
This all sounds good and all. After all, we are compressing the data into a smaller (dimensional) representation which then is used to try to reconstruct into the original input.
This step-wise function, however, has a big problem on itself: back-propagation does not work (in theory) with step-wise functions. You can find more about this in this paper. In this last paper they suggest switching the step-wise function with a ramp-like function. That is, to have almost an infinite amount of clusters.
Your problem might be directly related to this. Try switching the step-wise function with a ramp-wise one and measure how the error changes throughout the learning phase.
By the way, do you have any of this code available anywhere for other researchers to use?
edit:
A more pointed question:
What is the derivative of softmax to be used in my gradient descent?
This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)
I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.
Currently, I have two layers, and I set the hidden layer to contain 10 nodes.
I came up with two different types of implementations
Logistic sigmoid for hidden layer activation, softmax for output activation
Softmax for both hidden layer and output activation
I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct for sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.
I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.
My conclusions:
I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
Perhaps I am not using enough hidden nodes
Perhaps I should increase the number of layers
Any help would be greatly appreciated!
The dataset I am working with can be found here (processed Cleveland):
http://archive.ics.uci.edu/ml/datasets/Heart+Disease
The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:
(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))
where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.
Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.
To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:
(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))
please look at sites.google.com/site/gatmkorn for the open-source Desire simulation program.
For the Windows version, /mydesire/neural folder has several softmax classifiers, some with softmax-specific gradient-descent algorithm.
In the examples, this works nicely for a simplemcharacter-recognition task.
ASee also
Korn, G.A.: Advanced dynamic-system Simulation, Wiley 2007
GAK
look at the link:
http://www.youtube.com/watch?v=UOt3M5IuD5s
the softmax derivative is: dyi/dzi= yi * (1.0 - yi);