ANN: Perceptron performance in determining point positions - swift

So I'm starting with machine learning and Artificial Neural Networks and I found this article The Nature of Code that introduces to Artificial Neural Networks and the idea of a Perceptron.
Through the article they show you how to create a Perceptron that is able to discriminate between points positioned above or below a line based on the function:
f(x) = 2x + 1
I developed my own Perceptron in Swift and used XCode Playgrounds to illustrate my Perceptron performance.
The perceptron takes 3 inputs: x, y and bias (always 1). The weights of the 3 inputs are generated at random, and after some training they are adjusted.
This first graphic shows the value of the first weight over trainings. As you can see, the value stabilizes at the end, and this is a proof that the Perceptron learned to discriminate points:
The second graphic represents the function line and all the training points (selected at random). The green points are the ones that the Perceptron predicted well, whereas the red ones are the wrong predictions:
As you can see, almost all of the red dots are situated in the inverse function:
f(x) = -2x - 1
My question is why this line of error dots appear. I thought that at a certain point all the weights would be stabilized and that the Perceptron performance would be 100%, but it never does. Is this because of a code bug or ANN always have this tiny interval of error?
Any explanation will be welcomed, although keep in mind that I'm a newbie at ML and ANN.
Thank you very much.


Meaning of Bias with zero inputs in Perception at ANNs

I'm student in a graduate computer science program. Yesterday we had a lecture about neural networks.
I think I understood the specific parts of a perceptron in neural networks with one exception. I already made my research about the bias in an perceptron- but still I didn't got it.
So far I know that, with the bias I can manipulate the sum over the inputs with there weights in a perception to evaluate that the sum minus a specific bias is bigger than the activation function threshold - if the function should fire (Sigmoid).
But on the presentation slides from my professor he mentioned something like this:
The bias is added to the perceptron to avoid issues where all inputs
could be equal to zero - no multiplicative weight would have an effect
I can't figure out whats the meaning behind this sentence and why is it important, that sum over all weighted inputs can't be equal to zero ?. If all inputs are equal to zero, there should be no impact on the next perceptions in the next hidden layer, right? Furthermore this perception is a static value for backpropagation and has no influence on changing this weights at the perception.
Or am I wrong?
Has anyone a solution for that?
thanks in advance
A bias is essentially an offset.
Imagine the simple case of a single perceptron, with a relationship between the input and the output, say:
y = 2x + 3
Without the bias term, the perceptron could match the slope (often called the weight) of "2", meaning it could learn:
y = 2x
but it could not match the "+ 3" part.
Although this is a simple example, this logic scales to neural networks in general. The neural network can capture nonlinear functions, but often it needs an offset to do so.
What you asked
What your professor said is another good example of why an offset would be needed. Imagine all the inputs to a perceptron are 0. A perceptron's output is the sum of each of the inputs multiplied by a weight. This means that each weight is being multiplied by 0, then added together. Therefore, the result will always be 0.
With a bias, however, the output could still retain a value.

Can't approximate simple multiplication function in neural network with 1 hidden layer

I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae'), results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0

Trying to find object coordinates (x,y) in image, my neural network seems to optimize error without learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I generate images of a single coin pasted over a white background of size 200x200. The coin is randomly chosen among 8 euro coin images (one for each coin) and has :
random rotation ;
random size (bewteen fixed bounds) ;
random position (so that the coin is not cropped).
Here are two examples (center markers added): Two dataset examples
I am using Python + Lasagne. I feed the color image into the neural network that has an output layer of 2 linear neurons fully connected, one for x and one for y.
The targets associated to the generated coin images are the coordinates (x,y) of the coin center.
I have tried (from Using convolutional neural nets to detect facial keypoints tutorial):
Dense layer architecture with various number of layers and number of units (500 max) ;
Convolution architecture (with 2 dense layers before output) ;
Sum or mean of squared difference (MSE) as loss function ;
Target coordinates in the original range [0,199] or normalized [0,1] ;
Dropout layers between layers, with dropout probability of 0.2.
I always used simple SGD, tuning the learning rate trying to have a nice decreasing error curve.
I found that as I train the network, the error decreases until a point where the output is always the center of the image. It looks like the output is independent of the input. It seems that the network output is the average of the targets I give. This behavior looks like a simple minimization of the error since the positions of the coins are uniformly distributed on the image. This is not the wanted behavior.
I have the feeling that the network is not learning but is just trying to optimize the output coordinates to minimize the mean error against the targets. Am I right? How can I prevent this? I tried to remove the bias of the output neurons because I thought maybe I'm just modifying the bias and all others parameters are being set to zero but this didn't work.
Is it possible for a neural network alone to perform well at this task?
I have read that one can also train a net for present/not present binary classification and then scan the image to find possible locations of objects. But I just wondered if it was possible just using the forward computation of a neural net.
Question : How can I prevent this [overfitting without improvement to test scores]?
What needs to be done is to re-architect your neural net. A neural net just isn't going to do a good job at predicting an X and Y coordinate. It can through create a heat map of where it detects a coin, or said another way, you could have it turn your color picture into a "coin-here" probability map.
Why? Neurons have a good ability to be used to measure probability, not coordinates. Neural nets are not the magic machines they are sold to be but instead really do follow the program laid out by their architecture. You'd have to lay out a pretty fancy architecture to have the neural net first create an internal space representation of where the coins are, then another internal representation of their center of mass, then another to use the center of mass and the original image size to somehow learn to scale the X coordinate, then repeat the whole thing for Y.
Easier, much easier, is to create a coin detector Convolution that converts your color image to a black and white image of probability-a-coin-is-here matrix. Then use that output for your custom hand written code that turns that probability matrix into an X/Y coordinate.
Question : Is it possible for a neural network alone to perform well at this task?
A resounding YES, so long as you set up the right neural net architecture (like the above), but it would probably be much easier to implement and faster to train if you broke the task into steps and only applied the Neural Net to the coin detection step.

Understanding Matlab Pattern Recognition Neural Network Plots

I was currently doing a project on Vehicle classification and it has almost finished now but I have several confusion about the plots I get from my Neural Network
I used 230 images [90=Hatchbacks,90=Sedans,50=SUVs] for classification on 80 feature points.
Thus my vInput was a [80x230] matrix and my vTarget was [3x230] matrix
Classifier works well but I don't understand these plots or if they are abnormal or not.
My neural Network
Then I clicked these 4 plots in the PLOT section and got these sequentially.
Performance Plot
Training State
Confusion Plot
Receiver Operating Characteristic Plot
I know the images they are a lots of images but I know nothing about them.
On the matlab documentation they just train the system and plot the graph
So please someone briefly explain them to me or show me some good links to learn them.
First two plots shows training statistscs.
Performance Plot shows you mean square error dynamics for all your datasets in logarithmic scale. Training MSE is always decreasing, so its validation and test MSE you should be interested in. Your plot shows a perfect training.
Training State shows you some other training statistics.
Gradient is a value of backpropagation gradient on each iteration in logarithmic scale. 5e-7 means that you reached the bottom of the local minimum of your goal function.
Validation fails are iterations when validation MSE increased its value. A lot of fails means owertrainig, but in you case its OK. Matlab automatically stops training after 6 fails in a row.
The other two plots shows you the results of your network simulation after training.
Confusion Plot. In your case its 100% accurate. Green cells represent correct answers and red cells represent all types of incorrect answers.
For example, you may read the first one (training set) as: "59 samples from the class 1 was corrctly classified as class 1, 13 samples from the class 2 was corrctly classified as class 2 and 6 samples from the class 3 was corrctly classified as class 3".
Receiver Operating Characteristic Plot shows the same thing, but in a different way - using ROC curve:

Neural Network with softmax activation

A more pointed question:
What is the derivative of softmax to be used in my gradient descent?
This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)
I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.
Currently, I have two layers, and I set the hidden layer to contain 10 nodes.
I came up with two different types of implementations
Logistic sigmoid for hidden layer activation, softmax for output activation
Softmax for both hidden layer and output activation
I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct for sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.
I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.
My conclusions:
I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
Perhaps I am not using enough hidden nodes
Perhaps I should increase the number of layers
Any help would be greatly appreciated!
The dataset I am working with can be found here (processed Cleveland):
The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:
(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))
where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.
Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.
To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:
(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))
please look at for the open-source Desire simulation program.
For the Windows version, /mydesire/neural folder has several softmax classifiers, some with softmax-specific gradient-descent algorithm.
In the examples, this works nicely for a simplemcharacter-recognition task.
ASee also
Korn, G.A.: Advanced dynamic-system Simulation, Wiley 2007
look at the link:
the softmax derivative is: dyi/dzi= yi * (1.0 - yi);