If we have neural network and train it with desired outputs such as:
if case A the output will be 0.04
if case B then 0.08
if case C then 0.12 and so on until 1
If we got an actual output 0.06 from the application process, how do we interpret the output. Whether it will be count as case A or case B?
That will really depend on your thresholding strategy.
First of all you have to choose a threshold between each of your target categories. You can:
either you choose to put arbitrary thresholds, that can be the midpoints (i.e. 0.6 between categories 0.4 and 0.8) or really anything else.
or else compute thresholds that reduce the classification error, which can be done by averaging best working threshold values over several test runs.
Then you have to choose what to do when your output values falls exactly on a threshold, that is really up to you, you can either choose to classify it "to the left", "to the right" or even make your network say that it can't classify the input. But keep in mind that in most cases it is quite unlikely to happen, at most it will end up close to the threshold but rarely exactly on it.
Cheers,
Dolma
Related
The current task is to attempt a 1 to 1 replication of Matlab image stabilization code with openCV for a real time application (analyze stabilized frames, no video output). I'm currently prototyping in Python but the end goal is C++ and CUDA functions. Yes, I know there are better ways. Right now the powers that be think my transformation matrices should produce results within 0.1 pixels.
The issue seems to be replicating Matlab's matchFeatures function running with default arguments (sum of squares, exhaustive, ratio 0.6, threshold 1%, SURF features). It seems to have both a 0.6 ratio test (easily replicated with knnMatch, k=2 and a simple for loop) and a 1% of ideal match threshold filter. I actually fired up the C coder which gave me some hints along with the actual documentation. A normalized match distance can range from 0 (ideal) to 4. The threshold filter is thus set at 0.04. Of course the C coder output is so obfuscated it might as well be assembly language.
Just filtering match results (DMatch::distance) to keep less than 0.04 isn't the answer. I can tell from the C code that some sort of normalization is going on to produce the matchMetrics in Matlab, but I don't understand the underlying math. Can anyone shed some light on the match distance normalization process?
Is it better to have:
1 output neuron that outputs a value between 0 and 15 which would be my ultimate value
or
16 output neurons that output a value between 0 and 1 which represents the propability for this value?
Example: We want to find out the grade (ranging from 0 to 15) a student gets by inputing the number of hours he learned and his IQ.
TL;DR: I think your problem would be better framed as a regression task, so use one ouptut neuron, but it is worth to try both.
I don't quite like the broadness of your question in contrast to the very specific answers, so I am going to go a little deeper and explain what exactly should be the proper formulation.
Before we start, we should clarify the two big tasks that classical Artificial Neural Networks perform:
Classification
Regression
They are inherently very different from one another; in short, Classification tries to put a label on your input (e.g., the input image shows a dog), whereas regression tries to predict a numerical value (e.g., the input data corresponds to a house that has an estimated worth of 1.5 million $US).
Obviously, you can see that predicting the numerical value requires (trivially) only one output value. Also note that this is only true for this specific example. There could be other regression usecases, in which you want your output to have more than 0 dimensions (i.e. a single point), but instead be 1D, or 2D.
A common example would for example be Image Colorization, which we can interestingly enough also frame as a classification problem. The provided link shows examples for both. In this case you would obviously have to regress (or classify) every pixel, which leads to more than one output neuron.
Now, to get to your actual question, I want to elaborate a little more on the reasoning why one-hot encoded outputs (i.e. output with as many channels as classes) are preferred for classification tasks over a single neuron.
Since we could argue that a single neuron is enough to predict the class value, we have to understand why it is problematic to get to a specific class that way.
Categorical vs Ordinal vs Interval Variables
One of the main problems is the type of your variable. In your case, there exists a clear order (15 is better than 14 is better than 13, etc.), and even an interval ordering (at least on paper), since the difference between a 15 and 13 is the same as between 14 and 12, although some scholars might argue against that ;-)
Thus, your target is an interval variable, and could thus be in theory used to regress on it. More on that later. But consider for example a variable that describes whether the image depicts a cat (0), dog (1), or car (2). Now, arguably, we cannot even order the variables (is a car > dog, or car < dog?), nor can we say that there exists an "equal distance" between a cat and a dog (similar, since both are animals?) or a cat and a car (arguably more different from each other). Thus, it becomes really hard to interpret a single output value of the network. Say an input image results in the output of, say, 1.4.
Does this now still correspond to a dog, or is this closer to a car? But what if the image actually depicts a car that has properties of a cat?
On the other hand, having 3 separate neurons that reflect the different probabilities of each class eliminate that problem, since each one can depict a relatively "undisturbed" probability.
How to Loss Function
The other problem is the question how to backpropagate through the network in the previous example. Classically, classification tasks make use of Cross-Entropy Loss (CE), whereas regression uses Mean Squared Error (MSE) as a measure. Those two are inherently different, and especially the combination of CE and Softmax lead to very convenient (and stable) derivations.
Arguably, you could apply rounding to get from 1.4 to a concise class value (in that case, 1) and then use CE loss, but that would maybe lead to numerically instability; MSE on the other hand will never give you a "clear class value", but more a regressed estimate.
In the end, the question boils down to: Do I have a classification or regression problem. In your case, I would argue that both approaches could work reasonably well. A (classification) network might not recognize the correlation between the different output classes; i.e. a student that has a high likelihood for class 14 basically has zero probability of scoring a 3 or lower. On the other hand, regression might not be able to accurately predict the results for other reasons.
If you have the time, I would highly encourage you to try both approaches. For now, considering the interval type of your target, I would personally go with a regression task, and use rounding after you have trained your network and can make accurate predictions.
It is better to have a single neuron for each class (except binary classification). This allows for better design in terms of expanding upon an existing design. A simple example is creating a network for recognizing digits 0 through 9, but then changing the design to hex from 0 through F.
After reading a few papers on Neuro Evolution, more specifically NEAT, I realised that there was very little information regarding how you should weight each synapse at the start of the Neural Network. I understand that at the start, using NEAT, all the input neurons are connected to the output neuron, and then evolution takes place from there. However, should you weight each synapse randomly at the start, or simply set each one to 1?
It doesn't really matter a lot - it matters most how you mutate the weights of the connections in a genome.
However, setting the weights of each genome's connections to a random value is best: it acts like a small random search in the 'right' direction. If you'd set all the weights the same in for each genome, then weights in genomes will be extremely similar: keep in mind that a genome has a lot of connections, and with a mutation rate of 0.3 and two mutation options for example, only 15% of the population will have at least óne different weight after just 1 generation.
So make it something random, like random() * .2 - .1 (distribute between [-0.1, 0.1]). Just figure out what values work best for you.
Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.
Since a lot of these sites found on google use mathematical notation and I have no idea what any of it means I want to make a feedforward neural network like this:
n1
i1 n3
n2 o1
i2 n4
n3
Now can someone explain to me how to find the value of o1? How is it possible to make a neuron active when none of its inputs are active?
If none of the inputs are live, then you won't get anything out of the output.
It's been a long time since I spent some time on this, but back in the day, we'd add noise to the equation. This can be in the form of inputs that are always on or by adding a small random amount to each input before shoving it at the neural network.
Interestingly, the use of noise in neural networks has been shown to have a biological analog. If you're trying to hear something, and you add in a bit of white noise, it makes it easier to hear. same goes for seeing.
As for your initial question - How to find out the value of o1 depends on ...
The formula used throughout the neural network.
The values of n1 to n4.
The inputs.
http://www.cheshireeng.com/Neuralyst/nnbg.htm
Has some basic info on the maths.
Since the question isn't really clear to me... I'll say this in case it's what you're looking for:
Often times a bias neuron is added to the input and hidden layers to allow for the case you're mentioning. This extra neuron is always active and is used to handle the case when all other neurons on the layer are inactive.
This question is a good example of why "neural networks" do such an amazingly poor job of emulating the behavior of real-world neurons. Most real neurons have an intrinsic (or "natural") rate at which they fire action potentials, with no input from pre-synaptic neurons. The effect of pre-synaptic neurons is almost always to speed up or slow down this intrinsic firing rate, not to produce a single action potential in the post-synaptic neuron.
Why don't "neural networks" typically model this phenomenon? I don't know - you'd have to ask the people for whom "the approach inspired by biology has more or less been abandoned for a more practical approach based on statistics and signal processing".