Precision Recall for binary classification problems - classification

I have seen in the internet when people talk about binary classification problems they only report one precision and one recall for the entire model. (Sure it makes sense to report one accuracy). This makes sense if your objective is cancer and non-cancer therefore people who already have cancer and are identified as cancer are true positive. However consider consider you want to classify people are man or woman and you have the confusion matrix as follow :
man 4 2
woman 3 1
In this case it really makes no sense to have one precision and recall for the entire system. The precision and recall for men are different from women. I am wondering how many precision and recall are described in this scenario since you can look at true positive from women perspective as well(This is different from disease case) ?

Related

How are objective functions represented in SAT solvers?

SAT solvers can be used to solve the traveling salesman problem, where the sum of costs between chosen edges is important. I understand that you then ask the solver to go again finding a lesser cost. How is that maximum represented in conjunctive normal form?
The idea is to translate pseudo-Boolean constraints into CNF and many different encodings are available. A naive encoding is to enumerate all partial models that would result in a higher cost (which is exponential).
Note: I am co-author of the following publication:
You may find the PBLib useful as it provides different encodings.

Is it better to have 1 or 10 output neurons?

Is it better to have:
1 output neuron that outputs a value between 0 and 15 which would be my ultimate value
or
16 output neurons that output a value between 0 and 1 which represents the propability for this value?
Example: We want to find out the grade (ranging from 0 to 15) a student gets by inputing the number of hours he learned and his IQ.
TL;DR: I think your problem would be better framed as a regression task, so use one ouptut neuron, but it is worth to try both.
I don't quite like the broadness of your question in contrast to the very specific answers, so I am going to go a little deeper and explain what exactly should be the proper formulation.
Before we start, we should clarify the two big tasks that classical Artificial Neural Networks perform:
Classification
Regression
They are inherently very different from one another; in short, Classification tries to put a label on your input (e.g., the input image shows a dog), whereas regression tries to predict a numerical value (e.g., the input data corresponds to a house that has an estimated worth of 1.5 million $US).
Obviously, you can see that predicting the numerical value requires (trivially) only one output value. Also note that this is only true for this specific example. There could be other regression usecases, in which you want your output to have more than 0 dimensions (i.e. a single point), but instead be 1D, or 2D.
A common example would for example be Image Colorization, which we can interestingly enough also frame as a classification problem. The provided link shows examples for both. In this case you would obviously have to regress (or classify) every pixel, which leads to more than one output neuron.
Now, to get to your actual question, I want to elaborate a little more on the reasoning why one-hot encoded outputs (i.e. output with as many channels as classes) are preferred for classification tasks over a single neuron.
Since we could argue that a single neuron is enough to predict the class value, we have to understand why it is problematic to get to a specific class that way.
Categorical vs Ordinal vs Interval Variables
One of the main problems is the type of your variable. In your case, there exists a clear order (15 is better than 14 is better than 13, etc.), and even an interval ordering (at least on paper), since the difference between a 15 and 13 is the same as between 14 and 12, although some scholars might argue against that ;-)
Thus, your target is an interval variable, and could thus be in theory used to regress on it. More on that later. But consider for example a variable that describes whether the image depicts a cat (0), dog (1), or car (2). Now, arguably, we cannot even order the variables (is a car > dog, or car < dog?), nor can we say that there exists an "equal distance" between a cat and a dog (similar, since both are animals?) or a cat and a car (arguably more different from each other). Thus, it becomes really hard to interpret a single output value of the network. Say an input image results in the output of, say, 1.4.
Does this now still correspond to a dog, or is this closer to a car? But what if the image actually depicts a car that has properties of a cat?
On the other hand, having 3 separate neurons that reflect the different probabilities of each class eliminate that problem, since each one can depict a relatively "undisturbed" probability.
How to Loss Function
The other problem is the question how to backpropagate through the network in the previous example. Classically, classification tasks make use of Cross-Entropy Loss (CE), whereas regression uses Mean Squared Error (MSE) as a measure. Those two are inherently different, and especially the combination of CE and Softmax lead to very convenient (and stable) derivations.
Arguably, you could apply rounding to get from 1.4 to a concise class value (in that case, 1) and then use CE loss, but that would maybe lead to numerically instability; MSE on the other hand will never give you a "clear class value", but more a regressed estimate.
In the end, the question boils down to: Do I have a classification or regression problem. In your case, I would argue that both approaches could work reasonably well. A (classification) network might not recognize the correlation between the different output classes; i.e. a student that has a high likelihood for class 14 basically has zero probability of scoring a 3 or lower. On the other hand, regression might not be able to accurately predict the results for other reasons.
If you have the time, I would highly encourage you to try both approaches. For now, considering the interval type of your target, I would personally go with a regression task, and use rounding after you have trained your network and can make accurate predictions.
It is better to have a single neuron for each class (except binary classification). This allows for better design in terms of expanding upon an existing design. A simple example is creating a network for recognizing digits 0 through 9, but then changing the design to hex from 0 through F.

Choose training and test set for MLP and Hopfield network

I have a question regarding the choice of the training and the test set for a Multilayer Perceptron (MLP) and a Hopfield network.
For example, assume that we got 100 patterns of the digits 0-9 given in a bitmap format. 10 of them are perfect digits while the other 90 are distorted. Which of these patterns will be used for the training set and which for the test set? The goal is to classify the digits.
I suppose for the Hopfield network the perfect digits will be used as the training set, but what about the MLP? One approach I thought of was to take for example 70 of the distorted digits and use them as the training set along with the corresponding perfect digits as their intended targets. Is this approach correct?
Disclaimer: I have not worked with Hopfield Networks before, so I trust you in your statements about it, but it should not be of that great relevance for the answer, anyways.
I am also assuming that you want to classify the digits, which is something you don't explicitly state in your question.
As for a proper split: Aside from the fact that that little training data is generally not a feasible amount to get decent results for a MLP (even for a simple task such as digit classification), it is unlikely that you will be able to "pre-label" your training data in terms of quality in most real-world scenarios. You should therefore always assume that the data you are processing is inherently noisy. A good example for this is also the fact that data augmentation is frequently used to enrich your training corpus. Since data augmentation can consist of such simple changes as
added noise
minor rotations
horizontal/vertical flipping (the latter only makes so much sense for digits, though)
can improve your accuracy, it goes to show that visual quality and quantity for training are two very different things. Of course, it is not per se true that quantity alone will solve your problem (although research indicates that it is at least a good idea to use very much data)
Further, what you judge to be a good representation might be very much different from the network's perspective (although for labeling digits it might be rather easy to tell). A decent strategy is therefore to simply perform a random sampling for your training/test split.
Something I like to do when preprocessing a dataset is, when done splitting, to check whether every class is somewhat evenly represented in the splits, so you won't overfit.
Similarly, I would argue that having clean/high quality images of digits in both your test and training set might make the most sense, since you want to both be able to recognize a high quality number, as well as a sloppily written digit, and then test whether you can actually recognize it (with your test set).

Why use Crossover in Neural Network training?

Why specifically is it used?
I know it increases variation which may help explore the problem space, but how much does it increase the probability of finding the optimal solution/configuration in time? And does it do anything else advantageous?
And does it necessarily always help, or are there instances in which it would increase the time taken to find the optimal solution?
As Patrick Trentin said, crossover improve the speed of convergence, because it allows to combine good genes that are already found in the population.
But, for neuro-evolution, crossover is facing the "permutation problem", also known as "the competing convention problem". When two parents are permutations of the same network, then, except in rare cases, their offspring will always have a lower fitness. Because the same part of the network is copied in two different locations, and so the offspring is losing viable genes for one of these two locations.
for example the networks A,B,C,D and D,C,B,A that are permutations of the same network. The offspring can be:
A,B,C,D (copy of parent 1)
D,C,B,A (copy of parent 2)
A,C,B,D OK
A,B,C,A
A,B,B,A
A,B,B,D
A,C,B,A
A,C,C,A
A,C,C,D
D,B,C,A OK
D,C,B,D
D,B,B,A
D,B,B,D
D,B,C,D
D,C,C,A
D,C,C,D
So, for this example, 2/16 of the offspring are copies of the parents. 2/16 are combinations without duplicates. And 12/16 have duplicated genes.
The permutation problem occurs because networks that are permutations one of the other have the same fitness. So, even for an elitist GA, if one is selected as parent, the other will also often be selected as parent.
The permutations may be only partial. In this case, the result is better than for complete permutations, but the offspring will, in a lot of cases, still have a lower fitness than the parents.
To avoid the permutation problem, I heard about similarity based crossover, that compute similarity of neurons and their connected synapses, doing the crossing-over between the most similar neurons instead of a crossing-over based on the locus.
When evolving topology of the networks, some NEAT specialists think the permutation problem is part of a broader problem: "the variable lenght genome problem". And NEAT seems to avoid this problem by speciation of the networks, when two networks are too differents in topology and weights, they aren't allowed to mate. So, NEAT algorithm seems to consider permuted networks as too different, and doesn't allow them to mate.
A website about NEAT also says:
However, in another sense, one could say that the variable length genome problem can never be "solved" because it is inherent in any system that generates different constructions that solve the same problem. For example, both a bird and a bat represent solutions to the problem of flight, yet they are not compatible since they are different conventions of doing the same thing. The same situation can happen in NEAT, where very different structures might arise that do the same thing. Of course, such structures will not mate, avoiding the serious consequence of damaged offspring. Still, it can be said that since disparate representations can exist simultaneously, incompatible genomes are still present and therefore the problem is not "solved." Ultimately, it is subjective whether or not the problem has been solved. It depends on what you would consider a solution. However, it is at least correct to say, "the problem of variable length genomes is avoided."
Edit: To answer your comment.
You may be right for similarity based crossover, I'm not sure it totally avoids the permutation problem.
About the ultimate goal of crossover, without considering the permutation problem, I'm not sure it is useful for the evolution of neural networks, but my thought is: if we divide a neural network in several parts, each part contributes to the fitness, so two networks with a high fitness may have different good parts. And combining these parts should create an even better network. Some offspring will of course inherit the bad parts, but some other offspring will inherit the good parts.
Like Ray suggested, it could be useful to experiment the evolution of neural networks with and without crossover. As there is randomness in the evolution, the problem is to run a large number of tests, to compute the average evolution speed.
About evolving something else than a neural network, I found a paper that says an algorithm using crossover outperforms a mutation-only algorithm for solving the all-pairs shortest path problem (APSP).
Edit 2:
Even if the permutation problem seems to be only applicable to some particular problems like neuro-evolution, I don't think we can say the same about crossover, because maybe we are missing something about the problems that don't seem to be suitable for crossover.
I found a free version of the paper about similarity based crossover for neuro-evolution, and it shows that:
an algorithm using a naive crossover performs worse than a mutation-only algorithm.
using similarity based crossover it performs better than a mutation-only algorithm for all tested cases.
NEAT algorithm sometimes performs better than a mutation-only algorithm.
Crossover is complex and I think there is a lack of studies that compare it with mutation-only algorithms, maybe because its usefulness highly depends:
of its engineering, in function of particular problems like the permutation problem. So of the type of crossover we use (similarity based, single point, uniform, edge recombination, etc...).
And of the mating algorithm. For example, this paper shows that a gendered genetic algorithm strongly outperforms a non-gendered genetic algorithm for solving the TSP. For solving two other problems, the algorithm doesn't strongly outperforms, but it is better than the non-gendered GA. In this experiment, males are selected on their fitness, and females are selected on their ability to produce a good offspring. Unfortunately, this study doesn't compare the results with a mutation-only algorithm.

Basic intuition for neural networks?

There are lots of "introduction to neural networks" articles online, but most are an introduction to the math of artificial neural networks and not an introduction to the actual underlying concepts (even though they should be one and the same). How does a simple network of artificial neurons actually work?
This answer is roughly based on the beginning of "Neural Networks and Deep Learning" by M. A. Nielsen which is definitely worth reading - it's online and free.
The fundamental idea behind all neural networks is this: Each neuron in a neural network makes a decision. Once you understand how they do that, everything else will make sense. Let’s walk through a simple situation which will help us arrive at that understanding.
Let’s say you are trying to decide whether or not to wear a hat today. There are a number of factors which will affect your decision, and perhaps the most important ones are:
Is it sunny?
Do I have a hat to wear?
Would a hat suit my outfit?
For simplicity, we’ll assume these are the only three factors that you’re weighing up during this decision. Forgetting about neural networks for a second, let’s just try to build a ‘decision maker’ to help us answer this question.
First, we can see each question has a certain level of importance, and so we’ll need to use this relative importance of each question, along with the corresponding answer to each question, to make our decision.
Secondly, we’ll need to have some component which interprets each (yes or no) answer along with its importance to produce the final answer. This sounds simple enough to put into an equation, right? Let’s do it. We simply decide how important each factor is and multiply that importance (or ‘weight’) by the answer to the question (which can be 0 or 1):
3a + 5b + 2c > 6
The numbers 3, 5 and 2 are the ‘weights’ of question a, b and c, respectively. a, b and c, themselves can be either zero (the answer to the question was ‘no’), or one (the answer to the question was ‘yes’). If the above equation is true, then the decision is to wear a hat, and if it is false, the decision is to not wear a hat. The equation says that we’ll only wear a hat if the sum of our weights multiplied by our factors is greater than some threshold value. Above, I chose a threshold value of 6. If you think about it, this means that if I don’t have a hat to wear (b=0), no matter what the other answers are, I won’t be wearing a hat today. That is,
3a + 2c > 6
is never true, since a and c are only either 0 or 1. This makes sense – our simple decision model tells us not to wear a hat if we don’t have one! So the weights of 3, 5 and 2, and the threshold value of 6 seem like a good choices for our simple “should I wear a hat” decision-maker. It also means that, as long as I have a hat to wear, the sun shining (a=1) OR the hat suiting my outfit (c=1) is enough to make me wear a hat today. That is,
5 + 3 > 6 and 5 + 2 > 6
are both true. Good! You can see that by adjusting the weighting of each factor and the threshold, and by adding more factors, we can adjust our ‘decision maker’ to approximately model any decision-making process. What we have just demonstrated is the functionality of a simple neuron (a decision-maker!). Let’s put the above equation into ‘neuron-form’:
A neuron which processes 3 factors: a, b, c, with corresponding importance weightings of 3, 5, 2, and with a decision threshold of 6.
The neuron has 3 input connections (the factors) and 1 output connection (the decision). Each input connection has a weighting which encodes the importance of that connection. If the weighting of that connection is low (relative to the other weights), then it won’t have much effect on the decision. If it’s high, the decision will heavily depend on it.
This is great, we’ve got a fully working neuron that weights inputs and makes decisions. So here’s the next though: What if the output (our decision) was fed into the input of another neuron? That neuron would be using our decision about our hat to make a more abstract decision. And what if the inputs a, b and c are themselves the outputs of other neurons which compute lower-level decisions? We can see that neural networks can be interpreted as networks which compute decisions about decisions, leading from simple input data to more and more complex ‘meta-decisions’. This, to me, is an incredible concept. All the complexity of even the human brain can be modelled using these principles. From the level of photons interacting with our cone-cells right up to our pondering of the meaning of life, it’s just simple little decision-making neurons.
Below is a diagram of a simple neural network which essentially has 3 layers of abstraction:
A simple neural network with 2 inputs and 2 outputs.
As an example, the above inputs could be 2 infrared distance sensors, and the outputs might control control the on/off switch for 2 motors which drive the wheels of a robot.
In our simple hat example, we could pick the weights and the threshold quite easily, but how do we pick the weights and thresholds in this example so that, say, the robot can follow things that move? And how do we know how many neurons we need to solve this problem? Could we solve it with just 1 neuron, maybe 2? Or do we need 20? And how do we organise them? In layers? Modules? These questions are the questions in the field of neural networks. Techniques such as ‘backpropagation’ and (more recently) ‘neuroevolution’ are used effectively to answer some of these troubling questions, but these are outside the scope of this introduction – Wikipedia and Google Scholar and free online textbooks like “Neural Networks and Deep Learning” by M. A. Nielsen are great places to start learning about these concepts.
Hopefully you now have some intuition for how neural networks work, but if you’re interested in actually implementing a neural network there are a few optimisations and extensions to our concept of a neuron which.will make our neural nets more efficient and effective.
Firstly, notice that if we set the threshold value of the neuron to zero, we can always adjust the weightings of the inputs to account for this – only, we’ll also need to allow negative values for our weights. This is great since it removes one variable from our neuron. So we’ll allow negative weights and from now on we won’t need to worry about setting a threshold – it’ll always be zero.
Next, we’ll notice that the weights of the input connections are all relative to one-another, so we can actually normalise these to a value between -1 and 1. Cool. That simplifies things a little.
We can make a further, more substantial improvement to our decision-maker by realising that the inputs themselves (a, b and c in the above example) need not just be 0 or 1. For example, what if today is really sunny? Or maybe there’s scattered clouds, do it’s intermittently sunny? We can see that by allowing values between 0 and 1, our neuron gets more information and can therefore make a better decision – and the good news is, we don’t need to change anything in our neuron model!
So far, we’ve allowed the neuron to accept inputs between 0 and 1, and we’ve normalised the weights between -1 and 1 for convenience.
The next question is: why do we need such certainty in our final decision (i.e. the output of the neuron)? Why can’t it, like the inputs, also be a value between 0 and 1? If we did allow this, the decision of whether or not to wear a hat would become a level of certainty that wearing a hat is the right choice. But if this is a good idea, why did I introduce a threshold at all? Why not just directly pass on the sum of the weighted inputs to the output connection? Well, because, for reasons beyond the scope of this simple introduction to neural networks, it turns out that a neural network works better if the neurons are allowed to make something like an ‘educated guess’, rather than just presenting a raw probability. A threshold gives the neurons a slight bias toward certainty and allows them to be more ‘assertive’, and doing so makes neural networks more efficient. So in that sense, a threshold is good. But the problem with a threshold is that it doesn’t let us know when the neuron is uncertain about its decision – that is, if the sum of the weighted inputs is very close to the threshold, the neuron makes a definite yes/no answer where a definite yes/no answer is not ideal.
So how can we overcome this problem? Well it turns out that if we replace our “greater than zero” condition with a continuous function (called an ‘activation function’), then we can choose non-binary and non-linear reactions to the neuron’s weighted inputs. Let’s first look at our original “greater than zero” condition as a function:
‘Step’ function representing the original neuron’s ‘activation function’.
In the above activation function, the x-axis represents the sum of the weighted inputs and the y-axis represents the neuron’s output. Notice that even if the inputs sum to 0.01, the output is a very certain 1. This is not ideal, as we’ve explained earlier. So we need another activation function that only has a bias towards certainty. Here’s where we welcome the ‘sigmoid’ function:
The ‘sigmoid’ function; a more effective activation function for our artificial neural networks.
Notice how it looks like a halfway point between a step function (which we established as too certain) and a linear x=y line that we’d expect from a neuron which just outputs the raw probability that some some decision is correct. The equation for this sigmoid function is:
where x is the sum of the weighted inputs.
And that’s it! Our new-and-improved neuron does the following:
Takes multiple inputs between 0 and 1.
Weights each one by a value between -1 and 1.
Sums them all together.
Puts that sum into the sigmoid function.
Outputs the result!
It's deceptively simple, but by combining these simple decision-makers together and finding ideal connection weights, we can make arbitrarily complex decisions and calculations which stretch far beyond what our biological brains allow.