How is the bias node integrated in NEAT? - neural-network

In NEAT you can add a special bias input node that is always active. Regarding the implementation of such a node there is not much information in the original paper. Now I want to know how the bias node should behave, if there is a at all a consensus.
So the question is:
Do connections from the bias node come about during evolution and can be split for new nodes just like regular connections or does the bias node always have connections to all non-input nodes?

To answer my own question: According to the NEAT users page Kenneth O. Stanley talks about why the bias in NEAT is used as an extra input neuron:
Why does NEAT use a bias node instead of having a bias parameter in each node?
Mainly because not all nodes need a bias. Thus, it would unnecessarily enlarge the search space to be searching for a proper bias for every node in the system. Instead, we let evolution decide which nodes need biases by connecting the bias node to those nodes. This issue is not a major concern; it could work either way. You can easily code a bias into every node and try that as well.
My best guess is therefore that the BIAS input is treated like any other input in NEAT, with the difference that it is always active.

Related

How best to deal with "None of the above" in Image Classification?

This seems to be a fundamental question which some of you out there must have an opinion on. I have an image classifier implemented in CNTK with 48 classes. If the image does not match any of the 48 classes very well, then I'd like to be able to conclude that it was not among these 48 image types. My original idea was simply that if the highest output of the final Softmax layer was low, I would be able to conclude that the test image matched none well. While I occasionally see this occur, in most testing, Softmax still produces a very high (and mistaken) result when handed an 'unknown image type'. But maybe my network is 'over fit' and if it wasn't, my original idea would work fine. What do you think? Any way to define a 49-th class called 'none-of-the-above'?
You really have these two options indeed--thresholding the posterior probabilities (softmax values), and adding a garbage class.
In my area (speech), both approaches are their place:
If "none of the above" inputs are of the same nature as the "above" (e.g. non-grammatical inputs), thresholding works fine. Note that the posterior probability for a class is equal to one minus an estimate of the error rate for choosing this class. Rejecting anything with posterior < 50% would be rejecting all cases where you are more likely wrong than right. As long as your none-of-the-above classes are of similar nature, the estimate may be accurate enough to make this correct for them as well.
If "none of the above" inputs are of similar nature but your number of classes is very small (e.g. 10 digits), or if the inputs are of a totally different nature (e.g. a sound of a door slam or someone coughing), thresholding typically fails. Then, one would train a "garbage model." In our experience, it is OK to include the training data for the correct classes. Now the none-of-the-above class may match a correct class as well. But that's OK as long as the none-of-the-above class is not overtrained--its distribution will be much flatter, and thus even if it matches a known class, it will match it with a lower score and thus not win against the actual known class' softmax output.
In the end, I would use both. Definitely use a threshold (to catch the cases that the system can rule out) and use a garbage model, which I would just train it on whatever you have. I would expect that including the correct examples in training will not harm, even if it is the only data you have (please check the paper Anton posted for whether that applies to image as well). It may also make sense to try to synthesize data, e.g. by randomly combining patches from different images.
I agree with you that this is a key question, but I am not aware of much work in that area either.
There's one recent paper by Zhang and LeCun, that addresses the question for image classification in particular. They use large quantities of unlabelled data to create an additional "none of the above" class. The catch though is that, in some cases, their unlabelled data is not completely unlabelled, and they have means of removing "unlabelled" images that are actually in one of their labelled classes. Having said that, the authors report that apart from solving the "none of the above" problem, they even see performance gains even on their test sets.
As for fitting something post-hoc, just by looking at the outputs of the softmax, I can't provide any pointers.

How to make simulated electric components behave nicely?

I'm making a simple electric circuit simulator. It will (at least initially) only feature batteries, wires and resistors in series and parallel. However, I'm at a loss how best to simulate said circuit in a good way.
Specifically, I will have batteries and resistors with two contact points each, and wires that go between two contact points. I assume that each component will have a field for its resistance, the current through it and the voltage across it (current and voltage will, of course, be signed). Each component is given a resistance, and the batteries are given a voltage. The goal of the simulation is to assign correct values to all the other fields in real time as the player connects and disconnects components and wires.
These are the requirements:
It must be correct, including Ohm's and Kirchhoff's laws (I'm modeling real world circuits, and there is little point if the model does something completely different)
It must be numerically stable (we can't have uncontrolled oscillations or something just because two neighbouring resistors can't make up their minds together)
It should stabilize relatively quickly for, let's say, fewer than 30 components (having to wait a few seconds before the values are correct doesn't really satisfy "real time", but I really don't plan on using it for more than 10 or maybe 20 components)
The optimal formulation for me (how I envision this in my head) would be if I could assign a script to each component that took care of that component only, possibly by communicating field values with neighbouring components, and each component script works in parallel and adjusts as is needed
I only see problems here and no solutions. The biggest problem, I think, is Kirchhoff's voltage law (going around any sub-circuit, the voltage across all components, including signs, add up to 0), because that's a global law (it says somehting about a whole circuit and not just a single component / connection point). There is a mathematical reformulation saying that there exists a potential function on the points in the circuit (for instance, the voltage measured against the + pole of the battery), which is a bit more local, but I still don't see how to let a component know how much the voltage / potential drops across it.
Kirchhoff's current law (the net current flow into an intersection is 0) might also be trouble. It seems to force me to make intersections into separate objects to enforce it. I originally thought that I could just let each component have two lists (a left list and a right list) containing every other component that is connected to it at that point, but that might not make KCL easily enforcable.
I know there are circuit simulators out there, and they must have solved this exact problem somehow. I just can't find an explanation because if I try googling it, I only find the already made simulators and no explanations anywhere.

Bayesian network and fuzzy logic

Can anyone give me an example of a Bayesian network and fuzzy logic being used in intrusion detection?
I'm struggling to figure out how it can be used. And any code on it?
Thanks guys.
The exact details will depend upon whether you're talking about a burglar alarm type situation (sensor readings) or something fancier involving security guards and sharks with lasers. Either way, the principle is the same.
You start with root nodes describing the basic things that affect intrusion, e.g.,
Sensor detected motion (true/false)
Shark smelt blood (true/false)
Temperature (too low/just right/too high)
Security guard is asleep
...
any other things you can think of.
You assign a probability to each state of each root node.
P(Security guard is asleep) = 0.25
Then you define child nodes that depend upon those root nodes, e.g., Security guard heard noise would depend upon Security guard is asleep.
You assign conditional probabilities for each state of the child nodes, given each state of its parents.
P(Security guard heard noise|Security guard is asleep) = 0.05
P(Security guard heard noise|Security guard is not asleep) = 0.5
Eventually, you'll want to get to an outcome like Burglary has been foiled.
Once you have your network node set up, you can evaluate it, and calculate the probability of different outcomes happening.
Next you add evidence. So if you know your shark smelt blood, that node gets set to a particular value and you can reevaluate the network to see how probabilities have changed.
In terms of software, the Bayes Net toolbox is well regarded.

Can a virtual machine be implemented as a neural network?

Disclaimer: I'm not a mathematical genius, nor do I have any experience with writing neural networks. So, please, forgive whatever idiotic things I happen to say here. ;)
I've always read about neural networks being used for machine learning, but while experimenting with writing simple virtual machines, I began to wonder if they could be applied in another way.
Specifically, can a virtual machine be created as a neural network? If so, how would it work (feel free to use an abstract description here, if you have to)?
I've heard of the Joycean Machine, but I can't find any information other than very, very vague explanations.
EDIT: What I'm looking for here is an explanation of exactly how a neural network-based VM would interpret assembly. How would inputs be handled, etc? Would each individual input be a memory address? Let's brainstorm!
You really made my day buddy...
Since an already trained neural network won't be much different than a regular state machine, there is no point writing a neural network VM for a deterministic instruction set.
It might be interesting to train such a VM with multiple instruction sets or an unknown set. However, I doubt it will be practical to execute such a training and even a %99 correct interpreter will be of any use for conventional bytecode.
The only use of a neural network VM I can think of is executing a program that contains fuzzy logic constructs or AI algorithm heuristics.
Some silly stack machine example to demonstrate the idea:
push [x1]
push [y1] ;start coord
push [x2]
push [y2] ;end coord
pushmap [map] ;some struct
stepastar ;push the next step of A* heuristics to accumulator and update the map
pop ;do sth with is and pop
stepastar ;next step again
... ;stack top is a map
reward ;we liked the coordinate. reinforce the heuristic
stepastar
... ;stack top is a map
punish ;we didn't like the next coordinate. try something different
There is no explict heuristic here. Just assume we keep all state in *map including the heuristic algorithm.
You see it looks silly and not completely context sensitive but a neural network is of no value if it doesn't learn online.
Of course. With a rather complex network no doubt.
Much of the parsing of bytecodes/opcodes is pattern matching which neural networks excel at.
You could certainly do this with a neural network - I could easily see learning the correct state transitions for a given piece of bytecode.
Input could be something like:
Value at top of stack
Value in current accumulator
Byte code at current instruction pointer
Byte value at current data pointer
Previous flags
Output could be something like:
Change to instruction pointer
Change to data pointer
Change to accumulator
Stack operation (push, pop, or nothing)
Memory operation (read to accumulator, write accumulator or nothing)
New flags
However - I'm not sure why you would want to do this in the first place. A neural network would be much less efficient (and potentially make mistakes unless you trained it well enough) compared to just executing the bytecode directly. You'd probably need to write an accurate bytecode evaluator anyway just to create enough training data....
Also, in my experience neural networks tend to be good at pattern recognition but very bad at learning logical operations (like binary addition or XORs) once you get beyond a certain scale (i.e. more than a few bits). So depending on the complexity of your instruction set, the network could take a very large amount of time to train.

Dijkstra algorithm for iPhone

It is possible to easily use the GPS functionality in the iPhone since sdk 3.0, but it is explicitly forbidden to use Google's Maps.
This has two implications, I think:
You will have to provide maps yourself
You will have to calculate the shortest routes yourself.
I know that calculating the shortest route has puzzled mathematicians for ages, but both Tom Tom and Google are doing a great job, so that issue seems to have been solved.
Searching on the 'net, not being a mathematician myself, I came across the Dijkstra Algorithm. Is there anyone of you who has successfully used this algorithm in a Maps-like app in the iPhone?
Would you be willing to share it with me/the community?
Would this be the right approach, or are the other options?
Thank you so much for your consideration.
I do not believe Dijkstra's algorithm would be useful for real-world mapping because, as Tom Leys said (I would comment on his post, but lack the rep to do so), it requires a single starting point. If the starting point changes, everything must be recalculated, and I would imagine this would be quite slow on a device like the iPhone for a significantly large data set.
Dijkstra's algorithm is for finding the shortest path to all nodes (from a single starting node). Game programmers use a directed search such as A*. Where Dijkstra processes the node that is closest to the starting position first, A* processes the one that is estimated to be nearest to the end position
The way this works is that you provide a cheap "estimate" function from any given position to the end point. A good example is how far a bird would fly to get there. A* adds this to the current distance from the start for each node and then chooses the node that seems to be on the shortest path.
The better your estimate, the shorter the time it will take to find a good path. If this time is still too long, you can do a path find on a simple map and then another on a more complex map to find the route between the places you found on the simple map.
Update
After much searching, I have found an article on A* for you to to read
Dijkstra's algorithm is O(m log n) for n nodes and m edges (for a single path) and is efficient enough to be used for network routing. This means that it's efficient enough to be used for a one-off computation.
Briefly, Dijkstra's algorithm works like:
Take the start node
Assign it a depth of zero
Insert it into a priority queue at its depth key
Repeat:
Pop the node with the lowest depth from the priority queue
Record the node that you came from so you can track the path back
Mark the node as having been visited
If this node is the destination:
Break
For each neighbour:
If the node has not previously been visited:
Calculate depth as depth of current node + distance to neighbour
Insert neighbour into the priority queue at the calculated depth.
Return the destination node and list of the nodes through which it was reached.
Contrary to popular belief, Dijkstra's algorithm is not necessarily an all-pairs shortest path calculator, although it can be adapted to do this.
You would have to get a graph of the streets and intersections with the distances between the intersections. If you had this data you could use Dijkstra's algorithm to compute a shortest route.
If you look at technology tomtom calls 'IQ routes', they measure actual speed and travel time per roadstretch per time of day. This makes the arrival time more accurate. So the expected arrival time is more fact-based http://www.tomtom.com/page/iq-routes
Calculating a route using the A* algorithm is plenty fast enough on an iPhone with offline map data. I have experience of doing this commercially. I use the A* algorithm as documented on Wikipedia, and I keep the road network in memory and re-use it; once it's loaded, routing even over a large area like Spain or the western half of Canada is practically instant.
I take data from OpenStreetMap or elswhere and convert it into a directed graph, assuming (which is the right way to do it according to those who know) that any two roads sharing a point with the same ID are joined. I assign weights to different types of roads based on expected speeds, and if a portion of a road is one-way I create only a single arc; two-way roads get two arcs, one in each direction. That's pretty much the whole thing apart from some ad-hoc code to prevent dangerous turns, and implementing routing restrictions.
This was discussed earlier here: What algorithms compute directions from point a to point b on a map?
Have a look at CloudMade. They offer a free service for iPhone and iPad that allows navigation based on your current location. It is built on open street maps and has some nifty features like making your own mapstyle. It is a little slow from time to time but its totally free.