Neural networks, how do they look in coding?

Neural networks, how do they look in coding? - neural-network

Basically i know the concept of a neural network and what it is, but i can't figure out how it looks when you code it or how do you store the data, i went through many tutorials that i found on google but couldn't find any piece of code, just concepts and algorithms.
Can anyone give me a piece of code of a simple neural network something like "Hello World!"?

What you mainly need is an object representing a single neuron with the corrispective associations with other neurons (that represent synapses) and their weights.
A single neuron in a typical OOP language will be something like
class Synapse
{
Neuron sending;
Neuron receiving;
float weight;
}
class Neuron
{
ArrayList<Synapse> toSynapses;
ArrayList<Synapse> fromSynapses;
Function threshold;
}
where threshold represents the function that is applied on weighted sum of inputs to see if the neuron activates itself and propagates the signal.
Of course then you will need the specific algorithm to feed-forward the net or back-propagate the learning that will operate on this data structure.
The simplest thing you could start implement would be a simple perceptron, you can find some infos here.

You said you're already familiar with neural networks, but since there are many different types of neural networks of differing complexity (convolutional, hebbian, kohonen maps, etc.), I'll go over a simple Feed-forward neural network again, just to make sure we're on the same page.
A basic neural network consists of the following things
Neurons
Input Neuron(s)
Hidden Neurons (optional)
Output Neuron(s)
Links between Neurons (sometimes called synapses in analogy to biology)
An activation function
The Neurons have an activation value. When you evaluate a network, the input nodes' activation is set to the actual input. The links from the input nodes lead to nodes closer to the output, usually to one or more layers of hidden nodes. At each neuron, the input activation is processed using an activation function. Different activation functions can be used, and sometimes they even vary within the neurons of a single network.
The activation function processes the activation of the neuron into it's output. The early experiments usually used a simple threshold function (i.e. activation > 0.5 ? 1 : 0), nowadays a Sigmoid function is often used.
The output of the activation function is then propagated over the links to the next nodes. Each link has an associated weight it applies to its input.
Finally, the output of the network is extracted from the activation of the output neuron(s).
I've put together a very simple (and very verbose...) example here. It's written in Ruby and computes AND, which is about as simple as it gets.
A much trickier question is how to actually create a network that does something useful. The trivial network of the example was created manually, but that is infeasible with more complex problems. There are two approaches I am aware of, with the most common being backpropagation. Less used is neuroevolution, where the weights of the links are determined using a genetic algorithm.

AI-Junkie has a great tutorial on (A)NNs and they have the code posted there.
Here is a neuron (from ai-junkie):
struct SNeuron
{
//the number of inputs into the neuron
int m_NumInputs;
//the weights for each input
vector<double> m_vecWeight;
//ctor
SNeuron(int NumInputs);
};
Here is a neuron layer (ai-junkie):
struct SNeuronLayer
{
//the number of neurons in this layer
int m_NumNeurons;
//the layer of neurons
vector<SNeuron> m_vecNeurons;
SNeuronLayer(int NumNeurons, int NumInputsPerNeuron);
};
Like I mentioned before... you can find all of the code with the ai-junkie (A)NN tutorial.

This is the NUPIC programmer's guide. NuPIC is the framework to implement their theory (HTM) based on the structure and operation of the neocortex
This is how they define HTM
HTM technology has the potential to solve many difficult problems in machine learning, inference, and prediction. Some of the application areas we are exploring with our customers include recognizing objects in images, recognizing behaviors in videos, identifying the gender of a speaker, predicting traffic patterns, doing optical character recognition on messy text, evaluating medical images, and predicting click through patterns on the web.
this is a simple net with nument 1.5
from nupic.network import *
from nupic.network.simpledatainterface import WideDataInterface
def TheNet():
net=SimpleHTM(
levelParams=[
{ # Level 0
},
{ # Level 1
'levelSize': 8, 'bottomUpOut': 8,
'spatialPoolerAlgorithm': 'gaussian',
'sigma': 0.4, 'maxDistance': 0.05,
'symmetricTime': True, 'transitionMemory': 1,
'topNeighbors': 2, 'maxGroupSize': 1024,
'temporalPoolerAlgorithm': 'sumProp'
},
{ # Level 2
'levelSize': 4, 'bottomUpOut': 4,
'spatialPoolerAlgorithm': 'product',
'symmetricTime': True, 'transitionMemory': 1,
'topNeighbors': 2, 'maxGroupSize': 1024,
'temporalPoolerAlgorithm': 'sumProp'
},
{ # Level 3
'levelSize': 1,
'spatialPoolerAlgorithm': 'product',
'mapperAlgorithm': 'sumProp'
},],)
Data=WideDataInterface('Datos/__Categorias__.txt', 'Datos/Datos_Entrenamiento%d.txt', numDataFiles = 8)#
net.createNetwork(Data)
net.train(Datos)
if __name__ == '__main__':
print "Creating HTM Net..."
TheNet()

Related

Predictions using Convolutional Neural Networks and DL4J

This is my first time working with DL4J (Deep Learning for Java) and also my first Convolutional Neural Network. My Goal is to use the Convolutional Neural Netowrk to give me some predicted values about an image. I gathered and labelled my images myself. The labels or expected outputs consist of two numbers between 0 and 1 (I just wrote them in the file name like 0.01x0.87.jpg).
Now I can't find any way to use the DataSetIterator Class which DL4J uses so that I can also set my label values.
Is there a simple way to tell DL4J that I want to train my Network to recognize that image 0.01x0.01.jpg should spit out the values 0.01 and 0.01?

What you want to do is usually known as regression. In contrast to classification where you want to either have a 0 or 1 output, in regression any value can be the target.
In your case, you will likely want to use a network architecture that uses either a sigmoid (which forces your values to be between 0 and 1) or an identity (which keeps the values as is, i.e. allows for them to be outside of the 0 to 1 range) activation function.
As you have two values that you are trying to predict, you will have to also define that you are using two outputs.
So much for your model architecture.
For data loading, you can use the ImageRecordReader, but also pass it a PathMultiLabelGenerator of your own. When you implement the PathMultiLabelGenerator interface, you will get the full path of the image as a string, and you can do whatever you want with it, like for example remove the file ending, split on x and parse your filename into a list of DoubleWritable. DoubleWritable is just a simple wrapper class for double so creating that is as easy as just instantiating it by passing the actual value to the constructor.
To create a dataset iterator you can now follow the documentation on RecordReaderDataSetIterator.

Character Recognition Using Back Propagation Algorithm Testing

Recently I've been working on character recognition using Back Propagation Algorithm. I've taken the image and reduced to 5x7 size, therefore I got 35 pixels and trained the network using those pixels with 35 input neurons, 35 hidden nodes, and 10 output nodes. And I had completed the training successfully and I got weights that I needed. And I've got stuck here. I have my test set and I know I should feed forward the network. But I don't know what to do exactly. My test set will be 4 samples of 1x35. My output layer has 10 neurons. how do I exactly distinguish the characters with the output that I will get? I want to know how this testing works. Please guide me through this stage. Thanks in advance.

One vs All
A common approach for testing these types of neural networks is "one-vs-all" approach. We view each of the output nodes as its own classifier that is giving the probability of the sample being that class vs not being that class.
For instance if you network output [1, 0, ..., 0] then class 1 has high probability of being class 1 vs not being class 1. Class 2 has low probability of being class 2 vs not being class 2, etc.
Ties
In the case of a tie, it is common (in research) to have a random function break the tie. If you get [1, 1, 1, ..., 1] then the function would pick a number from 1-10 and that is your prediction. In practice sometimes an expert system is used to break ties. Perhaps class 1 is more expensive than class 2, so we tie in preference to class 2.
Steps
So the steps are:
Split dataset into test/train set
Train weights on train set
Pass test set forward through the neural network
For each sample, choose the argmax (the output with highest value) as your prediction
In case of tie, choose randomly between all tying classes
Aside
In your particular case, I imagine implementation of this strategy will result in a network that barely beats random performance (10%) accuracy.
I would suggest some reconsidering of the network architecture.
If you look at your 5x7 images, can you tell what number that image was originally? It seems likely that scaling the image down to this size losses too much information that the network cannot distinguish between classes.
Debugging
From what you've described I would look at the following when debugging your network.
Is your data preprocessing (down-scaling) leeching out too much information? Check this by manually investigating a few of the images and seeing if you can tell what the image should be.
Does your one-hot algorithm work? When you convert your targets for training, does it successfully convert 1 -> [1, 0, 0, ..., 0]?
Is your back-prop / gradient descent algorithm correct? You should see (roughly) a monotonic decrease in your loss function while training. Try at every step (or every few steps) printing the loss that you are optimizing. Or even for a very simple gut check, print mean squared error: (P-Y)^2

What is the policy gradient when multiple actions are possible?

I am trying to program a reinforcement learning algorithm using policy gradients, as inspired by Karpathy's blog article. Karpathy's example has only two actions UP or DOWN, so a single output neuron is sufficient (high activation=UP, low activation=DOWN). I want to extend this to multiple actions, so I believe I need a softmax activation function on the output layer. However, I am not certain about what the gradient for the output layer should be.
If I was using a cross-entropy loss function with the softmax activation in a supervised learning context, the gradient for neuron is simply:
g[i] = a[i] - target[i]
where target[i] = 1 for the desired action and 0 for all others.
To use this for reinforcement learning I would multiply g[i] by the discounted reward before back-propagating.
However, it seems that reinforcement learning uses negative log-likelihood as the loss instead of cross-entropy. How does that change the gradient?

Note: something that I think will get you on the right track:
The negative log likelihood is also know as the multiclass cross-entropy (Pattern Recognition and Machine Learning).
EDIT: misread the question. I thought this was talking about Deep Deterministic Policy Gradients
It would depend on your domain, but with a softmax, you are getting a probability across all output nodes. To me that doesn't really make sense in most domains when you think about DDPG. For example, if you are controlling the extension of robotic arms and legs, it wouldn't make sense to have limb extension measured as [.25, .25, .25, .25], if you wanted to have all limbs extended. In this case, .25 could mean fully extended, but what happens if the vector of outputs is [.75,.25,0,0]? So in this way, you could have a separate sigmoid function from 0 to 1 for all action nodes, where then you could represent it as [1,1,1,1] for all arms being extended. I hope that makes sense.
Since the actor network is what determines the actions in DDPG, we could then represent our network like this for our robot (rough keras example):
state = Input(shape=[your_state_shape])
hidden_layer = Dense(30, activation='relu')(state)
all_limbs = Dense(4, activation='sigmoid')(hidden_layer)
model = Model(input=state, output=all_limbs)
Then, your critic network will have to account for the action dimensions.
state = Input(shape=[your_state_shape])
action = Input(shape=[4])
state_hidden = Dense(30, activation='relu')(state)
state_hidden_2 = Dense(30, activation='linear')(state_hidden)
action_hidden = Dense(30, activation='linear')(action)
combined = merge([state_hidden_2, action_hidden], mode='sum')
squasher = Dense(30, activation='relu')(combined)
output = Dense(4, activation='linear')(squasher) #number of actions
Then you can use your target functions from there. Note, I don't know if this working code, as I haven't tested it, but hopefully you get the idea.
Source: https://arxiv.org/pdf/1509.02971.pdf
Awesome blog on this with Torc (not created by me): https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html
In the above blog, they also show how to use different output functions, such as one TAHN, and two sigmoid functions for actions.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!

First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

Depth of Artificial Neural Networks

According to this answer, one should never use more than two hidden layers of Neurons.
According to this answer, a middle layer should contain at most twice the amount of input or output neurons (so if you have 5 input neurons and 10 output neurons, one should use (at most) 20 middle neurons per layer).
Does that mean that all data will be modeled within that amount of Neurons?
So if, for example, one wants to do anything from modeling weather (a million input nodes from data from different weather stations) to simple OCR (of scanned text with a resolution of 1000x1000DPI) one would need the same amount of nodes?
PS.
My last question was closed. Is there another SE site where these kinds of questions are on topic?

You will likely have overfitting of your data (aka, High Variance). Think of it like this: The more neurons and layers you have gives you more parameters to fit your data better.
Remember that for the first layer node the equation becomes Z = sigmoid(sum(W*x))
The second layer node becomes Z2 = Sigmoid(sum(W*Z))
Look into machine learning class taught at Stanford...its a great online course and good tool as a reference.

More than two hidden layers can be useful in certain architectures
such as cascade correlation (Fahlman and Lebiere 1990) and in special
applications, such as the two-spirals problem (Lang and Witbrock 1988)
and ZIP code recognition (Le Cun et al. 1989).
Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation
Learning Architecture," NIPS2, 524-532.
Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
handwritten ZIP code recognition", Neural Computation, 1, 541-551.
Check out the sections "How many hidden layers should I use?" and "How many hidden units should I use?" on comp.ai.neural-nets's FAQ for more information.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse