Neuronal networks for Julia - first steps - neural-network

I would like to use neuronal networks and related concepts with Julia. The idea is to start with Tic-Tac-Toe, then look at more general m,n,k-games. I would like to be flexible for experiments with different networks, size and number of hidden layers etc.
A simple way to represent the board is using an m * n matrix with +1, -1, 0 for “X”, “O”, “empty cell”. Therefore the network should have the m * n input layer, at least one hidden layer and the m * n output layer indicating the probability of the next move per cell.
I had a look at the available resources in https://juliapackages.com/c/ai
Which package do you propose to start with? What about ease of use and performance? (I don’t need a GPU backend)

Related

Can't approximate simple multiplication function in neural network with 1 hidden layer

I just wanted to test how good can neural network approximate multiplication function (regression task).
I am using Azure Machine Learning Studio. I have 6500 samples, 1 hidden layer
(I have tested 5 /30 /100 neurons per hidden layer), no normalization. And default parameters
Learning rate - 0.005, Number of learning iterations - 200, The initial learning weigh - 0.1,
The momentum - 0 [description]. I got extremely bad accuracy, close to 0.
At the same time boosted Decision forest regression shows very good approximation.
What am I doing wrong? This task should be very easy for NN.
Big multiplication function gradient forces the net probably almost immediately into some horrifying state where all its hidden nodes have zero gradient.
We can use two approaches:
1) Devide by constant. We are just deviding everything before the learning and multiply after.
2) Make log-normalization. It makes multiplication into addition:
m = x*y => ln(m) = ln(x) + ln(y).
Some things to check:
Your output layer should have a linear activation function. If it's sigmoidal, it won't be able to represent values outside it's range (e.g. -1 to 1)
You should use a loss function that's appropriate for regression (e.g. squared error)
If your hidden layer uses sigmoidal activation functions, check that you're not saturating them. Multiplication can work on arbitrarily small/large values. And, if you pass a large number as input you can get saturation, which will lose information. If using ReLUs, make sure they're not getting stuck at 0 on all examples (although activations will generally be sparse on any given example).
Check that your training procedure is working as intended. Plot the error over time during training. How does it look? Are your gradients well behaved or are they blowing up? One source of problems can be the learning rate being set too high (unstable error, exploding gradients) or too low (very slow progress, error doesn't decrease quickly enough).
This is how I do multiplication with neural network:
import numpy as np
from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Dense(150, activation='relu', input_shape=(2,)))
model.add(layers.Dense(1, activation='relu'))
data = np.random.random((10000, 2))
results = np.asarray([a * b for a, b in data])
model.compile(optimizer='sgd', loss='mae')
model.fit(data, results, epochs=1, batch_size=1)
model.predict([[0.8, 0.5]])
It works.
"Two approaches: divide by constant, or make log normalization"
I'm tried both approaches. Certainly, log normalization works since as you rightly point out it forces an implementation of addition. Dividing by constant -- or similarly normalizing across any range -- seems not to succeed in my extensive testing.
The log approach is fine, but if you have two datasets with a set of inputs and a target y value where:
In dataset one the target is consistently a sum of two of the inputs
In dataset two the target is consistently the product of two of the inputs
Then it's not clear to me how to design a neural network which will find the target y in both datasets using backpropogation. If this isn't possible, then I find it a surprising limitation in the ability of a neural network to find the "an approximation to any function". But I'm new to this game, and my expectations may be unrealistic.
Here is one way you could approximate the multiplication function using one hidden layer. It uses a sigmoidal activation in the hidden layer, and it works quite nicely until a certain range of numbers. This is the gist link
m = x*y => ln(m) = ln(x) + ln(y), but only if x, y > 0

How can I add concurrency to neural network processing?

The basics of neural networks, as I understand them, is there are several inputs, weights and outputs. There can be hidden layers that add to the complexity of the whole thing.
If I have 100 inputs, 5 hidden layers and one output (yes or no), presumably, there will be a LOT of connections. Somewhere on the order of 100^5. To do back propagation via gradient descent seems like it will take a VERY long time.
How can I set up the back propagation in a way that is parallel (concurrent) to take advantage of multicore processors (or multiple processors).
This is a language agnostic question because I am simply trying to understand structure.
If you have 5 hidden layers (assuming with 100 nodes each) you have 5 * 100^2 weights (assuming the bias node is included in the 100 nodes), not 100^5 (because there are 100^2 weights between two consecutive layers).
If you use gradient descent, you'll have to calculate the contribution of each training sample to the gradient, so a natural way of distributing this across cores would be to spread the training sample across the cores and sum the contributions to the gradient in the end.
With backpropagation, you can use batch backpropagation (accumulate weight changes from several training samples before updating the weights, see e.g. https://stackoverflow.com/a/11415434/288875 ).
I would think that the first option is much more cache friendly (updates need to be merged only once between processors in each step).

Replicator Neural Network for outlier detection, Step-wise function causing same prediction

In my project, one of my objectives is to find outliers in aeronautical engine data and chose to use the Replicator Neural Network to do so and read the following report on it (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.12.3366&rep=rep1&type=pdf) and am having a slight understanding issue with the step-wise function (page 4, figure 3) and the prediction values due to it.
The explanation of a replicator neural network is best described in the above report but as a background the replicator neural network I have built works by having the same number of outputs as inputs and having 3 hidden layers with the following activation functions:
Hidden layer 1 = tanh sigmoid S1(θ) = tanh,
Hidden layer 2 = step-wise, S2(θ) = 1/2 + 1/(2(k − 1)) {summation each variable j} tanh[a3(θ −j/N)]
Hidden Layer 3 = tanh sigmoid S1(θ) = tanh,
Output Layer 4 = normal sigmoid S3(θ) = 1/1+e^-θ
I have implemented the algorithm and it seems to be training (since the mean squared error decreases steadily during training). The only thing I don't understand is how the predictions are made when the middle layer with the step-wise activation function is applied since it causes the 3 middle nodes' activations to be become specific discrete values (e.g. my last activations on the 3 middle were 1.0, -1.0, 2.0 ) , this causes these values to be forward propagated and me getting very similar or exactly the same predictions every time.
The section in the report on page 3-4 best describes the algorithm but i have no idea what i have to do to fix this, i don't have much time either :(
Any help would be greatly appreciated.
Thank you
I'm facing the problem of implementing this algorithm and here is my insight into the problem that you might have had: The middle layer, by utilizing a step-wise function, is essentially performing clustering on the data. Each layer transforms the data into a discrete number which could be interpreted as a coordinate in a grid system. Imagine we use two neurons in the middle layer with step-wise values ranging from -2 to +2 in increments of 1. This way we define a 5x5 grid where each set of features will be placed. The more steps you allow, the more grids. The more grids, the more "clusters" you have.
This all sounds good and all. After all, we are compressing the data into a smaller (dimensional) representation which then is used to try to reconstruct into the original input.
This step-wise function, however, has a big problem on itself: back-propagation does not work (in theory) with step-wise functions. You can find more about this in this paper. In this last paper they suggest switching the step-wise function with a ramp-like function. That is, to have almost an infinite amount of clusters.
Your problem might be directly related to this. Try switching the step-wise function with a ramp-wise one and measure how the error changes throughout the learning phase.
By the way, do you have any of this code available anywhere for other researchers to use?

Matlab Neural Network Advice

I am working currently on a project to optimize heater performance using MATLAB neural network tool, I read the manuals and got the guidance from MATLAB manual.
I have configured the network and tested it, what I need is two points:
1. Am I on the right track? is my network correct? I need an expert advise
2. I need to (Optimize) the performance of the heater, I have defined my function but I don't have a clue how to integrate the network in the optimization of the function.
my network is as follows
3 inputs x1 x2 x3
one out put
load input1
load input2
load input3
x1= importdata('input1.txt'); (similar the other inputs and output)
[x1n,x1min,x1max]=norm_nn(x1); ( I worte my own normalization function)
IN=[x1n x2n x3n]';
OUT=[y1n]';
INTRAIN = IN(:,1:1307);
OUTTRAIN = OUT(:,1:1307);
INTEST =IN(:,1308 : 1634);
OUTTEST = OUT(:,1308:1634);
NETWORKNet1 = newff(IN,OUT,[20 20 20], {'tansig' 'tansig' }, 'trainbr');
net = init (NETWORKNet1);
NETWORKNet1 = trainbr(NETWORKNet1,INTRAIN,OUTTRAIN);
YtestNwt1 = sim(NETWORKNet1,INTEST);
y1testd=denorm_nn7(YtestNet1(1,:),y1min,y1max);
e1=er8(y1testd,y1(1308:1634));
save Net1
I have used (1634 data points and divided it for training (80%) and test (20%))
Here is some advice:
(A) Use feedforwardnet as newff is deprecated
(B) Plot the training, test data and the network result to make it easier to visualize what's going on.
(C) By writing [20 20 20] your network has 3 hidden layers. The vast majority of problems require only 1 hidden layer. Only if all other avenues have been exhausted should you move to multiple hidden layers.
(D) Test the network (ie, the sim command) on the training data first. This is an 'easy' test for a neural network and should be working first before you move on. Then you can test it with the test data (which the network was not trained on). This will show if the network has generalized the shape of the data it is trying to learn.
Validation is also another important factor which helps the network to generalize. If you look at the matlab neural network training window (nntraintool) and click 'performance', one of the graphs should be labelled 'validation'.
Regarding your specific questions:
1. Is my network correct? - difficult to say without seeing the dataset.
2. Optimizing performance of the heater - on a simple level you would have a single output neuron, a number between 0 and 1 which denotes heater performance. The input neurons then contain any other parameters involved.
But now, the network can only predict what the performance will be, given any combination of inputs. It won't be able to tell you which inputs will give you maxmimum output. For only 3 inputs, with low resolution / granularity, you could try an exhaustive / brute force search. Otherwise, look into genetic algorithms to quickly find a good solution.

Few questions about kohonen neural network

I have big data set (time-series, about 50 parameters/values). I want to use Kohonen network to group similar data rows. I've read some about Kohonen neural networks, i understand idea of Kohonen network, but:
I don't know how to implement Kohonen with so many dimensions. I found example on CodeProject, but only with 2 or 3 dimensional input vector. When i have 50 parameters - shall i create 50 weights in my neurons?
I don't know how to update weights of winning neuron (how to calculate new weights?).
My english is not perfect and I don't understand everything I read about Kohonen network, especially descriptions of variables in formulas, thats why im asking.
One should distinguish the dimensionality of the map, which is usually low (e.g. 2 in the common case of a rectangular grid) and the dimensionality of the reference vectors which can be arbitrarily high without problems.
Look at http://www.psychology.mcmaster.ca/4i03/demos/competitive-demo.html for a nice example with 49-dimensional input vectors (7x7 pixel images). The Kohonen map in this case has the form of a one-dimensional ring of 8 units.
See also http://www.demogng.de for a java simulator for various Kohonen-like networks including ring-shaped ones like the one at McMasters. The reference vectors, however, are all 2-dimensional, but only for easier display. They could have arbitrary high dimensions without any change in the algorithms.
Yes, you would need 50 neurons. However, these types of networks are usually low dimensional as described in this self-organizing map article. I have never seen them use more than a few inputs.
You have to use an update formula. From the same article: Wv(s + 1) = Wv(s) + Θ(u, v, s) α(s)(D(t) - Wv(s))
yes, you'll need 50 inputs for each neuron
you basically do a linear interpolation between the neurons and the target (input) neuron, and use W(s + 1) = W(s) + Θ() * α(s) * (Input(t) - W(s)) with Θ being your neighbourhood function.
and you should update all your neurons, not only the winner
which function you use as a neighbourhood function depends on your actual problem.
a common property of such a function is that it has a value 1 when i=k and falls off with the distance euclidian distance. additionally it shrinks with time (in order to localize clusters).
simple neighbourhood functions include linear interpolation (up to a "maximum distance") or a gaussian function