Neural networks - Finding o/p from two distinct i/p patterns - neural-network

I have two distinct (unknown relationship) types of input patterns and I need to design a neural network where I would get an output based on both these patterns. However, I am unsure of how to design such a network.
I am a newbie in NN but I am trying to read as much as I can. In my problem as far as I can understand there are two input matrices of order say 6*1 and an o/p matrix of order 6*1. So how should I start with this? Is it ok to use backpropogation and a single hidden layer?
e.g.->
Input 1 Input 2 Output
0.59 1 0.7
0.70 1 0.4
0.75 1 0.5
0.83 0 0.6
0.91 0 0.8
0.94 0 0.9
How do I decide the order of the weight matrix and the transfer function?
Please help. Any link pertaining to this will also do. Thanks.

The simplest thing to try is to concatenate the 2 input vectors. This way you'll have 1 input vector of length 12, and this becomes a "text-book" learning problem from R^{12} to R^{6}.
The downside of this, is that you lose the information about each 6 inputs coming from a different source, but by your description it doesn't sound like you know much about these sources. Anyways, if you have any special knowledge of the 2 sources, you can use some pre-processing (like subtracting the mean, or dividing by the standard deviation) on each of the sources, to make them more similar, but most learning algorithms should also work OK without it.
As for which algorithm to try, I think the cannonical order is: linear machines (perceptron), then SVM, then multi-layer-networks (trained with backprop). The reason is, the more powerful the machine you use, the better chances you have to fit the train set, but less chances to fit the "true" pattern (overfitting).

Related

Neural Network Architecture for Binary Sequential Data

I want to develop a neural network to generate samples of sequential binary data. For e.g. I give my network a stream of binary data: 1 0 0 0 0 1 1 1 0 0 1 (11 digits). I want my generator to be able to output some sort of similar structure to my data. Given the previous example, I want something along the lines of 0 0 1 1 1 0 1 0 0 0 0 (11 digits). From my input -> output there is similar structure in the data.
My current approach is using a GAN with with LSTM to decipher patterns. This doesn't work too well.
Obviously I would like to generate far longer streams of data but similar concept. Does anyone have any suggestions on what type of model to use. I know this is a really unconventional optimization problem but I feel like this is a necessary step in breaking down my problem.
Lastly, it might help to think about the problem like this. If I were to create a simulator to model some environment my binary string could represent the days of rain vs. no rain. Evidently, I want to generate some sort of data that is believable and matches similar patterns to the actual data.
EDIT:
I am also open to any ideas on just modeling in general like maybe using markov chains, etc.

Activation function to get day of week

I'm writing a program to predict when will something happens. I don't know which activation function to get output in day of week (1-7).
I tried sigmoid function but i need to input the predicted day and it output probability of it, I don't want it to be this way.
I expect the activation function returning 0 to infinite, is ReLU the best activation function for this task?
EDIT:
also, what if i wanted output more than 7 days, for example, x will hapen in 9th day from today, or 15th day from today, etc? I'm looking for dynamic ways to do this
What you are trying to do is solving a classification problem with a regression approach. That's at least unconventional.
You can use any activation function you want and define your output as you want. E.g. linear, relu with output range from 1 to 7 or something between -1(or 0) and 1 like tanh or sigmoid and map the output (-1 -> 1; -0.3 -> 2; ...).
The problem for you will be that you get a floatingpoint number as a result. So your model not only has to learn how to classify correctly but also how to predict the (allmost) exact number you want in your output neuron. That makes the problem more complicated than it has to be. With a model like that it also will be likley that for some outlier datapoints you might get unexpected return values like 0, -1 or 8. What do you do then?
To sum it up: Listen to #venkata krishnan, use softmax and seven output neurons and map this result to a number between 1 and 7 outside the neural network if you have to.
EDIT
What comes to my mind after reading the comments again would be a mix of what you want and what you should do.
You could try to make the second last layer a 7 neuron softmax layer and map those output to a single neuron in the last layer.
Niether did i ever try that nor have i ever read about something like that so i can't tell you if thats a good idea, likely not, but you might consider it worth a try.
I want to add onto the point of #venkata krishnan, which raises a valid point in your problem setting. You will find an answer to your original question further down, but I strongly suggeste you read the following comment first.
Generally, you want to discern between categorical, ordinal and interval variables. I have given a relatively lengthy explanation in a different answer on Stackoverflow, it might be helpful to understand this concept in more detail.
In your scenario, you mostly want to have an understanding of "how wrong" you are. Of course, it is perfectly reasonable to assume what you are doing and interpret it as a interval variable, and therefore have an assumed ordering (and a distance) between different values.
What is problematic, though, is the fact that you are assuming a continuous space on a discrete variable. E.g., it does not make any sense to interpret the output of 4.3, since you can only tell between 4 (Friday, assuming you start numbering your days at 0), or 5 (Saturday). Any value in between would have to be rounded, which is perfectly fine - until you want to perform backpropagation on this loss.
It is problematic, because you are essentially introducing a non-convex and non-continous function, no matter how you "round" your values. Again, to exemplify this, you could assume to round to the nearest number; then, at the value of 4.5, you would see a sudden increase in the loss, which is non-differentialbe, and will therefore put a hard time on your optimizer, potentially limiting convergence of your system.
If, instead, you utilize several output neurons, as suggested by #venkata krishnan, you might lose the information of distance (how many days you are off) on paper, but you can of course still interpret your loss in any way you like. This would certainly be the better option for a discrete-valued variable.
To answer your original question: I personally would make sure that your loss function is bounded both in the upper and lower level, as you could otherwise have undefined/inconsistent loss values, that might lead to subpar optimization. One way to do this is to re-scale a Sigmoid function (the co-domain of sigmoid(R) is [0,1]. Eventually, you can then just multiply your output by 6, to get a value range that is [0,6], and could (after rounding) cover all the values you want.
As far I know, there is no such thing like an activation function which will yield 0 to infinite. You can apply 7 output nodes with a "Softmax" activation function which will return the probability. There is another solution which may work. You can you 3 output nodes with "Binary" activation function which will return either 0 or 1. That means you can have 8 different outputs with only 3 nodes which are 000, 001, 010, 011, 100, 101, 110 and 111. You can use 7 of them. 

Clockwork RNN (CW-RNN) and TanH activation function

Thanks again for taking the time to answer this post !
Quick question :
In a Clockwork Reccurent Neural Network (RNN), (here is the documentation), it seems to me that a TanH-activated output layer would suffer from extreme load from so many connections, and alway output near 1 , or am I mistaken ?
Let's say we have a CW-RNN with, for the hidden layer, 4 modules of 50 neurons each.
Even if these modules have a different clock rate, obviously sometimes, more than 1 module will activate & output something else that 0 : Thus, the output layer may output 1 because of so many inputs.
Is there anything I am missing in the CW-RNN concept ?
I know about weight initialization, but i'm just wondering if I am missing a piece of the puzzle.

Neural networks and the XOR function

I'm playing with a neural network I implemented myself: it's a trivial forward network using RPROP as a learning algorithm as the only "plus" compared to the basic design.
The network scores decently when I test it against MNIST or when I attempt at image compression, but when I try to model something as simple as the XOR function, sometimes during learning it gets trapped into a local minima, and outputs the following truth table:
0 XOR 0 = 1.4598413968251171e-171
1 XOR 0 = 0.9999999999999998
0 XOR 1 = 0.9999999999999998
1 XOR 1 = 0.5
Often the result after the training is correct, but sometimes 1 XOR 1 outputs 0.5 instead of 1 as it should. It does not really always happens with XOR(1,1), but with other inputs as well. Being the XOR function a "classical" in the literature of back propagation I wonder what's happening here, especially given that my network appears to learn more complex (but perhaps less non-linear) tasks just fine.
My wild guess is that's something wrong with the biases.
Any hint?
Note 1: the network layout above is 2|3|1, but does not change much when I use more hidden units, certain learning attempts still go wrong.
Note 2: I put the implementation into a Gist: https://gist.github.com/antirez/e45939b918868b91ec6fea1d1938db0d
The problem was due to a bug in my implementation: the bias unit of the NN immediately before the output unit was not computed correctly. After fixing the code the XOR function is always computed right.

Merge sensor data for clustering/neural net usage

I have several datasets i.e. matrices that have a 2 columns, one with a matlab date number and a second one with a double value. Here an example set of one of them
>> S20_EavesN0x2DEAir(1:20,:)
ans =
1.0e+05 *
7.345016409722222 0.000189375000000
7.345016618055555 0.000181875000000
7.345016833333333 0.000177500000000
7.345017041666667 0.000172500000000
7.345017256944445 0.000168750000000
7.345017465277778 0.000166875000000
7.345017680555555 0.000164375000000
7.345017888888889 0.000162500000000
7.345018104166667 0.000161250000000
7.345018312500001 0.000160625000000
7.345018527777778 0.000158750000000
7.345018736111110 0.000160000000000
7.345018951388888 0.000159375000000
7.345019159722222 0.000159375000000
7.345019375000000 0.000160625000000
7.345019583333333 0.000161875000000
7.345019798611111 0.000162500000000
7.345020006944444 0.000161875000000
7.345020222222222 0.000160625000000
7.345020430555556 0.000160000000000
Now that I have those different sensor values, I need to get them together into a matrix, so that I could perform clustering, neural net and so on, the only problem is, that the sensor data was taken with slightly different timings or timestamps and there is nothing I can do about that from a data collection point of view.
My first thought was interpolation to make one sensor data set fit another one, but that seems like a messy approach and I was thinking maybe I am missing something, a toolbox or function that would enable me to do this quicker without me fiddling around. To even complicate things more, the number of sensors grew over time, therefore I am looking at different start dates as well.
Someone a good idea on how to go about this? Thanks
I think your first thought about interpolation was the correct one, at least if you plan to use NNs. Another option would be to use approaches which are designed to deal with missing data, like http://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory for example.
It's hard to give an answer for the clustering part, because I have no idea what you're looking for in the data.
For the neural network, beside interpolating there are at least two other methods that come to mind:
training separate networks for each matrix
feeding them all together to the same network, with a flag specifying which matrix the data is coming from, i.e. something like: input (timestamp, flag_m1, flag_m2, ..., flag_mN) => target (value) where the flag_m* columns are mutually exclusive boolean values - i.e. flag_mK is 1 iff the line comes from matrix K, 0 otherwise.
These are the only things I can safely say with the amount of information you provided.