I am using an LSTM layer to multiplex among several memory cells. That said, having several input options, I want to feed only one of them to the hidden layer. I arranged the input to LSTM in such a way, so it would select an apropriate cell based on the input_gate, forget_gate, and output_gate I pass to it in addition to the cell_input.
However, it seems that the LSTM layer transforms the values of the memory cells, while I would expect it to pass them to the output as-is.
For example, I am passing the following input, which I printed in groups corresponding to input_gate, forget_gate, cell_input, and output_gate for convenience:
ig: [ 0. 1. 0. 0. 0. 0.]
fg: [ 0. 0. 0. 0. 0. 0.]
ci: [ 0.5 0.5 0.5 0.5 0.5 0. ]
og: [ 1. 1. 0. 0. 0. 1.]
I want the LSTM layer to only pass ci[0], ci[1], and ci[5] to the output as the og group indicates.
However, what I see in the output buffer is different:
LSTM out: [ 0.16597414 0.23799096 0.1135163 0.1135163 0.1135163 0.]
While not absolutely meaningless to me (the 0-th and 1-th entries are slightly greater than the rest) this output is not the [.5 .5 0. 0. 0.] that I expected.
From what I learned about LSTM, it doesn't seem that there is any transition function from the memory cells to the actual output.
Silly question, of course: the output is clamped by a sigmoid.
Related
I am trying to understand the stoachstic uniform selection algorithm as described in he docs: https://se.mathworks.com/help/gads/genetic-algorithm-options.html
The ga default selection function, Stochastic uniform, lays out a line in which each parent corresponds to a section of the line of length proportional to its scaled value. The algorithm moves along the line in steps of equal size. At each step, the algorithm allocates a parent from the section it lands on. The first step is a uniform random number less than the step size.
For myself the above docs can be interpreted in two ways:
Either a random number x will be picked initially and all subsequent "steps" are simply multiple of it.
Step size: 1
Random x e.g. 0.5
Location on line: 0.5, 1, 1.5, 2, 2.5
The algorithm moves along the line in fixed steps and additionally a random x < the fixed size is added every time.
Fixed Step size: 1
Random x varies all the time but < 1
Location on line: 1.1, 2.3, 3.2, 4.5, 5.1
Number 1 faces the issue that if the random value chosen is too small only the most fit individual will be selected as we don't move along the line at all. So is the 2nd interpretation correct?
As far as I understood the scaled fitness values sum up to the count of parents that will be generated, therefore isn't the step size always 1 since we can fit exactly as many steps * parents needed on the line?
Here's a third interpretation for your consideration. The algorithm moves along the line in fixed steps. However, the starting point is less than the step size.
Fixed step size: 1
Randomly chosen start: 0.32
Locations on line: 0.32 1.32 2.32 3.32 4.32 5.32
By using a fixed step size, the algorithm knows exactly how many parents will be selected. For example, if the line is 100 units long, and the step size is 1, then exactly 100 parents will be selected. But which parents are selected is determined by the random starting point.
This assumes that there are multiple parents to choose from in each interval of length 1. And the most fit individual has a scaled length less than 1.
I am currently trying to understand the ANN that I created for an assignment that essentially takes gray scale (0-150)images (120x128) and determines whether the person is Male or Female. It works for the most part. I am treating this like a boolean problem where the output(Male = 1, Female = 0). I am able to get the ANN to correctly identify Male or Female. However the outputs I am getting for the Males are (0.3-0.6) depending on the run. Should I be getting the value ~1 out?
I am using a sigmoid unit 1/(1+e^-y) and have tried to take the inverse. I have tried this using 5 - 60 hidden units on 1 layer and tried 2 outputs with flip flop results. I want to understand this so that I can apply this to a non-boolean problem. ie If I want a numerical output how would I got about doing that or am I using the wrong machine learning technique?
You can use binary function at the output with some threshold. Assuming, you have assigned 0 for female and 1 for male in training, while testing you will get values in between 0 and 1 and also some times below 0 and above 1......So to make a decision at the output value just add threshold of 0.5 and check output value, if it is less than 0.5 then estimated class is female and if it is equal to or greater than 0.5 then estimated class is male.
Hi my question is a bit long please bare and read it till the end.
I am working on a project with 30 participants. We have two type of data set (first data set has 30 rows and 160 columns , and second data set has the same 30 rows and 200 columns as outputs=y and these outputs are independent), what i want to do is to use the first data set and predict the second data set outputs.As first data set was rectangular type and had high dimension i have used factor analysis and now have 19 factors that cover up to 98% of the variance. Now i want to use these 19 factors for predicting the outputs of the second data set.
I am using neuralnet and backpropogation and everything goes well and my results are really close to outputs.
My questions :
1- as my inputs are the factors ( they are between -1 and 1 ) and my outputs scale are between 4 to 10000 and integer , should i still scaled them before running neural network ?
2-I scaled the data ( both input and outputs ) and then predicted with neuralnet , then i check the MSE error it was so high like 6000 while my prediction and real output are so close to each other. But if i rescale the prediction and outputs then check The MSE its near zero. Is it unbiased to rescale and then check the MSE ?
3- I read that it is better to not scale the output from the beginning but if i just scale the inputs all my prediction are 1. Is it correct to not to scale the outputs ?
4- If i want to plot the ROC curve how can i do it. Because my results are never equal to real outputs ?
Thank you for reading my question
[edit#1]: There is a publication on how to produce ROC curves using neural network results
http://www.lcc.uma.es/~jja/recidiva/048.pdf
1) You can scale your values (using minmax, for example). But only scale your training data set. Save the parameters used in the scaling process (in minmax they would be the min and max values by which the data is scaled). Only then, you can scale your test data set WITH the min and max values you got from the training data set. Remember, with the test data set you are trying to mimic the process of classifying unseen data. Unseen data is scaled with your scaling parameters from the testing data set.
2) When talking about errors, do mention which data set the error was computed on. You can compute an error function (in fact, there are different error functions, one of them, the mean squared error, or MSE) on the training data set, and one for your test data set.
4) Think about this: Let's say you train a network with the testing data set,and it only has 1 neuron in the output layer . Then, you present it with the test data set. Depending on which transfer function (activation function) you use in the output layer, you will get a value for each exemplar. Let's assume you use a sigmoid transfer function, where the max and min values are 1 and 0. That means the predictions will be limited to values between 1 and 0.
Let's also say that your target labels ("truth") only contains discrete values of 0 and 1 (indicating which class the exemplar belongs to).
targetLabels=[0 1 0 0 0 1 0 ];
NNprediction=[0.2 0.8 0.1 0.3 0.4 0.7 0.2];
How do you interpret this?
You can apply a hard-limiting function such that the NNprediction vector only contains the discreet values 0 and 1. Let's say you use a threshold of 0.5:
NNprediction_thresh_0.5 = [0 1 0 0 0 1 0];
vs.
targetLabels =[0 1 0 0 0 1 0];
With this information you can compute your False Positives, FN, TP, and TN (and a bunch of additional derived metrics such as True Positive Rate = TP/(TP+FN) ).
If you had a ROC curve showing the False Negative Rate vs. True Positive Rate, this would be a single point in the plot. However, if you vary the threshold in the hard-limit function, you can get all the values you need for a complete curve.
Makes sense? See the dependencies of one process on the others?
I have a lisp program on roulette wheel selection,I am trying to understand the theory behind it but I cannot understand anything.
How to calculate the fitness of the selected strng?
For example,if I have a string 01101,how did they get the fitness value as 169?
Is it that the binary coding of 01101 evaluates to 13,so i square the value and get the answer as 169?
That sounds lame but somehow I am getting the right answers by doing that.
The fitness function you have is therefore F=X^2.
The roulette wheel calculates the proportion (according to its fitness) of the whole that that individual (string) takes, this is then used to randomly select a set of strings for the next generation.
Suggest you read this a few times.
The "fitness function" for a given problem is chosen (often) arbitrarily keeping in mind that as the "fitness" metric rises, the solution should approach optimality. For example for a problem in which the objective is to minimize a positive value, the natural choice for F(x) would be 1/x.
For the problem at hand, it seems that the fitness function has been given as F(x) = val(x)*val(x) though one cannot be certain from just a single value pair of (x,F(x)).
Roulette-wheel selection is just a commonly employed method of fitness-based pseudo-random selection. This is easy to understand if you've ever played roulette or watched 'Wheel of Fortune'.
Let us consider the simplest case, where F(x) = val(x),
Suppose we have four values, 1,2,3 and 4.
This implies that these "individuals" have fitnesses 1,2,3 and 4 respectively. Now the probability of selection of an individual 'x1' is calculated as F(x1)/(sum of all F(x)). That is to say here, since the sum of the fitnesses would be 10, the probabilities of selection would be, respectively, 0.1,0.2,0.3 and 0.4.
Now if we consider these probabilities from a cumulative perspective the values of x would be mapped to the following ranges of "probability:
1 ---> (0.0, 0.1]
2 ---> (0.1, (0.1 + 0.2)] ---> (0.1, 0.3]
3 ---> (0.3, (0.1 + 0.2 + 0.3)] ---> (0.3, 0.6]
4 ---> (0.6, (0.1 + 0.2 + 0.3 + 0.4)] ---> (0.6, 1.0]
That is, an instance of a uniformly distributed random variable generated, say R lying in the normalised interval, (0, 1], is four times as likely to be in the interval corresponding to 4 as to that corresponding to 1.
To put it another way, suppose you were to spin a roulette-wheel-type structure with each x assigned a sector with the areas of the sectors being in proportion to their respective values of F(x), then the probability that the indicator will stop in any given sector is directly propotional to the value of F(x) for that x.
I made a neural network that also have Back Propagation.it has 5 nodes in input layer,6 nodes in hidden layer,1 node in output layer and have random weights and i use sigmoid as activation function.
i have two set of data for input.
for example :
13.5 22.27 0 0 0 desired value=0.02
7 19 4 7 2 desired value=0.03
now i train the network with 5000 iteration or iteration will stop if the error
value(desired - calculated output value) is less than or equal to 0.001.
the output value of first iteration for each input set is about 60 And it will decrease in each iteration.
now the problem is that the second set of inputs(that has desired value of 0.03),cause to stop iteration because of calculated output value of 3.001 but the first set of inputs did not arrived to desired value of it(that is 0.02) and its output is about 0.03 .
EDITED :
I used LMS algorithm andchanged the error threshold 0.00001 to find correct error value,but now output value of last iteration for both 0.03 and 0.02 desired value is between 0.023 and 0.027 and that is incorrect yet.
For your error value stop threshold, you should take the error on one epoch (Sum of every error of all your dataset) and not only on one member of you dataset. With this you will have to increase the value of your error threshold but it will force your neural network to do a good classification on all your example and not only on some example.