I wonder if you can use a type of NEAT network with unsupervised learning, making use of the Encog framework. I want to take advantage of their self organization, since my system does not have seasonality characteristics. As far as I could discover, only saw examples of networks supervised using NEAT.
Disclaimer: My knowledge of both ML and Encog is low.
I believe that the "boxes" example is in fact a demonstration of unsupervised learning using Encog's NEAT capability.
To do unsupervised learning, implement the CalculateScore interface and pass that score evaluator to NEATUtil.constructNEATTrainer(pop, score) when you create the network.
In the example, BoxesScore implements that interface and calls out to TrialEvaluation to calculate fitness:
public double calculateFitness() {
final double threshold = BoxesScore.EDGE_LEN * BoxesScore.SQR_LEN;
double rmsd = Math.sqrt(this.accDistance / 75.0);
double fitness;
if(rmsd > threshold) {
fitness = 0.0;
} else {
fitness = (((threshold-rmsd) * 100.0) / threshold) + (this.accRange / 7.5);
}
return fitness
}
You'll see from the rest of the code that the result isn't some hard coded list of test cases and expected results.
Thus as long as you can define what "fitness" means for your solution, you can do unsupervised learning with Encog's NEAT implementation.
Related
I'm new to Tensorflow, Neural Nets and I never used other than the JavaScript version of Tensorflow. And basically I'm experimenting and studdying all this.
Reading the (Python) Tensorflow docs I saw that Pruning can be done by TF.CONTRIB.MODEL_PRUNING, but, as far as I have found, there is nothing similar for Tensorflow.JS. So I'd like to experiment a bit and implement at least a very simple / basic pruning method.
This "very simple / basic pruning method" can be something like removing from the hidden layers those neurons whose weight is very near to 0. I would then train the model a bit more and see if I can recover the loss in accuracy.
I know I can access the weights with something like this:
const weights = model.layers.map(layer => {
return layer.getWeights()[0].dataSync();
});
What I would like to know if it is actually possible to find and remove units associated with those weights (and if I can do this during training).
Thanks!
Edu
It is possible to set the weights on the model. The same way you retrieve the model weights using get, you can use set to change the weights of your model.
model.fit(x, y, {epochs: 1000,
callbacks: {
onEpochEnd: () => {
// check your weight
model.layers[0].getWeights()
// set your weiths
model.layers[0].setWeights([tensors])
}
}})
My goal is to test how well a Multilayer Perceptron classifies the 20 newsgroups data. I keep getting only 5% accuracy with this method but can obtain ~90% with other classification methods such as Naive Bayes and KNN. I'm sure I am doing it wrong, so here is my code in hopes that someone can point me in the right direction:
newsgroups_data.setClassIndex(newsgroups_data.numAttributes() - 1);
StringToWordVector filter = new StringToWordVector();
FilteredClassifier classifier = new FilteredClassifier();
classifier.setFilter(filter);
MultilayerPerceptron mlp = new MultilayerPerceptron();
mlp.setTrainingTime(300); //This alone takes an hour or more
mlp.setLearningRate(0.01);
mlp.setHiddenLayers("1");
mlp.setReset(false);
classifier.setClassifier(mlp);
classifier.buildClassifier(newsgroups_data);
Evaluation eval = new Evaluation(newsgroups_data);
mlp.setHiddenLayers("1")
means you want to use one hidden layer with one node in it (that means you're setting up a neural network with ONE total neurons).
In Caffe we have a decay_ratio which is usually set as 0.0005. Then all trainable parameters, e.g., W matrix in FC6 will be decayed by:
W = W * (1 - 0.0005)
after we applied the gradient to it.
I go through many tutorial tensorflow codes, but do not see how people implement this weight decay to prevent numerical problems (very large absolute values)
I my experiences, I often run into numerical problems aften 100k iterations during training.
I also go through related questions at stackoverflow, e.g.,
How to set weight cost strength in TensorFlow?
However, the solution seems a little different as implemented in Caffe.
Does anyone has similar concerns? Thank you.
The current answer is wrong in that it doesn't give you proper "weight decay as in cuda-convnet/caffe" but instead L2-regularization, which is different.
When using pure SGD (without momentum) as an optimizer, weight decay is the same thing as adding a L2-regularization term to the loss. When using any other optimizer, this is not true.
Weight decay (don't know how to TeX here, so excuse my pseudo-notation):
w[t+1] = w[t] - learning_rate * dw - weight_decay * w
L2-regularization:
loss = actual_loss + lambda * 1/2 sum(||w||_2 for w in network_params)
Computing the gradient of the extra term in L2-regularization gives lambda * w and thus inserting it into the SGD update equation
dloss_dw = dactual_loss_dw + lambda * w
w[t+1] = w[t] - learning_rate * dw
gives the same as weight decay, but mixes lambda with the learning_rate. Any other optimizer, even SGD with momentum, gives a different update rule for weight decay as for L2-regularization! See the paper Fixing weight decay in Adam for more details. (Edit: AFAIK, this 1987 Hinton paper introduced "weight decay", literally as "each time the weights are updated, their magnitude is also decremented by 0.4%" at page 10)
That being said, there doesn't seem to be support for "proper" weight decay in TensorFlow yet. There are a few issues discussing it, specifically because of above paper.
One possible way to implement it is by writing an op that does the decay step manually after every optimizer step. A different way, which is what I'm currently doing, is using an additional SGD optimizer just for the weight decay, and "attaching" it to your train_op. Both of these are just crude work-arounds, though. My current code:
# In the network definition:
with arg_scope([layers.conv2d, layers.dense],
weights_regularizer=layers.l2_regularizer(weight_decay)):
# define the network.
loss = # compute the actual loss of your problem.
train_op = optimizer.minimize(loss, global_step=global_step)
if args.weight_decay not in (None, 0):
with tf.control_dependencies([train_op]):
sgd = tf.train.GradientDescentOptimizer(learning_rate=1.0)
train_op = sgd.minimize(tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)))
This somewhat makes use of TensorFlow's provided bookkeeping. Note that the arg_scope takes care of appending an L2-regularization term for every layer to the REGULARIZATION_LOSSES graph-key, which I then all sum up and optimize using SGD which, as shown above, corresponds to actual weight-decay.
Hope that helps, and if anyone gets a nicer code snippet for this, or TensorFlow implements it better (i.e. in the optimizers), please share.
Edit: see also this PR which just got merged into TF.
This is a duplicate question:
How to define weight decay for individual layers in TensorFlow?
# Create your variables
weights = tf.get_variable('weights', collections=['variables'])
with tf.variable_scope('weights_norm') as scope:
weights_norm = tf.reduce_sum(
input_tensor = WEIGHT_DECAY_FACTOR*tf.pack(
[tf.nn.l2_loss(i) for i in tf.get_collection('weights')]
),
name='weights_norm'
)
# Add the weight decay loss to another collection called losses
tf.add_to_collection('losses', weights_norm)
# Add the other loss components to the collection losses
# ...
# To calculate your total loss
tf.add_n(tf.get_collection('losses'), name='total_loss')
You can just set whatever lambda value you want to the weight decay. The above just adds the l2 norm to it.
I have been going through UFLDL tutorials.In the vectorized implementation of a simple neural net, the tutorials suggest that one way to do this would be to go through the entire training set instead of iterative approach. In the back propogation part, this would mean replacing:
gradW1 = zeros(size(W1));
gradW2 = zeros(size(W2));
for i=1:m,
delta3 = -(y(:,i) - h(:,i)) .* fprime(z3(:,i));
delta2 = W2'*delta3(:,i) .* fprime(z2(:,i));
gradW2 = gradW2 + delta3*a2(:,i)';
gradW1 = gradW1 + delta2*a1(:,i)';
end;
with
delta3 = -(y - h) .* fprime(z3)
delta2 = W2'*delta3().*fprime(z2)
gradW2 = delta3*a2'
gradW1 = delta2*a1'
//apply weight correction now that all gradients
//are computed
Please visit this page for information about the notation and the algorithm.
However this implementation yielded abnormally large values inside gradW1 and gradW2. This seems to be a result of me not updating the weights as I process each training input(tested this on another earlier working implementation). Am I right about this? From reading the tutorials it seems that there is a way to make this work, but I can't come up with something that works.
Backpropogation has two ways of implementation: batch and online training algorithm. Initially you described online training algorithm. Then you found and tried to implement batch training algorithm which sometime has side effect which you described. In your case it can be good idea to split learning samples into smaller chunks and learn on them.
I've been trying to implement the algorithm described here, and then test it on the "large action task" described in the same paper.
Overview of the algorithm:
In brief, the algorithm uses an RBM of the form shown below to solve reinforcement learning problems by changing its weights such that the free energy of a network configuration equates to the reward signal given for that state action pair.
To select an action, the algorithm performs gibbs sampling while holding the state variables fixed. With enough time, this produces the action with the lowest free energy, and thus the highest reward for the given state.
Overview of the large action task:
Overview of the author's guidelines for implementation:
A restricted Boltzmann machine with 13 hidden variables was trained on an instantiation of
the large action task with an 12-bit state space and a 40-bit action space. Thirteen key states were
randomly selected. The network was run for 12 000 actions with a learning rate going from 0.1
to 0.01 and temperature going from 1.0 to 0.1 exponentially over the course of training. Each
iteration was initialized with a random state. Each action selection consisted of 100 iterations of
Gibbs sampling.
Important omitted details:
Were bias units needed?
Was weight decay needed? And if so, L1 or L2?
Was a sparsity constraint needed for the weights and/or activations?
Was there modification of the gradient descent? (e.g. momentum)
What meta-parameters were needed for these additional mechanisms?
My implementation:
I initially assumed the authors' used no mechanisms other than those described in the guidelines, so I tried training the network without bias units. This led to near chance performance, and was my first clue to the fact that some mechanisms used must have been deemed 'obvious' by the authors and thus omitted.
I played around with the various omitted mechanisms mentioned above, and got my best results by using:
softmax hidden units
momentum of .9 (.5 until 5th iteration)
bias units for the hidden and visible layers
a learning rate 1/100th of that listed by the authors.
l2 weight decay of .0002
But even with all of these modifications, my performance on the task was generally around an average reward of 28 after 12000 iterations.
Code for each iteration:
%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
poshidprobs = softmax(data*vishid + hidbiases);
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidstates = softmax_sample(poshidprobs);
%%%%%%%%% START ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if test
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
else
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
end
data(numdims+1:end) = negaction > rand(numcases,numactiondims);
if mod(batch,100) == 1
disp(poshidprobs);
disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
end
posprods = data' * poshidprobs;
poshidact = poshidprobs;
posvisact = data;
%%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if batch>5,
momentum=.9;
else
momentum=.5;
end;
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);
Q = -F;
action = data(numdims+1:end);
reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
if correct_action(:,(batch)) == correct_action(:,1)
reward_dataA = [reward_dataA reward];
Q_A = [Q_A Q];
else
reward_dataB = [reward_dataB reward];
Q_B = [Q_B Q];
end
reward_error = sum(reward - Q);
rewardsum = rewardsum + reward;
errsum = errsum + abs(reward_error);
error_data(ind) = reward_error;
reward_data(ind) = reward;
Q_data(ind) = Q;
vishidinc = momentum*vishidinc + ...
epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);
vishid = vishid + vishidinc;
hidbiases = hidbiases + hidbiasinc;
visbiases = visbiases + visbiasinc;
%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
What I'm asking for:
So, if any of you can get this algorithm to work properly (the authors claim to average ~40 reward after 12000 iterations), I'd be extremely grateful.
If my code appears to be doing something obviously wrong, then calling attention to that would also constitute a great answer.
I'm hoping that what the authors left out is indeed obvious to someone with more experience with energy-based learning than myself, in which case, simply point out what needs to be included in a working implementation.
The algorithm in the paper looks wierd. They use a kind of hebbian learning that increases conectonstrength, but no mechanism to decay them. In contrast the regular CD pushes the energy of incorrect fantasies up, balancing overall activiity. I would speculate that yuo will need strong sparcity regulaton and/or weight decay.
bias never would hurt :)
Momentum and other fancy stuff may speed up, but usually not neccesary.
Why softmax on hiddens? Should it be just sigmoid?