I am trying to train a simple neural network with Pybrain. After training I want to confirm that the nn is working as intended, so I activate the same data that I used to train it with. However every activation outputs the same result. Am I misunderstanding a basic concept about neural networks or is this by design?
I have tried altering the number of hidden nodes, the hiddenclass type, the bias, the learningrate, the number of training epochs and the momentum to no avail.
This is my code...
from import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
net = buildNetwork(2, 3, 1)
ds = SupervisedDataSet(2, 1)
ds.addSample([77, 78], 77)
ds.addSample([78, 76], 76)
ds.addSample([76, 76], 75)
trainer = BackpropTrainer(net, ds)
for epoch in range(0, 1000):
error = trainer.train()
if error < 0.001:
print net.activate([77, 78])
print net.activate([78, 76])
print net.activate([76, 76])
This is an example of what the results can be... As you can see the output is the same even though the activation inputs are different.
[ 75.99893007]
[ 75.99893007]
[ 75.99893007]

I had a similar problem, I was able to improve the accuracy (I.E. get different answer for each input) by doing the following.
Normalizing/Standardizing input and output to the neural network
Doing this allows the neural network to more accurately determine the internal weights and values for the network in order to predict the answers
heres an article that explains it in more detail.

In the end I solved this by normalizing the data between 0 and 1 and also training until the error rate hit 0.00001. It takes much longer to train, but I do get accurate results now.


Compute softmax using breeze

I am constructing a deep neural network from scratch and I want to implement the softmax distributed function.
I am using breeze for that but it is not working as expected.
The documentation is also poor with very few examples, so it is difficult for me to understand how I should use it.
here is an example :
I have an ouput array that contains 10 dimensions.
I have my label array also.
Z contains 10 rows with the weighted values.
My label array contains also 10 rows and one is set to 1 to specify which row is the expected result.
lab(0) = 1
lab(1 to 9) = 0
my code :
def ComputeZ(ActivationFunction : String, z:Array[Double], label:Array[Double]) : Array[Double] = {
ActivationFunction match {
case "SoftMax" => **val t = softmax(z,label)**
I was expecting having a distributed probability with a total of 1 for the 10 rows but it returns actually the same values as Z.
I don't know what I am doing wrong
thanks for your help
Your question seems a little bit confusing to me. I mean, creating a SoftMax from scratch has nothing to do with the label or the real output value. A Softmax function is used to create a valid output probability distribution of a neural network, used in multiclass classification problems. As I see you have a one hot vector as label, it seems that you want to implement a CrossEntropy criterion or some error function that evaluates the divergence of the prediction distribution and the label distribution. That needs the output prediction probability distribution(applying your Softmax to the output layer) and the one hot vector of the output.
I watched the code of the softmax function in breeze but I don´t see a Layer implementation and it doesn´t do what I was expecting. Have in mind that you need a forward an a backward function.

NSLocalizedDescription = "The size of the output layer 'Identity' in the neural network does not match the number of classes in the classifier."

I just created a model that does a binary classification and has a dense layer of 1 unit at the end. I used Sigmoid activation. However, I get this error now when I wanna convert it to CoreML.
I tried to change the number of units to 2 and activation to softmax but still didn't work.
import coremltools as ct
#1. define input size
image_input = ct.ImageType(scale=1/255)
#2. give classifier
classifier_config = coremltools.ClassifierConfig(class_labels=[0, 1]) #ERROR here
#3. convert the model
coreml_model = coremltools.convert("mask_detection_model_surgical_mask.h5",
inputs=[image_input], classifier_config=classifier_config)
#4. load and resize an example image
example_image ="Unknown3.jpg").resize((256, 256))
# Make a prediction using Core ML
out_dict = coreml_model.predict({mymodel.input_names[0]: example_image})
# save to disk"FINALLY.mlmodel")
I found the answer to my question.
Use Softmax activation and 2 Dense units as the final layer with either loss='binary_crossentropy' or `loss='categorical_crossentropy'
Good luck to hundreds of people who posted a similar question but received no answer.

pretrained densenet/vgg16/resnet50 + gp does not train on cifar10 data

I'm trying to train a hybrid model with GP on top of pre-trained CNN (Densenet, VGG and Resnet) with CIFAR10 data, mimic the ex2 function in the gpflow document. But the testing result is always between 0.1~0.2, which generally means random guess (Wilson+2016 paper shows hybrid model for CIFAR10 data should get accuracy of 0.7). Could anyone give me a hint of what could be wrong?
I've tried same code with simpler cnn models (2 conv layer or 4 conv layer) and both have reasonable results. I've tried to use different Keras applications: Densenet121, VGG16, ResNet50, neither works. I've tried to freeze the weights in the pre-trained models still not working.
def cnn_dn(output_dim):
base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(32,32,3))
bout = base_model.output
fcl = GlobalAveragePooling2D()(bout)
#for layer in base_model.layers:
# layer.trainable = False
output=Dense(output_dim, activation='relu')(fcl)
md=Model(inputs=base_model.input, outputs=output)
return md
#add gp on top, reference:ex2() function in
#needs to slightly change build graph part because keras variable #sharing is not the same as tensorflow
## build graph
with tf.variable_scope('cnn'):
f_X = tf.cast(md(X), dtype=float_type)
f_Xtest = tf.cast(md(Xtest), dtype=float_type)
## predict
res=np.argmax(, feed_dict={Xtest:xts}),1).reshape(yts.shape)
correct = res == yts.astype(int)
I finally figure out that the solution is training larger iterations. In the original code, I just use 50 iterations as used in the ex2() function for MNIST data and it is not enough for more complicated network and CIFAR10 data. Adjusting some hyper-parameter (e.g. learning rate and activation function) also helps.

implementing a MLP model in keras for timeseries prediction but the model doesn't train well

I'm trying to come up with a MLP model for timeseries prediction following this blog post. I have 138 timeseries with a lookback_window=28 (splitted as 50127 timeseries for traing and 24255 timeseries for validation). I need to predict the next value (timesteps=28, n_features=1). I started from a 3 layer network but it didn't train well. I tried to make the network deeper by adding more layers/more hunits, but it doesn't improve. In the picture, you can see the result of prediction of the following model Here is my model code:
inp = Input(batch_shape=(batch_size, lookback_window))
first_layer = Dense(1000, input_dim=28, activation='relu')(inp)
snd_layer = Dense(500)(first_layer)
thirs_layer = Dense(250)(snd_layer)
tmp = Dense(100)(thirs_layer)
tmp2 = Dense(50)(tmp)
tmp3 = Dense(25)(tmp2)
out = Dense(1)(tmp3)
model = Model(inp, out)
model.compile(loss='mean_squared_error', optimizer='adam')
history =, train_y,
validation_data=(validation_data, validation_y),
What am I missing? How can I improve it?
The main thing I noticed is that you are not using non-linearities in your layers. I would use relus for the hidden layers and linear layer for the final layer in case you want values larger than 1 / -1 to be possible. If you do not want them to be possible use tanh. By increasing the data you make the problem harder and therefore your mostly linear model is underfitting severely.
I managed to get better results by the following changes:
Using RMSprop instead of Adam with lr=0.001, and as #TommasoPasini mentioned added them to all Dense layers (expect the last one). It improves the results a lot!
epochs= 3000 instead of 1000.
But now I think it is overfitting. Here are the plots of the results and the validation and train loss:

Feed Forward - Neural Networks Keras

for my input in the feed forward neural network that I have implemented in Keras, I just wanted to check that my understanding is correct.
[[ 25.26000023 26.37000084 24.67000008 23.30999947]
[ 26.37000084 24.67000008 23.30999947 21.36000061]
[ 24.67000008 23.30999947 21.36000061 19.77000046]...]
So in the data above it is a time window of 4 inputs in an array. My input layer is
model.add(Dense(4, input_dim=4, activation='sigmoid')), trainY, nb_epoch=10000,verbose=2,batch_size=4)
and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch? and does the batch_size need to be 4 in order for this time window to work?
Thanks John
and batch_size is 4, in theory when I call the fit function will the function go over all these inputs in each nb_epoch?
Yes, each epoch is iteration over all training samples
and does the batch_size need to be 4 in order for this time window to work?
No, these are completely unrelated things. Batch is simply a subset of your training data which is used to compute approximation of the true gradient of the cost function. Bigger the batch - closer you get to the true gradient (and original Gradient Descent), but training gets slower. Closer to 1 you get - it becomes more and more stochastic, noisy approxmation (and closer to Stochastic Gradient Descent). The fact that you matched batch_size and data dimensionality is just an odd-coincidence, and has no meaning.
Let me put this in more generall setting, what you do in gradient descent with additive loss function (which neural nets usually use) is going against the gradient which is
grad_theta 1/N SUM_i=1^N loss(x_i, pred(x_i), y_i|theta) =
= 1/N SUM_i=1^N grad_theta loss(x_i, pred(x_i), y_i|theta)
where loss is some loss function over your pred (prediction) as compared to y_i.
And in batch based scenatio (the rough idea) is that you do not need to go over all examples, but instead some strict subset, like batch = {(x_1, y_1), (x_5, y_5), (x_89, y_89) ... } and use approximation of the gradient of form
1/|batch| SUM_(x_i, y_i) in batch: grad_theta loss(x_i, pred(x_i), y_i|theta)
As you can see this is not related in any sense to the space where x_i live, thus there is no connection with dimensionality of your data.
Let me explain this with an example:
When you have 32 training examples and you call with a batch_size of 4, the neural network will be presented with 4 examples at a time, but one epoch will still be defined as one complete pass over all 32 examples. So in this case the network will go through 4 examples at a time, and will ,theoretically at least, call the forward pass (and the backward pass) 32 / 4 = 8 times.
In the extreme case when your batch_size is 1, that is plain old stochastic gradient descent. When your batch_size is greater than 1 then it's called batch gradient descent.