I am learning NN and Keras. My test data is something like this:
Result, HomeWinPossibility, DrawPossibility, AwayWinPossibility
[['AwayWin' 0.41 0.28 0.31]
['HomeWin' 0.55 0.25 0.2]
['AwayWin' 0.17 0.21 0.62]
.....
Here is my model:
model = Sequential()
model.add(Dense(16, input_shape=(3,)))
model.add(Activation('sigmoid'))
model.add(Dense(8, activation='relu'))
model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=["accuracy"])
model.fit(train_X, train_y_ohe, epochs=100, batch_size=1, verbose=1);
The output from fit is:
Epoch 1/100
190/190 [==============================] - 1s 3ms/step - loss: 0.9151 - acc: 0.5737
Epoch 2/100
190/190 [==============================] - 1s 3ms/step - loss: 0.9181 - acc: 0.5474
Epoch 3/100
190/190 [==============================] - 1s 3ms/step - loss: 0.9111 - acc: 0.5526
Epoch 100/100
190/190 [==============================] - 1s 3ms/step - loss: 0.9130 - acc: 0.5579
So why the loss is not going down as some NN tutorials I read? Is it because the data I provided are just noises, so NN can't find any clue or something not right with my model?
As the acc is always around 0.55 (so 50%), does it mean the NN actually achieved better than random guessing (> 33%)? If this is true, why it achieved accuracy 0.57 at the first epoch?
So why the loss is not going down as some NN tutorials I read?
It might be many reasons - all depending on your data. Here are things you could adjust:
You have a very low batch size. Although some data might actually respond to this, I think that a batch size of 1 would be too small in most cases - without getting started on the redundantness of the structure you show when you use batch size 1. Batch size is very dependent on how much, and what kind of, data you have, but try somewhere around 20-30 if you have sufficient data.
Try different activation functions (but always have softmax or sigmoid in the last layer because you want numbers between 0 and 1).
Increase the number of units in the first and/or second layer (if you have enough data).
Try to set the learning rate (lr) for the Adam optimizer: model.compile(optimizer=keras.optimizers.Adam(lr=0.001), ...)
Is it because the data I provided are just noises
If your data is pure noise across classes, then very probably, given that there are roughly the same number of datapoints in each class, the accuracy would be around 33%, since it would essentially just guess at random.
As the acc is always around 0.55(so 50%). does it mean the NN actually achieved better than random guessing (33%)?
Not necessarily. The accuracy is a measure of how many classes that were correctly classified. Say that the validation data (conventionally the part of the dataset that the accuracy is calculated on) only contains data from one class. Then if the NN only classifies everything to that one class, the validation data would have 100% accuracy!
That means if you don't have the same number of datapoints from each class, accuracy is not to be trusted alone! A much better measure in cases where you have unbalanced datasets is e.g. the AUC (Area under the ROC curve) or the F1 score, which takes false positives into account as well.
I would recommend that you look into the theory behind this. Just blindly running around will probably be very annoying because you'd have a very hard time getting good results. And even if you got good results, they might often not be as good as you think. One place to read would be Ian Goodfellow's book on deep learning.
Related
I have a sample neural network and am trying to see how much it would cost me to run it on a server and how long it would take to train if, for example, I add 3 more layers with around 4000,3000,2000 nodes in each layer respectively.
I understand that from a high level perspective the network needs to
Feed the inputs and get the results (which in turn will run Sigmoid) from the network which I guess happens in constant time (even tho the output may not be constant or even linear!)
Run Adam to optimize weights/biases which I guess also happens in linear time since it is like Gradient descent and is different in how it manages the learning rate!
Update the weights/biases which is constant!
I can't find a calculator to use and estimate the computation needed and I'm thinking of making one if I can get a good understanding of different variables in a neural network!
This is the code for my Tensorflow model:
const model = tf.sequential();
model.add(tf.layers.flatten({inputShape: [4317, 5]}));
model.add(tf.layers.dense({units: 1000, activation: 'sigmoid'}));
model.add(tf.layers.dense({units: 4316, activation: 'sigmoid'}));
const optimizer = tf.train.adam();
model.compile({
optimizer: optimizer,
loss: 'meanSquaredError'
});
And here is the network summary printed by Tensorflow
_________________________________________________________________
Layer (type) Output shape Param #
=================================================================
flatten_Flatten1 (Flatten) [null,21585] 0
_________________________________________________________________
dense_Dense1 (Dense) [null,1000] 21586000
_________________________________________________________________
dense_Dense2 (Dense) [null,4316] 4320316
=================================================================
Total params: 25906316
Trainable params: 25906316
Non-trainable params: 0
What if I change the activation functions to linear or ReLU?
I have a laptop with 16 GB of memory and 3.2 GHz 8-core ARMv8-A (M1 chip) and it looks like the laptop is taking about a minute to train a batch of 32 inputs.
With N inputs, each weight is used O(N) times per round of training, so assuming M weights you have roughly O(N*M) training time per round. It doesn't really matter where those weights are in your network. Even for recurrent layers (GRU,RNN, LSTM) this stays true.
Where things break down is that you can't let M go to infinity (which is how big-O works) because in that case your network training won't converge anymore. Effectively, it would be O(infinity).
While running my model,the time for each epoch is 0 sec, does that mean there is something wrong with my modelenter image description here
Probably training is just fast.
These epoch times are very possible when using GPU's. Of course, only on smaller datasets/architectures.
Say you are training MNIST with fairly straightforward network (LeNet alike CNN), these times are pretty normal.
From scanning the loss/acc, something is happening as the numbers are not constant.
Designed a CNN to detect motor movements from EEG.
Input Size (EEG data): 18x64 - 18 electrodes and 64 samples per epoch.
convlayer1; (10 filters of size 1x4)
reluLayer();
maxPooling2dLayer([1,2],'Stride',[1 2])
dropoutLayer(0.1);
convlayer2; (20 filters of size 4x1)
reluLayer();
maxPooling2dLayer([2,1],'Stride',[2 1])
dropoutLayer(0.1);
fullyConnectedLayer(2);
dropoutLayer(0.2);
softmaxLayer();
classificationLayer()];
Data from 8 subjects. Trained the network using 7 subjects and tested it using the left out subject. Did the same for all 8 subjects (basically - LOOM). Training accuracy was 96-98% and so was validation accuracy. For some subjects, the testing accuracy was 100% and for few others, it was 98-99%. Is this a case of overfitting or this result is reliable?
Thanks for your time and help.
Venkat
If testing performance is better than it's not the issue of overfitting. Overfitting avoids generalization but if your model is performing well on test data which means it is working on some unseen data and generalized well.
I'm training a network which has multiple losses and both creating and feeding the data into my network using a generator.
I've checked the structure of the data and it looks fine generally and it also trains pretty much as expected the majority of the time, however at a random epoch almost every time, the the training loss for every prediction suddenly jumps from say
# End of epoch 3
loss: 2.8845
to
# Beginning of epoch 4
loss: 1.1921e-07
I thought it could be the data, however, from what I can tell the data is generally fine and it's even more suspicious because this will happen at a random epoch (could be because of a random data point chosen during SGD?) but will persist throughout the rest of training. As in if at epoch 3, the training loss decreases to 1.1921e-07 then it will continue this way in epoch 4, epoch 5, etc.
However, there are times when it reaches epoch 5 and hasn't done this yet and then might do it at epoch 6 or 7.
Is there any viable reason outside of the data that could cause this? Could it even happen that a few fudgy data points causes this so fast?
Thanks
EDIT:
Results:
300/300 [==============================] - 339s - loss: 3.2912 - loss_1: 1.8683 - loss_2: 9.1352 - loss_3: 5.9845 -
val_loss: 1.1921e-07 - val_loss_1: 1.1921e-07 - val_loss_2: 1.1921e-07 - val_loss_3: 1.1921e-07
The next epochs after this all have trainig loss 1.1921e-07
Not entirely sure how satisfactory this is as an answer but my findings seem to show that using multiple categorical_crossentropy loss's together seems to result in a super unstable network? Swapping this out for other loss functions fixes the problem with the data remaining unchanged.
I have two distinct (unknown relationship) types of input patterns and I need to design a neural network where I would get an output based on both these patterns. However, I am unsure of how to design such a network.
I am a newbie in NN but I am trying to read as much as I can. In my problem as far as I can understand there are two input matrices of order say 6*1 and an o/p matrix of order 6*1. So how should I start with this? Is it ok to use backpropogation and a single hidden layer?
e.g.->
Input 1 Input 2 Output
0.59 1 0.7
0.70 1 0.4
0.75 1 0.5
0.83 0 0.6
0.91 0 0.8
0.94 0 0.9
How do I decide the order of the weight matrix and the transfer function?
Please help. Any link pertaining to this will also do. Thanks.
The simplest thing to try is to concatenate the 2 input vectors. This way you'll have 1 input vector of length 12, and this becomes a "text-book" learning problem from R^{12} to R^{6}.
The downside of this, is that you lose the information about each 6 inputs coming from a different source, but by your description it doesn't sound like you know much about these sources. Anyways, if you have any special knowledge of the 2 sources, you can use some pre-processing (like subtracting the mean, or dividing by the standard deviation) on each of the sources, to make them more similar, but most learning algorithms should also work OK without it.
As for which algorithm to try, I think the cannonical order is: linear machines (perceptron), then SVM, then multi-layer-networks (trained with backprop). The reason is, the more powerful the machine you use, the better chances you have to fit the train set, but less chances to fit the "true" pattern (overfitting).