2-Dence neural network accuracy optimization - neural-network

Below you can see the dataset i use"sorted_output" in order to construct an ANN with 2 hidden dense layers and one input, one output layer. My question is why am i getting extremely low accuracy(62,5%)? I have the feeling that the fact that since both my input data (columns A-U) and my output data (column V) are in binary form, this should lead me to 100% accuracy. Am i wrong?
from keras.models import Sequential
from keras.layers import Dense
from sklearn.cross_validation import train_test_split
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
dataset = numpy.loadtxt("sorted_output.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:21]
Y = dataset[:,21]
# split into 67% for train and 33% for test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)
# create model
model = Sequential()
model.add(Dense(12, input_dim=21, init='orthogonal', activation='relu'))
model.add(Dense(10, init='uniform', activation='relu'))
model.add(Dense(1, init='orthogonal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test,y_test), nb_epoch=150, batch_size=10)

Accuracy of the network depends on many factors. It is really difficult to understand why the accuracy is so low in your case. It really depends on the underlying data distribution and how good you network is at capturing the relevant information during training.
I suggest you to monitor the loss and see if the model is overfitting the train data. If that is the case, you might have to use some kind of regularization to improve the generalization. Otherwise, you can increase the depth of the network and check if results are better.
These are by no means an exhaustive list of approaches. Sometimes changing the optimizer also helps, depending on how your data distributed.

Related

When to use layernorm/batch norm?

Where should you splice the normalization when designing a network? E.g. if you have a stacked Transformer or Attention network, does it make sense to normalize any time after you have a dense layer?
What the original paper tries to explain is to reduce overfitting use Batch Normalization.
Where should you splice the normalization when designing a network?
Set the normalization early on inputs. Unbalanced input extreme values can cause instability.
While if you normalize on outputs this will not prevent the inputs to cause the instability all over again.
Here is the little code that explains what the BN do:
import torch
import torch.nn as nn
m = nn.BatchNorm1d(100, affine=False)
input = 1000*torch.randn(3, 100)
print(input)
output = m(input)
print(output)
print(output.mean()) # should be ~ 0
print(output.std()) # should be ~ 1
Does it make sense to normalize any time after you have a dense layer
Yes, you may do so as matrix multiplication may lead to producing the extremes. Also, after convolution layers, because these are also matrix multiplication, similar but less intense comparing to dense (nn.Linear) layer. If you for instance print the resent model, you will see that batch norms are set every time after the conv layer like this:
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
To print the full resnet you may use this:
import torchvision.models as models
r = models.resnet18()
print(r)

Predict a 2D matrix from an image with keras keeping spatial information

I want to train a CNN to predict 100x100x1 matrices (heatmaps) from 224x224x13 images using Keras. My idea is to finetune pretrained networks that keras provide (resnet, Xception, vgg16 etc.).
The first step is then to substitute the pretrained top layers for the ones that meet my problem constraints. I am trying to predict 100x100x1 heatmap images whose values range from 0 to 1. Therefore I want the output of my network to be a 100x100x1 matrix. I believe that if I use Flatten and then a Dense layer of 1000x1x1 I will be loosing spatial information, which I don't want (right?).
I want my code to be flexible, to be able to run independent from which pretrained architecture is being used (I have to run many experiments). Therefore I want to stack a Dense layer that connects to every unit of whatever kind of layer is before it (which will depend on the pretrained architecture I will be using).
Some answers relate to the fully convolutional approach, but that is not what I mean here. Both my X and Y have fixed shapes (224x224x3 and 100x100x1 respectively).
My problem is that I don't now how to stack the new layer/s in such a way that the predictions/outputs of the net are 100x100x1 matrices.
As it has been suggested in the answers, I am trying to add a 100x100x1 Dense layer. But I don't seem to get it working:
If for example I to like this:
x = self.base_model.output
predictions = keras.layers.Dense(input_shape = (None, 100,100), units= 1000, activation='linear')(x)
self.model = keras.models.Model(input=self.base_model.input, output=predictions)
Then I got this when I start training:
ValueError: Error when checking target: expected dense_1 to have 4 dimensions, but got array with shape (64, 100, 100)
The Y of the network are indeed batches of shape (64,100,100)
Any suggestions?
Also, which loss function should I use? As it has been suggested in the answers, I could use mse but I wonder, is there any loss function that is able to measure the spatial information of my desired 100x100x1 output?
Thanks in advance.
EDIT:
I semi-solved my problem thanks to #ncasas answer:
I just added some deconvolutional layers until I got an output that was similar to the 100x100x1. This is not what I wanted on the first place, since this implementation is not agnostic to the pretrained architecture that is built on top of. For Xception with input_shape = (224, 224, 3), this top layers give an output of 80x80x1:
x = self.base_model.output
x = keras.layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.UpSampling2D((3, 3))(x)
x = keras.layers.Conv2D(16, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.UpSampling2D((2, 2))(x)
x = keras.layers.Conv2D(16, (3, 3), activation='relu')(x)
x = keras.layers.UpSampling2D((2, 2))(x)
predictions = keras.layers.Conv2D(filters = 1,
kernel_size = (3, 3),
activation='sigmoid',
padding='same')(x)
self.model = keras.models.Model(input=self.base_model.input, output=predictions)
where self.base_model is keras.applications.Xception(weights = 'imagenet', include_top = False, input_shape = (224, 224, 3))
I am finally using mse as loss function and it works just fine.
What you are describing is a multidimiensional linear regression plus transfer learning.
In order to reuse the first layers of a trained Keras model, you can follow this post from the Keras blog, in section "Using the bottleneck features of a pre-trained network: 90% accuracy in a minute". For your case, the only difference is that:
For the layer before the last one, you should probably have something larger than 256.
The last layer would be a 10000 units Dense layer with linear activations (i.e. no activation at all). You can either reshape your expected outputs from 100x100 to 100000, or add an extra reshape layer to the network to have a 100x100 output.
Keep in mind that between the convolutional part of the network and the multilayer perceptron part (i.e. the final Dense layer(s)) there must be a Flatten layer to place the obtain activation patterns in a single matrix (search the linked post for "Flatten"); the error you receive is because of that.
If you don't want to flatten the activation patterns, you may want to directly use deconvolutions in your last layers. For that, you can take a look at the keras autoencoder tutorial, at section "Convolutional autoencoder".
The usual loss function used for regression problems is mean squared error (MSE). It does not make sense to use cross entropy for regression, as explained here.
You should find this paper helpful. Simply replace the fully connected layers with convolutional layers. Instead of a single prediction for the entire image, the result will be a heatmap of predictions for smaller portions of the image.
You should use the categorical_crossentropy loss function.

Accuracy goes down during epoch keras

I've tried to write a neural network but the accuracy doesn't change each epoch. I'm using keras and I can watch the accuracy change as each epoch is evaluated per se and it will start low, go up a bit, then drop back down to the exact same value each time example output. I've tried changing the batch size, learning rates, changing the data around a bit, but every time it does the same thing, just perhaps with a different accuracy value. I've also tried different optimizers. Any help is appreciated. (Also I was able to get an mnist example working)
model = Sequential()
model.add(Dense(1000, input_dim=100, init='uniform', activation='relu'))
model.add(Dense(len(history), init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
opt = SGD(lr=1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
model.fit(X, Y, nb_epoch=100, batch_size=50, verbose = 1)
scores = model.evaluate(X, Y)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
Since you only have a single neuron in your output layer, I assume you're doing regression and not classification.
If that's the case, then you should change your loss function to 'mse' and you also should remove the activation in your output layer because the sigmoid function will squash your output between 0 and 1.

Reconstruct Original Data using Denoising AutoEncoder

Sometimes , the raw data doesn't contains sufficient information like biological experimental data.I have a gene expression dataset with size 100*1000. I want to use Denoising AutoEncoder to get an reconstructed output with the same size(100*1000). How it would be possible?
Here you can find an interesting article about autoencoders. The denosing case is also mentioned - I hope that it will answer your question :
https://medium.com/a-year-of-artificial-intelligence/lenny-2-autoencoders-and-word-embeddings-oh-my-576403b0113a#.2jdcn3ctk
Just if anyone ever stumbles over this post and wonders how to code a denoising autoencoder. Here is a simple example:
import numpy as np
import tensorflow as tf
# Generate a 100x1000 dataset
x_train = np.random.rand(100, 1000)
# Add noise to the data
noise_factor = 0.5
x_train_noisy = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_train.shape)
# Clip the values to [0, 1]
x_train_noisy = np.clip(x_train_noisy, 0., 1.)
# Define the input layer
inputs = tf.keras.layers.Input(shape=(1000,))
# Define the encoder
encoded = tf.keras.layers.Dense(100, activation='relu')(inputs)
# Define the decoder
decoded = tf.keras.layers.Dense(1000, activation='sigmoid')(encoded)
# Define the autoencoder model
autoencoder = tf.keras.models.Model(inputs, decoded)
# Compile the model
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
# Train the model
autoencoder.fit(x_train_noisy, x_train, epochs=100, batch_size=32)
Note:
You have to replace x_train with your data
x_train has to be noise free (otherwise the denoising autoencoder won't work, since it has no reference)
you can add additional layers for your encoder and decoder part
you should play around with hyperparameters (number of neurons in the individual layers, loss function, (optimizer,) epochs, batch_size) to see what works best for you -> preferably, you run an optimisier to find the best values for them (like grid search and the like)
And here are a couple of links to other sources on autoencoders:
Machine Learning Mastery
Keras Blog

Is it desirable to scale data for skflow.TensorFlowDNNClassifier?

My colleagues and this question on Cross Validated say you should transform data to zero mean and unit variance for neural networks. However, my performance was slightly worse with scaling than without.
I tried using:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
steps = 5000
def exp_decay(global_step):
return tf.train.exponential_decay(
learning_rate=0.1, global_step=global_step,
decay_steps=steps, decay_rate=0.01)
random.seed(42) # to sample data the same way
classifier = skflow.TensorFlowDNNClassifier(
hidden_units=[150, 150, 150],
n_classes=2,
batch_size=128,
steps=steps,
learning_rate=exp_decay)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Did I do something wrong or is scaling not necessary?
Usually scaling benefits most for models that don't have regularization and linear models. For example simple mean squared error loss (like in TensorFlowLinearRegressor) without regularization won't work very well on not scaled data.
In your case you are using classifier that runs softmax regularization and you are using DNN, so scaling is not needed. DNNs themselve can model rescaling (via bias and weight on the feature in the first layer) if that's a useful thing to do.