tensorflow model has different results than the same model in skflow (optimizer) - neural-network

I'm using tensorflow to replicate a neural network for the MNIST dataset, previously programmed in skflow. Here is the model in skflow:
import tensorflow.contrib.learn as skflow
from sklearn import metrics
from sklearn.datasets import fetch_mldata
from sklearn.cross_validation import train_test_split
mnist = fetch_mldata('MNIST original')
train_dataset, test_dataset, train_labels, test_labels = train_test_split( mnist.data, mnist.target, test_size=10000, random_state=42)
classifier = skflow.TensorFlowDNNClassifier(hidden_units=[1200, 1200], n_classes=10, optimizer="SGD", learning_rate=0.01, batch_size=128, steps=1000)
classifier.fit(train_dataset, train_labels)
score = metrics.accuracy_score(test_labels, classifier.predict(test_dataset))
print("Accuracy: %f" % score)
This model get 0.950600 of accuracy.
But the model replicated in tensorflow gets nan in the loss fuction and fails to improve (I think it's not related with Tensorflow NaN bug? since I'm using tf.nn.softmax_cross_entropy_with_logits).
I can't figure out why, since the setup of the model in tensorflow is the same than in the model in skflow. The only thing I'm unsure if it's the same, is on how skflow initializes the weights of the network, I searched that part in the code of skflow but I have not found it.
Here is the code in tensorflow:
import numpy as np
import tensorflow as tf
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
num_labels = len(np.unique(mnist.target))
num_pixels = mnist.data.shape[1]
#reshape labels to one hot encoding
labels = (np.arange(num_labels) == mnist.target[:, None]).astype(np.float32)
#create train_dataset of 60000 and test_dataset of 10000 elem
train_dataset, test_dataset, train_labels, test_labels = train_test_split(mnist.data, labels, test_size=10000, random_state=42)
def accuracy(predictions, labels):
return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) / predictions.shape[0])
batch_size = 128
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(tf.float32,
shape=(batch_size, num_pixels))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_test_dataset = tf.cast(tf.constant(test_dataset), tf.float32)
w_hidden = tf.Variable(tf.truncated_normal([num_pixels, 1200]))
b_hidden = tf.Variable(tf.zeros([1200]))
hidden = tf.nn.relu(tf.matmul(tf_train_dataset, w_hidden) + b_hidden)
w_hidden_2 = tf.Variable(tf.truncated_normal([1200, 1200]))
b_hidden_2 = tf.Variable(tf.zeros([1200]))
hidden2 = tf.nn.relu(tf.matmul(hidden, w_hidden_2) + b_hidden_2)
w = tf.Variable(tf.truncated_normal([1200, num_labels]))
b = tf.Variable(tf.zeros([num_labels]))
logits = tf.matmul(hidden2, w) + b
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits, tf_train_labels))
# Optimizer.
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
# Predictions for the training, and test data.
train_prediction = tf.nn.softmax(logits)
test_prediction = tf.nn.softmax(tf.matmul(tf.nn.relu(tf.matmul(tf.nn.relu(tf.matmul(tf_test_dataset, w_hidden) + b_hidden), w_hidden_2) + b_hidden_2), w) + b)
num_steps = 1001
with tf.Session(graph=graph) as session:
tf.initialize_all_variables().run()
print("Initialized")
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
feed_dict = {tf_train_dataset: batch_data, tf_train_labels: batch_labels}
_, l, predictions = session.run( [optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 100 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
I'm clueless on what might be the issue. Any suggestions?
Edited 1: As I was suggested, I tried replacing tf.Variable calls with tf.get_variable("w_hidden", [num_pixels, 1200]), but I got Nans.
Also, I used skflow.ops.dnn op for doing the layers and used my own loss and etc, and still got Nans.
Edited 2: Turns out it is not a problem of weight initialization. It seems that the gradients are too unstable (in the tensorflow model) and that lead the loss to become NaN. As in Adding multiple layers to TensorFlow causes loss function to become Nan, I slowed the learning rate by an order of magnitude, and it worked out.
Now what I don't understand is what differs between the SGD optimizer of skflow and the one above. Or what is the explanation that they "seem" equal, but they need different learning rates?

Initialization in skflow relies on tf.get_variable default initialization - uniform_unit_scaling_initializer (see this for detailed description).
You can try replacing your tf.Variable calls with something like tf.get_variable("w_hidden", [num_pixels, 1200]).
Alternative, is to start with using skflow.ops.dnn op that will do the layers for you but you still do your own loss and etc.
Also please let me know if you there a clear usecase that forced you to rewrite things in pure TensorFlow instead of using skflow - I would love to address it. You can always write custom model via passing model_fn into TensorFlowEstimator and still use training / batching / saving and etc functionality.

Related

Precison issue with sigmoid activation function for Tensorflow/Keras 2.3.1

Assuming the following forward pass in a classic ANN
(Based on https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/):
Now let's use a sigmoid activation on that, I get:
So far so good, now let's check the result of this calculation in python:
1 / (1+ math.exp(-0.3775)) # ... = 0.5932699921071872, OK
However this is double precision and since Keras uses float32 let's calculate the same thing but with float32, I get:
w1 = np.array(0.15, dtype=np.float32)
i1 = np.array(0.05, dtype=np.float32)
w2 = np.array(0.2, dtype=np.float32)
i2 = np.array(0.1, dtype=np.float32)
b1 = np.array(0.35, dtype=np.float32)
k1 = np.array(1, dtype=np.float32)
k2= np.array(1, dtype=np.float32)
n1 = w1 * i1 + w2 * i2 + b1 * k1
np.array(k2 / (1 + np.exp(-n1)), dtype=np.float32) # --->array(0.59327, dtype=float32)
Ok so normally I should be able to get 0.59327 for our out_{h1} using Keras float32 framework.
Let's now try to get this result using Keras:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
import numpy as np
model = Sequential()
model.add(Dense(2,activation='sigmoid')) # Build only one layer (enough for example)
model.compile(optimizer = SGD(lr = 0.5),loss='mse') # Don't really care for this example, simply for compilation
input = np.array([[0.05, 0.10]]) # Set input, same as the ones provided in the example
output = np.array([[0.01, 0.99]]) # Don't really care for this example since we want to check activation only
weights = [np.array([[0.15, 0.20 ], # Required for set_weights
[0.25, 0.30]], dtype=np.float32),
np.array([0.35, 0.35], dtype=np.float32)]
model.build(input.shape) # Required for set_weights
model.set_weights(weights) # Set the weights to be the same as the one provided in the example
model.predict(np.array([[0.05, 0.10]], dtype=np.float32)) # This can be seen as out_{h1}
# array([[0.5944759 , 0.59628266]], dtype=float32)
# NOK: 0.5944759 != 0.59327
Can someone explain to me why I get 0.5944759 instead of 0.59327? The result seem far from the expected ouput and if possible provide an example of calculation and/or a way to get the expected output of 0.59327.
Please note this example was done using:
tensorflow 2.3.1
numpy 1.18.5
python 3.8.12
Thx for your help.
Tensorflow perform the Dense layer row to column:
It multiplies each row of your input with a column in the weights matrix, You need to transpose your weights matrix, if you do it you will get the correct result.
I validated using your code just modified:
weights = [np.array([[0.15, 0.20], # Required for set_weights
[0.2, 0.30]], dtype=np.float32),
np.array([0.35, 0.35], dtype=np.float32)]
To (pay attention to the transpose operator T:
weights = [np.array([[0.15, 0.20], # Required for set_weights
[0.2, 0.30]], dtype=np.float32).T,
np.array([0.35, 0.35], dtype=np.float32)]

Tensorflow - keras: bad performance for simple curve fitting task

I'm trying to implement a very simple one layered MLP for a toy regression problem with one variable (dimension = 1) and one target (dimension = 1). It's a simple curve fitting problem with zero noise.
Matlab\Deep Learning Toolbox
Using levenberg-marquardt backpropagation on a MLP with a single hidden layer with 100 neurons and hyperbolic tangent activation I got pretty decent performance with almost zero effort:
MSE = 7.18e-08
Plotting the predictions and the targets I get a very precise fitting.
Python\Tensorflow\Keras
With the same network settings I used in matlab there's almost no training. No matter how hard I try to tune the training parameters or switch the optimizer.
MSE = 0.12900154
In this case the plot of the predictions is a curve that is not even able to follow the oscillations of the target curve.
I can obtain something better using RELU activations for the hidden layer but we're still far:
MSE = 0.0582045
This is the code I used in Python:
# IMPORT LIBRARIES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
# IMPORT DATASET FROM CSV FILE, SHUFFLE TRAINING SET
# AND MAKE NUMPY ARRAY FOR TRAINING (DATA ARE ALREADY NORMALIZED)
dataset_path = "C:/Users/Rob/Desktop/Learning1.csv"
Learning_Dataset = pd.read_csv(dataset_path
, comment='\t',sep=","
,skipinitialspace=False)
Learning_Dataset = Learning_Dataset.sample(frac = 1) # SHUFFLING
test_dataset_path = "C:/Users/Rob/Desktop/Test1.csv"
Test_Dataset = pd.read_csv(test_dataset_path
, comment='\t',sep=","
,skipinitialspace=False)
Learning_Target = Learning_Dataset.pop('Target')
Test_Target = Test_Dataset.pop('Target')
Learning_Dataset = np.array(Learning_Dataset,dtype = "float32")
Test_Dataset = np.array(Test_Dataset,dtype = "float32")
Learning_Target = np.array(Learning_Target,dtype = "float32")
Test_Target = np.array(Test_Target,dtype = "float32")
# DEFINE SIMPLE MLP MODEL
inputs = tf.keras.layers.Input(shape=(1,))
x = tf.keras.layers.Dense(100, activation='relu')(inputs)
y = tf.keras.layers.Dense(1)(x)
model = tf.keras.Model(inputs=inputs, outputs=y)
# TRAIN MODEL
opt = tf.keras.optimizers.RMSprop(learning_rate = 0.001,
rho = 0.9,
momentum = 0.0,
epsilon = 1e-07,
centered = False)
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=100)
model.compile(optimizer = opt,
loss = 'mse',
metrics = ['mse'])
model.fit(Learning_Dataset,
Learning_Target,
epochs=500,
validation_split = 0.2,
verbose=0,
callbacks=[early_stop],
shuffle = False,
batch_size = 100)
# INFERENCE AND CHECK ACCURACY
Predictions = model.predict(Test_Dataset)
Predictions = Predictions.reshape(10000)
print(np.square(np.subtract(Test_Target,Predictions)).mean()) # MSE
plt.plot(Test_Dataset,Test_Target,'o',Test_Dataset,Predictions,'o')
plt.legend(('Target','Model Prediction'))
plt.show()
What am i doing wrong?
Thanks

pytorch linear regression given wrong results

I implemented a simple linear regression and I’m getting some poor results. Just wondering if these results are normal or I’m making some mistake.
I tried different optimizers and learning rates, I always get bad/poor results
Here is my code:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.autograd import Variable
class LinearRegressionPytorch(nn.Module):
def __init__(self, input_dim=1, output_dim=1):
super(LinearRegressionPytorch, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self,x):
x = x.view(x.size(0),-1)
y = self.linear(x)
return y
input_dim=1
output_dim = 1
if torch.cuda.is_available():
model = LinearRegressionPytorch(input_dim, output_dim).cuda()
else:
model = LinearRegressionPytorch(input_dim, output_dim)
criterium = nn.MSELoss()
l_rate =0.00001
optimizer = torch.optim.SGD(model.parameters(), lr=l_rate)
#optimizer = torch.optim.Adam(model.parameters(),lr=l_rate)
epochs = 100
#create data
x = np.random.uniform(0,10,size = 100) #np.linspace(0,10,100);
y = 6*x+5
mu = 0
sigma = 5
noise = np.random.normal(mu, sigma, len(y))
y_noise = y+noise
#pass it to pytorch
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Variable(x_data).cuda()
target = Variable(y_data).cuda()
else:
inputs = Variable(x_data)
target = Variable(y_data)
for epoch in range(epochs):
#predict data
pred_y= model(inputs)
#compute loss
loss = criterium(pred_y, target)
#zero grad and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
#if epoch % 50 == 0:
# print(f'epoch = {epoch}, loss = {loss.item()}')
#print params
for name, param in model.named_parameters():
if param.requires_grad:
print(name, param.data)
There are the poor results :
linear.weight tensor([[1.7374]], device='cuda:0')
linear.bias tensor([0.1815], device='cuda:0')
The results should be weight = 6 , bias = 5
Problem Solution
Actually your batch_size is problematic. If you have it set as one, your targetneeds the same shape as outputs (which you are, correctly, reshaping with view(-1, 1)).
Your loss should be defined like this:
loss = criterium(pred_y, target.view(-1, 1))
This network is correct
Results
Your results will not be bias=5 (yes, weight will go towards 6 indeed) as you are adding random noise to target (and as it's a single value for all your data points, only bias will be affected).
If you want bias equal to 5 remove addition of noise.
You should increase number of your epochs as well, as your data is quite small and network (linear regression in fact) is not really powerful. 10000 say should be fine and your loss should oscillate around 0 (if you change your noise to something sensible).
Noise
You are creating multiple gaussian distributions with different variations, hence your loss would be higher. Linear regression is unable to fit your data and find sensible bias (as the optimal slope is still approximately 6 for your noise, you may try to increase multiplication of 5 to 1000 and see what weight and bias will be learned).
Style (a little offtopic)
Please read documentation about PyTorch and keep your code up to date (e.g. Variable is deprecated in favor of Tensor and rightfully so).
This part of code:
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Tensor(x_data).cuda()
target = Tensor(y_data).cuda()
else:
inputs = Tensor(x_data)
target = Tensor(y_data)
Could be written succinctly like this (without much thought):
inputs = torch.from_numpy(x).float()
target = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = inputs.cuda()
target = target.cuda()
I know deep learning has it's reputation for bad code and fatal practice, but please do not help spreading this approach.

non-linear neural network regression - quadratic function is not being estimated correctly

I have mostly used ANNs for classification and only recently started to try them out for modeling continuous variables. As an exercise I generated a simple set of (x, y) pairs where y = x^2 and tried to train an ANN to learn this quadratic function.
The ANN model:
This ANN has 1 input node (ie. x), 2 hidden layers each with 2 nodes in each layer, and 1 output node. All four hidden nodes use the non-linear tanh activation function and the output node has no activation function (since it is regression).
The Data:
For the training set I randomly generated 100 numbers between (-20, 20) for x and computed y=x^2. For the testing set I randomly generated 100 numbers between (-30, 30) for x and also computed y=x^2. I then transformed all x so that they are centered around 0 and their min and max are approximately around -1.5 and 1.5. I also transformed all y similarly but made their min and max about -0.9 and 0.9. This way, all the data falls within that mid range of the tanh activation function and not way out at the extremes.
The Problem:
After training the ANN in Keras, I am seeing that only the right half of the polynomial function is being learned, and the left half is completely flat. Does anyone have any ideas why this may be happening? I tried playing around with different scaling options, as well as hidden layer specifications but no luck on that left side.
Thanks!
Attached is the code I used for everything and the image shows the plot of the scaled training x vs the predicted y. As you can see, only half of the parabola is recovered.
import numpy as np, pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
seed = 10
n = 100
X_train = np.random.uniform(-20, 20, n)
Y_train = X_train ** 2
X_test = np.random.uniform(-30, 30, n)
Y_test = X_test ** 2
#### Scale the data
x_cap = max(abs(np.array(list(X_train) + list(X_test))))
y_cap = max(abs(np.array(list(Y_train) + list(Y_test))))
x_mean = np.mean(np.array(list(X_train) + list(X_test)))
y_mean = np.mean(np.array(list(Y_train) + list(Y_test)))
X_train2 = (X_train-x_mean) / x_cap
X_test2 = (X_test-x_mean) / x_cap
Y_train2 = (Y_train-y_mean) / y_cap
Y_test2 = (Y_test-y_mean) / y_cap
X_train2 = X_train2 * (1.5 / max(X_train2))
Y_train2 = Y_train2 * (0.9 / max(Y_train2))
# define base model
def baseline_model1():
# create model
model1 = Sequential()
model1.add(Dense(2, input_dim=1, kernel_initializer='normal', activation='tanh'))
model1.add(Dense(2, input_dim=1, kernel_initializer='normal', activation='tanh'))
model1.add(Dense(1, kernel_initializer='normal'))
# Compile model
model1.compile(loss='mean_squared_error', optimizer='adam')
return model1
np.random.seed(seed)
estimator1 = KerasRegressor(build_fn=baseline_model1, epochs=100, batch_size=5, verbose=0)
estimator1.fit(X_train2, Y_train2)
prediction = estimator1.predict(X_train2)
plt.scatter(X_train2, prediction)
enter image description here
You should also consider adding more width to you hidden layer. I changed from 2 to 5 and got a very good fit. I also used more epochs as suggested from rvinas
Your network is very sensible to the initial parameters. The following will help:
Change your kernel_initializer to glorot_uniform. Your network is very small and glorot_uniform will work better in consonance with the tanh activations. Glorot uniform will encourage your weights to be initially within a more reasonable range (since it takes into account the fan-in and fan-out of each layer).
Train your model for more epochs (i.e. 1000).

Merging two tensors by convolution in Keras

I'm trying to convolve two 1D tensors in Keras.
I get two inputs from other models:
x - of length 100
ker - of length 5
I would like to get the 1D convolution of x using the kernel ker.
I wrote a Lambda layer to do it:
import tensorflow as tf
def convolve1d(x):
y = tf.nn.conv1d(value=x[0], filters=x[1], padding='VALID', stride=1)
return y
x = Input(shape=(100,))
ker = Input(shape=(5,))
y = Lambda(convolve1d)([x,ker])
model = Model([x,ker], [y])
I get the following error:
ValueError: Shape must be rank 4 but is rank 3 for 'lambda_67/conv1d/Conv2D' (op: 'Conv2D') with input shapes: [?,1,100], [1,?,5].
Can anyone help me understand how to fix it?
It was much harder than I expected because Keras and Tensorflow don't expect any batch dimension in the convolution kernel so I had to write the loop over the batch dimension myself, which requires to specify batch_shape instead of just shape in the Input layer. Here it is :
import numpy as np
import tensorflow as tf
import keras
from keras import backend as K
from keras import Input, Model
from keras.layers import Lambda
def convolve1d(x):
input, kernel = x
output_list = []
if K.image_data_format() == 'channels_last':
kernel = K.expand_dims(kernel, axis=-2)
else:
kernel = K.expand_dims(kernel, axis=0)
for i in range(batch_size): # Loop over batch dimension
output_temp = tf.nn.conv1d(value=input[i:i+1, :, :],
filters=kernel[i, :, :],
padding='VALID',
stride=1)
output_list.append(output_temp)
print(K.int_shape(output_temp))
return K.concatenate(output_list, axis=0)
batch_input_shape = (1, 100, 1)
batch_kernel_shape = (1, 5, 1)
x = Input(batch_shape=batch_input_shape)
ker = Input(batch_shape=batch_kernel_shape)
y = Lambda(convolve1d)([x,ker])
model = Model([x, ker], [y])
a = np.ones(batch_input_shape)
b = np.ones(batch_kernel_shape)
c = model.predict([a, b])
In the current state :
It doesn't work for inputs (x) with multiple channels.
If you provide several filters, you get as many outputs, each being the convolution of the input with the corresponding kernel.
From given code it is difficult to point out what you mean when you say
is it possible
But if what you mean is to merge two layers and feed merged layer to convulation, yes it is possible.
x = Input(shape=(100,))
ker = Input(shape=(5,))
merged = keras.layers.concatenate([x,ker], axis=-1)
y = K.conv1d(merged, 'same')
model = Model([x,ker], y)
EDIT:
#user2179331 thanks for clarifying your intention. Now you are using Lambda Class incorrectly, that is why the error message is showing.
But what you are trying to do can be achieved using keras.backend layers.
Though be noted that when using lower level layers you will lose some higher level abstraction. E.g when using keras.backend.conv1d you need to have input shape of (BATCH_SIZE,width, channels) and kernel with shape of (kernel_size,input_channels,output_channels). So in your case let as assume the x has channels of 1(input channels ==1) and y also have the same number of channels(output channels == 1).
So your code now can be refactored as follows
from keras import backend as K
def convolve1d(x,kernel):
y = K.conv1d(x,kernel, padding='valid', strides=1,data_format="channels_last")
return y
input_channels = 1
output_channels = 1
kernel_width = 5
input_width = 100
ker = K.variable(K.random_uniform([kernel_width,input_channels,output_channels]),K.floatx())
x = Input(shape=(input_width,input_channels)
y = convolve1d(x,ker)
I guess I have understood what you mean. Given the wrong example code below:
input_signal = Input(shape=(L), name='input_signal')
input_h = Input(shape=(N), name='input_h')
faded= Lambda(lambda x: tf.nn.conv1d(input, x))(input_h)
You want to convolute each signal vector with different fading coefficients vector.
The 'conv' operation in TensorFlow, etc. tf.nn.conv1d, only support a fixed value kernel. Therefore, the code above can not run as you want.
I have no idea, too. The code you given can run normally, however, it is too complex and not efficient. In my idea, another feasible but also inefficient way is to multiply with the Toeplitz matrix whose row vector is the shifted fading coefficients vector. When the signal vector is too long, the matrix will be extremely large.