Fitting a neural network with ReLUs to polynomial functions - neural-network

Out of curiosity I am trying to fit neural network with rectified linear units to polynomial functions.
For example, I would like to see how easy (or difficult) it is for a neural network to come up with an approximation for the function f(x) = x^2 + x. The following code should be able to do it, but seems to not learn anything. When I run
using Base.Iterators: repeated
ENV["JULIA_CUDA_SILENT"] = true
using Flux
using Flux: throttle
using Random
f(x) = x^2 + x
x_train = shuffle(1:1000)
y_train = f.(x_train)
x_train = hcat(x_train...)
m = Chain(
Dense(1, 45, relu),
Dense(45, 45, relu),
Dense(45, 1),
softmax
)
function loss(x, y)
Flux.mse(m(x), y)
end
evalcb = () -> #show(loss(x_train, y_train))
opt = ADAM()
#show loss(x_train, y_train)
dataset = repeated((x_train, y_train), 50)
Flux.train!(loss, params(m), dataset, opt, cb = throttle(evalcb, 10))
println("Training finished")
#show m([20])
it returns
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
loss(x_train, y_train) = 2.0100101f14
Training finished
m([20]) = Float32[1.0]
Anyone here sees how I could make the network fit f(x) = x^2 + x?

There seem to be couple of things wrong with your trial that have mostly to do with how you use your optimizer and treat your input -- nothing wrong with Julia or Flux. Provided solution does learn, but is by no means optimal.
It makes no sense to have softmax output activation on a regression problem. Softmax is used in classification problems where the output(s) of your model represent probabilities and therefore should be on the interval (0,1). It is clear your polynomial has values outside this interval. It is usual to have linear output activation in regression problems like these. This means in Flux no output activation should be defined on the output layer.
The shape of your data matters. train! computes gradients for loss(d...) where d is a batch in your data. In your case a minibatch consists of 1000 samples, and this same batch is repeated 50 times. Neural nets are often trained with smaller batches sizes, but a larger sample set. In the code I provided all batches consist of different data.
For training neural nets, in general, it is advised to normalize your input. Your input takes values from 1 to 1000. My example applies a simple linear transformation to get the input data in the right range.
Normalization can also apply to the output. If the outputs are large, this can result in (too) large gradients and weight updates. Another approach is to lower the learning rate a lot.
using Flux
using Flux: #epochs
using Random
normalize(x) = x/1000
function generate_data(n)
f(x) = x^2 + x
xs = reduce(hcat, rand(n)*1000)
ys = f.(xs)
(normalize(xs), normalize(ys))
end
batch_size = 32
num_batches = 10000
data_train = Iterators.repeated(generate_data(batch_size), num_batches)
data_test = generate_data(100)
model = Chain(Dense(1,40, relu), Dense(40,40, relu), Dense(40, 1))
loss(x,y) = Flux.mse(model(x), y)
opt = ADAM()
ps = Flux.params(model)
Flux.train!(loss, ps, data_train, opt , cb = () -> #show loss(data_test...))

Related

pytorch linear regression given wrong results

I implemented a simple linear regression and I’m getting some poor results. Just wondering if these results are normal or I’m making some mistake.
I tried different optimizers and learning rates, I always get bad/poor results
Here is my code:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.autograd import Variable
class LinearRegressionPytorch(nn.Module):
def __init__(self, input_dim=1, output_dim=1):
super(LinearRegressionPytorch, self).__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self,x):
x = x.view(x.size(0),-1)
y = self.linear(x)
return y
input_dim=1
output_dim = 1
if torch.cuda.is_available():
model = LinearRegressionPytorch(input_dim, output_dim).cuda()
else:
model = LinearRegressionPytorch(input_dim, output_dim)
criterium = nn.MSELoss()
l_rate =0.00001
optimizer = torch.optim.SGD(model.parameters(), lr=l_rate)
#optimizer = torch.optim.Adam(model.parameters(),lr=l_rate)
epochs = 100
#create data
x = np.random.uniform(0,10,size = 100) #np.linspace(0,10,100);
y = 6*x+5
mu = 0
sigma = 5
noise = np.random.normal(mu, sigma, len(y))
y_noise = y+noise
#pass it to pytorch
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Variable(x_data).cuda()
target = Variable(y_data).cuda()
else:
inputs = Variable(x_data)
target = Variable(y_data)
for epoch in range(epochs):
#predict data
pred_y= model(inputs)
#compute loss
loss = criterium(pred_y, target)
#zero grad and optimization
optimizer.zero_grad()
loss.backward()
optimizer.step()
#if epoch % 50 == 0:
# print(f'epoch = {epoch}, loss = {loss.item()}')
#print params
for name, param in model.named_parameters():
if param.requires_grad:
print(name, param.data)
There are the poor results :
linear.weight tensor([[1.7374]], device='cuda:0')
linear.bias tensor([0.1815], device='cuda:0')
The results should be weight = 6 , bias = 5
Problem Solution
Actually your batch_size is problematic. If you have it set as one, your targetneeds the same shape as outputs (which you are, correctly, reshaping with view(-1, 1)).
Your loss should be defined like this:
loss = criterium(pred_y, target.view(-1, 1))
This network is correct
Results
Your results will not be bias=5 (yes, weight will go towards 6 indeed) as you are adding random noise to target (and as it's a single value for all your data points, only bias will be affected).
If you want bias equal to 5 remove addition of noise.
You should increase number of your epochs as well, as your data is quite small and network (linear regression in fact) is not really powerful. 10000 say should be fine and your loss should oscillate around 0 (if you change your noise to something sensible).
Noise
You are creating multiple gaussian distributions with different variations, hence your loss would be higher. Linear regression is unable to fit your data and find sensible bias (as the optimal slope is still approximately 6 for your noise, you may try to increase multiplication of 5 to 1000 and see what weight and bias will be learned).
Style (a little offtopic)
Please read documentation about PyTorch and keep your code up to date (e.g. Variable is deprecated in favor of Tensor and rightfully so).
This part of code:
x_data = torch.from_numpy(x).float()
y_data = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = Tensor(x_data).cuda()
target = Tensor(y_data).cuda()
else:
inputs = Tensor(x_data)
target = Tensor(y_data)
Could be written succinctly like this (without much thought):
inputs = torch.from_numpy(x).float()
target = torch.from_numpy(y_noise).float()
if torch.cuda.is_available():
inputs = inputs.cuda()
target = target.cuda()
I know deep learning has it's reputation for bad code and fatal practice, but please do not help spreading this approach.

Keras Custom loss function to pass argument (numpy array containing Noise learnt) with same batch size as of y_true and y_pred

I have Implemented a custom loss function which takes in additional Noise ( numpy array) As illustrated below :
def custom_rcae_loss(self):
N = self.Noise
lambda_val = self.lamda[0]
mue = self.mue
self.batchNo += 1
index = self.batchNo
def custom_rcae(y_true, y_pred):
if(N.ndim >1):
term1 = keras.losses.mean_squared_error(y_true, (y_pred + N ))
The issue is that y_pred is of shape (batch_size, 28,28,1) :
How can I make sure my Noise is also of the same shape of y_pred?
Since I would like to perform (y_pred + Noise).
For instance: If my input is 5983 number of samples with a batch size of 128 There is not the same number of batch_size splits.
How can we address this issue while using keras for making sure Noise is of the same shape of y_pred
Looking forward to suggestions and hints
Thanks in advance

Merging two tensors by convolution in Keras

I'm trying to convolve two 1D tensors in Keras.
I get two inputs from other models:
x - of length 100
ker - of length 5
I would like to get the 1D convolution of x using the kernel ker.
I wrote a Lambda layer to do it:
import tensorflow as tf
def convolve1d(x):
y = tf.nn.conv1d(value=x[0], filters=x[1], padding='VALID', stride=1)
return y
x = Input(shape=(100,))
ker = Input(shape=(5,))
y = Lambda(convolve1d)([x,ker])
model = Model([x,ker], [y])
I get the following error:
ValueError: Shape must be rank 4 but is rank 3 for 'lambda_67/conv1d/Conv2D' (op: 'Conv2D') with input shapes: [?,1,100], [1,?,5].
Can anyone help me understand how to fix it?
It was much harder than I expected because Keras and Tensorflow don't expect any batch dimension in the convolution kernel so I had to write the loop over the batch dimension myself, which requires to specify batch_shape instead of just shape in the Input layer. Here it is :
import numpy as np
import tensorflow as tf
import keras
from keras import backend as K
from keras import Input, Model
from keras.layers import Lambda
def convolve1d(x):
input, kernel = x
output_list = []
if K.image_data_format() == 'channels_last':
kernel = K.expand_dims(kernel, axis=-2)
else:
kernel = K.expand_dims(kernel, axis=0)
for i in range(batch_size): # Loop over batch dimension
output_temp = tf.nn.conv1d(value=input[i:i+1, :, :],
filters=kernel[i, :, :],
padding='VALID',
stride=1)
output_list.append(output_temp)
print(K.int_shape(output_temp))
return K.concatenate(output_list, axis=0)
batch_input_shape = (1, 100, 1)
batch_kernel_shape = (1, 5, 1)
x = Input(batch_shape=batch_input_shape)
ker = Input(batch_shape=batch_kernel_shape)
y = Lambda(convolve1d)([x,ker])
model = Model([x, ker], [y])
a = np.ones(batch_input_shape)
b = np.ones(batch_kernel_shape)
c = model.predict([a, b])
In the current state :
It doesn't work for inputs (x) with multiple channels.
If you provide several filters, you get as many outputs, each being the convolution of the input with the corresponding kernel.
From given code it is difficult to point out what you mean when you say
is it possible
But if what you mean is to merge two layers and feed merged layer to convulation, yes it is possible.
x = Input(shape=(100,))
ker = Input(shape=(5,))
merged = keras.layers.concatenate([x,ker], axis=-1)
y = K.conv1d(merged, 'same')
model = Model([x,ker], y)
EDIT:
#user2179331 thanks for clarifying your intention. Now you are using Lambda Class incorrectly, that is why the error message is showing.
But what you are trying to do can be achieved using keras.backend layers.
Though be noted that when using lower level layers you will lose some higher level abstraction. E.g when using keras.backend.conv1d you need to have input shape of (BATCH_SIZE,width, channels) and kernel with shape of (kernel_size,input_channels,output_channels). So in your case let as assume the x has channels of 1(input channels ==1) and y also have the same number of channels(output channels == 1).
So your code now can be refactored as follows
from keras import backend as K
def convolve1d(x,kernel):
y = K.conv1d(x,kernel, padding='valid', strides=1,data_format="channels_last")
return y
input_channels = 1
output_channels = 1
kernel_width = 5
input_width = 100
ker = K.variable(K.random_uniform([kernel_width,input_channels,output_channels]),K.floatx())
x = Input(shape=(input_width,input_channels)
y = convolve1d(x,ker)
I guess I have understood what you mean. Given the wrong example code below:
input_signal = Input(shape=(L), name='input_signal')
input_h = Input(shape=(N), name='input_h')
faded= Lambda(lambda x: tf.nn.conv1d(input, x))(input_h)
You want to convolute each signal vector with different fading coefficients vector.
The 'conv' operation in TensorFlow, etc. tf.nn.conv1d, only support a fixed value kernel. Therefore, the code above can not run as you want.
I have no idea, too. The code you given can run normally, however, it is too complex and not efficient. In my idea, another feasible but also inefficient way is to multiply with the Toeplitz matrix whose row vector is the shifted fading coefficients vector. When the signal vector is too long, the matrix will be extremely large.

Fitting a sine wave with Keras and PYMC3 yields unexpected results

I've been trying to fit a sine curve with a keras (theano backend) model using pymc3. I've been using this [http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learning/] as a reference point.
A Keras implementation alone fit using optimization does a good job, however Hamiltonian Monte Carlo and Variational sampling from pymc3 is not fitting the data. The trace is stuck at where the prior is initiated. When I move the prior the posterior moves to the same spot. The posterior predictive of the bayesian model in cell 59 is barely getting the sine wave, whereas the non-bayesian fit model gets it near perfect in cell 63. I created a notebook here: https://gist.github.com/tomc4yt/d2fb694247984b1f8e89cfd80aff8706 which shows the code and the results.
Here is a snippet of the model below...
class GaussWeights(object):
def __init__(self):
self.count = 0
def __call__(self, shape, name='w'):
return pm.Normal(
name, mu=0, sd=.1,
testval=np.random.normal(size=shape).astype(np.float32),
shape=shape)
def build_ann(x, y, init):
with pm.Model() as m:
i = Input(tensor=x, shape=x.get_value().shape[1:])
m = i
m = Dense(4, init=init, activation='tanh')(m)
m = Dense(1, init=init, activation='tanh')(m)
sigma = pm.Normal('sigma', 0, 1, transform=None)
out = pm.Normal('out',
m, 1,
observed=y, transform=None)
return out
with pm.Model() as neural_network:
likelihood = build_ann(input_var, target_var, GaussWeights())
# v_params = pm.variational.advi(
# n=300, learning_rate=.4
# )
# trace = pm.variational.sample_vp(v_params, draws=2000)
start = pm.find_MAP(fmin=scipy.optimize.fmin_powell)
step = pm.HamiltonianMC(scaling=start)
trace = pm.sample(1000, step, progressbar=True)
The model contains normal noise with a fixed std of 1:
out = pm.Normal('out', m, 1, observed=y)
but the dataset does not. It is only natural that the predictive posterior does not match the dataset, they were generated in a very different way. To make it more realistic you could add noise to your dataset, and then estimate sigma:
mu = pm.Deterministic('mu', m)
sigma = pm.HalfCauchy('sigma', beta=1)
pm.Normal('y', mu=mu, sd=sigma, observed=y)
What you are doing right now is similar to taking the output from the network and adding standard normal noise.
A couple of unrelated comments:
out is not the likelihood, it is just the dataset again.
If you use HamiltonianMC instead of NUTS, you need to set the step size and the integration time yourself. The defaults are not usually useful.
Seems like keras changed in 2.0 and this way of combining pymc3 and keras does not seem to work anymore.

Logistic regression in Matlab, confused about the results

I am testing out logistic regression in Matlab on 2 datasets created from the audio files:
The first set is created via wavread by extracting vectors of each file: the set is 834 by 48116 matrix. Each traning example is a 48116 vector of the wav's frequencies.
The second set is created by extracting frequencies of 3 formants of the vowels, where each formant(feature) has its' frequency range (for example, F1 range is 500-1500Hz, F2 is 1500-2000Hz and so on). Each training example is a 3-vector of the wav's formants.
I am implementing the algorithm like so:
Cost function and gradient:
h = sigmoid(X*theta);
J = sum(y'*log(h) + (1-y)'*log(1-h)) * -1/m;
grad = ((h-y)'*X)/m;
theta_partial = theta;
theta_partial(1) = 0;
J = J + ((lambda/(2*m)) * (theta_partial'*theta_partial));
grad = grad + (lambda/m * theta_partial');
where X is the dataset and y is the output matrix of 8 classes.
Classifier:
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj', 'on', 'MaxIter', 50);
for c = 1:num_labels,
[theta] = fmincg(#(t)(lrCostFunction(t, X, (y==c), lambda)), initial_theta, options);
all_theta(c, :) = theta';
end
where num_labels = 8, lambda(regularization) is 0.1
With the first set, MaxIter = 50, and I get ~99.8% classification accuracy.
With the second set and MaxIter=50, the accuracy is poor - 62.589928
I thought about increasing MaxIter to a larger value to improve the performance, however, even at a ridiculous amount of iterations, the result doesn't go higher than 66.546763. Changing of the regularization value (lambda) doesn't seem to influence the results in any better way.
What could be the problem? I am new to machine learning and I can't seem to catch what exactly causes this drastic difference. The only reason that obviously stands out for me is that the first set's examples are very long vectors, hence, larger amount of features, and the second set's examples are represented by short 3-vectors. Is this data not enough to classify the second set? If so, what can be done about it to achieve better classification results for the second set?