3-layered Neural network doesen't learn properly - neural-network

So, I'm trying to implement a neural network with 3 layers in python, however I am not the brightest person so anything with more then 2 layers is kinda difficult for me. The problem with this one is that it gets stuck at .5 and does not learn I have no actual clue where it went wrong. Thank you for anyone with the patience to explain the error to me. (I hope the code makes sense)
import numpy as np
def sigmoid(x):
return 1/(1+np.exp(-x))
def reduce(x):
return x*(1-x)
for justanumber in range(1000):
for i in range(len(l0)):
print l2
PS. I know that it might be a piece of trash as a script but that is why I asked for assistance

Your computations are not fully correct. For example, the reduce is called on the l1_err and l2_err, where it should be called on l1 and l2.
You are performing stochastic gradient descent. In this case with such few parameters, it oscilates hugely. In this case use a full batch gradient descent.
The bias units are not present. Although you can still learn without bias, technically.
I tried to rewrite your code with minimal changes. I have commented your lines to show the changes.
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x):
return 1/(1+np.exp(-x))
def reduce(x):
return x*(1-x)
l0=np.array ([np.array([1,1,0,0]),
output=np.array ([[0],[1],[1],[0],[1]]);
final_err = list ();
gamma = 0.05
maxiter = 100000
for justanumber in range(maxiter):
syn0_del = np.zeros_like (syn0);
syn1_del = np.zeros_like (syn1);
l2_err_sum = 0;
for i in range(len(l0)):
this_data = l0[i,np.newaxis];
l2_delta=np.dot (reduce(l2), l2_err)
l1_err=np.dot (syn1, l2_delta)
l1_delta=np.dot(reduce(l1), l1_err)
# Accumulate gradient for this point for layer 1
syn1_del += np.matmul(l2_delta, l1).T;
# Accumulate gradient for this point for layer 0
syn0_del += np.matmul(l1_delta, this_data).T;
# The error for this datpoint. Mean sum of squares
l2_err_sum += np.mean (l2_err ** 2);
l2_err_sum /= l0.shape[0]; # Mean sum of squares
syn0 += gamma * syn0_del;
syn1 += gamma * syn1_del;
print ("iter: ", justanumber, "error: ", l2_err_sum);
final_err.append (l2_err_sum);
# Predicting
l1=sigmoid(np.matmul(l0,syn0))[:]# 1 x d * d x 4 = 1 x 4;
l2=sigmoid(np.matmul(l1,syn1))[:] # 1 x 4 * 4 x 1 = 1 x 1
print ("Predicted: \n", l2)
print ("Actual: \n", output)
plt.plot (np.array (final_err));
plt.show ();
The output I get is:
Therefore the network was able to predict all the toy training examples. (Note in real data you would not like to fit the data at its best as it leads to overfitting). Note that you may get a bit different result, as the weight initialisations are different. Also, try to initialise the weight between [-0.01, +0.01] as a rule of thumb, when you are not working on a specific problem and you specifically know the initialisation.
Here is the convergence plot.
Note that you do not need to actually iterate over each example, instead you can do matrix multiplication at once, which is much faster. Also, the above code does not have bias units. Make sure you have bias units when you re-implement the code.
I would recommend you go through the Raul Rojas' Neural Networks, a Systematic Introduction, Chapter 4, 6 and 7. Chapter 7 will tell you how to implement deeper networks in a simple way.


GPflow change point kernel issue with multiple dimensions

I'm following the tutorial here for implementing a change point kernel in gpflow.
However, I have 3 inputs and 1 output and I would like the changepoint kernel to be on the first input dimension only and other standard kernels to be on the other two input dimensions. I'm getting the following error :
InvalidArgumentError: Incompatible shapes: [2000,3,1] vs. [3,2000,1] [Op:Mul] name: mul/
Below is a minimum working example. Could anyone please let me know where I'm going wrong?
gpflow version 2.0.0.rc1
import pandas as pd
import gpflow
from gpflow.utilities import print_summary
df_all = pd.read_csv(
# Training dataset in numpy format
X = df_all[['X1', 'X2', 'X3']].to_numpy()
Y1 = df_all['Y'].to_numpy().reshape(-1, 1)
# Changepoint kernel only on first dimension and standard kernels for the other two dimensions
base_k1 = gpflow.kernels.Matern32(lengthscale=0.2, active_dims=[0])
base_k2 = gpflow.kernels.Matern32(lengthscale=2., active_dims=[0])
k1 = gpflow.kernels.ChangePoints(
[base_k1, base_k2], [.4], steepness=5)
k2 = gpflow.kernels.Matern52(lengthscale=[1., 1.], active_dims=[1, 2])
k_all = k1+k2
m1 = gpflow.models.GPR(data=(X, Y1), kernel=k_all, mean_function=None)
opt = gpflow.optimizers.Scipy()
def objective_closure():
return -m1.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure, m1.trainable_variables,
The correct answer would be to move the active_dims=[0] from the base_k* kernels to the ChangePoints() kernel,
k1 = gpflow.kernels.ChangePoints([base_k1, base_k2], [0.4], steepness=5, active_dims=[0])
but this is currently not supported in GPflow 2, which is a bug. I've opened an issue on github, and will update this answer once it's fixed (if you feel up to having a go at fixing this bug, feel free to open a pull request, help always welcome!).

pinv(H) is not equal to pinv(H'*H)*H'

I'm testing the y = SinC(x) function with single hidden layer feedforward neural networks (SLFNs) with 20 neurons.
With a SLFN, in the output layer, the output weight(OW) can be described by
OW = pinv(H)*T
after adding regularized parameter gamma, which
OW = pinv(I/gamma+H'*H)*H'*T
gamma -> Inf, pinv(H'*H)*H'*T == pinv(H)*T, also pinv(H'*H)*H' == pinv(H).
But when I try to calculate pinv(H'*H)*H' and pinv(H), I find a huge difference between these two when neurons number is over 5 (under 5, they are equal or almost the same).
For example, when H is 10*10 matrix, cond(H) = 21137561386980.3, rank(H) = 10,
H = [0.736251410036783 0.499731137079796 0.450233920602169 0.296610970576716 0.369359425954153 0.505556211442208 0.502934880027889 0.364904559142718 0.253349959726753 0.298697900877265;
0.724064281864009 0.521667364351399 0.435944895257239 0.337878535128756 0.364906002569385 0.496504064726699 0.492798607017131 0.390656915261343 0.289981152837390 0.307212326718916;
0.711534656474153 0.543520341487420 0.421761457948049 0.381771374416867 0.360475582262355 0.487454209236671 0.482668250979627 0.417033287703137 0.329570921359082 0.315860145366824;
0.698672860220896 0.565207057974387 0.407705930918082 0.427683127210120 0.356068794706095 0.478412571446765 0.472552121296395 0.443893207685379 0.371735862991355 0.324637323886021;
0.685491077062637 0.586647027111176 0.393799811411985 0.474875155650945 0.351686254239637 0.469385056318048 0.462458480695760 0.471085139463084 0.415948455902421 0.333539494486324;
0.672003357663056 0.607763454504209 0.380063647372632 0.522520267708374 0.347328559602877 0.460377531907542 0.452395518357816 0.498449772544129 0.461556360076788 0.342561958147251;
0.658225608290477 0.628484290731116 0.366516925684188 0.569759064961507 0.342996293691614 0.451395814182317 0.442371323528726 0.525823695636816 0.507817005881821 0.351699689941632;
0.644175558300583 0.648743139215935 0.353177974096445 0.615761051907079 0.338690023332811 0.442445652121229 0.432393859824045 0.553043275759248 0.553944175102542 0.360947346089454;
0.629872705346690 0.668479997764613 0.340063877672496 0.659781468051379 0.334410299080102 0.433532713184646 0.422470940392161 0.579948548513999 0.599160649563718 0.370299272759337;
0.615338237874436 0.687641820315375 0.327190410302607 0.701205860709835 0.330157655029498 0.424662569229062 0.412610204098877 0.606386924575225 0.642749594844498 0.379749516620049];
T=[-0.806458764562879 -0.251682808380338 -0.834815868451399 -0.750626822371170 0.877733363571576 1 -0.626938984683970 -0.767558933097629 -0.921811074815239 -1]';
There is a huge difference between pinv(H'*H)*H*T and pinv(H)*T, where
pinv(H'*H)*H*T = [-4803.39093243484 3567.08623820149 668.037919243849 5975.10699147077
1709.31211566970 -1328.53407325092 -1844.57938928594 -22511.9388736373
-2377.63048959478 31688.5125271114]';
pinv(H)*T = [-19780274164.6438 -3619388884.32672 -76363206688.3469 16455234.9229156
-135982025652.153 -93890161354.8417 283696409214.039 193801203.735488
-18829106.6110445 19064848675.0189]'.
I also find that if I round H , round(H,2), pinv(H'*H)*H*T and pinv(H)*T return the same answer. So I guess one of the reason might be the float calculation issue inside the matlab.
But since cond(H) is large, any small change of H may result in large difference in the inverse of H. I think the round function may not be a good option to test. As Cris Luengo mentioned, with large cond,the numerical imprecision will affect the accuracy of inverse.
In my test, I use 1000 training samples Input:[-10,10], with noise between [-0.2,0.2], and test samples are noise free. 20 neurons are selected. The OW = pinv(H)*Tcan give reasonable results for SinC training, while the performance for OW = pinv(H'*H)*T is worse. Then I try to increase the precision of H'*H by pinv(vpa(H'*H)), there's no significant improvement.
Does anyone know how to solve this?
After some research, the answer is that ELM is very sentive to scaling and activation function.
Please refer to this paper for details: https://dl.acm.org/citation.cfm?id=2797143.2797161
And paper: https://ieeexplore.ieee.org/document/8533625 demonstrated a noval algorithm to improve the perforamance of ELM for scaling.

Pytorch: NN function approximator, 2 in 1 out

[Please be aware of the Edit History below, as the major problem statement has changed.]
We are trying to implement a neural network in pytorch, that approximates a function f(x,y)=z. So there are two real numbers as input and one as ouput, we therefore want 2 nodes in the input layer and one in the output layer. We constructed a test set of 5050 samples and had pretty good results for that task in Keras with Tensorflow backend, with 3 hidden layers with a configuration of the nodes like: 2(in) - 4 - 16 - 4 - 1(out); and ReLU activation functions on all hidden layers, linear on in- and output.
Now in Pytorch we tried to implement a similar network but our loss function still literally explodes: It changes in the first few steps and converges then to some value around 10^7. In Keras we had an error around 10 percent. We already tried different network configurations without any improvement. Maybe someone could have a look on our code and suggest any change?
To explain: tr_data is a list, containing 5050 2*1 numpy arrays which are the inputs for the network. tr_labels is a list, containing 5050 numbers which are the outputs we want to learn. loadData() just load those two lists.
import torch.nn as nn
import torch.nn.functional as F
DIM_IN = 2
class Net(nn.Module):
def __init__(self):
#super(Net, self).__init__()
self.hidden1 = nn.Linear(DIM_IN, DIM_HIDDEN_1)
self.hidden2 = nn.Linear(DIM_HIDDEN_1, DIM_HIDDEN_2)
self.hidden3 = nn.Linear(DIM_HIDDEN_2, DIM_HIDDEN_3)
self.out = nn.Linear(DIM_HIDDEN_3, DIM_OUT)
def forward(self, x):
x = F.relu(self.hidden1(x))
x = F.tanh(self.hidden2(x))
x = F.tanh(self.hidden3(x))
x = self.out(x)
return x
model = Net()
loss_fn = nn.MSELoss(size_average=False)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARN_RATE)
tr_data,tr_labels = loadData()
tr_data_torch = torch.zeros(BATCH_SIZE, DIM_IN)
tr_labels_torch = torch.zeros(BATCH_SIZE, DIM_OUT)
for i in range(BATCH_SIZE):
tr_data_torch[i] = torch.from_numpy(tr_data[i])
tr_labels_torch[i] = tr_labels[i]
for t in range(EPOCH_NUM):
labels_pred = model(tr_data_torch)
loss = loss_fn(labels_pred, tr_labels_torch)
#print(t, loss.item())
I have to say, those are our first steps in Pytorch, so please forgive me if there are some obvious, dumb mistakes. I appreciate any help or hint,
Thank you!
EDIT 1 ------------------------------------------------------------------
Following the comments and answers, we improved our code. The Loss function has now for the first time reasonable values, around 250. Our new class definition looks like:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hidden1 = nn.Sequential(nn.Linear(DIM_IN, DIM_HIDDEN_1), nn.ReLU())
self.hidden2 = nn.Sequential(nn.Linear(DIM_HIDDEN_1, DIM_HIDDEN_2), nn.ReLU())
self.hidden3 = nn.Sequential(nn.Linear(DIM_HIDDEN_2, DIM_HIDDEN_3), nn.ReLU())
self.out = nn.Linear(DIM_HIDDEN_3, DIM_OUT)
def forward(self, x):
x = self.hidden1(x)
x = self.hidden2(x)
x = self.hidden3(x)
x = self.out(x)
return x
and the loss function:
loss_fn = nn.MSELoss(size_average=True, reduce=True)
As we stated before, we already had far more satisfying results in keras with tensorflow backend. The loss function was around 30, with a similar network configuration. I share the essential parts(!) of our keras code here:
model = Sequential()
model.add(Dense(4, activation="linear", input_shape=(2,)))
model.add(Dense(16, activation="relu"))
model.add(Dense(4, activation="relu"))
model.add(Dense(1, activation="linear" ))
model.compile ( loss="mean_squared_error", optimizer="adam", metrics=["mse"] )
history=model.fit ( np.array(tr_data), np.array(tr_labels), \
validation_data = ( np.array(val_data), np.array(val_labels) ),
batch_size=50, epochs=200, callbacks = [ cbk ] )
Thank your already for all the help! If anybody still has suggestions to improve the network, we would be happy about it. As somebody already asked for the data, we want to share a pickle file here:
together with the code to access it:
import pickle
tr_data=pickle.load ( f )
tr_labels=pickle.load ( f )
val_data=pickle.load ( f )
val_labels=pickle.load ( f )
It should be interesting for you to point out the differences between torch.nn and torch.nn.functional (see here). Essentially, it might be that your backpropagation graph might be executed not 100% correct due to a different specification.
As pointed out by previous commenters, I would suggest to define your layers including the activations. My personal favorite way is to use nn.Sequential(), which allows you to specify multiple opeations chained together, like so:
self.hidden1 = nn.Sequential(nn.Linear(DIM_IN, DIM_HIDDEN1), nn.ReLU())
and then simply calling self.hidden1 later (without wrapping it in F.relu()).
May I also ask why you do not call the commented super(Net, self).__init__() (which is the generally recommended way)?
Additionally, if that should not fix the problem, can you maybe just share the code for Keras in comparison?

Why does supressing weights improve Tensorflow neural net performance?

I have a 2-layer non-convolutional network in Tensorflow, using tanh as the activation function. I understand that weights should be initialized with a truncated normal distribution divided by sqrt(nInputs) e.g.:
weightsLayer1 = tf.Variable(tf.div(tf.truncated_normal([nInputUnits, nUnitsHiddenLayer1),math.sqrt(nInputUnits))))
Being a bit of a bumbling newbie in NN and Tensorflow, I mistakenly implemented this as 2 lines only to make it more readable:
weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nUnitsHiddenLayer1])
weightsLayer1 = tf.div(weightsLayer1, math.sqrt(nInputUnits))
I now know that this is wrong and that the 2nd line causes the weights to be recomputed at each learning step. However, to my suprise, the "incorrect" implementation consistently yields better performance, both in train and test/evaluation datasets. I thought that the incorrect 2-line implementation should be a train wreck, since it is recomputing (suppressing) weights to values other than those chosen by the optimizer, which I would expect would wreak havoc in the optimization process, but it actually improves it. Does anyone have any explanation for this? I am using the Tensorflow adam optimizer.
Update 2016.6.22 - updated the 2nd code block above.
You are right that weightsLayer1 = tf.div(weightsLayer1, math.sqrt(nInputUnits)) is executed at each step. But that does NOT mean that the values in the weight variable are scaled down by sqrt(nInputUnits) in each step. This line is not an in-place operation that affects the values stored in the variable. It computes a new tensor, holding the values in the variable divided by sqrt(nInputUnits) and that tensor, I assume, then goes into the rest of your computation graph. This does not interfere with the optimizer. You are still defining a valid computation graph, just with an somewhat arbitrary scaling of the weights. The optimizer can still compute the gradients with respect to this variable (it will back-propagate through your division operation) and create the corresponding update operations.
In terms of the model that you are defining, the two versions are totally equivalent. For any set of values of weightsLayer1 in the original model (where you don't do the division), you can simply scale them up by sqrt(nInputUnits) and you will get the identical results with your second model. The two represent exactly the same model class, if you will.
Why one works better than the other? Your guess is as good as mine. If you have done the same division for all your variables, you have effectively divided your learning rate by sqrt(nInputUnits). This smaller learning rate might have been beneficial to the problem at hand.
Edit: I think the fact that you give the same name to the variable and the newly created tensor causes confusion. When you do
A = tf.Variable(1.0)
A = tf.mul(A, 2.0)
# Do something with A
then the second line creates a new tensor (as discussed above) and you re-bind the name (and it is only a name) A to that new tensor. For the graph being defined, the naming is absolutely irrelevant. The following code defines the same graph:
A = tf.Variable(1.0)
B = tf.mul(A, 2.0)
# Do something with B
Maybe this becomes clear if you execute the following code:
A = tf.Variable(1.0)
print A
B = A
A = tf.mul(A, 2.0)
print A
print B
The output is
<tensorflow.python.ops.variables.Variable object at 0x7ff025c02bd0>
Tensor("Mul:0", shape=(), dtype=float32)
<tensorflow.python.ops.variables.Variable object at 0x7ff025c02bd0>
The first time you print A it tells you that A is a variable object. After executing A = tf.mul(A, 2.0) and printing A again, you can see that the name A is now bound to a tf.Tensor object. However, the variable still exists, as can be seen by looking at the object behind the name B.
This is what the single line of code does:
t = tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
Creates a Tensor with shape [ nInputUnits, nUnitsHiddenLayer1 ], initialized with 1.0 as the standard deviation of the truncated normal distribution. ( 1.0 is standard stddev value )
t1 = tf.div( t, math.sqrt( nInputUnits ) )
divide all values in t with math.sqrt( nInputUnits )
Your two lines of code do exactly the same thing. On the first line and the second line all values are divided by math.sqrt( nInputUnits ).
As for your statement:
I now know that this is wrong and that the 2nd line causes the weights to be recomputed at each learning step.
EDIT my mistake
Indeed you are right, they are divided by math.sqrt( nInputUnits ) at every execuction, but not reinitialized! The point of importance is where you put tf.variable()
Here both lines are only initialized once:
weightsLayer1 = tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
weightsLayer1 = tf.Variable( tf.div( weightsLayer1, math.sqrt( nInputUnits ) ) )
and here the second line is preformed at every step:
weightsLayer1 = tf.Variable( tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
weightsLayer1 = tf.div( weightsLayer1, math.sqrt( nInputUnits ) )
Why does the second yield better results? it looks like some kind normalization to me, but somebody more knowledgeable should verify that.
you can write it more readable like this:
weightsLayer1 = tf.Variable( tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] , stddev = 1. / math.sqrt( nInputUnits ) )

Trying to balance my dataset through sample_weight in scikit-learn

I'm using RandomForest for classification, and I got an unbalanced dataset, as: 5830-no, 1006-yes. I try to balance my dataset with class_weight and sample_weight, but I can`t.
My code is:
X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
But I don't get any improvement on my ratios TPR, FPR, ROC when using class_weight and sample_weight.
Why? Am I doing anything wrong?
Nevertheless, if I use the function called balanced_subsample, my ratios obtain a great improvement:
def balanced_subsample(x,y,subsample_size):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
My new code is:
X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
This is not a full answer yet, but hopefully it'll help get there.
First some general remarks:
To debug this kind of issue it is often useful to have a deterministic behavior. You can pass the random_state attribute to RandomForestClassifier and various scikit-learn objects that have inherent randomness to get the same result on every run. You'll also need:
import numpy as np
import random
for your balanced_subsample function to behave the same way on every run.
Don't grid search on n_estimators: more trees is always better in a random forest.
Note that sample_weight and class_weight have a similar objective: actual sample weights will be sample_weight * weights inferred from class_weight.
Could you try:
Using subsample=1 in your balanced_subsample function. Unless there's a particular reason not to do so we're better off comparing the results on similar number of samples.
Using your subsampling strategy with class_weight and sample_weight both set to None.
EDIT: Reading your comment again I realize your results are not so surprising!
You get a better (higher) TPR but a worse (higher) FPR.
It just means your classifier tries hard to get the samples from class 1 right, and thus makes more false positives (while also getting more of those right of course!).
You will see this trend continue if you keep increasing the class/sample weights in the same direction.
There is a imbalanced-learn API that helps with oversampling/undersampling data that might be useful in this situation. You can pass your training set into one of the methods and it will output the oversampled data for you. See simple example below
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1)
x_oversampled, y_oversampled = ros.fit_sample(orig_x_data, orig_y_data)
Here it the link to the API: http://contrib.scikit-learn.org/imbalanced-learn/api.html
Hope this helps!