How to check via callbacks if alpha is decreasing? + How to load all cores during training? - callback

I'm training doc2vec, and using callbacks trying to see if alpha is decreasing over training time using this code:
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch.'''
def __init__(self, path_prefix):
self.path_prefix = path_prefix
self.epoch = 0
os.makedirs(self.path_prefix, exist_ok=True)
def on_epoch_end(self, model):
savepath = get_tmpfile(
'{}_epoch{}.model'.format(self.path_prefix, self.epoch)
"Model alpha: {}".format(model.alpha),
"Model min_alpha: {}".format(model.min_alpha),
"Epoch saved: {}".format(self.epoch + 1),
"Start next epoch"
self.epoch += 1
def train():
workers = multiprocessing.cpu_count()*4
model = Doc2Vec(
vec_size=600, alpha=0.03, min_alpha=0.00025, epochs=20,
min_count=10, dm=1, hs=1, negative=0, workers=workers,
"HS", model.hs, "Negative", model.negative, "Epochs",
model.epochs, "Workers: ", model.workers, "Model alpha:
And while training I see that alpha is not changing over time. On each callback I see alpha = 0.03.
Is it possible to check if alpha is decreasing? Or it really not decreasing at all during training?
One more question:
How can I benefit from all my cores while training doc2vec?
As we can see, each core is not loaded more than +-30%.

The model.alpha property only holds the initially-configured starting-alpha – it's not updated to the effective learning-rate through training.
So, even if the value is being decreased properly (and I expect that it is), you wouldn't see it in the logging you've added.
Separate observations about your code:
in gensim versions at least through 3.5.0, maximum training throughput is most often reached with some value for workers between 3 and the number of cores – but usually not the full number of cores (if it's higher than 12) or larger. So workers=multiprocessing.cpu_count()*4 is likely going to much slower than what you could achieve with a lower number.
if your corpus is large enough to support 600-dimensional vectors, and discarding words with fewer than min_count=10 examples, negative sampling may work faster and get better results than the hs mode. (The pattern in published work seems to be to prefer negative-sampling with larger corpuses.)


Transferring arrays/classes/records between locales

In a typical N-Body simulation, at the end of each epoch, each locale would need to share its own portion of the world (i.e. all bodies) to the rest of the locales. I am working on this with a local-view approach (i.e. using on Loc statements). I encountered some strange behaviours that I couldn't make sense out of, so I decided to make a test program, in which things got more complicated. Here's the code to replicate the experiment.
proc log(args...?n) {
writeln("[locale = ",, "] [",, "] => ", args);
const max: int = 50000;
record stuff {
var x1: int;
var x2: int;
proc init() {
this.x1 =;
this.x2 =;
class ctuff {
var x1: int;
var x2: int;
proc init() {
this.x1 =;
this.x2 =;
class wrapper {
// The point is that total size (in bytes) of data in `r`, `c` and `a` are the same here, because the record and the class hold two ints per index.
var r: [{1..max / 2}] stuff;
var c: [{1..max / 2}] owned ctuff?;
var a: [{1..max}] int;
proc init() {
this.a =;
proc test() {
var wrappers: [LocaleSpace] owned wrapper?;
coforall loc in LocaleSpace {
on Locales[loc] {
wrappers[loc] = new owned wrapper();
// rest of the experiment further down.
Two interesting behaviours happen here.
1. Moving data
Now, each instance of wrapper in array wrappers should live in its locale. Specifically, the references (wrappers) will live in locale 0, but the internal data (r, c, a) should live in the respective locale. So we try to move some from locale 1 to locale 3, as such:
on Locales[3] {
var timer: Timer;
var local_stuff = wrappers[1]!.r;
log("get r from 1", timer.elapsed());
on Locales[3] {
var timer: Timer;
var local_c = wrappers[1]!.c;
log("get c from 1", timer.elapsed());
on Locales[3] {
var timer: Timer;
var local_a = wrappers[1]!.a;
log("get a from 1", timer.elapsed());
Surprisingly, my timings show that
Regardless of the size (const max), the time of sending the array and record strays constant, which doesn't make sense to me. I even checked with chplvis, and the size of GET actually increases, but the time stays the same.
The time to send the class field increases with time, which makes sense, but it is quite slow and I don't know which case to trust here.
2. Querying the locales directly.
To demystify the problem, I also query the of some variables directly. First, we query the data, which we expect to live in locale 2, from locale 2:
on Locales[2] {
var wrappers_ref = wrappers[2]!; // This is always 1 GET from 0, okay.
And the result is:
[locale = 2] [2020-12-26T19:36:26.834472] => (array, 2, 2)
[locale = 2] [2020-12-26T19:36:26.894779] => (record, 2, 2, 2)
[locale = 2] [2020-12-26T19:36:27.023112] => (class, 2, 2, 2)
Which is expected. Yet, if we query the locale of the same data on locale 1, then we get:
[locale = 1] [2020-12-26T19:34:28.509624] => (array, 2, 2)
[locale = 1] [2020-12-26T19:34:28.574125] => (record, 2, 2, 1)
[locale = 1] [2020-12-26T19:34:28.700481] => (class, 2, 2, 2)
Implying that wrappers_ref.r[1] lives in locale 1, even though it should clearly be on locale 2. My only guess is that by the time is executed, the data (i.e. the .x of the record) is already moved to the querying locale (1).
So all in all, the second part of the experiment lead to a secondary question, whilst not answering the first part.
NOTE: all experiment are run with -nl 4 in chapel/chapel-gasnet docker image.
Good observations, let me see if I can shed some light.
As an initial note, any timings taken with the gasnet Docker image should be taken with a grain of salt since that image simulates the execution across multiple nodes using your local system rather than running each locale on its own compute node as intended in Chapel. As a result, it is useful for developing distributed memory programs, but the performance characteristics are likely to be very different than running on an actual cluster or supercomputer. That said, it can still be useful for getting coarse timings (e.g., your "this is taking a much longer time" observation) or for counting communications using chplvis or the CommDiagnostics module.
With respect to your observations about timings, I also observe that the array-of-class case is much slower, and I believe I can explain some of the behaviors:
First, it's important to understand that any cross-node communications can be characterized using a formula like alpha + beta*length. Think of alpha as representing the basic cost of performing the communication, independent of length. This represents the cost of calling down through the software stack to get to the network, putting the data on the wire, receiving it on the other side, and getting it back up through the software stack to the application there. The precise value of alpha will depend on factors like the type of communication, choice of software stack, and physical hardware. Meanwhile, think of beta as representing the per-byte cost of the communication where, as you intuit, longer messages necessarily cost more because there's more data to put on the wire, or potentially to buffer or copy, depending on how the communication is implemented.
In my experience, the value of alpha typically dominates beta for most system configurations. That's not to say that it's free to do longer data transfers, but that the variance in execution time tends to be much smaller for longer vs. shorter transfers than it is for performing a single transfer versus many. As a result, when choosing between performing one transfer of n elements vs. n transfers of 1 element, you'll almost always want the former.
To investigate your timings, I bracketed your timed code portions with calls to the CommDiagnostics module as follows:
...code to time here...
and found, as you did with chplvis, that the number of communications required to localize the array of records or array of ints was constant as I varied max, for example:
This is consistent with what I'd expect from the implementation: That for an array of value types, we perform a fixed number of communications to access array meta-data, and then communicate the array elements themselves in a single data transfer to amortize the overheads (avoid paying multiple alpha costs).
In contrast, I found that the number of communications for localizing the array of classes was proportional to the size of the array. For example, for the default value of 50,000 for max, I saw:
I believe the reason for this distinction relates to the fact that c is an array of owned classes, in which only a single class variable can "own" a given ctuff object at a time. As a result, when copying the elements of array c from one locale to another, you're not just copying raw data, as with the record and integer cases, but also performing an ownership transfer per element. This essentially requires setting the remote value to nil after copying its value to the local class variable. In our current implementation, this seems to be done using a remote get to copy the remote class value to the local one, followed by a remote put to set the remote value to nil, hence, we have a get and put per array element, resulting in O(n) communications rather than O(1) as in the previous cases. With additional effort, we could potentially have the compiler optimize this case, though I believe it will always be more expensive than the others due to the need to perform the ownership transfer.
I tested the hypothesis that owned classes were resulting in the additional overhead by changing your ctuff objects from being owned to unmanaged, which removes any ownership semantics from the implementation. When I do this, I see a constant number of communications, as in the value cases:
I believe this represents the fact that once the language has no need to manage the ownership of the class variables, it can simply transfer their pointer values in a single transfer again.
Beyond these performance notes, it's important to understand a key semantic difference between classes and records when choosing which to use. A class object is allocated on the heap, and a class variable is essentially a reference or pointer to that object. Thus, when a class variable is copied from one locale to another, only the pointer is copied, and the original object remains where it was (for better or worse). In contrast, a record variable represents the object itself, and can be thought of as being allocated "in place" (e.g., on the stack for a local variable). When a record variable is copied from one locale to the other, it's the object itself (i.e., the record's fields' values) which are copied, resulting in a new copy of the object itself. See this SO question for further details.
Moving on to your second observation, I believe that your interpretation is correct, and that this may be a bug in the implementation (I need to stew on it a bit more to be confident). Specifically, I think you're correct that what's happening is that wrappers_ref.r[1].x1 is being evaluated, with the result being stored in a local variable, and that the query is being applied to the local variable storing the result rather than the original field. I tested this theory by taking a ref to the field and then printing of that ref, as follows:
ref x1loc = wrappers_ref.r[1].x1;
and that seemed to give the right result. I also looked at the generated code which seemed to indicate that our theories were correct. I don't believe that the implementation should behave this way, but need to think about it a bit more before being confident. If you'd like to open a bug against this on Chapel's GitHub issues page, for further discussion there, we'd appreciate that.

How to deal with overfitting with simple (X,Y) data in MLPRegressor

Dealing with low amounts of data, and dealing with overfitting w/ Folding[GridSearchCV]
I am completely stumped as to how to get better estimations from my model. It seems that when I try to run my code, I get negative Accuracies. How can I improve cross_val_score or testing scores or whatever you want to call it such that I can predict values more reliably.
I tried adding more data (from 50 to 200+).
I tried random parameters (and realized this was a Naive approach)
I also tried cleaning my data w/ StandardScaler on the features
Anyone have any suggestions?
from sklearn.neural_network import MLPRegressor
from sklearn import preprocessing
import requests
import json
from calendar import monthrange
import numpy as np
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import scale
r =requests.get('')
y = json.loads(r.text)
#print(y["Monthly Adjusted Time Series"].keys())
keysInResultSet = y["Weekly Adjusted Time Series"].keys()
featuresListTemp = []
labelsListTemp = []
count = 0;
for i in keysInResultSet:
count = count + 1;
#print(y["Monthly Adjusted Time Series"][i])
tmpList = []
strValue = y["Weekly Adjusted Time Series"][i]["5. adjusted close"]
numValue = float(strValue)
print("TOTAL SET")
arrTestInput = []
arrTestOutput = []
print("SCALING SET")
X_train = np.array(featuresListTemp)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
product_model = MLPRegressor()
#10.0 ** -np.arange(1, 10)
#todo : once found general settings, iterate through some more seeds to find one that can be used on the training
parameters = {'learning_rate': ['constant','adaptive'],'solver': ['lbfgs','adam'], 'tol' : 10.0 ** -np.arange(1, 4), 'verbose' : [True], 'early_stopping': [True], 'activation' : ['tanh','logistic'], 'learning_rate_init': 10.0 ** -np.arange(1, 4), 'max_iter': [4000], 'alpha': 10.0 ** -np.arange(1, 4), 'hidden_layer_sizes':np.arange(1,11), 'random_state':np.arange(1, 3)}
clf = GridSearchCV(product_model, parameters, n_jobs=-1), labelsListTemp)
print(clf.score(X_train_scaled, labelsListTemp))
best_params = clf.best_params_
newPM = MLPRegressor(hidden_layer_sizes=((best_params['hidden_layer_sizes'])), #try reducing the layer size / increasing it and playing around with resultFit variable
solver=best_params['solver'], #non scaled input
scores = cross_val_score(newPM, X_train_scaled, labelsListTemp, cv=10, scoring='neg_mean_absolute_error')
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Output from line 63 and down
0.9142644531564619 {'activation': 'logistic', 'alpha': 0.001, 'early_stopping': True, 'hidden_layer_sizes': 7, 'learning_rate':
'constant', 'learning_rate_init': 0.1, 'max_iter': 4000,
'random_state': 2, 'solver': 'lbfgs', 'tol': 0.01, 'verbose': True}
Accuracy: -21.91 (+/- 58.89) [ -32.87854574 -105.0632913
-22.89836453 -7.33154414 -22.38773819 -3.3786339 -1.7658796 -3.78002866 -4.78734308 -14.81212738]
{'activation': 'logistic', 'alpha': 0.01, 'early_stopping': True, 'hidden_layer_sizes': 30, 'learning_rate': 'constant', 'learning_rate_init': 0.1, 'max_iter': 4000, 'random_state': 2, 'solver': 'lbfgs', 'tol': 0.1, 'verbose': True}
{'activation': 'tanh', 'alpha': 0.01, 'early_stopping': True, 'hidden_layer_sizes': 99, 'learning_rate': 'constant', 'learning_rate_init': 0.1, 'max_iter': 4000, 'random_state': 1, 'solver': 'lbfgs', 'tol': 0.01, 'verbose': True}
Both configurations stated above will work for the sample set. Thanks all, please let me know if there are any questions. This can be solved by scaling down all your other parameters ie. instead of 10.0 ** -np.arange(1, 3) do 10.0 ** -np.arange(1, 2)
to a more limited set. Start removing parameters that you know are correct (very hard to do, but one could be learning_rate='constant' as I noticed that all my best fits resulted in a learning rate that was constant, regardless of any other parameters.
This is mostly for time optimization but will also help with overfitting as you increase the number of nodes in the network. The idea is that you want to increase the fit some N degrees without losing too much of the generalization properties of the true function) once you perform your first grid search.
You should start you grid search making sure that the # of hidden nodes is some where between the # of input nodes and the # of output nodes.
Once you find a decent fit, you can improve the fit by increasing the number of nodes. You must take care not to add too many nodes as to lose the generalization power of the true function. Before you even start thinking about scaling up, you must start reducing the complexity of the parameters such that on your second grid search you will be performing it on an increased number of nodes w/ more general parameters.
The generalization of parameters is described above with the second grid search taking into account more general parameters from the initial search, whilst increasing the network nodes.
I know this is confusing but it's what helped me fit this decently.
For anyone struggling I would try to
0) generalize after performing a search and getting a decent model
1) use generalization on second search with increased nodes
2) play with alpha parameter while scaling up (the rest of the parameters you can generalize)
3) add a few different seeds or remove them depending on the situation
4) While changing tol will alter fit it is also highly dependent on the number of iterations. For that reason, depending on the case, a reasonable number might be .01 or .001 (reasonable depending on how many iterations you want to wait for a given result/ opportunity to converge) If the tol is set too low, you will run out of iterations as each epoch will never get a chance to stop early.

3-layered Neural network doesen't learn properly

So, I'm trying to implement a neural network with 3 layers in python, however I am not the brightest person so anything with more then 2 layers is kinda difficult for me. The problem with this one is that it gets stuck at .5 and does not learn I have no actual clue where it went wrong. Thank you for anyone with the patience to explain the error to me. (I hope the code makes sense)
import numpy as np
def sigmoid(x):
return 1/(1+np.exp(-x))
def reduce(x):
return x*(1-x)
for justanumber in range(1000):
for i in range(len(l0)):
print l2
PS. I know that it might be a piece of trash as a script but that is why I asked for assistance
Your computations are not fully correct. For example, the reduce is called on the l1_err and l2_err, where it should be called on l1 and l2.
You are performing stochastic gradient descent. In this case with such few parameters, it oscilates hugely. In this case use a full batch gradient descent.
The bias units are not present. Although you can still learn without bias, technically.
I tried to rewrite your code with minimal changes. I have commented your lines to show the changes.
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x):
return 1/(1+np.exp(-x))
def reduce(x):
return x*(1-x)
l0=np.array ([np.array([1,1,0,0]),
output=np.array ([[0],[1],[1],[0],[1]]);
final_err = list ();
gamma = 0.05
maxiter = 100000
for justanumber in range(maxiter):
syn0_del = np.zeros_like (syn0);
syn1_del = np.zeros_like (syn1);
l2_err_sum = 0;
for i in range(len(l0)):
this_data = l0[i,np.newaxis];
#l2_delta=reduce(l2_err) (reduce(l2), l2_err) (syn1, l2_delta)
#l1_delta=reduce(l1_err), l1_err)
# Accumulate gradient for this point for layer 1
syn1_del += np.matmul(l2_delta, l1).T;
# Accumulate gradient for this point for layer 0
syn0_del += np.matmul(l1_delta, this_data).T;
# The error for this datpoint. Mean sum of squares
l2_err_sum += np.mean (l2_err ** 2);
l2_err_sum /= l0.shape[0]; # Mean sum of squares
syn0 += gamma * syn0_del;
syn1 += gamma * syn1_del;
print ("iter: ", justanumber, "error: ", l2_err_sum);
final_err.append (l2_err_sum);
# Predicting
l1=sigmoid(np.matmul(l0,syn0))[:]# 1 x d * d x 4 = 1 x 4;
l2=sigmoid(np.matmul(l1,syn1))[:] # 1 x 4 * 4 x 1 = 1 x 1
print ("Predicted: \n", l2)
print ("Actual: \n", output)
plt.plot (np.array (final_err)); ();
The output I get is:
Therefore the network was able to predict all the toy training examples. (Note in real data you would not like to fit the data at its best as it leads to overfitting). Note that you may get a bit different result, as the weight initialisations are different. Also, try to initialise the weight between [-0.01, +0.01] as a rule of thumb, when you are not working on a specific problem and you specifically know the initialisation.
Here is the convergence plot.
Note that you do not need to actually iterate over each example, instead you can do matrix multiplication at once, which is much faster. Also, the above code does not have bias units. Make sure you have bias units when you re-implement the code.
I would recommend you go through the Raul Rojas' Neural Networks, a Systematic Introduction, Chapter 4, 6 and 7. Chapter 7 will tell you how to implement deeper networks in a simple way.

Keras ImageDataGenerator Slow

I am looking for the best approach to train on larger-than-memory-data in Keras and currently noticing that the vanilla ImageDataGenerator tends to be slower than I would hope.
I have two networks training on the Kaggle cat's vs dogs dataset (25000 images):
1) this approach is exactly the code from:
2) same as (1) but using an ImageDataGenerator instead of loading into memory the data
Note: for below, "preprocessing" means resizing, scaling, flattening
I find the following on my gtx970:
For network 1, it takes ~0s per epoch.
For network 2, it takes ~36s per epoch if the preprocessing is done in the data generator.
For network 2, it takes ~13s per epoch if preprocessing is done in a first-pass outside of the data generator.
Is this likely the speed limit for ImageDataGenerator (13s seems like the usual 10-100x difference between disk and ram...)? Are there approaches/mechanisms better suited for training on larger-than-memory-data when using Keras?
e.g. Perhaps there is way to get the ImageDataGenerator in Keras to save its processed images after the first epoch?
I assume you already might have solved this, but nevertheless...
Keras image preprocessing has the option of saving the results by setting the save_to_dir argument in the flow() or flow_from_directory() function:
In my understanding, problem is that augmented images are used only once in a training cycle of a model, not even across several epochs. So it's a huge waste of GPU cycles while CPU is struggling.
I found following solution:
I generate as many augmentations in RAM as I can
I use them for training across a frame of epochs, 10 to 30, whatever it takes to get a noticeable convergence
after that I generate new batch of augmented images (by implementing on_epoch_end) and process goes on.
This approach most of the time keeps GPU busy, while being able to benefit from data augmentation. I use custom Sequence subclass to generate augmentation and fix classes imbalance at the same time.
EDIT: adding some code to clarify the idea
from pyutilz.string import read_config_file
from tqdm.notebook import tqdm
from gc import collect
import numpy as np
import tensorflow
import random
import cv2
class StoppingFromFile(tensorflow.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
if read_config_file('control.ini','ML','stop',globals()):
if stop is not None:
if stop==True or stop=='True':
logging.warning(f'Model should be stopped according to the control fole')
self.model.stop_training = True
class AugmentedBalancedSequence(tensorflow.keras.utils.Sequence):
def __init__(self, images_and_classes:dict,input_size:tuple,class_sizes:list, augmentations_fn:object, preprocessing_fn:object, batch_size:int=10,
num_class_samples=100, frame_length:int=5, aug_p:float=0.1,aug_pipe_p:float=0.2,is_validation:bool=False,
From a dict of file paths grouped by class label, creates each N epochs augmented balanced training set.
If current class is too scarce, ensures that current frame has no duplicate final images.
If it's rich enough, ensures that current frame has no duplicate base images.
"""'Got {len(images_and_classes)} classes.')
self.batch_size = batch_size
self.epoch = 0
#print(f'got frame_length={self.frame_length}')
def __len__(self):
return int(np.ceil(len(self.images)/ float(self.batch_size)))
def __getitem__(self, idx):
a=idx * self.batch_size;b=a+self.batch_size
return self.images[a:b],self.labels[a:b]
def on_epoch_end(self):
import ast
self.epoch += 1
import pathlib
p = pathlib.Path(fname)
if p.is_file():
with open (fname) as f:
for var,val in mydict.items():
if hasattr(self,var):
converted = val #ast.literal_eval(val)
if converted is not None:
if getattr(self, var)!=converted:
setattr(self, var, converted)
print(f'{var} became {val}')
except Exception as e:
if self.epoch % self.frame_length == 0:
#print('generating data...')
def _add_sample(self,image,label):
from random import random
if self.disk_saving_prob>0:
if random()<self.disk_saving_prob:
if self.cur_example_file>self.disk_example_nfiles:
Path(r'example_images/').mkdir(parents=True, exist_ok=True)
if self.preprocessing_fn:
def _generate_data(self):'Generating new set of augmented data...')
#del self.images
#del self.labels
if self.num_class_samples:
if self.images is None:
self.indices=np.random.choice(expected_length, expected_length, replace=False)
#for each class
for label,images in tqdm(self.images_and_classes.items()):
if self.num_class_samples is None:
#Just all native samples without augmentations
for image in images:
#if there are enough native samples
if len(images)>=self.num_class_samples:
#randomly select samples of this class which will participate in this frame of epochs
indices=np.random.choice(len(images), self.num_class_samples, replace=False)
#apply albumentations pipeline to selected samples
for idx in indices:
if not self.is_validation:
# Randomly pick next image from existing. try applying augmentation pipeline (with maxed out probability) till we get num_class_samples DIFFERENT images
while len(hashes)<self.num_class_samples:
if self.is_validation and norig<len(images):
#just include all originals first
if next_hash not in hashes or (self.is_validation and norig<=len(images)):
#print(f'Adding orig {norig} out of {self.num_class_samples}, hashes={hashes}')
if next_hash in hashes:
#self.images=self.images[indices];self.labels=self.labels[indices]'Generated {self.img_sent} samples ({nartificial} artificial)')
once I have images and classes loaded,
train_datagen = AugmentedBalancedSequence(images_and_classes=images_and_classes_train,
augmentations_fn=get_albumentations_pipeline,aug_p=AUG_P,aug_pipe_p=AUG_PIPE_P,preprocessing_fn=preprocess_input, batch_size=BATCH_SIZE,frame_length=FRAME_LENGTH,disk_saving_prob=0.05)
val_datagen = AugmentedBalancedSequence(images_and_classes=images_and_classes_val,
augmentations_fn=get_albumentations_pipeline,preprocessing_fn=preprocess_input, batch_size=BATCH_SIZE,frame_length=FRAME_LENGTH,is_validation=True)
and after the model is instantiated, I do,epochs=600,verbose=1,

Why does supressing weights improve Tensorflow neural net performance?

I have a 2-layer non-convolutional network in Tensorflow, using tanh as the activation function. I understand that weights should be initialized with a truncated normal distribution divided by sqrt(nInputs) e.g.:
weightsLayer1 = tf.Variable(tf.div(tf.truncated_normal([nInputUnits, nUnitsHiddenLayer1),math.sqrt(nInputUnits))))
Being a bit of a bumbling newbie in NN and Tensorflow, I mistakenly implemented this as 2 lines only to make it more readable:
weightsLayer1 = tf.Variable(tf.truncated_normal([nInputUnits, nUnitsHiddenLayer1])
weightsLayer1 = tf.div(weightsLayer1, math.sqrt(nInputUnits))
I now know that this is wrong and that the 2nd line causes the weights to be recomputed at each learning step. However, to my suprise, the "incorrect" implementation consistently yields better performance, both in train and test/evaluation datasets. I thought that the incorrect 2-line implementation should be a train wreck, since it is recomputing (suppressing) weights to values other than those chosen by the optimizer, which I would expect would wreak havoc in the optimization process, but it actually improves it. Does anyone have any explanation for this? I am using the Tensorflow adam optimizer.
Update 2016.6.22 - updated the 2nd code block above.
You are right that weightsLayer1 = tf.div(weightsLayer1, math.sqrt(nInputUnits)) is executed at each step. But that does NOT mean that the values in the weight variable are scaled down by sqrt(nInputUnits) in each step. This line is not an in-place operation that affects the values stored in the variable. It computes a new tensor, holding the values in the variable divided by sqrt(nInputUnits) and that tensor, I assume, then goes into the rest of your computation graph. This does not interfere with the optimizer. You are still defining a valid computation graph, just with an somewhat arbitrary scaling of the weights. The optimizer can still compute the gradients with respect to this variable (it will back-propagate through your division operation) and create the corresponding update operations.
In terms of the model that you are defining, the two versions are totally equivalent. For any set of values of weightsLayer1 in the original model (where you don't do the division), you can simply scale them up by sqrt(nInputUnits) and you will get the identical results with your second model. The two represent exactly the same model class, if you will.
Why one works better than the other? Your guess is as good as mine. If you have done the same division for all your variables, you have effectively divided your learning rate by sqrt(nInputUnits). This smaller learning rate might have been beneficial to the problem at hand.
Edit: I think the fact that you give the same name to the variable and the newly created tensor causes confusion. When you do
A = tf.Variable(1.0)
A = tf.mul(A, 2.0)
# Do something with A
then the second line creates a new tensor (as discussed above) and you re-bind the name (and it is only a name) A to that new tensor. For the graph being defined, the naming is absolutely irrelevant. The following code defines the same graph:
A = tf.Variable(1.0)
B = tf.mul(A, 2.0)
# Do something with B
Maybe this becomes clear if you execute the following code:
A = tf.Variable(1.0)
print A
B = A
A = tf.mul(A, 2.0)
print A
print B
The output is
<tensorflow.python.ops.variables.Variable object at 0x7ff025c02bd0>
Tensor("Mul:0", shape=(), dtype=float32)
<tensorflow.python.ops.variables.Variable object at 0x7ff025c02bd0>
The first time you print A it tells you that A is a variable object. After executing A = tf.mul(A, 2.0) and printing A again, you can see that the name A is now bound to a tf.Tensor object. However, the variable still exists, as can be seen by looking at the object behind the name B.
This is what the single line of code does:
t = tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
Creates a Tensor with shape [ nInputUnits, nUnitsHiddenLayer1 ], initialized with 1.0 as the standard deviation of the truncated normal distribution. ( 1.0 is standard stddev value )
t1 = tf.div( t, math.sqrt( nInputUnits ) )
divide all values in t with math.sqrt( nInputUnits )
Your two lines of code do exactly the same thing. On the first line and the second line all values are divided by math.sqrt( nInputUnits ).
As for your statement:
I now know that this is wrong and that the 2nd line causes the weights to be recomputed at each learning step.
EDIT my mistake
Indeed you are right, they are divided by math.sqrt( nInputUnits ) at every execuction, but not reinitialized! The point of importance is where you put tf.variable()
Here both lines are only initialized once:
weightsLayer1 = tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
weightsLayer1 = tf.Variable( tf.div( weightsLayer1, math.sqrt( nInputUnits ) ) )
and here the second line is preformed at every step:
weightsLayer1 = tf.Variable( tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] )
weightsLayer1 = tf.div( weightsLayer1, math.sqrt( nInputUnits ) )
Why does the second yield better results? it looks like some kind normalization to me, but somebody more knowledgeable should verify that.
you can write it more readable like this:
weightsLayer1 = tf.Variable( tf.truncated_normal( [ nInputUnits, nUnitsHiddenLayer1 ] , stddev = 1. / math.sqrt( nInputUnits ) )