Trying to use Distributed data parallel on GANs but getting runtime error about an inplace operation - neural-network

I am trying to train a GAN a machine with 3GPUs using distributed data parallel.
before wrapping my model in the DDP everything works fine but when I wrap it, it givers me the following Runtime Error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128]] is at version 5; expected version 4 instead.
I cloned every related tensor to the gradient to solve the inplace operation (if it is any) but I could not find it.
the part of code with the problem is as follow:
Tensor = torch.cuda.FloatTensor
# ----------
# Training
# ----------
def train_gan(rank, world_size, opt):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
if rank == 0:
get_dataloader(rank, opt)
dist.barrier()
print(f"Rank {rank}/{world_size} training process passed data download barrier.\n")
dataloader = get_dataloader(rank, opt)
# Loss function
adversarial_loss = torch.nn.BCELoss()
# Initialize generator and discriminator
generator = Generator()
discriminator = Discriminator()
# Initialize weights
generator.apply(weights_init_normal)
discriminator.apply(weights_init_normal)
generator.to(rank)
discriminator.to(rank)
generator_d = DDP(generator, device_ids=[rank])
discriminator_d = DDP(discriminator, device_ids=[rank])
# Optimizers
# Since we are computing the average of several batches at once (an effective batch size of
# world_size * batch_size) we scale the learning rate to match.
optimizer_G = torch.optim.Adam(generator_d.parameters(), lr=opt.lr * opt.world_size, betas=(opt.b1, opt.b2))
optimizer_D = torch.optim.Adam(discriminator_d.parameters(), lr=opt.lr * opt.world_size, betas=(opt.b1, opt.b2))
losses = []
for epoch in range(opt.n_epochs):
for i, (imgs, _) in enumerate(dataloader):
# Adversarial ground truths
valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False).to(rank)
fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False).to(rank)
# Configure input
real_imgs = Variable(imgs.type(Tensor)).to(rank)
# -----------------
# Train Generator
# -----------------
optimizer_G.zero_grad()
# Sample noise as generator input
z = Variable(Tensor(np.random.normal(0, 1, (imgs.shape[0], opt.latent_dim)))).to(rank)
# Generate a batch of images
gen_imgs = generator_d(z)
# Loss measures generator's ability to fool the discriminator
g_loss = adversarial_loss(discriminator_d(gen_imgs), valid)
g_loss.backward()
optimizer_G.step()
# ---------------------
# Train Discriminator
# ---------------------
optimizer_D.zero_grad()
# Measure discriminator's ability to classify real from generated samples
real_loss = adversarial_loss(discriminator_d(real_imgs), valid)
fake_loss = adversarial_loss(discriminator_d(gen_imgs.detach()), fake)
d_loss = ((real_loss + fake_loss) / 2).to(rank)
d_loss.backward()
optimizer_D.step()

I encountered a similar error when trying to train a GAN with DistributedDataParallel.
I noticed the problem was coming from BatchNorm layers in my discriminator.
Indeed, DistributedDataParallel synchronizes the batchnorm parameters at each forward pass (see the doc), thereby modifying the variable inplace, which causes problems if you have multiple forward passes in a row.
Converting my BatchNorm layers to SyncBatchNorm did the trick for me:
discriminator = torch.nn.SyncBatchNorm.convert_sync_batchnorm(discriminator)
discriminator = DPP(discriminator)
You probably want to do it anyway when using DistributedDataParallel.
Alternatively, if you don't want to use SyncBatchNorm, you can set the broadcast_buffers parameter to False, but I don't think you really want to do that, as it means your batch norm stats will not be synchronized among processes.
discriminator = DPP(discriminator, device_ids=[rank], broadcast_buffers=False)

Related

How to predict the outcome variables using a saved pipeline when the data set does not contain the actual outcome?

I have a data set that contains the following columns: outcome (this is the outcome that we want to predict), and raw (a column that consists of text). I want to develop an ML model that will predict the outcome from the raw column. I have trained an ML model in Databricks using the following pipeline:
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")
countVec = CountVectorizer(inputCol="words", outputCol="features")
indexer = StringIndexer(inputCol="outcome", outputCol="label").setHandleInvalid("skip").fit(trainDF)
inverter = IndexToString(inputCol="prediction", outputCol="prediction_label", labels=indexer.labels)
nb = NaiveBayes(labelCol="label", featuresCol="features", smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[regexTokenizer, indexer, countVec, nb, inverter])
model = pipeline.fit(trainDF)
model.write().overwrite().save("/FileStore/project")
In another notebook, I load the model and try to predict the values for a new data set. This data set does not contain the outcome variable ("outcome" in this case):
model = PipelineModel.load("/FileStore/project")
score_output_df = model.transform(score_this)
When I try to predict the values for the new data set, I get an error message that the column "outcome" cannot be found. I suspect that this is due to the fact that some stages in the pipeline transform this column (the indexer and inverter stages are used to convert the outcome column to numbers and then back to string labels.).
My question is this, how can I load a saved model and use it to predict values when the original pipeline contains stages that have this column as an input.
instead of using
model.write().overwrite().save("/FileStore/project")
you have to write it like this
model.write().overwrite().save("/FileStore/project/model.sav")
and then for loading you will use this
model = PipelineModel.load("/FileStore/project/model.sav")
score_output_df = model.transform(score_this)
I have found a solution to the problem and will post it here so that if someone faces the same problem they can benefit from it. The solution was simply to extract the stages that I want to use in the prediction and save them to the model as such:
model = PipelineModel.load("/FileStore/project")
stages1 = []
stages1 += [model.stages[0]]
stages1 += [model.stages[2]]
stages1 += [model.stages[3]]
stages1 += [model.stages[4]]
model.stages = stages1
score_output_df = model.transform(score_this)
In this code, I exclude the second step ([1]) because it contains the indexer. Once I do this, I can predict values when the "outcome" column is not available.

Pytorch minibatching keeps model from training

I am trying to classify sequences by a binary feature. I have a dataset of sequence/label pairs and am using a simple one-layer LSTM to classify each sequence. Before I implemented minibatching, I was getting reasonable accuracy on a test set (80%), and the training loss would go from 0.6 to 0.3 (averaged).
I implemented minibatching, using parts of this tutorial: https://pytorch.org/tutorials/beginner/chatbot_tutorial.html
However, now my model won’t do better than 70-72% (70% of the data has one label) with batch size set to 1 and all other parameters exactly the same. Additionally, the loss starts out at 0.0106 and quickly gets really really small, with no significant change in results. I feel like the results between no batching and batching with size 1 should be the same, so I probably have a bug, but for the life of me I can’t find it. My code is below.
Training code (one epoch):
for i in t:
model.zero_grad()
# prep inputs
last = i+self.params['batch_size']
last = last if last < len(train_data) else len(train_data)
batch_in, lengths, batch_targets = self.batch2TrainData(train_data[shuffled][i:last], word_to_ix, label_to_ix)
iters += 1
# forward pass.
tag_scores = model(batch_in, lengths)
# compute loss, then do backward pass, then update gradients
loss = loss_function(tag_scores, batch_targets)
loss.backward()
# Clip gradients: gradients are modified in place
nn.utils.clip_grad_norm_(model.parameters(), 50.0)
optimizer.step()
Functions:
def prep_sequence(self, seq, to_ix):
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long)
# transposes batch_in
def zeroPadding(self, l, fillvalue=0):
return list(itertools.zip_longest(*l, fillvalue=fillvalue))
# Returns padded input sequence tensor and lengths
def inputVar(self, batch_in, word_to_ix):
idx_batch = [self.prep_sequence(seq, word_to_ix) for seq in batch_in]
lengths = torch.tensor([len(idxs) for idxs in idx_batch])
padList = self.zeroPadding(idx_batch)
padVar = torch.LongTensor(padList)
return padVar, lengths
# Returns all items for a given batch of pairs
def batch2TrainData(self, batch, word_to_ix, label_to_ix):
# sort by dec length
batch = batch[np.argsort([len(x['turn']) for x in batch])[::-1]]
input_batch, output_batch = [], []
for pair in batch:
input_batch.append(pair['turn'])
output_batch.append(pair['label'])
inp, lengths = self.inputVar(input_batch, word_to_ix)
output = self.prep_sequence(output_batch, label_to_ix)
return inp, lengths, output
Model:
class LSTMClassifier(nn.Module):
def __init__(self, params, vocab_size, tagset_size, weights_matrix=None):
super(LSTMClassifier, self).__init__()
self.hidden_dim = params['hidden_dim']
if weights_matrix is not None:
self.word_embeddings = nn.Embedding.from_pretrained(weights_matrix)
else:
self.word_embeddings = nn.Embedding(vocab_size, params['embedding_dim'])
self.lstm = nn.LSTM(params['embedding_dim'], self.hidden_dim, bidirectional=False)
# The linear layer that maps from hidden state space to tag space
self.hidden2tag = nn.Linear(self.hidden_dim, tagset_size)
def forward(self, batch_in, lengths):
embeds = self.word_embeddings(batch_in)
packed = nn.utils.rnn.pack_padded_sequence(embeds, lengths)
lstm_out, _ = self.lstm(packed)
outputs, _ = nn.utils.rnn.pad_packed_sequence(lstm_out)
tag_space = self.hidden2tag(outputs)
tag_scores = F.log_softmax(tag_space, dim=0)
return tag_scores[-1]
For anyone else with a similar issue, I got it to work. I removed the log_softmax calculation, so this:
tag_space = self.hidden2tag(outputs)
tag_scores = F.log_softmax(tag_space, dim=0)
return tag_scores[-1]
becomes this:
tag_space = self.hidden2tag(outputs)
return tag_space[-1]
I also changed NLLLoss to CrossEntropyLoss, (not shown above), and initialized CrossEntropyLoss with no parameters (aka no ignore_index).
I am not certain why these changes were necessary (the docs even say that NLLLoss should be run after a log_softmax layer), but they got my model working and brought my loss back to a reasonable range (~0.5).

Inputs to Encoder-Decoder LSTMCell/RNN Network

I'm creating an LSTM Encoder-Decoder Network, using Keras, following the code provided here: https://github.com/LukeTonin/keras-seq-2-seq-signal-prediction. The only change I made is to replace the GRUCell with an LSTMCell. Basically both the encoder and decoder consists of 2 layers, of 35 LSTMCells. The layers are stacked over (and combined with) each other using an RNN Layer.
The LSTMCell returns 2 states whereas the GRUCell returns 1 state. This is where I am encountering an error, as I do not know how to code for the 2 returned states of the LSTMCell.
I have created two models: first, an encoder-decoder model. Second, a prediction model. I am not encountering any problems in the encoder-decoder model, but a encountering problems in the decoder of the prediction model.
The error I am getting is:
ValueError: Layer rnn_4 expects 9 inputs, but it received 3 input tensors. Input received: [<tf.Tensor 'input_4:0' shape=(?, ?, 1) dtype=float32>, <tf.Tensor 'input_11:0' shape=(?, 35) dtype=float32>, <tf.Tensor 'input_12:0' shape=(?, 35) dtype=float32>]
This error happens when this line below, in the prediction model, is run:
decoder_outputs_and_states = decoder(
decoder_inputs, initial_state=decoder_states_inputs)
The section of code this fits into is:
encoder_predict_model = keras.models.Model(encoder_inputs,
encoder_states)
decoder_states_inputs = []
# Read layers backwards to fit the format of initial_state
# For some reason, the states of the model are order backwards (state of the first layer at the end of the list)
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
# One state for GRU, but two states for LSTMCell
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
decoder_outputs_and_states = decoder(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_outputs = decoder_outputs_and_states[0]
decoder_states = decoder_outputs_and_states[1:]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_predict_model = keras.models.Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
Could somebody help me with the for loop above, and initial states I should be passing the decoder after that?
I had an similar error and i solved just doing what he says, adding another input tensor:
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
# One state for GRU
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
here it solved the prolem...

How to implement exponentially decay learning rate in Keras by following the global steps

Look at the following example
# encoding: utf-8
import numpy as np
import pandas as pd
import random
import math
from keras import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam, RMSprop
from keras.callbacks import LearningRateScheduler
X = [i*0.05 for i in range(100)]
def step_decay(epoch):
initial_lrate = 1.0
drop = 0.5
epochs_drop = 2.0
lrate = initial_lrate * math.pow(drop,
math.floor((1+epoch)/epochs_drop))
return lrate
def build_model():
model = Sequential()
model.add(Dense(32, input_shape=(1,), activation='relu'))
model.add(Dense(1, activation='linear'))
adam = Adam(lr=0.5)
model.compile(loss='mse', optimizer=adam)
return model
model = build_model()
lrate = LearningRateScheduler(step_decay)
callback_list = [lrate]
for ep in range(20):
X_train = np.array(random.sample(X, 10))
y_train = np.sin(X_train)
X_train = np.reshape(X_train, (-1,1))
y_train = np.reshape(y_train, (-1,1))
model.fit(X_train, y_train, batch_size=2, callbacks=callback_list,
epochs=1, verbose=2)
In this example, the LearningRateSchedule does not change the learning rate at all because in each iteration of ep, epoch=1. Thus the learning rate is just const (1.0, according to step_decay). In fact, instead of setting epoch>1 directly, I have to do outer loop as shown in the example, and insider each loop, I just run 1 epoch. (This is the case when I implement deep reinforcement learning, instead of supervised learning).
My question is how to set an exponentially decay learning rate in my example and how to get the learning rate in each iteration of ep.
You can actually pass two arguments to the LearningRateScheduler.
According to Keras documentation, the scheduler is
a function that takes an epoch index as input (integer, indexed from
0) and current learning rate and returns a new learning rate as output
(float).
So, basically, simply replace your initial_lr with a function parameter, like so:
def step_decay(epoch, lr):
# initial_lrate = 1.0 # no longer needed
drop = 0.5
epochs_drop = 2.0
lrate = lr * math.pow(drop,math.floor((1+epoch)/epochs_drop))
return lrate
The actual function you implement is not exponential decay (as you mention in your title) but a staircase function.
Also, you mention your learning rate does not change inside your loop. That's true because you set model.fit(..., epochs=1,...) and your epochs_drop = 2.0 at the same time. I am not sure this is your desired case or not. You are providing a toy example and it's not clear in that case.
I would like to add the more common case where you don't mix a for loop with fit() and just provide a different epochs parameter in your fit() function. In this case you have the following options:
First of all keras provides a decaying functionality itself with the predefined optimizers. For example in your case Adam() the actual code is:
lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
which is not exactly exponential either and it's somehow different than tensorflow's one. Also, it's used only when decay > 0.0 as it's obvious.
To follow the tensorflow convention of exponential decay you should implement:
decayed_learning_rate = learning_rate * ^ (global_step / decay_steps)
Depending on your needs you could choose to implement a Callback subclass and define a function within it (see 3rd bullet below) or use LearningRateScheduler which is actually exactly this with some checking: a Callback subclass which updates the learning rate at each epoch end.
If you want a finer handling of your learning rate policy (per batch for example) you would have to implement your subclass since as far as I know there is no implemented subclass for this task. The good part is that it's super easy:
Create a subclass
class LearningRateExponentialDecay(Callback):
and add the __init__() function which will initialize your instance with all needed parameters and also create a global_step variables to keep track of the iterations (batches):
def __init__(self, init_learining_rate, decay_rate, decay_steps):
self.init_learining_rate = init_learining_rate
self.decay_rate = decay_rate
self.decay_steps = decay_steps
self.global_step = 0
Finally, add the actual function inside the class:
def on_batch_begin(self, batch, logs=None):
actual_lr = float(K.get_value(self.model.optimizer.lr))
decayed_learning_rate = actual_lr * self.decay_rate ^ (self.global_step / self.decay_steps)
K.set_value(self.model.optimizer.lr, decayed_learning_rate)
self.global_step += 1
The really cool part is the if you want the above subclass to update every epoch you could use on_epoch_begin(self, epoch, logs=None) which nicely has epoch as parameter to it's signature. This case is even easier as you could skip global step altogether (no need to keep track of it now unless you want a fancier way to apply your decay) and use epoch in it's place.

Keras deep autoencoder prediction is inaccurate

I am using a Keras deep autoencoder to reproduce my sparse matrix of [360, 6860] dimension. Each row is the count of trigrams for a protein sequence. The matrix has 2 classes of proteins, but I want the network to be ignorant of that initially, that is why I am using an autoencoder. I am following the keras blog autoencoder tutorial for this.
This is my code-
# this is the size of our encoded representations
encoding_dim = 32
input_img = Input(shape=(6860,))
encoded = Dense(128, activation='relu', activity_regularizer=regularizers.activity_l1(10e-5))(input_img)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)
decoded = Dense(64, activation='relu')(encoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(6860, activation='sigmoid')(decoded)
autoencoder = Model(input=input_img, output=decoded)
# this model maps an input to its encoded representation
encoder = Model(input=input_img, output=encoded)
# create a placeholder for an encoded (32-dimensional) input
encoded_input_1 = Input(shape=(32,))
encoded_input_2 = Input(shape=(64,))
encoded_input_3 = Input(shape=(128,))
# retrieve the last layer of the autoencoder model
decoder_layer_1 = autoencoder.layers[-3]
decoder_layer_2 = autoencoder.layers[-2]
decoder_layer_3 = autoencoder.layers[-1]
# create the decoder model
decoder_1 = Model(input = encoded_input_1, output = decoder_layer_1(encoded_input_1))
decoder_2 = Model(input = encoded_input_2, output = decoder_layer_2(encoded_input_2))
decoder_3 = Model(input = encoded_input_3, output = decoder_layer_3(encoded_input_3))
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
autoencoder.fit(x_train, x_train,
nb_epoch= 100,
batch_size=40,
shuffle=True,
validation_data=(x_test, x_test))
My validation set dimension is [80, 6860]. The problem is if I use the decoder to predict from the test set, my predictions are really off. For example if I predict with the following code-
# encode and decode some digits
# note that we take them from the *test* set
encoded_imgs = encoder.predict(x_test)
decoded_imgs = decoder_1.predict(encoded_imgs)
decoded_imgs = decoder_2.predict(decoded_imgs)
decoded_imgs = decoder_3.predict(decoded_imgs)
print x_test[3, np.where(x_test[3, :] != 0)[0]]
print (decoded_imgs[3, np.where(x_test[3, :] != 0)[0]])
a single row of my test set where the values are not zero are-
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
for the same row, the autoencoder's prediction of the same indices are-
[ 0.04615583 0.04613763 0.10268984 0.00286385 0.0030572 0.02551027
0.00552908 0.09686473 0.02554915 0.0082816 0.02254158 0.01127195
0.00305908 0.17113154 0.01140419 0.03370495 0.00515486 0.02614204
0.00558715 0.02835727 0.0029659 0.01425297 0.00834536 0.04502939
0.02260707 0.01131396 0.00561662 0.01131314 0.00493734 0.00265232
0.0056083 0.01724379 0.06099484 0.03738695 0.01128869 0.01995548
0.00562622 0.00556281 0.01732991 0.03142899 0.05339266 0.04778111
0.00292415 0.02264618 0.01419865 0.00550648 0.00836777 0.01139715]
Now, first I thought, maybe I can use some kind of thresholding to get the 1's from these values. But it seems they are pretty random. For a single row, for the first 50 zero values for my test set, my autoencoder predicts-
[ 0.14251608 0.00118295 0.00118732 0.00304095 0.031255 0.00108441
0.0201351 0.00853934 0.00558488 0.00281343 0.00296877 0.00109651
0.01129742 0.00827519 0.0170884 0.01417614 0.01714166 0.00549215
0.00099755 0.00558552 0.00829634 0.01988331 0.00092845 0.00294271
0.01429107 0.01137067 0.01137967 0.01121876 0.00491931 0.00562285
0.0055124 0.01720702 0.0142925 0.00553411 0.00551252 0.00281541
0.01145663 0.002876 0.00555185 0.00525392 0.01421779 0.00273949
0.01698892 0.02529835 0.0112521 0.01130333 0.00554186 0.00291986
0.00554437 0.01144382]
How can I improve the predictions? What am I doing wrong here? I must say that the data is hugely sparse. If you want you can download the toy data from here. Please, let me know if you have any questions.
One of the most important reasons is probably your training data size is just too small. You have a fully connected network and thus with 7 layers (including input and output) the number of parameters are just huge, close to 1.8M. You only have 360 training samples. So basically the parameters are untrained.
You can improve your work in two ways. One is of course to get more training data. The second is to follow the CNN example in the later part of the tutorial. CNN has been popular since it can greatly reduce the number of parameters.