How can i get all outputs of the last transformer encoder in bert pretrained model and not just the cls token output? - neural-network

I'm using pytorch and this is the model from huggingface transformers link:
from transformers import BertTokenizerFast, BertForSequenceClassification
bert = BertForSequenceClassification.from_pretrained("bert-base-uncased",
num_labels=int(data['class'].nunique()),
output_attentions=False,
output_hidden_states=False)
and in the forward function I'm building, I'm calling x1, x2 = self.bert(sent_id, attention_mask=mask)
Now, as far as I know, x2 is the cls output(which is the output of the first transformer encoder) but yet again, I don't think I understand the output of the model.
but I want the output of all the 12 last transformer encoders.
How can I do that in pytorch ?

Ideally, if you want to look into the outputs of all the layer, you should use BertModel and not BertForSequenceClassification. Because, BertForSequenceClassification is inherited from BertModel and adds a linear layer on top of the BERT model.
from transformers import BertModel
my_bert_model = BertModel.from_pretrained("bert-base-uncased")
### Add your code to map the model to device, data to device, and obtain input_ids and mask
sequence_output, pooled_output = my_bert_model(ids, attention_mask=mask)
# sequence_output has the following shape: (batch_size, sequence_length, 768), which contains output for all tokens in the last layer of the BERT model.
sequence_output contains output for all tokens in the last layer of the BERT model.
In order to obtain the outputs of all the transformer encoder layers, you can use the following:
my_bert_model = BertModel.from_pretrained("bert-base-uncased")
sequence_output, pooled_output, all_layer_output = model(ids, attention_mask=mask, output_hidden_states=True)
all_layer_output is a output tuple containing the outputs embeddings layer + outputs of all the layer. Each element in the tuple will have a shape (batch_size, sequence_length, 768)
Hence, to get the sequence of outputs at layer-5, you can use all_layer_output[5]. As, all_layer_output[0] contains outputs of the embeddings.

detailed in the doc: https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel.
from transformers import BertModel, BertConfig
config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)
outputs = model(inputs)
print(len(outputs)) # 3
hidden_states = outputs[2]
print(len(hidden_states)) # 13
embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

Related

How to predict the outcome variables using a saved pipeline when the data set does not contain the actual outcome?

I have a data set that contains the following columns: outcome (this is the outcome that we want to predict), and raw (a column that consists of text). I want to develop an ML model that will predict the outcome from the raw column. I have trained an ML model in Databricks using the following pipeline:
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")
countVec = CountVectorizer(inputCol="words", outputCol="features")
indexer = StringIndexer(inputCol="outcome", outputCol="label").setHandleInvalid("skip").fit(trainDF)
inverter = IndexToString(inputCol="prediction", outputCol="prediction_label", labels=indexer.labels)
nb = NaiveBayes(labelCol="label", featuresCol="features", smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[regexTokenizer, indexer, countVec, nb, inverter])
model = pipeline.fit(trainDF)
model.write().overwrite().save("/FileStore/project")
In another notebook, I load the model and try to predict the values for a new data set. This data set does not contain the outcome variable ("outcome" in this case):
model = PipelineModel.load("/FileStore/project")
score_output_df = model.transform(score_this)
When I try to predict the values for the new data set, I get an error message that the column "outcome" cannot be found. I suspect that this is due to the fact that some stages in the pipeline transform this column (the indexer and inverter stages are used to convert the outcome column to numbers and then back to string labels.).
My question is this, how can I load a saved model and use it to predict values when the original pipeline contains stages that have this column as an input.
instead of using
model.write().overwrite().save("/FileStore/project")
you have to write it like this
model.write().overwrite().save("/FileStore/project/model.sav")
and then for loading you will use this
model = PipelineModel.load("/FileStore/project/model.sav")
score_output_df = model.transform(score_this)
I have found a solution to the problem and will post it here so that if someone faces the same problem they can benefit from it. The solution was simply to extract the stages that I want to use in the prediction and save them to the model as such:
model = PipelineModel.load("/FileStore/project")
stages1 = []
stages1 += [model.stages[0]]
stages1 += [model.stages[2]]
stages1 += [model.stages[3]]
stages1 += [model.stages[4]]
model.stages = stages1
score_output_df = model.transform(score_this)
In this code, I exclude the second step ([1]) because it contains the indexer. Once I do this, I can predict values when the "outcome" column is not available.

Trying to use Distributed data parallel on GANs but getting runtime error about an inplace operation

I am trying to train a GAN a machine with 3GPUs using distributed data parallel.
before wrapping my model in the DDP everything works fine but when I wrap it, it givers me the following Runtime Error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128]] is at version 5; expected version 4 instead.
I cloned every related tensor to the gradient to solve the inplace operation (if it is any) but I could not find it.
the part of code with the problem is as follow:
Tensor = torch.cuda.FloatTensor
# ----------
# Training
# ----------
def train_gan(rank, world_size, opt):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
if rank == 0:
get_dataloader(rank, opt)
dist.barrier()
print(f"Rank {rank}/{world_size} training process passed data download barrier.\n")
dataloader = get_dataloader(rank, opt)
# Loss function
adversarial_loss = torch.nn.BCELoss()
# Initialize generator and discriminator
generator = Generator()
discriminator = Discriminator()
# Initialize weights
generator.apply(weights_init_normal)
discriminator.apply(weights_init_normal)
generator.to(rank)
discriminator.to(rank)
generator_d = DDP(generator, device_ids=[rank])
discriminator_d = DDP(discriminator, device_ids=[rank])
# Optimizers
# Since we are computing the average of several batches at once (an effective batch size of
# world_size * batch_size) we scale the learning rate to match.
optimizer_G = torch.optim.Adam(generator_d.parameters(), lr=opt.lr * opt.world_size, betas=(opt.b1, opt.b2))
optimizer_D = torch.optim.Adam(discriminator_d.parameters(), lr=opt.lr * opt.world_size, betas=(opt.b1, opt.b2))
losses = []
for epoch in range(opt.n_epochs):
for i, (imgs, _) in enumerate(dataloader):
# Adversarial ground truths
valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False).to(rank)
fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False).to(rank)
# Configure input
real_imgs = Variable(imgs.type(Tensor)).to(rank)
# -----------------
# Train Generator
# -----------------
optimizer_G.zero_grad()
# Sample noise as generator input
z = Variable(Tensor(np.random.normal(0, 1, (imgs.shape[0], opt.latent_dim)))).to(rank)
# Generate a batch of images
gen_imgs = generator_d(z)
# Loss measures generator's ability to fool the discriminator
g_loss = adversarial_loss(discriminator_d(gen_imgs), valid)
g_loss.backward()
optimizer_G.step()
# ---------------------
# Train Discriminator
# ---------------------
optimizer_D.zero_grad()
# Measure discriminator's ability to classify real from generated samples
real_loss = adversarial_loss(discriminator_d(real_imgs), valid)
fake_loss = adversarial_loss(discriminator_d(gen_imgs.detach()), fake)
d_loss = ((real_loss + fake_loss) / 2).to(rank)
d_loss.backward()
optimizer_D.step()
I encountered a similar error when trying to train a GAN with DistributedDataParallel.
I noticed the problem was coming from BatchNorm layers in my discriminator.
Indeed, DistributedDataParallel synchronizes the batchnorm parameters at each forward pass (see the doc), thereby modifying the variable inplace, which causes problems if you have multiple forward passes in a row.
Converting my BatchNorm layers to SyncBatchNorm did the trick for me:
discriminator = torch.nn.SyncBatchNorm.convert_sync_batchnorm(discriminator)
discriminator = DPP(discriminator)
You probably want to do it anyway when using DistributedDataParallel.
Alternatively, if you don't want to use SyncBatchNorm, you can set the broadcast_buffers parameter to False, but I don't think you really want to do that, as it means your batch norm stats will not be synchronized among processes.
discriminator = DPP(discriminator, device_ids=[rank], broadcast_buffers=False)

Inputs to Encoder-Decoder LSTMCell/RNN Network

I'm creating an LSTM Encoder-Decoder Network, using Keras, following the code provided here: https://github.com/LukeTonin/keras-seq-2-seq-signal-prediction. The only change I made is to replace the GRUCell with an LSTMCell. Basically both the encoder and decoder consists of 2 layers, of 35 LSTMCells. The layers are stacked over (and combined with) each other using an RNN Layer.
The LSTMCell returns 2 states whereas the GRUCell returns 1 state. This is where I am encountering an error, as I do not know how to code for the 2 returned states of the LSTMCell.
I have created two models: first, an encoder-decoder model. Second, a prediction model. I am not encountering any problems in the encoder-decoder model, but a encountering problems in the decoder of the prediction model.
The error I am getting is:
ValueError: Layer rnn_4 expects 9 inputs, but it received 3 input tensors. Input received: [<tf.Tensor 'input_4:0' shape=(?, ?, 1) dtype=float32>, <tf.Tensor 'input_11:0' shape=(?, 35) dtype=float32>, <tf.Tensor 'input_12:0' shape=(?, 35) dtype=float32>]
This error happens when this line below, in the prediction model, is run:
decoder_outputs_and_states = decoder(
decoder_inputs, initial_state=decoder_states_inputs)
The section of code this fits into is:
encoder_predict_model = keras.models.Model(encoder_inputs,
encoder_states)
decoder_states_inputs = []
# Read layers backwards to fit the format of initial_state
# For some reason, the states of the model are order backwards (state of the first layer at the end of the list)
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
# One state for GRU, but two states for LSTMCell
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
decoder_outputs_and_states = decoder(
decoder_inputs, initial_state=decoder_states_inputs)
decoder_outputs = decoder_outputs_and_states[0]
decoder_states = decoder_outputs_and_states[1:]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_predict_model = keras.models.Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
Could somebody help me with the for loop above, and initial states I should be passing the decoder after that?
I had an similar error and i solved just doing what he says, adding another input tensor:
# If instead of a GRU you were using an LSTM Cell, you would have to append two Input tensors since the LSTM has 2 states.
for hidden_neurons in layers[::-1]:
# One state for GRU
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
decoder_states_inputs.append(keras.layers.Input(shape=(hidden_neurons,)))
here it solved the prolem...

LSTM neural network with two sources of data

I have the following configuration: One lstm network that receives a text with n-grams with size 2. Below a simple schematic:
After some tests, I noticed that for some classes I have an significant incrise on accuracy when I use ngrams with size 3. Now I want to train a new LSTM neural network with both ngram sizes at same time, like the following schematic:
How can I provide the data and build this model, using keras to perform this task?
I assume you already have a function to split words into n-grams, as you already have the 2-grams and 3-grams model working? Therefor I just construct a one-sample example of the word "cool" for a working example. I had to use embedding for my example, as an LSTM layer with 26^3=17576 nodes was a little too much for my computer to handle. I expect you did the same in your 3-grams code?
Below is a complete working example:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, concatenate
from tensorflow.keras.models import Model
import numpy as np
# c->2 o->14 o->14 l->11
np_2_gram_in = np.array([[26*2+14,26*14+14,26*14+11]])#co,oo,ol
np_3_gram_in = np.array([[26**2*2+26*14+14,26**2*14+26*14+26*11]])#coo,ool
np_output = np.array([[1]])
output_shape=1
lstm_2_gram_embedding = 128
lstm_3_gram_embedding = 192
inputs_2_gram = Input(shape=(None,))
em_input_2_gram = Embedding(output_dim=lstm_2_gram_embedding, input_dim=26**2)(inputs_2_gram)
lstm_2_gram = LSTM(lstm_2_gram_embedding)(em_input_2_gram)
inputs_3_gram = Input(shape=(None,))
em_input_3_gram = Embedding(output_dim=lstm_3_gram_embedding, input_dim=26**3)(inputs_3_gram)
lstm_3_gram = LSTM(lstm_3_gram_embedding)(em_input_3_gram)
concat = concatenate([lstm_2_gram, lstm_3_gram])
output = Dense(output_shape,activation='sigmoid')(concat)
model = Model(inputs=[inputs_2_gram, inputs_3_gram], outputs=[output])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit([np_2_gram_in, np_3_gram_in], [np_output], epochs=5)
model.predict([np_2_gram_in,np_3_gram_in])

How to apply LSTM-autoencoder to variant-length time-series data?

I read LSTM-autoencoder in this tutorial: https://blog.keras.io/building-autoencoders-in-keras.html, and paste the corresponding keras implementation below:
from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model
inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
In this implementation, they fixed the input to be of shape (timesteps, input_dim), which means length of time-series data is fixed to be timesteps. If I remember correctly RNN/LSTM can handle time-series data of variable lengths and I am wondering if it is possible to modify the code above somehow to accept data of any length?
Thanks!
You can use shape=(None, input_dim)
But the RepeatVector will need some hacking taking dimensions directly from the input tensor. (The code works with tensorflow, not sure about theano)
import keras.backend as K
def repeat(x):
stepMatrix = K.ones_like(x[0][:,:,:1]) #matrix with ones, shaped as (batch, steps, 1)
latentMatrix = K.expand_dims(x[1],axis=1) #latent vars, shaped as (batch, 1, latent_dim)
return K.batch_dot(stepMatrix,latentMatrix)
decoded = Lambda(repeat)([inputs,encoded])
decoded = LSTM(input_dim, return_sequences=True)(decoded)