LSTM neural network with two sources of data - neural-network

I have the following configuration: One lstm network that receives a text with n-grams with size 2. Below a simple schematic:
After some tests, I noticed that for some classes I have an significant incrise on accuracy when I use ngrams with size 3. Now I want to train a new LSTM neural network with both ngram sizes at same time, like the following schematic:
How can I provide the data and build this model, using keras to perform this task?

I assume you already have a function to split words into n-grams, as you already have the 2-grams and 3-grams model working? Therefor I just construct a one-sample example of the word "cool" for a working example. I had to use embedding for my example, as an LSTM layer with 26^3=17576 nodes was a little too much for my computer to handle. I expect you did the same in your 3-grams code?
Below is a complete working example:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, concatenate
from tensorflow.keras.models import Model
import numpy as np
# c->2 o->14 o->14 l->11
np_2_gram_in = np.array([[26*2+14,26*14+14,26*14+11]])#co,oo,ol
np_3_gram_in = np.array([[26**2*2+26*14+14,26**2*14+26*14+26*11]])#coo,ool
np_output = np.array([[1]])
output_shape=1
lstm_2_gram_embedding = 128
lstm_3_gram_embedding = 192
inputs_2_gram = Input(shape=(None,))
em_input_2_gram = Embedding(output_dim=lstm_2_gram_embedding, input_dim=26**2)(inputs_2_gram)
lstm_2_gram = LSTM(lstm_2_gram_embedding)(em_input_2_gram)
inputs_3_gram = Input(shape=(None,))
em_input_3_gram = Embedding(output_dim=lstm_3_gram_embedding, input_dim=26**3)(inputs_3_gram)
lstm_3_gram = LSTM(lstm_3_gram_embedding)(em_input_3_gram)
concat = concatenate([lstm_2_gram, lstm_3_gram])
output = Dense(output_shape,activation='sigmoid')(concat)
model = Model(inputs=[inputs_2_gram, inputs_3_gram], outputs=[output])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit([np_2_gram_in, np_3_gram_in], [np_output], epochs=5)
model.predict([np_2_gram_in,np_3_gram_in])

Related

How to know that the token ids in a gensim pre-trained word2vec will match the ids of a tokenizer's vocabulary

I am building a pytorch BiLSTM that utilizes pre-trained gensim word2vec. I first used a nn.Embedding layer that was trained with the model from scratch but, i decided to use a pre-trained word2vec embeddings to improve accuracy.
My model architecture follows a simple BiLSTM architecture, where the first layer is the embedding layer followed by a BiLSTM layer(s), and lastly two feed forward layers.
import torch
import gensim
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
word2vec = gensim.models.Word2Vec.load('path_to_word2vec/wikipedia_cbow_100')
weights = torch.FloatTensor(word2vec.wv.vectors)
class BiLSTM_model(torch.nn.Module) :
def __init__(self, max_features, embedding_dim, hidden_dim, num_layers, lstm_dropout) :
# max_features is the vocabulary size (num of tokens/words).
super().__init__()
# self.embeddings = nn.Embedding(max_features, embedding_dim, padding_idx=0)
self.embeddings = nn.Embedding.from_pretrained(weights)
self.lstm = nn.LSTM(word2vec.wv.vector_size,
hidden_dim,
batch_first=True,
bidirectional=True,
num_layers = num_layers,
dropout=lstm_dropout)
self.relu=nn.ReLU()
self.fc1 = nn.Linear(hidden_dim * 2, 64)
self.dropout = nn.Dropout(0.2)
self.fc2 = nn.Linear(64, config['num_classes'])
def forward(self, input):
embeddings_out = self.embeddings(input)
lstm_out, (hidden, cell) = self.lstm(embeddings_out)
hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
rel = self.relu(hidden)
dense1 = self.fc1(rel)
drop = self.dropout(dense1)
final_out = self.fc2(drop)
return final_out
i use a keras tokenizer to tokenize the text and obtain the token ids.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
## Tokenize the sentences
tokenizer = Tokenizer(num_words=config['max_features'])
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)
finally i use a standard training loop with an optimizer and a loss function. The code runs fine but there are no performance gains from using the pre-trained embeddings.
I suspect that it has to do with token ids not matching between the keras.preprocessing.text tokenizer and the gensim pre-trained embeddings for the words. My question is, how do i confirm (or deny) this inconsistency and ,if it is the case, how do i handle the issue?
Note: i am using a custom word2vec embeddings for the Arabic language. You can find the embeddings here.
After looking into jhso's comment. It seems that the solution for this problem is to use word2vec.wv.index2word which will return the vocabulary (words) as a list sorted in an order which reflects a word's embedding.
for example, the following code:
pretrained_embedding = gensim.models.Word2Vec.load('path/to/embedding')
word_vectors= pretrained_embedding.wv
for i in range (0,3):
print(f"{i}: '{word_vectors.index2word[i]}'")
will print:
0: 'this'
1: 'is'
2: 'an'
3: 'example'
where this token will have the id 0 and so on.
You then use word2vec.wv.index2word as input to the keras.preprocessing.text.Tokenizer object's .fit_on_texts() method as following:
vocabulary = pretrained_embeddings.index2word
tokenizer = Tokenizer(num_words=config['max_features'])
tokenizer.fit_on_texts(vocabulary)
this should preserve the token ids between the gensim word2vec model and the keras tokenizer.

How can i get all outputs of the last transformer encoder in bert pretrained model and not just the cls token output?

I'm using pytorch and this is the model from huggingface transformers link:
from transformers import BertTokenizerFast, BertForSequenceClassification
bert = BertForSequenceClassification.from_pretrained("bert-base-uncased",
num_labels=int(data['class'].nunique()),
output_attentions=False,
output_hidden_states=False)
and in the forward function I'm building, I'm calling x1, x2 = self.bert(sent_id, attention_mask=mask)
Now, as far as I know, x2 is the cls output(which is the output of the first transformer encoder) but yet again, I don't think I understand the output of the model.
but I want the output of all the 12 last transformer encoders.
How can I do that in pytorch ?
Ideally, if you want to look into the outputs of all the layer, you should use BertModel and not BertForSequenceClassification. Because, BertForSequenceClassification is inherited from BertModel and adds a linear layer on top of the BERT model.
from transformers import BertModel
my_bert_model = BertModel.from_pretrained("bert-base-uncased")
### Add your code to map the model to device, data to device, and obtain input_ids and mask
sequence_output, pooled_output = my_bert_model(ids, attention_mask=mask)
# sequence_output has the following shape: (batch_size, sequence_length, 768), which contains output for all tokens in the last layer of the BERT model.
sequence_output contains output for all tokens in the last layer of the BERT model.
In order to obtain the outputs of all the transformer encoder layers, you can use the following:
my_bert_model = BertModel.from_pretrained("bert-base-uncased")
sequence_output, pooled_output, all_layer_output = model(ids, attention_mask=mask, output_hidden_states=True)
all_layer_output is a output tuple containing the outputs embeddings layer + outputs of all the layer. Each element in the tuple will have a shape (batch_size, sequence_length, 768)
Hence, to get the sequence of outputs at layer-5, you can use all_layer_output[5]. As, all_layer_output[0] contains outputs of the embeddings.
detailed in the doc: https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel.
from transformers import BertModel, BertConfig
config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)
outputs = model(inputs)
print(len(outputs)) # 3
hidden_states = outputs[2]
print(len(hidden_states)) # 13
embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

Pytorch: NN function approximator, 2 in 1 out

[Please be aware of the Edit History below, as the major problem statement has changed.]
We are trying to implement a neural network in pytorch, that approximates a function f(x,y)=z. So there are two real numbers as input and one as ouput, we therefore want 2 nodes in the input layer and one in the output layer. We constructed a test set of 5050 samples and had pretty good results for that task in Keras with Tensorflow backend, with 3 hidden layers with a configuration of the nodes like: 2(in) - 4 - 16 - 4 - 1(out); and ReLU activation functions on all hidden layers, linear on in- and output.
Now in Pytorch we tried to implement a similar network but our loss function still literally explodes: It changes in the first few steps and converges then to some value around 10^7. In Keras we had an error around 10 percent. We already tried different network configurations without any improvement. Maybe someone could have a look on our code and suggest any change?
To explain: tr_data is a list, containing 5050 2*1 numpy arrays which are the inputs for the network. tr_labels is a list, containing 5050 numbers which are the outputs we want to learn. loadData() just load those two lists.
import torch.nn as nn
import torch.nn.functional as F
BATCH_SIZE = 5050
DIM_IN = 2
DIM_HIDDEN_1 = 4
DIM_HIDDEN_2 = 16
DIM_HIDDEN_3 = 4
DIM_OUT = 1
LEARN_RATE = 1e-4
EPOCH_NUM = 500
class Net(nn.Module):
def __init__(self):
#super(Net, self).__init__()
super().__init__()
self.hidden1 = nn.Linear(DIM_IN, DIM_HIDDEN_1)
self.hidden2 = nn.Linear(DIM_HIDDEN_1, DIM_HIDDEN_2)
self.hidden3 = nn.Linear(DIM_HIDDEN_2, DIM_HIDDEN_3)
self.out = nn.Linear(DIM_HIDDEN_3, DIM_OUT)
def forward(self, x):
x = F.relu(self.hidden1(x))
x = F.tanh(self.hidden2(x))
x = F.tanh(self.hidden3(x))
x = self.out(x)
return x
model = Net()
loss_fn = nn.MSELoss(size_average=False)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARN_RATE)
tr_data,tr_labels = loadData()
tr_data_torch = torch.zeros(BATCH_SIZE, DIM_IN)
tr_labels_torch = torch.zeros(BATCH_SIZE, DIM_OUT)
for i in range(BATCH_SIZE):
tr_data_torch[i] = torch.from_numpy(tr_data[i])
tr_labels_torch[i] = tr_labels[i]
for t in range(EPOCH_NUM):
labels_pred = model(tr_data_torch)
loss = loss_fn(labels_pred, tr_labels_torch)
#print(t, loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
I have to say, those are our first steps in Pytorch, so please forgive me if there are some obvious, dumb mistakes. I appreciate any help or hint,
Thank you!
EDIT 1 ------------------------------------------------------------------
Following the comments and answers, we improved our code. The Loss function has now for the first time reasonable values, around 250. Our new class definition looks like:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
#super().__init__()
self.hidden1 = nn.Sequential(nn.Linear(DIM_IN, DIM_HIDDEN_1), nn.ReLU())
self.hidden2 = nn.Sequential(nn.Linear(DIM_HIDDEN_1, DIM_HIDDEN_2), nn.ReLU())
self.hidden3 = nn.Sequential(nn.Linear(DIM_HIDDEN_2, DIM_HIDDEN_3), nn.ReLU())
self.out = nn.Linear(DIM_HIDDEN_3, DIM_OUT)
def forward(self, x):
x = self.hidden1(x)
x = self.hidden2(x)
x = self.hidden3(x)
x = self.out(x)
return x
and the loss function:
loss_fn = nn.MSELoss(size_average=True, reduce=True)
As we stated before, we already had far more satisfying results in keras with tensorflow backend. The loss function was around 30, with a similar network configuration. I share the essential parts(!) of our keras code here:
model = Sequential()
model.add(Dense(4, activation="linear", input_shape=(2,)))
model.add(Dense(16, activation="relu"))
model.add(Dense(4, activation="relu"))
model.add(Dense(1, activation="linear" ))
model.summary()
model.compile ( loss="mean_squared_error", optimizer="adam", metrics=["mse"] )
history=model.fit ( np.array(tr_data), np.array(tr_labels), \
validation_data = ( np.array(val_data), np.array(val_labels) ),
batch_size=50, epochs=200, callbacks = [ cbk ] )
Thank your already for all the help! If anybody still has suggestions to improve the network, we would be happy about it. As somebody already asked for the data, we want to share a pickle file here:
https://mega.nz/#!RDYxSYLY!P4a9mEDtZ7A5Bl7ZRjRk8EzLXQt2gyURa3wN3NCWFPA
together with the code to access it:
import pickle
f=open("data.pcl","rb")
tr_data=pickle.load ( f )
tr_labels=pickle.load ( f )
val_data=pickle.load ( f )
val_labels=pickle.load ( f )
f.close()
It should be interesting for you to point out the differences between torch.nn and torch.nn.functional (see here). Essentially, it might be that your backpropagation graph might be executed not 100% correct due to a different specification.
As pointed out by previous commenters, I would suggest to define your layers including the activations. My personal favorite way is to use nn.Sequential(), which allows you to specify multiple opeations chained together, like so:
self.hidden1 = nn.Sequential(nn.Linear(DIM_IN, DIM_HIDDEN1), nn.ReLU())
and then simply calling self.hidden1 later (without wrapping it in F.relu()).
May I also ask why you do not call the commented super(Net, self).__init__() (which is the generally recommended way)?
Additionally, if that should not fix the problem, can you maybe just share the code for Keras in comparison?

How to apply LSTM-autoencoder to variant-length time-series data?

I read LSTM-autoencoder in this tutorial: https://blog.keras.io/building-autoencoders-in-keras.html, and paste the corresponding keras implementation below:
from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model
inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
In this implementation, they fixed the input to be of shape (timesteps, input_dim), which means length of time-series data is fixed to be timesteps. If I remember correctly RNN/LSTM can handle time-series data of variable lengths and I am wondering if it is possible to modify the code above somehow to accept data of any length?
Thanks!
You can use shape=(None, input_dim)
But the RepeatVector will need some hacking taking dimensions directly from the input tensor. (The code works with tensorflow, not sure about theano)
import keras.backend as K
def repeat(x):
stepMatrix = K.ones_like(x[0][:,:,:1]) #matrix with ones, shaped as (batch, steps, 1)
latentMatrix = K.expand_dims(x[1],axis=1) #latent vars, shaped as (batch, 1, latent_dim)
return K.batch_dot(stepMatrix,latentMatrix)
decoded = Lambda(repeat)([inputs,encoded])
decoded = LSTM(input_dim, return_sequences=True)(decoded)

Keras ImageDataGenerator Slow

I am looking for the best approach to train on larger-than-memory-data in Keras and currently noticing that the vanilla ImageDataGenerator tends to be slower than I would hope.
I have two networks training on the Kaggle cat's vs dogs dataset (25000 images):
1) this approach is exactly the code from: http://www.pyimagesearch.com/2016/09/26/a-simple-neural-network-with-python-and-keras/
2) same as (1) but using an ImageDataGenerator instead of loading into memory the data
Note: for below, "preprocessing" means resizing, scaling, flattening
I find the following on my gtx970:
For network 1, it takes ~0s per epoch.
For network 2, it takes ~36s per epoch if the preprocessing is done in the data generator.
For network 2, it takes ~13s per epoch if preprocessing is done in a first-pass outside of the data generator.
Is this likely the speed limit for ImageDataGenerator (13s seems like the usual 10-100x difference between disk and ram...)? Are there approaches/mechanisms better suited for training on larger-than-memory-data when using Keras?
e.g. Perhaps there is way to get the ImageDataGenerator in Keras to save its processed images after the first epoch?
Thanks!
I assume you already might have solved this, but nevertheless...
Keras image preprocessing has the option of saving the results by setting the save_to_dir argument in the flow() or flow_from_directory() function:
https://keras.io/preprocessing/image/
In my understanding, problem is that augmented images are used only once in a training cycle of a model, not even across several epochs. So it's a huge waste of GPU cycles while CPU is struggling.
I found following solution:
I generate as many augmentations in RAM as I can
I use them for training across a frame of epochs, 10 to 30, whatever it takes to get a noticeable convergence
after that I generate new batch of augmented images (by implementing on_epoch_end) and process goes on.
This approach most of the time keeps GPU busy, while being able to benefit from data augmentation. I use custom Sequence subclass to generate augmentation and fix classes imbalance at the same time.
EDIT: adding some code to clarify the idea
from pyutilz.string import read_config_file
from tqdm.notebook import tqdm
from gc import collect
import numpy as np
import tensorflow
import random
import cv2
class StoppingFromFile(tensorflow.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
if read_config_file('control.ini','ML','stop',globals()):
if stop is not None:
if stop==True or stop=='True':
logging.warning(f'Model should be stopped according to the control fole')
self.model.stop_training = True
class AugmentedBalancedSequence(tensorflow.keras.utils.Sequence):
def __init__(self, images_and_classes:dict,input_size:tuple,class_sizes:list, augmentations_fn:object, preprocessing_fn:object, batch_size:int=10,
num_class_samples=100, frame_length:int=5, aug_p:float=0.1,aug_pipe_p:float=0.2,is_validation:bool=False,
disk_saving_prob:float=.01,disk_example_nfiles:int=50):
"""
From a dict of file paths grouped by class label, creates each N epochs augmented balanced training set.
If current class is too scarce, ensures that current frame has no duplicate final images.
If it's rich enough, ensures that current frame has no duplicate base images.
"""
logging.info(f'Got {len(images_and_classes)} classes.')
self.disk_example_nfiles=disk_example_nfiles;self.disk_saving_prob=disk_saving_prob;self.cur_example_file=0
self.images_and_classes=images_and_classes
self.num_class_samples=num_class_samples
self.augmentations_fn=augmentations_fn
self.preprocessing_fn=preprocessing_fn
self.is_validation=is_validation
self.frame_length=frame_length
self.batch_size = batch_size
self.class_sizes=class_sizes
self.input_size=input_size
self.aug_pipe_p=aug_pipe_p
self.aug_p=aug_p
self.images=None
self.epoch = 0
#print(f'got frame_length={self.frame_length}')
self._generate_data()
def __len__(self):
return int(np.ceil(len(self.images)/ float(self.batch_size)))
def __getitem__(self, idx):
a=idx * self.batch_size;b=a+self.batch_size
return self.images[a:b],self.labels[a:b]
def on_epoch_end(self):
import ast
self.epoch += 1
mydict={}
import pathlib
fname='control.json'
p = pathlib.Path(fname)
if p.is_file():
try:
with open (fname) as f:
mydict=json.load(f)
for var,val in mydict.items():
if hasattr(self,var):
converted = val #ast.literal_eval(val)
if converted is not None:
if getattr(self, var)!=converted:
setattr(self, var, converted)
print(f'{var} became {val}')
except Exception as e:
logging.error(str(e))
if self.epoch % self.frame_length == 0:
#print('generating data...')
self._generate_data()
def _add_sample(self,image,label):
from random import random
idx=self.indices[self.img_sent]
if self.disk_saving_prob>0:
if random()<self.disk_saving_prob:
self.cur_example_file+=1
if self.cur_example_file>self.disk_example_nfiles:
self.cur_example_file=1
Path(r'example_images/').mkdir(parents=True, exist_ok=True)
cv2.imwrite(f'example_images/test{self.cur_example_file}.jpg',cv2.cvtColor(image,cv2.COLOR_RGB2BGR))
if self.preprocessing_fn:
self.images[idx]=self.preprocessing_fn(image)
else:
self.images[idx]=image
self.labels[idx]=label
self.img_sent+=1
def _generate_data(self):
logging.info('Generating new set of augmented data...')
collect()
#del self.images
#del self.labels
#collect()
if self.num_class_samples:
expected_length=len(self.images_and_classes)*self.num_class_samples
else:
expected_length=sum(self.class_sizes.values())
if self.images is None:
self.images=np.empty((expected_length,)+(self.input_size[1],)+(self.input_size[0],)+(3,))
self.labels=np.empty((expected_length),np.int32)
self.indices=np.random.choice(expected_length, expected_length, replace=False)
self.img_sent=0
collect()
relaxed_augmentation_pipeline=self.augmentations_fn(p=self.aug_p,pipe_p=self.aug_pipe_p)
maxed_out_augmentation_pipeline=self.augmentations_fn(p=self.aug_p,pipe_p=1.0)
#for each class
x,y=[],[]
nartificial=0
for label,images in tqdm(self.images_and_classes.items()):
if self.num_class_samples is None:
#Just all native samples without augmentations
for image in images:
self._add_sample(image,label)
else:
#if there are enough native samples
if len(images)>=self.num_class_samples:
#randomly select samples of this class which will participate in this frame of epochs
indices=np.random.choice(len(images), self.num_class_samples, replace=False)
#apply albumentations pipeline to selected samples
for idx in indices:
if not self.is_validation:
self._add_sample(relaxed_augmentation_pipeline(image=images[idx])['image'],label)
else:
self._add_sample(images[idx],label)
else:
#------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Randomly pick next image from existing. try applying augmentation pipeline (with maxed out probability) till we get num_class_samples DIFFERENT images
#------------------------------------------------------------------------------------------------------------------------------------------------------------------
hashes=set()
norig=0
while len(hashes)<self.num_class_samples:
if self.is_validation and norig<len(images):
#just include all originals first
image=images[norig]
else:
image=maxed_out_augmentation_pipeline(image=random.choice(images))['image']
next_hash=np.sum(image)
if next_hash not in hashes or (self.is_validation and norig<=len(images)):
#print(f'Adding orig {norig} out of {self.num_class_samples}, hashes={hashes}')
self._add_sample(image,label)
if next_hash in hashes:
norig+=1
hashes.add(norig)
else:
hashes.add(next_hash)
nartificial+=1
#self.images=self.images[indices];self.labels=self.labels[indices]
logging.info(f'Generated {self.img_sent} samples ({nartificial} artificial)')
once I have images and classes loaded,
train_datagen = AugmentedBalancedSequence(images_and_classes=images_and_classes_train,
input_size=INPUT_SIZE,class_sizes=class_sizes_train,num_class_samples=UPSCALE_SAMPLES,
augmentations_fn=get_albumentations_pipeline,aug_p=AUG_P,aug_pipe_p=AUG_PIPE_P,preprocessing_fn=preprocess_input, batch_size=BATCH_SIZE,frame_length=FRAME_LENGTH,disk_saving_prob=0.05)
val_datagen = AugmentedBalancedSequence(images_and_classes=images_and_classes_val,
input_size=INPUT_SIZE,class_sizes=class_sizes_val,num_class_samples=None,
augmentations_fn=get_albumentations_pipeline,preprocessing_fn=preprocess_input, batch_size=BATCH_SIZE,frame_length=FRAME_LENGTH,is_validation=True)
and after the model is instantiated, I do
model.fit(train_datagen,epochs=600,verbose=1,
validation_data=(val_datagen.images,val_datagen.labels),validation_batch_size=BATCH_SIZE,
callbacks=[checkpointer,StoppingFromFile()],validation_freq=1)