How to apply LSTM-autoencoder to variant-length time-series data? - neural-network

I read LSTM-autoencoder in this tutorial: https://blog.keras.io/building-autoencoders-in-keras.html, and paste the corresponding keras implementation below:
from keras.layers import Input, LSTM, RepeatVector
from keras.models import Model
inputs = Input(shape=(timesteps, input_dim))
encoded = LSTM(latent_dim)(inputs)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
In this implementation, they fixed the input to be of shape (timesteps, input_dim), which means length of time-series data is fixed to be timesteps. If I remember correctly RNN/LSTM can handle time-series data of variable lengths and I am wondering if it is possible to modify the code above somehow to accept data of any length?
Thanks!

You can use shape=(None, input_dim)
But the RepeatVector will need some hacking taking dimensions directly from the input tensor. (The code works with tensorflow, not sure about theano)
import keras.backend as K
def repeat(x):
stepMatrix = K.ones_like(x[0][:,:,:1]) #matrix with ones, shaped as (batch, steps, 1)
latentMatrix = K.expand_dims(x[1],axis=1) #latent vars, shaped as (batch, 1, latent_dim)
return K.batch_dot(stepMatrix,latentMatrix)
decoded = Lambda(repeat)([inputs,encoded])
decoded = LSTM(input_dim, return_sequences=True)(decoded)

Related

How to know that the token ids in a gensim pre-trained word2vec will match the ids of a tokenizer's vocabulary

I am building a pytorch BiLSTM that utilizes pre-trained gensim word2vec. I first used a nn.Embedding layer that was trained with the model from scratch but, i decided to use a pre-trained word2vec embeddings to improve accuracy.
My model architecture follows a simple BiLSTM architecture, where the first layer is the embedding layer followed by a BiLSTM layer(s), and lastly two feed forward layers.
import torch
import gensim
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
word2vec = gensim.models.Word2Vec.load('path_to_word2vec/wikipedia_cbow_100')
weights = torch.FloatTensor(word2vec.wv.vectors)
class BiLSTM_model(torch.nn.Module) :
def __init__(self, max_features, embedding_dim, hidden_dim, num_layers, lstm_dropout) :
# max_features is the vocabulary size (num of tokens/words).
super().__init__()
# self.embeddings = nn.Embedding(max_features, embedding_dim, padding_idx=0)
self.embeddings = nn.Embedding.from_pretrained(weights)
self.lstm = nn.LSTM(word2vec.wv.vector_size,
hidden_dim,
batch_first=True,
bidirectional=True,
num_layers = num_layers,
dropout=lstm_dropout)
self.relu=nn.ReLU()
self.fc1 = nn.Linear(hidden_dim * 2, 64)
self.dropout = nn.Dropout(0.2)
self.fc2 = nn.Linear(64, config['num_classes'])
def forward(self, input):
embeddings_out = self.embeddings(input)
lstm_out, (hidden, cell) = self.lstm(embeddings_out)
hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
rel = self.relu(hidden)
dense1 = self.fc1(rel)
drop = self.dropout(dense1)
final_out = self.fc2(drop)
return final_out
i use a keras tokenizer to tokenize the text and obtain the token ids.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
## Tokenize the sentences
tokenizer = Tokenizer(num_words=config['max_features'])
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)
finally i use a standard training loop with an optimizer and a loss function. The code runs fine but there are no performance gains from using the pre-trained embeddings.
I suspect that it has to do with token ids not matching between the keras.preprocessing.text tokenizer and the gensim pre-trained embeddings for the words. My question is, how do i confirm (or deny) this inconsistency and ,if it is the case, how do i handle the issue?
Note: i am using a custom word2vec embeddings for the Arabic language. You can find the embeddings here.
After looking into jhso's comment. It seems that the solution for this problem is to use word2vec.wv.index2word which will return the vocabulary (words) as a list sorted in an order which reflects a word's embedding.
for example, the following code:
pretrained_embedding = gensim.models.Word2Vec.load('path/to/embedding')
word_vectors= pretrained_embedding.wv
for i in range (0,3):
print(f"{i}: '{word_vectors.index2word[i]}'")
will print:
0: 'this'
1: 'is'
2: 'an'
3: 'example'
where this token will have the id 0 and so on.
You then use word2vec.wv.index2word as input to the keras.preprocessing.text.Tokenizer object's .fit_on_texts() method as following:
vocabulary = pretrained_embeddings.index2word
tokenizer = Tokenizer(num_words=config['max_features'])
tokenizer.fit_on_texts(vocabulary)
this should preserve the token ids between the gensim word2vec model and the keras tokenizer.

How can i get all outputs of the last transformer encoder in bert pretrained model and not just the cls token output?

I'm using pytorch and this is the model from huggingface transformers link:
from transformers import BertTokenizerFast, BertForSequenceClassification
bert = BertForSequenceClassification.from_pretrained("bert-base-uncased",
num_labels=int(data['class'].nunique()),
output_attentions=False,
output_hidden_states=False)
and in the forward function I'm building, I'm calling x1, x2 = self.bert(sent_id, attention_mask=mask)
Now, as far as I know, x2 is the cls output(which is the output of the first transformer encoder) but yet again, I don't think I understand the output of the model.
but I want the output of all the 12 last transformer encoders.
How can I do that in pytorch ?
Ideally, if you want to look into the outputs of all the layer, you should use BertModel and not BertForSequenceClassification. Because, BertForSequenceClassification is inherited from BertModel and adds a linear layer on top of the BERT model.
from transformers import BertModel
my_bert_model = BertModel.from_pretrained("bert-base-uncased")
### Add your code to map the model to device, data to device, and obtain input_ids and mask
sequence_output, pooled_output = my_bert_model(ids, attention_mask=mask)
# sequence_output has the following shape: (batch_size, sequence_length, 768), which contains output for all tokens in the last layer of the BERT model.
sequence_output contains output for all tokens in the last layer of the BERT model.
In order to obtain the outputs of all the transformer encoder layers, you can use the following:
my_bert_model = BertModel.from_pretrained("bert-base-uncased")
sequence_output, pooled_output, all_layer_output = model(ids, attention_mask=mask, output_hidden_states=True)
all_layer_output is a output tuple containing the outputs embeddings layer + outputs of all the layer. Each element in the tuple will have a shape (batch_size, sequence_length, 768)
Hence, to get the sequence of outputs at layer-5, you can use all_layer_output[5]. As, all_layer_output[0] contains outputs of the embeddings.
detailed in the doc: https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel.
from transformers import BertModel, BertConfig
config = BertConfig.from_pretrained("xxx", output_hidden_states=True)
model = BertModel.from_pretrained("xxx", config=config)
outputs = model(inputs)
print(len(outputs)) # 3
hidden_states = outputs[2]
print(len(hidden_states)) # 13
embedding_output = hidden_states[0]
attention_hidden_states = hidden_states[1:]

Saving and retrieving the parameters of a Gpflow model

I am currently implementing an algorithm with GPflow using GPR. I wanted to save the parameters after the GPR training and load the model for testing. Does anyone knows the command?
GPflow has a page with tips & tricks now. You can follow the link where you will find the answer on your question. But, I'm going to paste MWE here as well:
Let's say you want to store GPR model, you can do it with gpflow.Saver():
kernel = gpflow.kernels.RBF(1)
x = np.random.randn(100, 1)
y = np.random.randn(100, 1)
model = gpflow.models.GPR(x, y, kernel)
filename = "/tmp/gpr.gpflow"
path = Path(filename)
if path.exists():
path.unlink()
saver = gpflow.saver.Saver()
saver.save(filename, model)
To load it back you have to use either this solution:
with tf.Graph().as_default() as graph, tf.Session().as_default():
model_copy = saver.load(filename)
or if you want to load the model in the same session where you stored it before, you need to apply some tricks:
ctx_for_loading = gpflow.saver.SaverContext(autocompile=False)
model_copy = saver.load(filename, context=ctx_for_loading)
model_copy.clear()
model_copy.compile()
UPDATE 1 June 2020:
GPflow 2.0 doesn't provide custom saver. It relies on TensorFlow checkpointing and tf.saved_model. You can find examples here: GPflow intro.
One option that I employ for gpflow models is to just save and load the trainables. It assumes you have a function that builds and compiles the model.
I show this in the following, by saving the variables to an hdf5 file.
import h5py
def _load_model(model, load_file):
"""
Load a model given by model path
"""
vars = {}
def _gather(name, obj):
if isinstance(obj, h5py.Dataset):
vars[name] = obj[...]
with h5py.File(load_file) as f:
f.visititems(_gather)
model.assign(vars)
def _save_model(model, save_file):
vars = model.read_trainables()
with h5py.File(save_file) as f:
for name, value in vars.items():
f[name] = value

LSTM neural network with two sources of data

I have the following configuration: One lstm network that receives a text with n-grams with size 2. Below a simple schematic:
After some tests, I noticed that for some classes I have an significant incrise on accuracy when I use ngrams with size 3. Now I want to train a new LSTM neural network with both ngram sizes at same time, like the following schematic:
How can I provide the data and build this model, using keras to perform this task?
I assume you already have a function to split words into n-grams, as you already have the 2-grams and 3-grams model working? Therefor I just construct a one-sample example of the word "cool" for a working example. I had to use embedding for my example, as an LSTM layer with 26^3=17576 nodes was a little too much for my computer to handle. I expect you did the same in your 3-grams code?
Below is a complete working example:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, concatenate
from tensorflow.keras.models import Model
import numpy as np
# c->2 o->14 o->14 l->11
np_2_gram_in = np.array([[26*2+14,26*14+14,26*14+11]])#co,oo,ol
np_3_gram_in = np.array([[26**2*2+26*14+14,26**2*14+26*14+26*11]])#coo,ool
np_output = np.array([[1]])
output_shape=1
lstm_2_gram_embedding = 128
lstm_3_gram_embedding = 192
inputs_2_gram = Input(shape=(None,))
em_input_2_gram = Embedding(output_dim=lstm_2_gram_embedding, input_dim=26**2)(inputs_2_gram)
lstm_2_gram = LSTM(lstm_2_gram_embedding)(em_input_2_gram)
inputs_3_gram = Input(shape=(None,))
em_input_3_gram = Embedding(output_dim=lstm_3_gram_embedding, input_dim=26**3)(inputs_3_gram)
lstm_3_gram = LSTM(lstm_3_gram_embedding)(em_input_3_gram)
concat = concatenate([lstm_2_gram, lstm_3_gram])
output = Dense(output_shape,activation='sigmoid')(concat)
model = Model(inputs=[inputs_2_gram, inputs_3_gram], outputs=[output])
model.compile(optimizer='adam', loss='binary_crossentropy')
model.fit([np_2_gram_in, np_3_gram_in], [np_output], epochs=5)
model.predict([np_2_gram_in,np_3_gram_in])

How do I export a graph to Tensorflow Serving so that the input is b64?

I have a Keras graph with a float32 tensor of shape (?, 224, 224, 3) that I want to export to Tensorflow Serving, in order to make predictions with RESTful. Problem is that I cannot input tensors, but encoded b64 strings, as that is a limitation of the REST API. That means that when exporting the graph, the input needs to be a string that needs to be decoded.
How can I "inject" the new input to be converted to the old tensor, without retraining the graph itself? I have tried several examples [1][2].
I currently have the following code for exporting:
image = tf.placeholder(dtype=tf.string, shape=[None], name='source')
signature = predict_signature_def(inputs={'image_bytes': image},
outputs={'output': model.output})
I somehow need to find a way to convert image to model.input, or a way to get the model output to connect to image.
Any help would be greatly appreciated!
You can use tf.decode_base64:
image = tf.placeholder(dtype=tf.string, shape=[None], name='source')
image_b64decoded = tf.decode_base64(image)
signature = predict_signature_def(inputs={'image_bytes': image_b64decoded},
outputs={'output': model.output})
EDIT:
If you need to use tf.image.decode_image, you can get it to work with multiple inputs using tf.map_fn:
image = tf.placeholder(dtype=tf.string, shape=[None], name='source')
image_b64decoded = tf.decode_base64(image)
image_decoded = tf.map_fn(tf.image.decode_image, image_b64decoded, dtype=tf.uint8)
This will work as long as the images have all the same dimensions, of course. However, the result is a tensor with completely unknown shape, because tf.image.decode_image can output a different number of dimensions depending on the type of image. You can either reshape it or use another tf.image.decode_* call so at least you have a known number of dimensions in the tensor.
Creating an export_model may be an easier way.
One example in tensorflow.org
The Keras graph with a float32, shape (?, 224, 224, 3) tensor
model = ...
Define a function to preprocess b64 image
def preprocess_input(base64_input_bytes):
def decode_bytes(img_bytes):
img = tf.image.decode_jpeg(img_bytes, channels=3)
img = tf.image.resize(img, (224, 224))
img = tf.image.convert_image_dtype(img, tf.float32)
return img
base64_input_bytes = tf.reshape(base64_input_bytes, (-1,))
return tf.map_fn(lambda img_bytes:
decode_bytes(img_bytes),
elems=base64_input_bytes,
fn_output_signature=tf.float32)
Export a serving model
serving_inputs = tf.keras.layers.Input(shape=(), dtype=tf.string, name='b64_input_bytes')
serving_x = tf.keras.layers.Lambda(preprocess_input, name='decode_image_bytes')(serving_inputs)
serving_x = model(serving_x)
serving_model = tf.keras.Model(serving_inputs, serving_x)
tf.saved_model.save(serving_model, serving_model_path)
Serving
import requests
data = json.dumps({"signature_name": "serving_default", "instances": [{"b64_input_bytes": {"b64": b64str_1}}, {"b64_input_bytes": {"b64": b64str_2}}]})
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/{model_name}:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)['predictions']