How to use caffe to classify text? - neural-network

I'm using the Rotten Tomatoes dataset to train my net. It's divided in two groups, positive and negative examples. How can I configure my cnn in caffe to predict if a given text is a positive or a negative example?
I already formatted the data, each sentence has a size of 56 words. But using the following config does not give me even a satisfactory result.
n = caffe.NetSpec()
n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB,
source=db_path,
transform_param=dict(scale= 1 / mean),
ntop=2)
n.conv1 = L.Convolution(n.data, kernel_size=3, pad=1,
param=dict(lr_mult=1), num_output=10,
weight_filler=dict(type='xavier'))
n.pool1 = L.Pooling(n.conv1, kernel_size=n_classes,
stride=2, pool=P.Pooling.MAX)
n.ip1 = L.InnerProduct(n.pool1, num_output=100,
weight_filler=dict(type='xavier'))
n.relu1 = L.ReLU(n.ip1, in_place=True)
n.ip2 = L.InnerProduct(n.relu1, num_output=n_classes,
weight_filler=dict(type='xavier'))
n.loss = L.SoftmaxWithLoss(n.ip2, n.label)
My dataset is divided in two text files. One containing the positives examples and other containing negatives examples. Polarity dataset v1.1. To organize my data I get the length of the biggest sentence (59 words) so if a sentence is smaller than 59 words I add some text to it. I adapted from this code. For example, lets pretend that the biggest sentence has 3 words:
data = 'abc def ghijkl. mnopqrst uvwxyz. abcd.'
##
#In this data I have 3 sentences:
##
sentence_one = 'abc def ghijkl
sentence_two = 'mnopqrst uvwxyz'
sentence_three = 'abcd'
The sentence_one is the biggest (3 words), so to format the others two sentence I did the following:
sentence_two = 'mnopqrst uvwxyz <PAD>'
sentence_three = 'abcd <PAD> <PAD>'
Saved each positive and negative sentence to a caffe datum and saved in lmdb:
datum = caffe.proto.caffe_pb2.Datum()
datum.channels = 1
datum.height = 59 #biggest sentence
datum.width = 1
datum.label = label # 0 or 1
datum.data = sentence.tobytes()
Using my datum database and the above caffe's configuration I get a poor accuracy (less than 3 percent). What am I doing wrong?

Related

Pytorch LSTM - generating sentence- word by word?

I'm trying to implement a neural network to generate sentences (image captions), and I'm using Pytorch's LSTM (nn.LSTM) for that.
The input I want to feed in the training is from size batch_size * seq_size * embedding_size, such that seq_size is the maximal size of a sentence. For example - 64*30*512.
After the LSTM there is one FC layer (nn.Linear).
As far as I understand, this type of networks work with hidden state (h,c in this case), and predict the next word each time.
My question is- in the training - do we have to manually feed the sentence word by word to the LSTM in the forward function, or the LSTM knows how to do it itself?
My forward function looks like this:
def forward(self, features, caption, h = None, c = None):
batch_size = caption.size(0)
caption_size = caption.size(1)
no_hc = False
if h == None and c == None:
no_hc = True
h,c = self.init_hidden(batch_size)
embeddings = self.embedding(caption)
output = torch.empty((batch_size, caption_size, self.vocab_size)).to(device)
for i in range(caption_size): #go over the words in the sentence
if i==0:
lstm_input = features.unsqueeze(1)
else:
lstm_input = embeddings[:,i-1,:].unsqueeze(1)
out, (h,c) = self.lstm(lstm_input, (h,c))
out = self.fc(out)
output[:,i,:] = out.squeeze()
if no_hc:
return output
return output, h,c
(took inspiration from here)
The output of the forward here is from size batch_size * seq_size * vocab_size, which is good because it can be compared with the original batch_size * seq_size sized caption in the loss function.
The question is whether this for loop inside the forward that feeds the words one after the other is really necessary, or I can somehow feed the entire sentence at once and get the same results?
(I saw some example that do that, for example this one, but I'm not sure if it's really equivalent)
The answer is, LSTM knows how to do it on its own. You do not have to manually feed each word one by one.
An intuitive way to understand is that the shape of the batch that you send, contains seq_length (batch.shape[1]), using which it decides the number of words in the sentence. The words are passed through LSTM Cell generating the hidden states and C.

Pytorch minibatching keeps model from training

I am trying to classify sequences by a binary feature. I have a dataset of sequence/label pairs and am using a simple one-layer LSTM to classify each sequence. Before I implemented minibatching, I was getting reasonable accuracy on a test set (80%), and the training loss would go from 0.6 to 0.3 (averaged).
I implemented minibatching, using parts of this tutorial: https://pytorch.org/tutorials/beginner/chatbot_tutorial.html
However, now my model won’t do better than 70-72% (70% of the data has one label) with batch size set to 1 and all other parameters exactly the same. Additionally, the loss starts out at 0.0106 and quickly gets really really small, with no significant change in results. I feel like the results between no batching and batching with size 1 should be the same, so I probably have a bug, but for the life of me I can’t find it. My code is below.
Training code (one epoch):
for i in t:
model.zero_grad()
# prep inputs
last = i+self.params['batch_size']
last = last if last < len(train_data) else len(train_data)
batch_in, lengths, batch_targets = self.batch2TrainData(train_data[shuffled][i:last], word_to_ix, label_to_ix)
iters += 1
# forward pass.
tag_scores = model(batch_in, lengths)
# compute loss, then do backward pass, then update gradients
loss = loss_function(tag_scores, batch_targets)
loss.backward()
# Clip gradients: gradients are modified in place
nn.utils.clip_grad_norm_(model.parameters(), 50.0)
optimizer.step()
Functions:
def prep_sequence(self, seq, to_ix):
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long)
# transposes batch_in
def zeroPadding(self, l, fillvalue=0):
return list(itertools.zip_longest(*l, fillvalue=fillvalue))
# Returns padded input sequence tensor and lengths
def inputVar(self, batch_in, word_to_ix):
idx_batch = [self.prep_sequence(seq, word_to_ix) for seq in batch_in]
lengths = torch.tensor([len(idxs) for idxs in idx_batch])
padList = self.zeroPadding(idx_batch)
padVar = torch.LongTensor(padList)
return padVar, lengths
# Returns all items for a given batch of pairs
def batch2TrainData(self, batch, word_to_ix, label_to_ix):
# sort by dec length
batch = batch[np.argsort([len(x['turn']) for x in batch])[::-1]]
input_batch, output_batch = [], []
for pair in batch:
input_batch.append(pair['turn'])
output_batch.append(pair['label'])
inp, lengths = self.inputVar(input_batch, word_to_ix)
output = self.prep_sequence(output_batch, label_to_ix)
return inp, lengths, output
Model:
class LSTMClassifier(nn.Module):
def __init__(self, params, vocab_size, tagset_size, weights_matrix=None):
super(LSTMClassifier, self).__init__()
self.hidden_dim = params['hidden_dim']
if weights_matrix is not None:
self.word_embeddings = nn.Embedding.from_pretrained(weights_matrix)
else:
self.word_embeddings = nn.Embedding(vocab_size, params['embedding_dim'])
self.lstm = nn.LSTM(params['embedding_dim'], self.hidden_dim, bidirectional=False)
# The linear layer that maps from hidden state space to tag space
self.hidden2tag = nn.Linear(self.hidden_dim, tagset_size)
def forward(self, batch_in, lengths):
embeds = self.word_embeddings(batch_in)
packed = nn.utils.rnn.pack_padded_sequence(embeds, lengths)
lstm_out, _ = self.lstm(packed)
outputs, _ = nn.utils.rnn.pad_packed_sequence(lstm_out)
tag_space = self.hidden2tag(outputs)
tag_scores = F.log_softmax(tag_space, dim=0)
return tag_scores[-1]
For anyone else with a similar issue, I got it to work. I removed the log_softmax calculation, so this:
tag_space = self.hidden2tag(outputs)
tag_scores = F.log_softmax(tag_space, dim=0)
return tag_scores[-1]
becomes this:
tag_space = self.hidden2tag(outputs)
return tag_space[-1]
I also changed NLLLoss to CrossEntropyLoss, (not shown above), and initialized CrossEntropyLoss with no parameters (aka no ignore_index).
I am not certain why these changes were necessary (the docs even say that NLLLoss should be run after a log_softmax layer), but they got my model working and brought my loss back to a reasonable range (~0.5).

Keras deep autoencoder prediction is inaccurate

I am using a Keras deep autoencoder to reproduce my sparse matrix of [360, 6860] dimension. Each row is the count of trigrams for a protein sequence. The matrix has 2 classes of proteins, but I want the network to be ignorant of that initially, that is why I am using an autoencoder. I am following the keras blog autoencoder tutorial for this.
This is my code-
# this is the size of our encoded representations
encoding_dim = 32
input_img = Input(shape=(6860,))
encoded = Dense(128, activation='relu', activity_regularizer=regularizers.activity_l1(10e-5))(input_img)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)
decoded = Dense(64, activation='relu')(encoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(6860, activation='sigmoid')(decoded)
autoencoder = Model(input=input_img, output=decoded)
# this model maps an input to its encoded representation
encoder = Model(input=input_img, output=encoded)
# create a placeholder for an encoded (32-dimensional) input
encoded_input_1 = Input(shape=(32,))
encoded_input_2 = Input(shape=(64,))
encoded_input_3 = Input(shape=(128,))
# retrieve the last layer of the autoencoder model
decoder_layer_1 = autoencoder.layers[-3]
decoder_layer_2 = autoencoder.layers[-2]
decoder_layer_3 = autoencoder.layers[-1]
# create the decoder model
decoder_1 = Model(input = encoded_input_1, output = decoder_layer_1(encoded_input_1))
decoder_2 = Model(input = encoded_input_2, output = decoder_layer_2(encoded_input_2))
decoder_3 = Model(input = encoded_input_3, output = decoder_layer_3(encoded_input_3))
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
autoencoder.fit(x_train, x_train,
nb_epoch= 100,
batch_size=40,
shuffle=True,
validation_data=(x_test, x_test))
My validation set dimension is [80, 6860]. The problem is if I use the decoder to predict from the test set, my predictions are really off. For example if I predict with the following code-
# encode and decode some digits
# note that we take them from the *test* set
encoded_imgs = encoder.predict(x_test)
decoded_imgs = decoder_1.predict(encoded_imgs)
decoded_imgs = decoder_2.predict(decoded_imgs)
decoded_imgs = decoder_3.predict(decoded_imgs)
print x_test[3, np.where(x_test[3, :] != 0)[0]]
print (decoded_imgs[3, np.where(x_test[3, :] != 0)[0]])
a single row of my test set where the values are not zero are-
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
for the same row, the autoencoder's prediction of the same indices are-
[ 0.04615583 0.04613763 0.10268984 0.00286385 0.0030572 0.02551027
0.00552908 0.09686473 0.02554915 0.0082816 0.02254158 0.01127195
0.00305908 0.17113154 0.01140419 0.03370495 0.00515486 0.02614204
0.00558715 0.02835727 0.0029659 0.01425297 0.00834536 0.04502939
0.02260707 0.01131396 0.00561662 0.01131314 0.00493734 0.00265232
0.0056083 0.01724379 0.06099484 0.03738695 0.01128869 0.01995548
0.00562622 0.00556281 0.01732991 0.03142899 0.05339266 0.04778111
0.00292415 0.02264618 0.01419865 0.00550648 0.00836777 0.01139715]
Now, first I thought, maybe I can use some kind of thresholding to get the 1's from these values. But it seems they are pretty random. For a single row, for the first 50 zero values for my test set, my autoencoder predicts-
[ 0.14251608 0.00118295 0.00118732 0.00304095 0.031255 0.00108441
0.0201351 0.00853934 0.00558488 0.00281343 0.00296877 0.00109651
0.01129742 0.00827519 0.0170884 0.01417614 0.01714166 0.00549215
0.00099755 0.00558552 0.00829634 0.01988331 0.00092845 0.00294271
0.01429107 0.01137067 0.01137967 0.01121876 0.00491931 0.00562285
0.0055124 0.01720702 0.0142925 0.00553411 0.00551252 0.00281541
0.01145663 0.002876 0.00555185 0.00525392 0.01421779 0.00273949
0.01698892 0.02529835 0.0112521 0.01130333 0.00554186 0.00291986
0.00554437 0.01144382]
How can I improve the predictions? What am I doing wrong here? I must say that the data is hugely sparse. If you want you can download the toy data from here. Please, let me know if you have any questions.
One of the most important reasons is probably your training data size is just too small. You have a fully connected network and thus with 7 layers (including input and output) the number of parameters are just huge, close to 1.8M. You only have 360 training samples. So basically the parameters are untrained.
You can improve your work in two ways. One is of course to get more training data. The second is to follow the CNN example in the later part of the tutorial. CNN has been popular since it can greatly reduce the number of parameters.

How to apply word2vec for k-means clustering?

I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
#Override
public String preProcess(String sentence) {
//System.out.println(sentence.toLowerCase());
return sentence.toLowerCase();
}
});
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
log.info("Fitting Word2Vec model....");
vec.fit();
log.info("Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
modell.clusterCenters.foreach(println)
//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
row=>
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
.map(_.toDouble))),row.split(',').slice(0,1).head)
}
AltCompByCluster.foreach(println)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.

IDL and MatLab getting strange values from NetCDF file

I have a NetCDF file, which contains data representing total precipitation across the globe over several months (so it's stored in a three dimensional array). I first ensured that the data was sensible, and the way it was formed, both in XConv and ncdump. All looks sensible - values vary from very small (~10^-10 - this makes sense, as this is model data, and effectively represents zero) to about 5x10^-3.
The problems start when I try to handle this data in IDL or MatLab. The arrays generated in these programs are full of huge negative numbers such as -4x10^4, with occasional huge positive numbers, such as 5000. Strangely, looking at a plot of the data in MatLab with respect to latitude and longitude (at a specific time), the pattern of rainfall looks sensible, but the values are just completely wrong.
In IDL, I'm reading the file in to write it to a text file so it can be handled by some software that takes very basic text files. Here's the code I'm using:
PRO nao_heaps
address = '/Users/levyadmin/Downloads/'
file_base = 'output'
ncid = ncdf_open(address + file_base + '.nc')
MONTHS=['january','february','march','april','may','june','july','august','september','october','november','december']
varid_field = ncdf_varid(ncid, "tp")
varid_lon = ncdf_varid(ncid, "longitude")
varid_lat = ncdf_varid(ncid, "latitude")
varid_time = ncdf_varid(ncid, "time")
ncdf_varget,ncid, varid_field, total_precip
ncdf_varget,ncid, varid_lat, lats
ncdf_varget,ncid, varid_lon, lons
ncdf_varget,ncid, varid_time, time
ncdf_close,ncid
lats = reform(lats)
lons = reform(lons)
time = reform(time)
total_precip = reform(total_precip)
total_precip = total_precip*1000. ;put in mm
noLats=(size(lats))(1)
noLons=(size(lons))(1)
noMonths=(size(time))(1)
; the data may not be an integer number of years (otherwise we could make this next loop cleaner)
av_precip=fltarr(noLons,noLats,12)
for month=0, 11 do begin
year = 0
while ( (year*12) + month lt noMonths ) do begin
av_precip(*,*,month) = av_precip(*,*,month) + total_precip(*,*, (year*12)+month )
year++
endwhile
av_precip(*,*,month) = av_precip(*,*,month)/year
endfor
fname = address + file_base + '.dat'
OPENW,1,fname
PRINTF,1,'longitude'
PRINTF,1,lons
PRINTF,1,'latitude'
PRINTF,1,lats
for month=0,11 do begin
PRINTF,1,MONTHS(month)
PRINTF,1,av_precip(*,*,month)
endfor
CLOSE,1
END
Anyone have any ideas why I'm getting such strange values in MatLab and IDL?!
AH! Found the answer. NetCDF files use an offset, and a scale factor for the data to keep the size of the file to a minimum. To get the correct values, I simply need to:
total_precip = offset + (scale_factor * total_precip) ;put into correct range
At present I'm getting the scale factor and offset from ncdump, and hard coding them into my IDL program, but does anyone know how I can get them dynamically in my IDL code..?