How to apply word2vec for k-means clustering? - scala

I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
public String preProcess(String sentence) {
return sentence.toLowerCase();
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());"Building model....");
Word2Vec vec = new Word2Vec.Builder()
.build();"Fitting Word2Vec model....");;"Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
//Get cluster index for each buyer Id
val AltCompByCluster = {
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.


Pytorch LSTM - generating sentence- word by word?

I'm trying to implement a neural network to generate sentences (image captions), and I'm using Pytorch's LSTM (nn.LSTM) for that.
The input I want to feed in the training is from size batch_size * seq_size * embedding_size, such that seq_size is the maximal size of a sentence. For example - 64*30*512.
After the LSTM there is one FC layer (nn.Linear).
As far as I understand, this type of networks work with hidden state (h,c in this case), and predict the next word each time.
My question is- in the training - do we have to manually feed the sentence word by word to the LSTM in the forward function, or the LSTM knows how to do it itself?
My forward function looks like this:
def forward(self, features, caption, h = None, c = None):
batch_size = caption.size(0)
caption_size = caption.size(1)
no_hc = False
if h == None and c == None:
no_hc = True
h,c = self.init_hidden(batch_size)
embeddings = self.embedding(caption)
output = torch.empty((batch_size, caption_size, self.vocab_size)).to(device)
for i in range(caption_size): #go over the words in the sentence
if i==0:
lstm_input = features.unsqueeze(1)
lstm_input = embeddings[:,i-1,:].unsqueeze(1)
out, (h,c) = self.lstm(lstm_input, (h,c))
out = self.fc(out)
output[:,i,:] = out.squeeze()
if no_hc:
return output
return output, h,c
(took inspiration from here)
The output of the forward here is from size batch_size * seq_size * vocab_size, which is good because it can be compared with the original batch_size * seq_size sized caption in the loss function.
The question is whether this for loop inside the forward that feeds the words one after the other is really necessary, or I can somehow feed the entire sentence at once and get the same results?
(I saw some example that do that, for example this one, but I'm not sure if it's really equivalent)
The answer is, LSTM knows how to do it on its own. You do not have to manually feed each word one by one.
An intuitive way to understand is that the shape of the batch that you send, contains seq_length (batch.shape[1]), using which it decides the number of words in the sentence. The words are passed through LSTM Cell generating the hidden states and C.

Text mining : Bad predictions of toxic comments using Word2Vec

I have a dataset containing sentences and boolean columns (0 or 1) to classify the type of the comment (toxic|severe_toxic|obscene|threat|insult|identity_hate).
You can download the dataset here :
I filtered the words with spacy to only keep useful words, i kept : Adjectives, Adverbs, Verbs and Nouns using this function :
def filter_words(words) :
vec = []
conditions = ('ADV','NOUN','ADJ','VERB')
for token in nlp(words):
if not token.is_stop and token.pos_ in conditions:
return vec
Then i converted the dataframe to a parquet file to speed up the performances.
I ended up with a dataframe which looks like this :
I used a Word2Vec on this DF to create a features column in order to use RandomForestClassifier to predict if model works well.
Here is the code :
from import Word2Vec
from pyspark.sql.functions import *
word2vec = Word2Vec(inputCol="vector_words",outputCol="features")
model =
result = model.transform(sentences)
result = result.withColumn("toxic", result["toxic"].cast(IntegerType()))
rf =RandomForestClassifier(labelCol="toxic",featuresCol="features")
result = result.dropna()
(trainingSet, testSet) = result.randomSplit([0.7,0.3])
model_toxic =
predictions = model_toxic.transform(testSet)
But the problem i have here, is that i only have 16 predictions that are considered toxic from which 13 are really identified as toxic while there are about 4000 toxic comments in the set.
I don't understand why. Is it because of the filter i applied on the words, which might be too restrictive( i don't know why though ) or is it because the parameters of my Word2Vec and RandomForestClassifier aren't precise enough?
I'm new to pyspark and i couldn't find any information about bad models, basically people on internet are pretty happy about the results. Any help would be appreciated.

What is the correct svmlight input format in Mallet?

I am using Mallet with the SVMLight input format to do classification usingNaiveBayes classifier. But I get a NumberFormatException. I'm wondering how I can use strings features when using SVMLight. As I read in the guideline 1, the features can also be strings.
Can anyone help me what is wrong with my code or input?
Here is my code:
public void trainMalletNaiveBayes() throws Exception {
ArrayList<Pipe> pipes = new ArrayList<Pipe>();
pipes.add(new SvmLight2FeatureVectorAndLabel());
pipes.add(new PrintInputAndTarget());
SerialPipes pipe = new SerialPipes(pipes);
//prepare training instances
InstanceList trainingInstanceList = new InstanceList(pipe);
trainingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/featureFiles_svm.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));
//prepare test instances
InstanceList testingInstanceList = new InstanceList(pipe);
testingInstanceList.addThruPipe(new CsvIterator(new FileReader("/tmp/test_set.csv"), "^(\\S*)[\\s,]*(.*)$", 2, 1, -1));
ClassifierTrainer trainer = new NaiveBayesTrainer();
Classifier classifier = trainer.train(trainingInstanceList);
And here is the first three lines of my input file:
No f1:NP f2:NN f3:1 f4:1 f5:0 f6:0 f7:0 f8:0.0 f9:1 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:NN f17:NOTHING
No f1:NP f2:NN f3:8 f4:4 f5:0 f6:0 f7:1 f8:4.127134385045092 f9:8 f10:true f11:false f12:false f13:false f14:false f15:ROOT f16:DT f17:NOTHING
Yes f1:NP f2:NN f3:4 f4:3 f5:0 f6:0 f7:0 f8:0.0 f9:4 f10:true f11:false f12:false f13:false f14:false f15:NP f16:DT f17:NN
The first column is the label of the instance and there rest of the data includes the features and their values. For example, NN shows the POS of the head word of a phrase.
In the meantime, I get the exception for the NN (NumberFormatException: For input string: "NN") . I'm wondering why it doesn't have any problem with the NP which comes before that, but stops at the NN.
All features need to have numeric values. For booleans you can use true=1 and false=0. You would also have to modify f1:NP to f1_NP=1.
The reason it's not dying on the NP is that the SvmLight2FeatureVectorAndLabel class is expecting to parse an entire line (label and data), but the code is reading the file with a CsvIterator that is splitting off the first element as a label.
The class uses this code for an iterator:
new SelectiveFileLineIterator (fileReader, "^\\s*#.+")

How to use caffe to classify text?

I'm using the Rotten Tomatoes dataset to train my net. It's divided in two groups, positive and negative examples. How can I configure my cnn in caffe to predict if a given text is a positive or a negative example?
I already formatted the data, each sentence has a size of 56 words. But using the following config does not give me even a satisfactory result.
n = caffe.NetSpec(), n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB,
transform_param=dict(scale= 1 / mean),
n.conv1 = L.Convolution(, kernel_size=3, pad=1,
param=dict(lr_mult=1), num_output=10,
n.pool1 = L.Pooling(n.conv1, kernel_size=n_classes,
stride=2, pool=P.Pooling.MAX)
n.ip1 = L.InnerProduct(n.pool1, num_output=100,
n.relu1 = L.ReLU(n.ip1, in_place=True)
n.ip2 = L.InnerProduct(n.relu1, num_output=n_classes,
n.loss = L.SoftmaxWithLoss(n.ip2, n.label)
My dataset is divided in two text files. One containing the positives examples and other containing negatives examples. Polarity dataset v1.1. To organize my data I get the length of the biggest sentence (59 words) so if a sentence is smaller than 59 words I add some text to it. I adapted from this code. For example, lets pretend that the biggest sentence has 3 words:
data = 'abc def ghijkl. mnopqrst uvwxyz. abcd.'
#In this data I have 3 sentences:
sentence_one = 'abc def ghijkl
sentence_two = 'mnopqrst uvwxyz'
sentence_three = 'abcd'
The sentence_one is the biggest (3 words), so to format the others two sentence I did the following:
sentence_two = 'mnopqrst uvwxyz <PAD>'
sentence_three = 'abcd <PAD> <PAD>'
Saved each positive and negative sentence to a caffe datum and saved in lmdb:
datum = caffe.proto.caffe_pb2.Datum()
datum.channels = 1
datum.height = 59 #biggest sentence
datum.width = 1
datum.label = label # 0 or 1 = sentence.tobytes()
Using my datum database and the above caffe's configuration I get a poor accuracy (less than 3 percent). What am I doing wrong?

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity.
I am trying to train a corpus of data and then use it for text analysis*. I've tried using NaiveBayes, but that seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
So, I am now trying to use TFIDF and passing that output into a RowMatrix and computing the similarities. But, I'm not sure how to run my query (one word for now). Here's what I've tried:
val rddOfTfidfFromCorpus : RDD[Vector]
val query = "word"
val tf = new HashingTF().transform(List(query))
val tfIDF = new IDF().fit(sc.makeRDD(List(tf))).transform(tf)
val mergedVectors = rddOfTfidfFromCorpus.union(sc.makeRDD(List(tfIDF)))
val similarities = new RowMatrix(mergedVectors).columnSimilarities(1.0)
Here is where I'm stuck (if I've even done everything right until here). I tried filtering the similarities i and j down to the parts of my query's TFIDF and end up with an empty collection.
The gist is that I want to train on a corpus of data and find what category it falls in. The above code is at least trying to get it down to one category and checking if I can get a prediction from that at least....
*Note that this is a toy example, so I only need something that works well enough
*I am using Spark 1.4.0
Using columnSimilarities doesn't make sense here. Since each column in your matrix represents a set of terms you'll get a matrix of similarities between tokens not documents. You could transpose the matrix and then use columnSimilarities but as far as I understand what you want is a similarity between query and corpus. You can express that using matrix multiplication as follows:
For starters you'll need an IDFModel you've trained on a corpus. Lets assume it is called idf:
import org.apache.spark.mllib.feature.IDFModel
val idf: IDFModel = ??? // Trained using corpus data
and a small helper:
def toBlockMatrix(rdd: RDD[Vector]) = new IndexedRowMatrix({case (v, i) => IndexedRow(i, v)}
First lets convert query to an RDD and compute TF:
val query: Seq[String] = ???
val queryTf = new HashingTF().transform(query)
Next we can apply IDF model and convert result to matrix:
val queryTfidf = idf.transform(queryTf)
val queryMatrix = toBlockMatrix(queryTfidf)
We'll need a corpus matrix as well:
val corpusMatrix = toBlockMatrix(rddOfTfidfFromCorpus)
If you multiple both we get a matrix with number of rows equal to the number of docs in the query and number of columns equal to the number of documents in the corpus.
val dotProducts = queryMatrix.multiply(corpusMatrix.transpose)
To get a proper cosine similarity you have to divide by a product of magnitudes but if you can handle that.
There are two problems here. First of all it is rather expensive. Moreover I am not sure if it really useful. To reduce cost you can apply some dimensionality reduction algorithm first but lets leave it for now.
Judging from a following statement
NaiveBayes (...) seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
I guess you want some kind of unsupervised learning method. The simplest thing you can try is K-means:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
val numClusters: Int = ???
val numIterations = 20
val model = KMeans.train(rddOfTfidfFromCorpus, numClusters, numIterations)
val predictions = model.predict(queryTfidf)