mllib KMeans show random behavior - scala

I am using Scala Spark version 1.6.1 of KMeans and observe random behavior.
From my understanding the only random part is the initial centers initialization, which I addressed.
The experiment goes as follows: I run KMeans once and get the model - in the first time the centers are initialized randomly. After I get the model I run the following code:
//val latestModel: KMeansModel was trained earlier
val km:KMeans = new KMeans()
km.setK(numberOfClusters)
km.setMaxIterations(20)
if(previousModel != null)
{
if(latestModel.k == numberOfClusters)
{
logger.info("Using cluster centers from previous model")
km.setInitialModel(latestModel) //Set initial cluster centers
}
}
kmeansModel = KMeans.train(dataAfterPCA, numberOfClusters, 20)
println("Run#1")
kmeansModel.clusterCenters.foreach(t => println(t))
kmeansModel = KMeans.train(dataAfterPCA, numberOfClusters, 20)
println("Run#2")
kmeansModel.clusterCenters.foreach(t => println(t))
As you can see I used the centers from latestModel and observed the printing.
Clusters centers are different:
Run#1
[0.20910608631141306,0.2008812839967183,0.27863526709646663,0.17173268189352492,0.4068108508134425,1.5978368739711135,-0.03644171546864227,-0.034547377483902755,-0.30757069112989693,-0.04681453873202328,-0.03432819320158391,-0.0229510885384198,0.16155254061277455]
[-0.9986167379861676,-0.4228356715735266,-0.9797043073290139,-0.48157892793353135,-0.7818198908298358,-0.3991524190947045,-0.09142025949212684,-0.034547396992719734,-0.4149601436468508,-0.04681453873202326,56.38182990388363,-0.027308795774228338,-0.8567167533956337]
[0.40443230723847245,0.40753014996762926,0.48063940183378684,0.37038231765864527,0.615561235153811,-0.1546334408565992,1.1517155044090817,-0.034547396992719734,0.17947924999402878,22.44497403279252,-0.04625456310989393,-0.027308795774228335,0.3521192856019467]
[0.44614142085922764,0.39183992738993073,0.5599716298711428,0.31737580128115594,0.8674951428275776,0.799192261554564,1.090005738447001,-0.034547396992719734,-0.10481785131247881,-0.04681453873202326,-0.04625456310989393,41.936484571639795,0.4864344010461224]
[0.3506753428299332,0.3395786568210998,0.45443729624612045,0.3115089688709545,0.4762387976829325,11.3438592782776,0.04041394221229458,-0.03454735647587367,1.0065342405811888,-0.046814538732023264,-0.04625456310989393,-0.02730879577422834,0.19094114706893608]
[0.8238890515931856,0.8366348841253755,0.9862410283176735,0.7635549199270218,1.1877685458769478,0.7813626284105487,38.470668704621396,-0.03452467509554947,-0.4149294724823659,-0.04681453873202326,1.2214455451195836,-0.0212002267696445,1.1580099782670004]
[0.21425069771110813,0.22469514482272127,0.30113774986108593,0.182605001533264,0.4637631333393578,0.029033109984974183,-0.002029301682406235,-0.03454739699271971,2.397309416381941,0.011941957462594896,-0.046254563109893905,-0.018931196565979497,0.35297479589140024]
[-0.6546798328639079,-0.6358370654999287,-0.7928424675098332,-0.5071485895971765,-0.7400917528763642,-0.39717704681705857,-0.08938412993092051,-0.02346229974103403,-0.40690957159820434,-0.04681453873202331,-0.023692354206657835,-0.024758557139368385,-0.6068025631839297]
[-0.010895214450242299,-0.023949109470308646,-0.07602949287623037,-0.018356772906618683,-0.39876455727035937,-0.21260655806916112,-0.07991736890951397,-0.03454278343886248,-0.3644711133467814,-0.04681453873202319,-0.03250578362850749,-0.024761896110663685,-0.09605183996736125]
[0.14061295519424166,0.14152409771288327,0.1988841951819923,0.10943684592384875,0.3404665467004296,-0.06397788416055701,0.030711112793548753,0.044173951636969355,-0.08950950493941498,-0.039099833378049946,-0.03265898863536165,-0.02406954910363843,0.16029254891067157]
Run#2
[0.11726347529467256,0.11240236056044385,0.145845029386598,0.09061870140058333,0.15437020046635777,0.03499211466800115,-0.007112193875767524,-0.03449302405046689,-0.20652827212743696,-0.041880871009984943,-0.042927843040582066,-0.024409659630584803,0.10595250123068904]
[-0.9986167379861676,-0.4228356715735266,-0.9797043073290139,-0.48157892793353135,-0.7818198908298358,-0.3991524190947045,-0.09142025949212684,-0.034547396992719734,-0.4149601436468508,-0.04681453873202326,56.38182990388363,-0.027308795774228338,-0.8567167533956337]
[0.40443230723847245,0.40753014996762926,0.48063940183378684,0.37038231765864527,0.615561235153811,-0.1546334408565992,1.1517155044090817,-0.034547396992719734,0.17947924999402878,22.44497403279252,-0.04625456310989393,-0.027308795774228335,0.3521192856019467]
[0.44614142085922764,0.39183992738993073,0.5599716298711428,0.31737580128115594,0.8674951428275776,0.799192261554564,1.090005738447001,-0.034547396992719734,-0.10481785131247881,-0.04681453873202326,-0.04625456310989393,41.936484571639795,0.4864344010461224]
[0.056657434641233205,0.03626919750209713,0.1229690343482326,0.015190756508711958,-0.278078039715814,-0.3991255672375599,0.06613236052364684,28.98230095429352,-0.4149601436468508,-0.04681453873202326,-0.04625456310989393,-0.027308795774228338,-0.31945629161893124]
[0.8238890515931856,0.8366348841253755,0.9862410283176735,0.7635549199270218,1.1877685458769478,0.7813626284105487,38.470668704621396,-0.03452467509554947,-0.4149294724823659,-0.04681453873202326,1.2214455451195836,-0.0212002267696445,1.1580099782670004]
[-0.17971932675588306,-7.925508727413683E-4,-0.08990036350145142,-0.033456211225756705,-0.1514393713761394,-0.08538399305051374,-0.09132371177664707,-0.034547396992719734,-0.19858350916572132,-0.04681453873202326,4.873470425033645,-0.023394262810850164,0.15064661243568334]
[-0.4488579509785471,-0.4428314704219248,-0.5776049270843375,-0.3580559344350086,-0.6787807800457122,-0.378841125619109,-0.08742047856626034,-0.027746008987067004,-0.3951588549839565,-0.046814538732023264,-0.04625456310989399,-0.02448638761790114,-0.4757072927512256]
[0.2986301685357443,0.2895405124404614,0.39435230210861016,0.2549716029318805,0.5238783183359862,5.629286423487358,0.012002410566794644,-0.03454737293733725,0.1657346440290886,-0.046814538732023264,-0.03653898382838679,-0.025149508122450703,0.2715302163354414]
[0.2072253546037051,0.21958064267615496,0.29431697644435456,0.17741927849917147,0.4521349932664591,-0.010031680919536882,3.9433761322307554E-4,-0.03454739699271971,2.240412962951767,0.005598926623403161,-0.046254563109893905,-0.018412129948368845,0.33990882056156724]
I am trying to understand what is the source of this random behavior and how can it be avoided, couldn't find anything on the Git source either.
Any ideas/suggestions? having a stable behavior is mandatory for me.

It's normal.Each time you train the model,it will randomly initialize the parameters.If you set the number of iterations big enough,it will converge together.
you should use km.train() not KMeans.train()

Related

How to apply word2vec for k-means clustering?

I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
#Override
public String preProcess(String sentence) {
//System.out.println(sentence.toLowerCase());
return sentence.toLowerCase();
}
});
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
log.info("Fitting Word2Vec model....");
vec.fit();
log.info("Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
modell.clusterCenters.foreach(println)
//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
row=>
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
.map(_.toDouble))),row.split(',').slice(0,1).head)
}
AltCompByCluster.foreach(println)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.

How do I view the datapoints that are added to a cluster after applying K-Means algorithm?

I have implemented k-means algorithm in scala as follows.
def clustering(clustnum:Int,iternum:Int,parsedData: RDD[org.apache.spark.mllib.linalg.Vector]): Unit= {
val clusters = KMeans.train(parsedData, clustnum, iternum)
println("The Cluster centers of each column for "+clustnum+" clusters and "+iternum+" iterations are:- ")
clusters.clusterCenters.foreach(println)
val predictions= clusters.predict(parsedData)
predictions.collect()
}
I know how I can print cluster centers of each cluster but is there a function in scala which prints which rows have been added to which cluster?
The data I am working with contains rows of float values with each row having an ID. It has around 34 columns and around 200 rows. I am working on spark in scala.
I need to be able to see the result.
As in Id_1 is in cluster 1 or so and so.
Edit : I was able to do this much
println(clustnum+" clusters and "+iternum+" iterations ")
val vectorsAndClusterIdx = parsedData.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.collect().foreach(println)
It prints the cluster ID and the row that is added to the cluster
The row is shown as a string and the cluster ID is the printed after
([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
But is there some way to just print the row ID and cluster ID only?
Would using dataframes help me here?
You can use the predict() function of KMeansModel.
Have a look at the documentation: http://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel
In your code:
KMeans.train(parsedData, clustnum, iternum)
returns a KMeansModel object.
So, you can do this:
val predictions = clusters.predict(parsedData)
and get a MapPartitionsRDD as result.
predictions.collect()
gives you an Array with the cluster index assignments.
println(clustnum+" clusters and "+iternum+" iterations ")
val vectorsAndClusterIdx = parsedData.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
vectorsAndClusterIdx.collect().foreach(println)
Seems to solve my problem. It prints the cluster ID and the row that is added to the cluster
The row is shown as a string and the cluster ID is the printed after
([1.0,1998.0,1.0,1.0,1.0,1.0,14305.0,39567.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],4)
([2.0,1998.0,1.0,1.0,2.0,1.0,185.0,2514.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)
([3.0,1998.0,1.0,1.0,2.0,2.0,27.0,272.0,1998.0,23.499,25.7,27.961,29.04,28.061,26.171,24.44,24.619,24.529,24.497,23.838,22.322,1998.0,0.0,0.007,0.007,96.042,118.634,61.738,216.787,262.074,148.697,216.564,49.515,28.098],0)

Gaussian Mixture Model in scala spark 1.5.1 weights are always uniformly distributed

I implemented the default gmm model provided in mllib for my algorithm.
I am repeatedly finding that the resultant weights are always equally waited no matter how many clusters i initiate. Is there any specific reason why the weights are not being adjusted ? Am I implementing it wrong ?
import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.clustering.GaussianMixtureModel
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrameNaFunctions
var colnames= df.columns;
for(x<-colnames)
{
if (df.select(x).dtypes(0)._2.equals("StringType")|| df.select(x).dtypes(0)._2.equals("LongType"))
{df = df.drop(x)}
}
colnames= df.columns;
var assembler = new VectorAssembler().setInputCols(colnames).setOutputCol("features")
var output = assembler.transform(df)
var normalizer= new Normalizer().setInputCol("features").setOutputCol("normalizedfeatures").setP(2.0)
var normalizedOutput = normalizer.transform(output)
var temp = normalizedOutput.select("normalizedfeatures")
var outputs = temp.rdd.map(_.getAs[org.apache.spark.mllib.linalg.Vector]("normalizedfeatures"))
var gmm = new GaussianMixture().setK(2).setMaxIterations(10000).setSeed(25).run(outputs)
Output code :
for (i <- 0 until gmm.k) {
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}
And therefore the points are being predicted in the same cluster for all the points .
var ol=gmm.predict(outputs).toDF
I am also having this issue. The weights and the gaussians are always the same. It seems independent of K.
My code is pretty simple. My data is 39 dimensional vectors of doubles. I just train like this...
val gmm = new GaussianMixture().setK(2).run(vectors)
for (i <- 0 until gmm.k) {
println("weight=%f\nmu=%s\nsigma=\n%s\n" format
(gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}
I tried KMeans, and it worked as expected. So I thought this has to be a bug with GaussianMixture.
But then I tried clustering just the first dimension, and it worked. Now I think it must be an EM issue with to little data... except I have lots.
Any GMM experts out there? How much data does one need GaussianMixture and 39 dimensions.
Or is this a bug after all?

KMeans|| for sentiment analysis on Spark

I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes.
What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.
K ~= 4000
maxInteration was 20
var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
log.info("Clustering data size {}",data.count())
log.info("==================Train process started==================");
val clusterSize = modelSize/5
val kmeans = new KMeans()
kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
kmeans.setK(clusterSize)
kmeans.setRuns(1)
kmeans.setMaxIterations(50)
kmeans.setEpsilon(1e-4)
time = System.currentTimeMillis()
val clusterModel: KMeansModel = kmeans.run(data)
And spark context initialization is here:
val conf = new SparkConf()
.setAppName("SparkPreProcessor")
.setMaster("local[4]")
.set("spark.default.parallelism", "8")
.set("spark.executor.memory", "1g")
val sc = SparkContext.getOrCreate(conf)
Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster
I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:
// Initialize centers by sampling using the k-means++ procedure.
centers(0) = pickWeighted(rand, points, weights).toDense
for (i <- 1 until k) {
// Pick the next center with a probability proportional to cost under current centers
val curCenters = centers.view.take(i)
val sum = points.view.zip(weights).map { case (p, w) =>
w * KMeans.pointCost(curCenters, p)
}.sum
val r = rand.nextDouble() * sum
var cumulativeScore = 0.0
var j = 0
while (j < points.length && cumulativeScore < r) {
cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
j += 1
}
if (j == 0) {
logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
s" Using duplicate point for center k = $i.")
centers(i) = points(0).toDense
} else {
centers(i) = points(j - 1).toDense
}
}
Initialisation using KMeans.K_MEANS_PARALLEL is more complicated then random. However, it shouldn't make such a big difference. I would recommend to investigate, whether it is the parallel algorithm which takes to much time (it should actually be more efficient then KMeans itself).
For information on profiling see:
http://spark.apache.org/docs/latest/monitoring.html
If it is not the initialisation which takes up the time there is something seriously wrong. However, using random initialisation shouldn't be any worse for the final result (just less efficient!).
Actually when you use KMeans.K_MEANS_PARALLEL to initialise you should get reasonable results with 0 iterations. If this is not the case there might be some regularities in the distribution of the data which send KMeans offtrack. Hence, if you haven't distributed your data randomly you could also change this. However, such an impact would surprise me give a fixed number of iterations.
I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization.
Data size approximately: 4k clusters for 21k 100-dim vectors.

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity.
I am trying to train a corpus of data and then use it for text analysis*. I've tried using NaiveBayes, but that seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
So, I am now trying to use TFIDF and passing that output into a RowMatrix and computing the similarities. But, I'm not sure how to run my query (one word for now). Here's what I've tried:
val rddOfTfidfFromCorpus : RDD[Vector]
val query = "word"
val tf = new HashingTF().transform(List(query))
val tfIDF = new IDF().fit(sc.makeRDD(List(tf))).transform(tf)
val mergedVectors = rddOfTfidfFromCorpus.union(sc.makeRDD(List(tfIDF)))
val similarities = new RowMatrix(mergedVectors).columnSimilarities(1.0)
Here is where I'm stuck (if I've even done everything right until here). I tried filtering the similarities i and j down to the parts of my query's TFIDF and end up with an empty collection.
The gist is that I want to train on a corpus of data and find what category it falls in. The above code is at least trying to get it down to one category and checking if I can get a prediction from that at least....
*Note that this is a toy example, so I only need something that works well enough
*I am using Spark 1.4.0
Using columnSimilarities doesn't make sense here. Since each column in your matrix represents a set of terms you'll get a matrix of similarities between tokens not documents. You could transpose the matrix and then use columnSimilarities but as far as I understand what you want is a similarity between query and corpus. You can express that using matrix multiplication as follows:
For starters you'll need an IDFModel you've trained on a corpus. Lets assume it is called idf:
import org.apache.spark.mllib.feature.IDFModel
val idf: IDFModel = ??? // Trained using corpus data
and a small helper:
def toBlockMatrix(rdd: RDD[Vector]) = new IndexedRowMatrix(
rdd.zipWithIndex.map{case (v, i) => IndexedRow(i, v)}
).toCoordinateMatrix.toBlockMatrix
First lets convert query to an RDD and compute TF:
val query: Seq[String] = ???
val queryTf = new HashingTF().transform(query)
Next we can apply IDF model and convert result to matrix:
val queryTfidf = idf.transform(queryTf)
val queryMatrix = toBlockMatrix(queryTfidf)
We'll need a corpus matrix as well:
val corpusMatrix = toBlockMatrix(rddOfTfidfFromCorpus)
If you multiple both we get a matrix with number of rows equal to the number of docs in the query and number of columns equal to the number of documents in the corpus.
val dotProducts = queryMatrix.multiply(corpusMatrix.transpose)
To get a proper cosine similarity you have to divide by a product of magnitudes but if you can handle that.
There are two problems here. First of all it is rather expensive. Moreover I am not sure if it really useful. To reduce cost you can apply some dimensionality reduction algorithm first but lets leave it for now.
Judging from a following statement
NaiveBayes (...) seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
I guess you want some kind of unsupervised learning method. The simplest thing you can try is K-means:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
val numClusters: Int = ???
val numIterations = 20
val model = KMeans.train(rddOfTfidfFromCorpus, numClusters, numIterations)
val predictions = model.predict(queryTfidf)