KMeans|| for sentiment analysis on Spark - scala

I'm trying to write sentiment analysis program based on Spark. To do this I'm using word2vec and KMeans clustering. From word2Vec I've got 20k word/vectors collection in 100 dimension space and now I'm trying to clusterize this vectors space. When I run KMeans with default parallel implementation the algorithm worked 3 hours! But with random initialization strategy it was like 8 minutes.
What am I doing wrong? I have mac book pro machine with 4 kernels processor and 16 GB of RAM.
K ~= 4000
maxInteration was 20
var vectors: Iterable[org.apache.spark.mllib.linalg.Vector] =
model.getVectors.map(entry => new VectorWithLabel(entry._1, entry._2.map(_.toDouble)))
val data = sc.parallelize(vectors.toIndexedSeq).persist(StorageLevel.MEMORY_ONLY_2)
log.info("Clustering data size {}",data.count())
log.info("==================Train process started==================");
val clusterSize = modelSize/5
val kmeans = new KMeans()
kmeans.setInitializationMode(KMeans.K_MEANS_PARALLEL)
kmeans.setK(clusterSize)
kmeans.setRuns(1)
kmeans.setMaxIterations(50)
kmeans.setEpsilon(1e-4)
time = System.currentTimeMillis()
val clusterModel: KMeansModel = kmeans.run(data)
And spark context initialization is here:
val conf = new SparkConf()
.setAppName("SparkPreProcessor")
.setMaster("local[4]")
.set("spark.default.parallelism", "8")
.set("spark.executor.memory", "1g")
val sc = SparkContext.getOrCreate(conf)
Also few updates about running this program. I'm running it inside Intelij IDEA. I don't have real Spark cluster. But I thought that your personal machine can be Spark cluster
I saw that the program hangs inside this loop from Spark code LocalKMeans.scala:
// Initialize centers by sampling using the k-means++ procedure.
centers(0) = pickWeighted(rand, points, weights).toDense
for (i <- 1 until k) {
// Pick the next center with a probability proportional to cost under current centers
val curCenters = centers.view.take(i)
val sum = points.view.zip(weights).map { case (p, w) =>
w * KMeans.pointCost(curCenters, p)
}.sum
val r = rand.nextDouble() * sum
var cumulativeScore = 0.0
var j = 0
while (j < points.length && cumulativeScore < r) {
cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j))
j += 1
}
if (j == 0) {
logWarning("kMeansPlusPlus initialization ran out of distinct points for centers." +
s" Using duplicate point for center k = $i.")
centers(i) = points(0).toDense
} else {
centers(i) = points(j - 1).toDense
}
}

Initialisation using KMeans.K_MEANS_PARALLEL is more complicated then random. However, it shouldn't make such a big difference. I would recommend to investigate, whether it is the parallel algorithm which takes to much time (it should actually be more efficient then KMeans itself).
For information on profiling see:
http://spark.apache.org/docs/latest/monitoring.html
If it is not the initialisation which takes up the time there is something seriously wrong. However, using random initialisation shouldn't be any worse for the final result (just less efficient!).
Actually when you use KMeans.K_MEANS_PARALLEL to initialise you should get reasonable results with 0 iterations. If this is not the case there might be some regularities in the distribution of the data which send KMeans offtrack. Hence, if you haven't distributed your data randomly you could also change this. However, such an impact would surprise me give a fixed number of iterations.

I've run spark on AWS with 3 slaves (c3.xlarge) and the result is the same - problem is that parallel KMeans initialize algo in N parallel runs, but it's still extremely slow for small amount of data, my solution is to contionue using Random initialization.
Data size approximately: 4k clusters for 21k 100-dim vectors.

Related

spark No space left on device when working on extremely large data

The followings are my scala spark code:
val vertex = graph.vertices
val edges = graph.edges.map(v=>(v.srcId, v.dstId)).toDF("key","value")
var FMvertex = vertex.map(v => (v._1, HLLCounter.encode(v._1)))
var encodedVertex = FMvertex.toDF("keyR", "valueR")
var Degvertex = vertex.map(v => (v._1, 0.toLong))
var lastRes = Degvertex
//calculate FM of the next step
breakable {
for (i <- 1 to MaxIter) {
var N_pre = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
var adjacency = edges.join(
encodedVertex,//FMvertex.toDF("keyR", "valueR"),
$"value" === $"keyR"
).rdd.map(r => (r.getAs[VertexId]("key"), r.getAs[Array[Byte]]("valueR"))).reduceByKey((a,b)=>HLLCounter.Union(a,b))
FMvertex = FMvertex.union(adjacency).reduceByKey((a,b)=>HLLCounter.Union(a,b))
// update vetex encode
encodedVertex = FMvertex.toDF("keyR", "valueR")
var N_curr = FMvertex.map(v => (v._1, HLLCounter.decode(v._2)))
lastRes = N_curr
var middleAns = N_curr.union(N_pre).reduceByKey((a,b)=>Math.abs(a-b))//.mapValues(x => x._1 - x._2)
if (middleAns.values.sum() == 0){
println(i)
break
}
Degvertex = Degvertex.join(middleAns).mapValues(x => x._1 + i * x._2)//.map(identity)
}
}
val res = Degvertex.join(lastRes).mapValues(x => x._1.toDouble / x._2.toDouble)
return res
In which I use several functions I defined in Java:
import net.agkn.hll.HLL;
import com.google.common.hash.*;
import com.google.common.hash.Hashing;
import java.io.Serializable;
public class HLLCounter implements Serializable {
private static int seed = 1234567;
private static HashFunction hs = Hashing.murmur3_128(seed);
private static int log2m = 15;
private static int regwidth = 5;
public static byte[] encode(Long id) {
HLL hll = new HLL(log2m, regwidth);
Hasher myhash = hs.newHasher();
hll.addRaw(myhash.putLong(id).hash().asLong());
return hll.toBytes();
}
public static byte[] Union(byte[] byteA, byte[] byteB) {
HLL hllA = HLL.fromBytes(byteA);
HLL hllB = HLL.fromBytes(byteB);
hllA.union(hllB);
return hllA.toBytes();
}
public static long decode(byte[] bytes) {
HLL hll = HLL.fromBytes(bytes);
return hll.cardinality();
}
}
This code is used for calculating Effective Closeness on a large graph, and I used Hyperloglog package.
The code works fine when I ran it on a graph with about ten million vertices and hundred million of edges. However, when I ran it on a graph with thousands million of graph and billions of edges, after several hours running on clusters, it shows
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 91 in stage 29.1 failed 4 times, most recent failure: Lost task 91.3 in stage 29.1 (TID 17065, 9.10.135.216, executor 102): java.io.IOException: : No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
Can anybody help me? I just begin to use spark for several days. Thank you for helping.
Xiaotian, you state "The shuffle read and shuffle write is about 1TB. I do not need those intermediate values or RDDs". This statement affirms that you are not familiar with Apache Spark or possibly the algorithm you are running. Please let me explain.
When adding three numbers, you have to make a choice about the first two numbers to add. For example (a+b)+c or a+(b+c). Once that choice is made, there is a temporary intermediate value that is held for the number within the parenthesis. It is not possible to continue the computation across all three numbers without the intermediary number.
The RDD is a space efficient data structure. Each "new" RDD represents a set of operations across an entire data set. Some RDDs represent a single operation, like "add five" while others represent a chain of operations, like "add five, then multiply by six, and subtract by seven". You cannot discard an RDD without discarding some portion of your mathematical algorithm.
At its core, Apache Spark is a scatter-gather algorithm. It distributes a data set to a number of worker nodes, where that data set is part of a single RDD that gets distributed, along with the needed computations. At this point in time, the computations are not yet performed. As the data is requested from the computed form of the RDD, the computations are performed on-demand.
Occasionally, it is not possible to finish a computation on a single worker without knowing some of the intermediate values from other workers. This kind of cross communication between the workers always happens between the head node which distributes the data to the various workers and collects and aggregates the data from the various workers; but, depending on how the algorithm is structured, it can also occur mid-computation (especially in algorithms that groupBy or join data slices).
You have an algorithm that requires shuffling, in such a manner that a single node cannot collect the results from all of the other nodes because the single node doesn't have enough ram to hold the intermediate values collected from the other nodes.
In short, you have an algorithm that can't scale to accommodate the size of your data set with the hardware you have available.
At this point, you need to go back to your Apache Spark algorithm and see if it is possible to
Tune the partitions in the RDD to reduce the cross talk (partitions that are too small might require more cross talk in shuffling as a fully connected inter-transfer grows at O(N^2), partitions that are too big might run out of ram within a compute node).
Restructure the algorithm such that full shuffling is not required (sometimes you can reduce in stages such that you are dealing with more reduction phases, each phase having less data combine).
Restructure the algorithm such that shuffling is not required (it is possible, but unlikely that the algorithm is simply mis-written, and factoring it differently can avoid requesting remote data from a node's perspective).
If the problem is in collecting the results, rewrite the algorithm to return the results not in the head node's console, but in a distributed file system that can accommodate the data (like HDFS).
Without the nuts-and-bolts of your Apache Spark program, and access to your data set, and access to your Spark cluster and it's logs, it's hard to know which one of these common approaches would benefit you the most; so I listed them all.
Good Luck!

mllib KMeans show random behavior

I am using Scala Spark version 1.6.1 of KMeans and observe random behavior.
From my understanding the only random part is the initial centers initialization, which I addressed.
The experiment goes as follows: I run KMeans once and get the model - in the first time the centers are initialized randomly. After I get the model I run the following code:
//val latestModel: KMeansModel was trained earlier
val km:KMeans = new KMeans()
km.setK(numberOfClusters)
km.setMaxIterations(20)
if(previousModel != null)
{
if(latestModel.k == numberOfClusters)
{
logger.info("Using cluster centers from previous model")
km.setInitialModel(latestModel) //Set initial cluster centers
}
}
kmeansModel = KMeans.train(dataAfterPCA, numberOfClusters, 20)
println("Run#1")
kmeansModel.clusterCenters.foreach(t => println(t))
kmeansModel = KMeans.train(dataAfterPCA, numberOfClusters, 20)
println("Run#2")
kmeansModel.clusterCenters.foreach(t => println(t))
As you can see I used the centers from latestModel and observed the printing.
Clusters centers are different:
Run#1
[0.20910608631141306,0.2008812839967183,0.27863526709646663,0.17173268189352492,0.4068108508134425,1.5978368739711135,-0.03644171546864227,-0.034547377483902755,-0.30757069112989693,-0.04681453873202328,-0.03432819320158391,-0.0229510885384198,0.16155254061277455]
[-0.9986167379861676,-0.4228356715735266,-0.9797043073290139,-0.48157892793353135,-0.7818198908298358,-0.3991524190947045,-0.09142025949212684,-0.034547396992719734,-0.4149601436468508,-0.04681453873202326,56.38182990388363,-0.027308795774228338,-0.8567167533956337]
[0.40443230723847245,0.40753014996762926,0.48063940183378684,0.37038231765864527,0.615561235153811,-0.1546334408565992,1.1517155044090817,-0.034547396992719734,0.17947924999402878,22.44497403279252,-0.04625456310989393,-0.027308795774228335,0.3521192856019467]
[0.44614142085922764,0.39183992738993073,0.5599716298711428,0.31737580128115594,0.8674951428275776,0.799192261554564,1.090005738447001,-0.034547396992719734,-0.10481785131247881,-0.04681453873202326,-0.04625456310989393,41.936484571639795,0.4864344010461224]
[0.3506753428299332,0.3395786568210998,0.45443729624612045,0.3115089688709545,0.4762387976829325,11.3438592782776,0.04041394221229458,-0.03454735647587367,1.0065342405811888,-0.046814538732023264,-0.04625456310989393,-0.02730879577422834,0.19094114706893608]
[0.8238890515931856,0.8366348841253755,0.9862410283176735,0.7635549199270218,1.1877685458769478,0.7813626284105487,38.470668704621396,-0.03452467509554947,-0.4149294724823659,-0.04681453873202326,1.2214455451195836,-0.0212002267696445,1.1580099782670004]
[0.21425069771110813,0.22469514482272127,0.30113774986108593,0.182605001533264,0.4637631333393578,0.029033109984974183,-0.002029301682406235,-0.03454739699271971,2.397309416381941,0.011941957462594896,-0.046254563109893905,-0.018931196565979497,0.35297479589140024]
[-0.6546798328639079,-0.6358370654999287,-0.7928424675098332,-0.5071485895971765,-0.7400917528763642,-0.39717704681705857,-0.08938412993092051,-0.02346229974103403,-0.40690957159820434,-0.04681453873202331,-0.023692354206657835,-0.024758557139368385,-0.6068025631839297]
[-0.010895214450242299,-0.023949109470308646,-0.07602949287623037,-0.018356772906618683,-0.39876455727035937,-0.21260655806916112,-0.07991736890951397,-0.03454278343886248,-0.3644711133467814,-0.04681453873202319,-0.03250578362850749,-0.024761896110663685,-0.09605183996736125]
[0.14061295519424166,0.14152409771288327,0.1988841951819923,0.10943684592384875,0.3404665467004296,-0.06397788416055701,0.030711112793548753,0.044173951636969355,-0.08950950493941498,-0.039099833378049946,-0.03265898863536165,-0.02406954910363843,0.16029254891067157]
Run#2
[0.11726347529467256,0.11240236056044385,0.145845029386598,0.09061870140058333,0.15437020046635777,0.03499211466800115,-0.007112193875767524,-0.03449302405046689,-0.20652827212743696,-0.041880871009984943,-0.042927843040582066,-0.024409659630584803,0.10595250123068904]
[-0.9986167379861676,-0.4228356715735266,-0.9797043073290139,-0.48157892793353135,-0.7818198908298358,-0.3991524190947045,-0.09142025949212684,-0.034547396992719734,-0.4149601436468508,-0.04681453873202326,56.38182990388363,-0.027308795774228338,-0.8567167533956337]
[0.40443230723847245,0.40753014996762926,0.48063940183378684,0.37038231765864527,0.615561235153811,-0.1546334408565992,1.1517155044090817,-0.034547396992719734,0.17947924999402878,22.44497403279252,-0.04625456310989393,-0.027308795774228335,0.3521192856019467]
[0.44614142085922764,0.39183992738993073,0.5599716298711428,0.31737580128115594,0.8674951428275776,0.799192261554564,1.090005738447001,-0.034547396992719734,-0.10481785131247881,-0.04681453873202326,-0.04625456310989393,41.936484571639795,0.4864344010461224]
[0.056657434641233205,0.03626919750209713,0.1229690343482326,0.015190756508711958,-0.278078039715814,-0.3991255672375599,0.06613236052364684,28.98230095429352,-0.4149601436468508,-0.04681453873202326,-0.04625456310989393,-0.027308795774228338,-0.31945629161893124]
[0.8238890515931856,0.8366348841253755,0.9862410283176735,0.7635549199270218,1.1877685458769478,0.7813626284105487,38.470668704621396,-0.03452467509554947,-0.4149294724823659,-0.04681453873202326,1.2214455451195836,-0.0212002267696445,1.1580099782670004]
[-0.17971932675588306,-7.925508727413683E-4,-0.08990036350145142,-0.033456211225756705,-0.1514393713761394,-0.08538399305051374,-0.09132371177664707,-0.034547396992719734,-0.19858350916572132,-0.04681453873202326,4.873470425033645,-0.023394262810850164,0.15064661243568334]
[-0.4488579509785471,-0.4428314704219248,-0.5776049270843375,-0.3580559344350086,-0.6787807800457122,-0.378841125619109,-0.08742047856626034,-0.027746008987067004,-0.3951588549839565,-0.046814538732023264,-0.04625456310989399,-0.02448638761790114,-0.4757072927512256]
[0.2986301685357443,0.2895405124404614,0.39435230210861016,0.2549716029318805,0.5238783183359862,5.629286423487358,0.012002410566794644,-0.03454737293733725,0.1657346440290886,-0.046814538732023264,-0.03653898382838679,-0.025149508122450703,0.2715302163354414]
[0.2072253546037051,0.21958064267615496,0.29431697644435456,0.17741927849917147,0.4521349932664591,-0.010031680919536882,3.9433761322307554E-4,-0.03454739699271971,2.240412962951767,0.005598926623403161,-0.046254563109893905,-0.018412129948368845,0.33990882056156724]
I am trying to understand what is the source of this random behavior and how can it be avoided, couldn't find anything on the Git source either.
Any ideas/suggestions? having a stable behavior is mandatory for me.
It's normal.Each time you train the model,it will randomly initialize the parameters.If you set the number of iterations big enough,it will converge together.
you should use km.train() not KMeans.train()

Different result returned using Scala Collection par in a series of runs

I have tasks that I want to execute concurrently and each task takes substantial amount of memory so I have to execute them in batches of 2 to conserve memory.
def runme(n: Int = 120) = (1 to n).grouped(2).toList.flatMap{tuple =>
tuple.par.map{x => {
println(s"Running $x")
val s = (1 to 100000).toList // intentionally to make the JVM allocate a sizeable chunk of memory
s.sum.toLong
}}
}
val result = runme()
println(result.size + " => " + result.sum)
The result I expected from the output was 120 => 84609924480 but the output was rather random. The returned collection size differed from execution to execution. Most of the time there was missing count even though all the futures were executed looking at the console. I thought flatMap waits the parallel executions in map to complete before returning the complete. What should I do to always get the right result using par? Thanks
Just for the record: changing the underlying collection in this case shouldn't change the output of your program. The problem is related to this known bug. It's fixed from 2.11.6, so if you use that (or higher) Scala version, you should not see the strange behavior.
And about the overflow, I still think that your expected value is wrong. You can check that the sum is overflowing because the list is of integers (which are 32 bit) while the total sum exceeds the integer limits. You can check it with the following snippet:
val n = 100000
val s = (1 to n).toList // your original code
val yourValue = s.sum.toLong // your original code
val correctValue = 1l * n * (n + 1) / 2 // use math formula
var bruteForceValue = 0l // in case you don't trust math :) It's Long because of 0l
for (i ← 1 to n) bruteForceValue += i // iterate through range
println(s"yourValue = $yourValue")
println(s"correctvalue = $correctValue")
println(s"bruteForceValue = $bruteForceValue")
which produces the output
yourValue = 705082704
correctvalue = 5000050000
bruteForceValue = 5000050000
Cheers!
Thanks #kaktusito.
It worked after I changed the grouped list to Vector or Seq i.e. (1 to n).grouped(2).toList.flatMap{... to (1 to n).grouped(2).toVector.flatMap{...

Memory efficient way of union a sequence of RDDs from Files in Apache Spark

I'm currently trying to train a set of Word2Vec Vectors on the UMBC Webbase Corpus (around 30GB of text in 400 files).
I often run into out of memory situations even on 100 GB plus Machines. I run Spark in the application itself. I tried to tweak a little bit, but I am not able to perform this operation on more than 10 GB of textual data. The clear bottleneck of my implementation is the union of the previously computed RDDs, that where the out of memory exception comes from.
Maybe one you have the experience to come up with a more memory efficient implementation than this:
object SparkJobs {
val conf = new SparkConf()
.setAppName("TestApp")
.setMaster("local[*]")
.set("spark.executor.memory", "100g")
.set("spark.rdd.compress", "true")
val sc = new SparkContext(conf)
def trainBasedOnWebBaseFiles(path: String): Unit = {
val folder: File = new File(path)
val files: ParSeq[File] = folder.listFiles(new TxtFileFilter).toIndexedSeq.par
var i = 0;
val props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
props.setProperty("nthreads","2")
val pipeline = new StanfordCoreNLP(props);
//preprocess files parallel
val training_data_raw: ParSeq[RDD[Seq[String]]] = files.map(file => {
//preprocess line of file
println(file.getName() +"-" + file.getTotalSpace())
val rdd_lines: Iterator[Option[Seq[String]]] = for (line <- Source.fromFile(file,"utf-8").getLines) yield {
//performs some preprocessing like tokenization, stop word filtering etc.
processWebBaseLine(pipeline, line)
}
val filtered_rdd_lines = rdd_lines.filter(line => line.isDefined).map(line => line.get).toList
println(s"File $i done")
i = i + 1
sc.parallelize(filtered_rdd_lines).persist(StorageLevel.MEMORY_ONLY_SER)
})
val rdd_file = sc.union(training_data_raw.seq)
val starttime = System.currentTimeMillis()
println("Start Training")
val word2vec = new Word2Vec()
word2vec.setVectorSize(100)
val model: Word2VecModel = word2vec.fit(rdd_file)
println("Training time: " + (System.currentTimeMillis() - starttime))
ModelUtil.storeWord2VecModel(model, Config.WORD2VEC_MODEL_PATH)
}}
}
Like Sarvesh points out in the comments, it is probably too much data for a single machine. Use more machines. We typically see the need for 20–30 GB of memory to work with a file of 1 GB. By this (extremely rough) estimate you'd need 600–800 GB of memory for the 30 GB input. (You can get a more accurate estimate by loading a part of the data.)
As a more general comment, I'd suggest you avoid using rdd.union and sc.parallelize. Use instead sc.textFile with a wildcard to load all files into a single RDD.
Have you tried getting word2vec vectors from a smaller corpus? I tell you this cause I was running the word2vec spark implementation on a much smaller one and I got issues with it cause there is this issue: http://mail-archives.apache.org/mod_mbox/spark-issues/201412.mbox/%3CJIRA.12761684.1418621192000.36769.1418759475999#Atlassian.JIRA%3E
So for my use case that issue made the word2vec spark implementation a bit useless. Thus I used spark for massaging my corpus but not for actually getting the vectors.
As other suggested stay away from calling rdd.union.
Also I think .toList will probably gather every line from the RDD and collect it in your Driver Machine ( the one used to submit the task) probably this is why you are getting out-of-memory. You should totally avoid turning the RDD into a list!

Adding immutable Vectors

I am trying to work more with scalas immutable collection since this is easy to parallelize, but i struggle with some newbie problems. I am looking for a way to create (efficiently) a new Vector from an operation. To be precise I want something like
val v : Vector[Double] = RandomVector(10000)
val w : Vector[Double] = RandomVector(10000)
val r = v + w
I tested the following:
// 1)
val r : Vector[Double] = (v.zip(w)).map{ t:(Double,Double) => t._1 + t._2 }
// 2)
val vb = new VectorBuilder[Double]()
var i=0
while(i<v.length){
vb += v(i) + w(i)
i = i + 1
}
val r = vb.result
}
Both take really long compared to the work with Array:
[Vector Zip/Map ] Elapsed time 0.409 msecs
[Vector While Loop] Elapsed time 0.374 msecs
[Array While Loop ] Elapsed time 0.056 msecs
// with warm-up (10000) and avg. over 10000 runs
Is there a better way to do it? I think the work with zip/map/reduce has the advantage that it can run in parallel as soon as the collections have support for this.
Thanks
Vector is not specialized for Double, so you're going to pay a sizable performance penalty for using it. If you are doing a simple operation, you're probably better off using an array on a single core than a Vector or other generic collection on the entire machine (unless you have 12+ cores). If you still need parallelization, there are other mechanisms you can use, such as using scala.actors.Futures.future to create instances that each do the work on part of the range:
val a = Array(1,2,3,4,5,6,7,8)
(0 to 4).map(_ * (a.length/4)).sliding(2).map(i => scala.actors.Futures.future {
var s = 0
var j = i(0)
while (j < i(1)) {
s += a(j)
j += 1
}
s
}).map(_()).sum // _() applies the future--blocks until it's done
Of course, you'd need to use this on a much longer array (and on a machine with four cores) for the parallelization to improve things.
You should use lazily built collections when you use more than one higher-order methods:
v1.view zip v2 map { case (a,b) => a+b }
If you don't use a view or an iterator each method will create a new immutable collection even when they are not needed.
Probably immutable code won't be as fast as mutable but the lazy collection will improve execution time of your code a lot.
Arrays are not type-erased, Vectors are. Basically, JVM gives Array an advantage over other collections when handling primitives that cannot be overcome. Scala's specialization might decrease that advantage, but, given their cost in code size, they can't be used everywhere.