I have a Spark application which cleans and prepares a data set, and then applies a K-means clustering algorithm onto this set. Afterwards, some metrics of the resulting clusters are calculated.
Naturally, calculating the K-means vector clusters is a task that has a long execution time. When debugging the calculation for the cluster metrics I cannot iterate quickly on my code due to the clusters being calculated on each execution. How do I solve this?
Ideas I have are:
writing a unit test for the metric calculation methods. But it would be cumbersome to mock cluster data as
Serialise computed K-means vector cluster data to disk
Any help is appreciated
Code for reference:
def main(args: Array[String]): Unit = {
// -- start of long running execution
val lines = sc.textFile("src/main/resources/stackoverflow/stackoverflow.csv")
val raw = rawPostings(lines)
val grouped = groupedPostings(raw)
val scored = scoredPostings(grouped)
val vectors = vectorPostings(scored)
// assert(vectors.count() == 2121822, "Incorrect number of vectors: " + vectors.count())
val means = kmeans(sampleVectors(vectors), vectors, debug = true)
// -- end of long running execution
val results = clusterResults(means, vectors) // < -- this method operates on the result of previous ops
printResults(results)
}
Implementation of clusterResults:
def clusterResults(means: Array[(Int, Int)], vectors: RDD[(LangIndex, HighScore)]): Array[(String, Double, Int, Int)] = {
// -- Note that means is quite intensive to compute
val closest = vectors.map(p => (findClosest(p, means), p)) // -- (Int, (LangIndex, HighScore))
val closestGrouped = closest.groupByKey() // -- (Int, Iter((LangIndex, HighScore))
vectors.take(3).foreach(println)
val median = closestGrouped.mapValues { vs =>
// #todo: what does groupBy(identity) do?
// Predef.idintity is equivalent ot x => x
val langId: Int = vs.map(_._1).groupBy(identity).maxBy(_._2.size)._1 // most common language in the cluster
val langLabel: String = langs(langId / langSpread)
val langPercent: Double = vs.map(_._1).count(_.equals(langId)) / vs.size // percent of the questions in the most common language (= number of questions in most common lang divided by total questions)
val clusterSize: Int = vs.size
val medianScore: Int = vs.map(_._2).
(langLabel, langPercent, clusterSize, medianScore)
}
median.collect().map(_._2).sortBy(_._4)
}
Related
Goal
I have a mutable Map[Long, Long] with millions of entries. I need to make many iterations of updates with millions of updates. I would like to do this as fast as possible.
Background
Currently, the fastest method is to use a single threaded mutable.LongMap[Long]. This type is optimized for Long types as the key.
Other map types appear to be slower -- but I may have implemented them incorrectly as I was trying to do the updates concurrently and/or in parallel without success. It is possible that updating a map in parallel is not actually occurring or is not possible in Scala.
In order of fastest to slowest:
LongMap[Long] (from above)
TrieMap[Long, Long]
ParTrieMap[Long, Long]
HashMap[Long, Long]
ParHashMap[Long, Long]
ParMap[Long, Long]
It is OK if a faster method is not mutable, but I do not think this will be the case. A mutable map is probably best for this use case.
Code to generate test data and time the test
import java.util.Calendar
import scala.collection.mutable
object DictSpeedTest2 {
//helper constants
val million: Long = 1000000
val billion: Long = million * 1000
//config
val garbageCollectionWait = 3
val numEntries: Long = million * 10 //may need to increase JVM memory with something like: -Xmx32g
val maxValue: Long = billion * million // max Long = 9223372036854775807L
// this is 1000000000000000L
def main(args: Array[String]): Unit = {
//generate random data; initial entries in a; updates in b
val a = genData(numEntries, maxValue, seed = 1000)
val b = genData(numEntries, maxValue, seed = 9999)
//initialization
val dict = new mutable.LongMap[Long]()
a.foreach(x => dict += (x._1 -> x._2))
//run and time test
println("start test: " + Calendar.getInstance().getTime)
val start = System.currentTimeMillis
b.foreach(x => dict += (x._1 -> x._2)) //updates
val end = System.currentTimeMillis
//print runtime
val durationInSeconds = (end - start).toFloat / 1000 + "s"
println("end test: " + Calendar.getInstance().getTime + " -- " + durationInSeconds)
}
def genData(n: Long, max: Long, seed: Long): Array[(Long, Long)] = {
val r = scala.util.Random
r.setSeed(seed) //deterministic generation of arrays
val a = new Array[(Long, Long)](n.toInt)
a.map(_ => (r.nextInt(), r.nextInt()) )
}
}
Current timings
LongMap[Long] with the above code completes in the following times on my 2018 MacBook Pro:
~3.5 seconds with numEntries = 10 million
~100 seconds with numEntries = 100 million
If you are not limited to use only Scala/Java maps than for exceptional performance you can peek 3rd party libraries that have maps specialized for Long/Long key/value pairs.
Here is not so outdated overview of such kind of libraries with benchmark results for Int/Int pairs.
I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and F1 score are decreasing. All that I have already read for this algorithm shows me that I have an issue.
I have tried a different number of hash tables and I have provided different numbers of Jaccard similarity thresholds but I have the same exact problem, the accuracy is decreasing rapidly. I have also tried different samplings of my dataset and nothing changed. My workflow goes on like this: I am concatenating all the text columns of my dataframe, which includes title, authors, journal and abstract and next I am tokenizing the concatenated column into words. Then I am using a CountVectorizer to transform this "bag of words" into vectors. Next, I am providing this column in MinHashLSH with some hash tables and finaly I am doing an approximateSimilarityJoin to find similar "papers" which are under my given threshold. My implementation is the following.
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import UnsupervisedLinkPrediction.BroutForce.join
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf, when}
import org.apache.spark.sql.types._
object lsh {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR) // show only errors
// val cores=args(0).toInt
// val partitions=args(1).toInt
// val hashTables=args(2).toInt
// val limit = args(3).toInt
// val threshold = args(4).toDouble
val cores="*"
val partitions=1
val hashTables=16
val limit = 1000
val jaccardDistance = 0.89
val master = "local["+cores+"]"
val ss = SparkSession.builder().master(master).appName("MinHashLSH").getOrCreate()
val sc = ss.sparkContext
val inputFile = "resources/data/node_information.csv"
println("reading from input file: " + inputFile)
println
val schemaStruct = StructType(
StructField("id", IntegerType) ::
StructField("pubYear", StringType) ::
StructField("title", StringType) ::
StructField("authors", StringType) ::
StructField("journal", StringType) ::
StructField("abstract", StringType) :: Nil
)
// Read the contents of the csv file in a dataframe. The csv file contains a header.
// var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
papers.repartition(partitions)
println("papers.rdd.getNumPartitions"+papers.rdd.getNumPartitions)
import ss.implicits._
// Read the original graph edges, ground trouth
val originalGraphDF = sc.textFile("resources/data/Cit-HepTh.txt").map(line => {
val fields = line.split("\t")
(fields(0), fields(1))
}).toDF("nodeA_id", "nodeB_id").cache()
val originalGraphCount = originalGraphDF.count()
println("Ground truth count: " + originalGraphCount )
val nullAuthor = ""
val nullJournal = ""
val nullAbstract = ""
papers = papers.na.fill(nullAuthor, Seq("authors"))
papers = papers.na.fill(nullJournal, Seq("journal"))
papers = papers.na.fill(nullAbstract, Seq("abstract"))
papers = papers.withColumn("nonNullAbstract", when(col("abstract") === nullAbstract, col("title")).otherwise(col("abstract")))
papers = papers.drop("abstract").withColumnRenamed("nonNullAbstract", "abstract")
papers.show(false)
val filteredGt= originalGraphDF.as("g").join(papers.as("p"),(
$"g.nodeA_id" ===$"p.id") || ($"g.nodeB_id" ===$"p.id")
).select("g.nodeA_id","g.nodeB_id").distinct().cache()
filteredGt.show()
val filteredGtCount = filteredGt.count()
println("Filtered GroundTruth count: "+ filteredGtCount)
//TOKENIZE
val tokPubYear = new Tokenizer().setInputCol("pubYear").setOutputCol("pubYear_words")
val tokTitle = new Tokenizer().setInputCol("title").setOutputCol("title_words")
val tokAuthors = new RegexTokenizer().setInputCol("authors").setOutputCol("authors_words").setPattern(",")
val tokJournal = new Tokenizer().setInputCol("journal").setOutputCol("journal_words")
val tokAbstract = new Tokenizer().setInputCol("abstract").setOutputCol("abstract_words")
println("Setting pipeline stages...")
val stages = Array(
tokPubYear, tokTitle, tokAuthors, tokJournal, tokAbstract
// rTitle, rAuthors, rJournal, rAbstract
)
val pipeline = new Pipeline()
pipeline.setStages(stages)
println("Transforming dataframe\n")
val model = pipeline.fit(papers)
papers = model.transform(papers)
println(papers.count())
papers.show(false)
papers.printSchema()
val udf_join_cols = udf(join(_: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String]))
val joinedDf = papers.withColumn(
"paper_data",
udf_join_cols(
papers("pubYear_words"),
papers("title_words"),
papers("authors_words"),
papers("journal_words"),
papers("abstract_words")
)
).select("id", "paper_data").cache()
joinedDf.show(5,false)
val vocabSize = 1000000
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("paper_data").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(joinedDf)
val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)
val vectorizedDf = cvModel.transform(joinedDf).filter(isNoneZeroVector(col("features"))).select(col("id"), col("features"))
vectorizedDf.show()
val mh = new MinHashLSH().setNumHashTables(hashTables)
.setInputCol("features").setOutputCol("hashValues")
val mhModel = mh.fit(vectorizedDf)
mhModel.transform(vectorizedDf).show()
vectorizedDf.createOrReplaceTempView("vecDf")
println("MinHashLSH.getHashTables: "+mh.getNumHashTables)
val dfA = ss.sqlContext.sql("select id as nodeA_id, features from vecDf").cache()
dfA.show(false)
val dfB = ss.sqlContext.sql("select id as nodeB_id, features from vecDf").cache()
dfB.show(false)
val predictionsDF = mhModel.approxSimilarityJoin(dfA, dfB, jaccardDistance, "JaccardDistance").cache()
println("Predictions:")
val predictionsCount = predictionsDF.count()
predictionsDF.show()
println("Predictions count: "+predictionsCount)
predictionsDF.createOrReplaceTempView("predictions")
val pairs = ss.sqlContext.sql("select datasetA.nodeA_id, datasetB.nodeB_id, JaccardDistance from predictions").cache()
pairs.show(false)
val totalPredictions = pairs.count()
println("Properties:\n")
println("Threshold: "+threshold+"\n")
println("Hahs tables: "+hashTables+"\n")
println("Ground truth: "+filteredGtCount)
println("Total edges found: "+totalPredictions +" \n")
println("EVALUATION PROCESS STARTS\n")
println("Calculating true positives...\n")
val truePositives = filteredGt.as("g").join(pairs.as("p"),
($"g.nodeA_id" === $"p.nodeA_id" && $"g.nodeB_id" === $"p.nodeB_id") || ($"g.nodeA_id" === $"p.nodeB_id" && $"g.nodeB_id" === $"p.nodeA_id")
).cache().count()
println("True Positives: "+truePositives+"\n")
println("Calculating false positives...\n")
val falsePositives = predictionsCount - truePositives
println("False Positives: "+falsePositives+"\n")
println("Calculating true negatives...\n")
val pairsPerTwoCount = (limit *(limit - 1)) / 2
val trueNegatives = (pairsPerTwoCount - truePositives) - falsePositives
println("True Negatives: "+trueNegatives+"\n")
val falseNegatives = filteredGtCount - truePositives
println("False Negatives: "+falseNegatives)
val truePN = (truePositives+trueNegatives).toFloat
println("TP + TN sum: "+truePN+"\n")
val sum = (truePN + falseNegatives+ falsePositives).toFloat
println("TP +TN +FP+ FN sum: "+sum+"\n")
val accuracy = (truePN/sum).toFloat
println("Accuracy: "+accuracy+"\n")
val precision = truePositives.toFloat / (truePositives+falsePositives).toFloat
val recall = truePositives.toFloat/(truePositives+falseNegatives).toFloat
val f1Score = 2*(recall*precision)/(recall+precision).toFloat
println("F1 score: "+f1Score+"\n")
ss.stop()
I forget to tell you that I am running this code in a cluster with 40 cores and 64g of RAM. Note that approximate similarity join (Spark's implementation) works with JACCARD DISTANCE and not with JACCARD INDEX. So I provide as a similarity threshold the JACCARD DISTANCE which for my case is jaccardDistance = 1 - threshold. (threshold = Jaccard Index ).
I was expecting to get higher accuracy and f1 score while I am increasing the hash tables. Do you have any idea about my issue?
Thank all of you in advance!
There are multiple visible problems here, and probably more hidden, so just to enumerate a few:
LSH is not really a classifier and attempt to evaluate it as one doesn't make much sense, even if you assume that text similarity is somehow a proxy for citation (which is big if).
If the problem was to be framed as classification problem it should be treated as multi-label classification (each paper can cite or be cited by multiple sources) not multi-class classification, hence simple accuracy is not meaningful.
Even if it was a classification and could be evaluated as such your calculations don't include actual negatives, which don't meet the threshold of the approxSimilarityJoin
Also setting threshold to 1 restricts joins to either exact matches or cases of hash collisions - hence preference towards LSH with higher collisions rates.
Additionally:
Text processing approach you took is rather pedestrian and prefers non-specific features (remember you don't optimize your actual goal, but text similarity).
Such approach, especially treating everything as equal, discards majority of useful information in the set primarily, but not limited to, temporal relationships..
I would like to ask you for help to identify which part of my code is not efficient. I am comparing the QuickSort algorithm with the CountingSort algorithm, assuming that the number of elements in an Array[Byte] is less than 16.
However, the CountingSort time is much higher than the QuickSort time, in all the tests I had performed sequentially. Then, I wanted to test this code in Spark to compute the Median Filter, but the results of the distributed execution times are consistent with the sequential execution times. What I mean is that QuickSort is always faster than CountingSort, even for smaller arrays.
Evidently something in my code is hanging the final processing.
This is the code:
def Histogram(Input: Array[Byte]) : Array[Int] = {
val result = Array.ofDim[Int](256)
val range = Input.distinct.map(x => x & 0xFF)
val mx = Input.map(x => x & 0xFF).max
for (h <- range)
result(h) = Input.count(x => (x & 0xFF) == h)
result.slice(0, mx + 1)
}
def CummulativeSum(Input: Array[Int]): Array[Long] = Input.map(x => x.toLong).scanLeft(0.toLong)(_ + _).drop(1)
def CountingSort(Input: Array[Byte]): Array[Byte] = {
val hist = Histogram(Input)
val cum = CummulativeSum(hist)
val Output = Array.fill[Byte](Input.length)(0)
for (i <- Input.indices) {
Output(cum(Input(i) & 0xFF).toInt - 1) = Input(i)
cum(Input(i) & 0xFF) -= 1
}
Output
}
You can build your histogram without traversing the input quite so many times.
def histogram(input :Array[Byte]) :Array[Int] = {
val inputMap :Map[Int,Array[Byte]] = input.groupBy(_ & 0xFF)
.withDefaultValue(Array())
Array.tabulate(inputMap.keys.max+1)(inputMap(_).length)
}
I'm not sure if this is much faster, but it is certainly more concise.
def countingSort(input :Array[Byte]) :Array[Byte] =
histogram(input).zipWithIndex.flatMap{case (v,x) => Seq.fill(v)(x.toByte)}
My tests show it produces the same results but there could be edge conditions that I've missed.
I am new to spark but I am attempting to produce network clusters using user supplied tags or attributes. First I am using the jaccard minhash algorithm to produce similarity scores then running it through power iteration clustering algorithm but as soon as it starts there is no CPU activity and will run for some time with zero progress. Wondering how to configure the cluster or change the code to get this to run. Below is my code
//about 10,000 rows of (id, 100 tags in binary form)
val data = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("inferSchema","true").load("gs://data/*.csv")
val columnNames = data.columns
val tags = columnNames.slice(1, columnNames.size)
//put tags in a vector
val assembler = new VectorAssembler().setInputCols(tags).setOutputCol("attributes")
val newData = assembler.transform(data).select("userID","attributes")
val mh = new MinHashLSH().setNumHashTables(5).setInputCol("attributes").setOutputCol("values")
val modelMINHASH = mh.fit(goodData)
// Approximate nearest neighbor search
val fullData = modelMINHASH.approxSimilarityJoin(newData , newData , 0.9).filter("datasetA.userID < datasetB.userID")
var explodeDF = fullData.withColumn("id", fullData("datasetA.userID")).withColumn("id2", fullData("datasetB.userID")).select("id","id2","distCol")
val temp = explodeDF.rdd
val newRDD = temp.map(x => (x.getAs[Integer]("id").longValue(),x.getAs[Integer]("id2").longValue(),1-x.getAs[Double]("distCol"))).cache()
//this is where the code haults and I see no progress
val modelPIC = new PowerIterationClustering().setK(16).setMaxIterations(5).run(newRDD)
val clusters = modelPIC.assignments
I'm programming a K-means algorithm in Spark-Scala.
My model predicts in which cluster is each point.
Data
-6.59 -44.68
-35.73 39.93
47.54 -52.04
23.78 46.82
....
Load the data
val data = sc.textFile("/home/borja/flink/kmeans/points")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
Cluster the data into two classes using KMeans
val numClusters = 10
val numIterations = 100
val clusters = KMeans.train(parsedData, numClusters, numIterations)
Predict
val prediction = clusters.predict(parsedData)
However, I need to put the result and the points in the same file, in the next format:
[no title, numberOfCluster (1,2,3,..10), pointX, pointY]:
6 -6.59 -44.68
8 -35.73 39.93
10 47.54 -52.04
7 23.78 46.82
This is the entry of this executable in Python to print really nice the result.
But my best effort has got just this:
(you can check the first numbers are wrong: 68, 384, ...)
var i = 0
val c = sc.parallelize(data.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
i = 0
val c2 = sc.parallelize(prediction.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
val result = c.join(c2)
result.take(5)
Result:
res94: Array[(Int, (String, Int))] = Array((68,(17.79 13.69,0)), (384,(-33.47 -4.87,8)), (440,(-4.75 -42.21,1)), (4,(-33.31 -13.11,6)), (324,(-39.04 -16.68,6)))
Thanks for your help! :)
I don't have a spark cluster handy to test, but something like this should work:
val result = parsedData.map { v =>
val cluster = clusters.predict(v)
s"$cluster ${v(0)} ${v(1)}"
}
result.saveAsTextFile("/some/output/path")