Spark MinHashLSH Never Progresses - scala

I am new to spark but I am attempting to produce network clusters using user supplied tags or attributes. First I am using the jaccard minhash algorithm to produce similarity scores then running it through power iteration clustering algorithm but as soon as it starts there is no CPU activity and will run for some time with zero progress. Wondering how to configure the cluster or change the code to get this to run. Below is my code
//about 10,000 rows of (id, 100 tags in binary form)
val data = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("inferSchema","true").load("gs://data/*.csv")
val columnNames = data.columns
val tags = columnNames.slice(1, columnNames.size)
//put tags in a vector
val assembler = new VectorAssembler().setInputCols(tags).setOutputCol("attributes")
val newData = assembler.transform(data).select("userID","attributes")
val mh = new MinHashLSH().setNumHashTables(5).setInputCol("attributes").setOutputCol("values")
val modelMINHASH = mh.fit(goodData)
// Approximate nearest neighbor search
val fullData = modelMINHASH.approxSimilarityJoin(newData , newData , 0.9).filter("datasetA.userID < datasetB.userID")
var explodeDF = fullData.withColumn("id", fullData("datasetA.userID")).withColumn("id2", fullData("datasetB.userID")).select("id","id2","distCol")
val temp = explodeDF.rdd
val newRDD = temp.map(x => (x.getAs[Integer]("id").longValue(),x.getAs[Integer]("id2").longValue(),1-x.getAs[Double]("distCol"))).cache()
//this is where the code haults and I see no progress
val modelPIC = new PowerIterationClustering().setK(16).setMaxIterations(5).run(newRDD)
val clusters = modelPIC.assignments

Related

Increase of hash tables in MinHashLSH, decreases accuracy and f1

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and F1 score are decreasing. All that I have already read for this algorithm shows me that I have an issue.
I have tried a different number of hash tables and I have provided different numbers of Jaccard similarity thresholds but I have the same exact problem, the accuracy is decreasing rapidly. I have also tried different samplings of my dataset and nothing changed. My workflow goes on like this: I am concatenating all the text columns of my dataframe, which includes title, authors, journal and abstract and next I am tokenizing the concatenated column into words. Then I am using a CountVectorizer to transform this "bag of words" into vectors. Next, I am providing this column in MinHashLSH with some hash tables and finaly I am doing an approximateSimilarityJoin to find similar "papers" which are under my given threshold. My implementation is the following.
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import UnsupervisedLinkPrediction.BroutForce.join
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf, when}
import org.apache.spark.sql.types._
object lsh {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR) // show only errors
// val cores=args(0).toInt
// val partitions=args(1).toInt
// val hashTables=args(2).toInt
// val limit = args(3).toInt
// val threshold = args(4).toDouble
val cores="*"
val partitions=1
val hashTables=16
val limit = 1000
val jaccardDistance = 0.89
val master = "local["+cores+"]"
val ss = SparkSession.builder().master(master).appName("MinHashLSH").getOrCreate()
val sc = ss.sparkContext
val inputFile = "resources/data/node_information.csv"
println("reading from input file: " + inputFile)
println
val schemaStruct = StructType(
StructField("id", IntegerType) ::
StructField("pubYear", StringType) ::
StructField("title", StringType) ::
StructField("authors", StringType) ::
StructField("journal", StringType) ::
StructField("abstract", StringType) :: Nil
)
// Read the contents of the csv file in a dataframe. The csv file contains a header.
// var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
papers.repartition(partitions)
println("papers.rdd.getNumPartitions"+papers.rdd.getNumPartitions)
import ss.implicits._
// Read the original graph edges, ground trouth
val originalGraphDF = sc.textFile("resources/data/Cit-HepTh.txt").map(line => {
val fields = line.split("\t")
(fields(0), fields(1))
}).toDF("nodeA_id", "nodeB_id").cache()
val originalGraphCount = originalGraphDF.count()
println("Ground truth count: " + originalGraphCount )
val nullAuthor = ""
val nullJournal = ""
val nullAbstract = ""
papers = papers.na.fill(nullAuthor, Seq("authors"))
papers = papers.na.fill(nullJournal, Seq("journal"))
papers = papers.na.fill(nullAbstract, Seq("abstract"))
papers = papers.withColumn("nonNullAbstract", when(col("abstract") === nullAbstract, col("title")).otherwise(col("abstract")))
papers = papers.drop("abstract").withColumnRenamed("nonNullAbstract", "abstract")
papers.show(false)
val filteredGt= originalGraphDF.as("g").join(papers.as("p"),(
$"g.nodeA_id" ===$"p.id") || ($"g.nodeB_id" ===$"p.id")
).select("g.nodeA_id","g.nodeB_id").distinct().cache()
filteredGt.show()
val filteredGtCount = filteredGt.count()
println("Filtered GroundTruth count: "+ filteredGtCount)
//TOKENIZE
val tokPubYear = new Tokenizer().setInputCol("pubYear").setOutputCol("pubYear_words")
val tokTitle = new Tokenizer().setInputCol("title").setOutputCol("title_words")
val tokAuthors = new RegexTokenizer().setInputCol("authors").setOutputCol("authors_words").setPattern(",")
val tokJournal = new Tokenizer().setInputCol("journal").setOutputCol("journal_words")
val tokAbstract = new Tokenizer().setInputCol("abstract").setOutputCol("abstract_words")
println("Setting pipeline stages...")
val stages = Array(
tokPubYear, tokTitle, tokAuthors, tokJournal, tokAbstract
// rTitle, rAuthors, rJournal, rAbstract
)
val pipeline = new Pipeline()
pipeline.setStages(stages)
println("Transforming dataframe\n")
val model = pipeline.fit(papers)
papers = model.transform(papers)
println(papers.count())
papers.show(false)
papers.printSchema()
val udf_join_cols = udf(join(_: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String]))
val joinedDf = papers.withColumn(
"paper_data",
udf_join_cols(
papers("pubYear_words"),
papers("title_words"),
papers("authors_words"),
papers("journal_words"),
papers("abstract_words")
)
).select("id", "paper_data").cache()
joinedDf.show(5,false)
val vocabSize = 1000000
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("paper_data").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(joinedDf)
val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)
val vectorizedDf = cvModel.transform(joinedDf).filter(isNoneZeroVector(col("features"))).select(col("id"), col("features"))
vectorizedDf.show()
val mh = new MinHashLSH().setNumHashTables(hashTables)
.setInputCol("features").setOutputCol("hashValues")
val mhModel = mh.fit(vectorizedDf)
mhModel.transform(vectorizedDf).show()
vectorizedDf.createOrReplaceTempView("vecDf")
println("MinHashLSH.getHashTables: "+mh.getNumHashTables)
val dfA = ss.sqlContext.sql("select id as nodeA_id, features from vecDf").cache()
dfA.show(false)
val dfB = ss.sqlContext.sql("select id as nodeB_id, features from vecDf").cache()
dfB.show(false)
val predictionsDF = mhModel.approxSimilarityJoin(dfA, dfB, jaccardDistance, "JaccardDistance").cache()
println("Predictions:")
val predictionsCount = predictionsDF.count()
predictionsDF.show()
println("Predictions count: "+predictionsCount)
predictionsDF.createOrReplaceTempView("predictions")
val pairs = ss.sqlContext.sql("select datasetA.nodeA_id, datasetB.nodeB_id, JaccardDistance from predictions").cache()
pairs.show(false)
val totalPredictions = pairs.count()
println("Properties:\n")
println("Threshold: "+threshold+"\n")
println("Hahs tables: "+hashTables+"\n")
println("Ground truth: "+filteredGtCount)
println("Total edges found: "+totalPredictions +" \n")
println("EVALUATION PROCESS STARTS\n")
println("Calculating true positives...\n")
val truePositives = filteredGt.as("g").join(pairs.as("p"),
($"g.nodeA_id" === $"p.nodeA_id" && $"g.nodeB_id" === $"p.nodeB_id") || ($"g.nodeA_id" === $"p.nodeB_id" && $"g.nodeB_id" === $"p.nodeA_id")
).cache().count()
println("True Positives: "+truePositives+"\n")
println("Calculating false positives...\n")
val falsePositives = predictionsCount - truePositives
println("False Positives: "+falsePositives+"\n")
println("Calculating true negatives...\n")
val pairsPerTwoCount = (limit *(limit - 1)) / 2
val trueNegatives = (pairsPerTwoCount - truePositives) - falsePositives
println("True Negatives: "+trueNegatives+"\n")
val falseNegatives = filteredGtCount - truePositives
println("False Negatives: "+falseNegatives)
val truePN = (truePositives+trueNegatives).toFloat
println("TP + TN sum: "+truePN+"\n")
val sum = (truePN + falseNegatives+ falsePositives).toFloat
println("TP +TN +FP+ FN sum: "+sum+"\n")
val accuracy = (truePN/sum).toFloat
println("Accuracy: "+accuracy+"\n")
val precision = truePositives.toFloat / (truePositives+falsePositives).toFloat
val recall = truePositives.toFloat/(truePositives+falseNegatives).toFloat
val f1Score = 2*(recall*precision)/(recall+precision).toFloat
println("F1 score: "+f1Score+"\n")
ss.stop()
I forget to tell you that I am running this code in a cluster with 40 cores and 64g of RAM. Note that approximate similarity join (Spark's implementation) works with JACCARD DISTANCE and not with JACCARD INDEX. So I provide as a similarity threshold the JACCARD DISTANCE which for my case is jaccardDistance = 1 - threshold. (threshold = Jaccard Index ).
I was expecting to get higher accuracy and f1 score while I am increasing the hash tables. Do you have any idea about my issue?
Thank all of you in advance!
There are multiple visible problems here, and probably more hidden, so just to enumerate a few:
LSH is not really a classifier and attempt to evaluate it as one doesn't make much sense, even if you assume that text similarity is somehow a proxy for citation (which is big if).
If the problem was to be framed as classification problem it should be treated as multi-label classification (each paper can cite or be cited by multiple sources) not multi-class classification, hence simple accuracy is not meaningful.
Even if it was a classification and could be evaluated as such your calculations don't include actual negatives, which don't meet the threshold of the approxSimilarityJoin
Also setting threshold to 1 restricts joins to either exact matches or cases of hash collisions - hence preference towards LSH with higher collisions rates.
Additionally:
Text processing approach you took is rather pedestrian and prefers non-specific features (remember you don't optimize your actual goal, but text similarity).
Such approach, especially treating everything as equal, discards majority of useful information in the set primarily, but not limited to, temporal relationships..

infinite centroid for kmeans spark scala

(i think i am almost sure what the answer is)
here is my code:
val fileName = """file:///home/user/data/csv/sessions_sample.csv"""
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(fileName)
// calculate input for kmeans
val input1 = df.select("id", "duration", "ip_dist", "txr1", "txr2", "txr3", "txr4").na.fill(3.0)
val input2 = input1.map(r => (r.getInt(0), Vectors.dense((1 until r.size - 1).map{ i => r.getDouble(i)}.toArray[Double])))
val input3 = input2.toDF("id", "features")
// initiate kmeans
val kmeans = new KMeans().setK(100).setSeed(1L).setFeaturesCol("features").setPredictionCol("prediction")
val model = kmeans.fit(input3)
val model = kmeans.fit(input3.select("features"))
// Make predictions
val predictions = model.transform(input3.select("features"))
val predictions = model.transform(input3)
val evaluator = new ClusteringEvaluator()
// i get an error when i run this line
val silhouette = evaluator.evaluate(predictions)
java.lang.AssertionError: assertion failed: Number of clusters must be
greater than one. at scala.Predef$.assert(Predef.scala:170) at
org.apache.spark.ml.evaluation.SquaredEuclideanSilhouette$.computeSilhouetteScore(ClusteringEvaluator.scala:416)
at
org.apache.spark.ml.evaluation.ClusteringEvaluator.evaluate(ClusteringEvaluator.scala:96)
... 49 elided
But my centroids look like this:
model.clusterCenters.foreach(println)
[3217567.1300936914,145.06533614203505,Infinity,Infinity,Infinity]
i think that beceause some centers are infinite => kmeans is unstable => silhouette measure goes wrong.
But it still doesnt answer why, if i try to change k, any k > 1 so far, i have an error saying "Number of clusters must be greater than one".
please advice.
I once saw the same message. The root cause is that every data is the same (my data is generated by a program) so of course there is only one cluster. BTW, I did not check its centers so I am not sure whether my case is the same as yours.

XGBoost failing after using windowing functions on label column

I have successfully trained an XGBoost model where trainDF is a dataframe hacing two columns: features and label where we have 11k 1s and 57M 0's (unbalanced dataset). Everything works fine.
val udnersample = 0.1
// Undersampling of 0's -- choosing 10%
val training1 = output1.filter($"datestr" < end_period1 &&
$"label" === 1)
val training0 = output1.filter($"datestr" < end_period1 &&
$"label" === 0).sample(
false, undersample)
val training = training0.unionAll(training1)
val traindDF = training.select("label",
"features").toDF("label", "features")}
val paramMap = List("eta" -> 0.05,
"max_depth" -> 6,
"objective" -> "binary:logistic").toMap
val num_trees = 400
val num_cores = 200
val XGBModel = XGBoost.trainWithDataFrame(trainDF,
paramMap,
num_trees,
num_cores,
useExternalMemory = true)
Then, I want to change the y label with some windowing, so that in each group, I can predict y label earlier.
val sum_label = "sum_label"
val label_window_length = 19
val sliding_window_label = Window.partitionBy("id").orderBy(
asc("timestamp")).rowsBetween(0, label_window_length)
val training_source = output1.filter($"datestr" <
end_period1).withColumn(
sum_label, sum($"label").over(sliding_window_label)).drop(
"label").withColumnRenamed(sum_label, "label")
val training1 = training_source.filter(col("label") === 1)
val training0 = training_source.filter(col("label") === 0).sample(false, 0.099685)
val training = training0.unionAll(training1)
val traindDF = training.select("label",
"features").toDF("label", "features")}
The result has 57M 0's and 214k 1's (soughly the same number of rows though). No NAs in "label" column of trainDF and the type is still double (nullable=true). Then xgboost fails:
Name: ml.dmlc.xgboost4j.java.XGBoostError
Message: XGBoostModel training failed
StackTrace: at ml.dmlc.xgboost4j.scala.spark.XGBoost$.postTrackerReturnProcessing(XGBoost.scala:316)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithRDD(XGBoost.scala:293)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:138)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.train(XGBoostEstimator.scala:35)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:169)
I can include the logs as needed. My confusion is that using the windowing function and literally not changing any other setting, causes XGB to fail. I would appreciate any thoughts on this.
It turns out that saving the table traindDF in hive and reloading it into Spark solves the problem:
traindDF.write.mode("overwrite").saveAsTable("database.tablename")
Then, you can easily load the table:
val traindDF = spark.sql("""select * from database.tablename""")
This trick solved the problem. It seems like spark windowing function is a bit unstable and saving the result into a hive table makes it work.
A better way to do this is using windowing functions in hive instead of Spark.

Spark ML VectorAssembler() dealing with thousands of columns in dataframe

I was using spark ML pipeline to set up classification models on really wide table. This means that I have to automatically generate all the code that deals with columns instead of literately typing each of them. I am pretty much a beginner on scala and spark. I was stuck at the VectorAssembler() part when I was trying to do something like following:
val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length
// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
setInputCol("target").
setOutputCol("indexedLabel")
val featureAssembler = new VectorAssembler().
setInputCols(featureSIArray).
setOutputCol("features")
val convpipeline = new Pipeline().
setStages(Array(labelIndexer, featureAssembler))
val myFeatureTransfer = convpipeline.fit(df)
Apparently it didn't work. I am not sure what should I do to make the whole thing more automatic or ML pipeline does not take that many columns at this moment(which I doubt)?
I finally figured out one way, which is not very pretty. It is to create vector.dense for the features, and then create data frame out of this.
import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line =>
val indexed = line.split('\t').zipWithIndex
val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
LabeledPoint(mykey, Vectors.dense(myValues))
}
val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")
You shouldn't use quotes (s"$quote$x$quote") unless column names contain quotes. Try
val featureAssembler = new VectorAssembler().
setInputCols(featureArray).
setOutputCol("features")
For pyspark, you can first create a list of the column names:
df_colnames = df.columns
Then you can use that in vectorAssembler:
assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)

converting RDD to vector with fixed length file data

I'm new to Spark + Scala and still developing my intuition. I have a file containing many samples of data. Every 2048 lines represents a new sample. I'm attempting to convert each sample into a vector and then run through a k-means clustering algorithm. The data file looks like this:
123.34 800.18
456.123 23.16
...
When I'm playing with a very small subset of the data, I create an RDD from the file like this:
val fileData = sc.textFile("hdfs://path/to/file.txt")
and then create the vector using this code:
val freqLineCount = 2048
val numSamples = 200
val freqPowers = fileData.map( _.split(" ")(1).toDouble )
val allFreqs = freqPowers.take(numSamples*freqLineCount).grouped(freqLineCount)
val lotsOfVecs = allFreqs.map(spec => Vectors.dense(spec) ).toArray
val lotsOfVecsRDD = sc.parallelize( lotsOfVecs ).cache()
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(lotsOfVecsRDD, numClusters, numIterations)
The key here is that I can call .grouped on an array of strings and it returns an array of arrays with the sequential 2048 values. That is then trivial to convert to vectors and run it through the KMeans training algo.
I'm attempting to run this code on a much larger data set and running into java.lang.OutOfMemoryError: Java heap space errors. Presumably because I'm calling the take method on my freqPowers variable and then performing some operations on that data.
How would I go about achieving my goal of running KMeans on this data set keeping in mind that
each data sample occurs every 2048 lines in the file (so the file should be parsed somewhat sequentially)
this code needs to run on a distributed cluster
I need to not run out of memory :)
thanks in advance
You can do something like:
val freqLineCount = 2048
val freqPowers = fileData.flatMap(_.split(" ")(1).toDouble)
// Replacement of your current code.
val groupedRDD = freqPowers.zipWithIndex().groupBy(_._2 / freqLineCount)
val vectorRDD = groupedRDD.map(grouped => Vectors.dense(grouped._2.map(_._1).toArray))
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(vectorRDD, numClusters, numIterations)
The replacing code uses zipWithIndex() and division of longs to group RDD elements into freqLineCount chunks. After the grouping, the elements in question are extracted into their actual vectors.