Spark LSH approxSimilarityJoin taking too much time - scala

Spark LSH approxSimilarityJoin is taking too much time:
val column="name"
val"id", "name", "duns_number", "country_id") 1.7 million record
val new_df_1="index", "name", "duns_number", "country_id") 0.7 million record
val n_gram = new NGram()
val n_gram_df = n_gram.transform(new_df)
val n_gram_df_1=n_gram.transform(new_df_1)
val validateEmptyVector = udf({ v: Vector => v.numNonzeros > 0 }, DataTypes.BooleanType)
val vectorModeler: CountVectorizerModel = new CountVectorizer()
val vectorizedProductsDF = vectorModeler.transform(n_gram_df)
.select(col("id"), col(column), col("tokenize"),col("duns_number"),col("country_id"))
val vectorizedProductsDF_1 = vectorModeler.transform(n_gram_df_1)
val minLshConfig = new MinHashLSH().setNumHashTables(3)
val lshModel =
val transform_1=lshModel.transform(vectorizedProductsDF)
val transform_2=lshModel.transform(vectorizedProductsDF_1)
val result=lshModel.approxSimilarityJoin(transform_1,transform_2,0.42).toDF
Last line of code (approxSimilarityJoin) is taking too much time and in Stages last few tasks stuck.
I tried with 13 executors with 4 cores each and


My Scala Spark code doesn't work, although pyspark it works fine

I am running Linear regression in Pyspark with multiple parameters and it works fine, But if run the same code with same configuration in scala it gives me below error on fitting the model.
Please ask question and I will put all the details, I am stuck here since last one month.
19/12/06 23:59:35 ERROR cluster.YarnScheduler: Lost executor 6 on Container killed by YARN for exceeding memory limits. 1.5 GB of 1.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
one thing more in scala if I just directly fit the linear regression model without multiple parameters then it also works.
val spark = (SparkSession.builder
.config("spark.driver.memory", "3g")
.config("spark.executor.memory", "20g")
.config("spark.executor.cores", "3")
val df1 = spark.sql("select * from Table_4kCols_10k_rows") //some 3GB hive table
val list_cols =df1.columns
val featureCols = list_cols.filter(_!= "dormant_flag")
val df2=df1.withColumnRenamed("dormant_flag","label")
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val df = assembler.transform(df2)
val seed = 5043
val Array(train, test) = df.randomSplit(Array(0.7, 0.3))
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
val fittedLR = // This works fine
val stages = Array(lr)
val pipeline = new Pipeline().setStages(stages)
val params = new ParamGridBuilder().addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).addGrid(lr.regParam, Array(0.1, 2.0)).build()
val evaluator = new BinaryClassificationEvaluator().setMetricName("areaUnderROC").setRawPredictionCol("prediction").setLabelCol("label")
val tvs = new TrainValidationSplit().setTrainRatio(0.75).setEstimatorParamMaps(params).setEstimator(pipeline).setEvaluator(evaluator)
val tvsFitted = //This gives an error //pyspark works fine

how to make faster windowing text file and machine learning over windows in spark

I'm trying to use Spark to learn multiclass logistic regression on a windowed text file. What I'm doing is first creating windows and explode them into $"word_winds". Then move the center word of each window into $"word". To fit the LogisticRegression model, I convert each different word into a class ($"label"), thereby it learns. I count the different labels to prone those with few minF samples.
The problem is that some part of the code is very very slow, even for small input files (you can use some README file to test the code). Googling, some users have been experiencing slowness by using explode. They suggest some modifications to the code in order to speed up 2x. However, I think that with a 100MB input file, this wouldn't be sufficient. Please suggest something different, probably to avoid actions that slow down the code. I'm using Spark 2.4.0 and sbt 1.2.8 on a 24-core machine.
import org.apache.spark.sql.functions._
import{HashingTF, IDF}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
object SimpleApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val in_file = "sample.txt"
val stratified = true
val wsize = 7
val ngram = 3
val minF = 2
val windUdf = udf{s: String => s.sliding(ngram).toList.sliding(wsize).toList}
val get_mid = udf{s: Seq[String] => s(s.size/2)}
val rm_punct = udf{s: String => s.replaceAll("""([\p{Punct}|¿|\?|¡|!]|\p{C}|\b\p{IsLetter}{1,2}\b)\s*""", "")}
// Read and remove punctuation
var df =
.withColumn("value", rm_punct($"value"))
// Creating windows and explode them, and get the center word into $"word"
df = df.withColumn("char_nGrams", windUdf('value))
.withColumn("word_winds", explode($"char_nGrams"))
.withColumn("word", get_mid('word_winds))
val indexer = new StringIndexer().setInputCol("word")
df =
val hashingTF = new HashingTF().setInputCol("word_winds")
df = hashingTF.transform(df)
val idf = new IDF().setInputCol("freqFeatures")
df =
// Remove word whose freq is less than minF
var counts = df.groupBy("label").count
.filter(col("count") > minF)
.withColumn("id", monotonically_increasing_id())
var filtro = df.groupBy("label").count.filter(col("count") <= minF)
df = df.join(filtro, Seq("label"), "leftanti")
var dfs = if(stratified){
// Create stratified sample 'dfs'
var revs = counts.orderBy(asc("count")).select("count")
.withColumn("id", monotonically_increasing_id())
revs = revs.withColumnRenamed("count", "ascc")
// Weigh the labels (linearly) inversely ("ascc") proportional NORMALIZED weights to word ferquency
counts = counts.join(revs, Seq("id"), "inner").withColumn("weight", col("ascc")/df.count)
val minn ="weight").agg(min("weight")).first.getDouble(0) - 0.01
val maxx ="weight").agg(max("weight")).first.getDouble(0) - 0.01
counts = counts.withColumn("weight_n", (col("weight") - minn) / (maxx - minn))
counts = counts.withColumn("weight_n", when(col("weight_n") > 1.0, 1.0)
var fractions ="label", "weight_n") => (x(0), x(1)
df.stat.sampleBy("label", fractions, 36L).select("features", "word_winds", "word", "label")
}else{ df }
dfs = dfs.checkpoint()
val lr = new LogisticRegression().setRegParam(0.01)
val Array(tr, ts) = dfs.randomSplit(Array(0.7, 0.3), seed = 12345)
val training ="word_winds", "features", "label", "word")
val test ="word_winds", "features", "label", "word")
val model =
def mapCode(m: scala.collection.Map[Any, String]) = udf( (s: Double) =>
m.getOrElse(s, "")
var labels ="label", "word").distinct.rdd
.map(x => (x(0), x(1).asInstanceOf[String]))
var predictions = model.transform(test)
predictions = predictions.withColumn("pred_word", mapCode(labels)($"prediction"))
Since your data is somewhat small it might help if you use coalesce before explode. Sometimes it can be inefficient to have too many nodes especially if there is a lot of shuffling in your code.
Like you said, it does seem like a lot of people have issues with explode. I looked at the link you provided but no one mentioned trying flatMap instead of explode.

Increase of hash tables in MinHashLSH, decreases accuracy and f1

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and F1 score are decreasing. All that I have already read for this algorithm shows me that I have an issue.
I have tried a different number of hash tables and I have provided different numbers of Jaccard similarity thresholds but I have the same exact problem, the accuracy is decreasing rapidly. I have also tried different samplings of my dataset and nothing changed. My workflow goes on like this: I am concatenating all the text columns of my dataframe, which includes title, authors, journal and abstract and next I am tokenizing the concatenated column into words. Then I am using a CountVectorizer to transform this "bag of words" into vectors. Next, I am providing this column in MinHashLSH with some hash tables and finaly I am doing an approximateSimilarityJoin to find similar "papers" which are under my given threshold. My implementation is the following.
import UnsupervisedLinkPrediction.BroutForce.join
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf, when}
import org.apache.spark.sql.types._
object lsh {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR) // show only errors
// val cores=args(0).toInt
// val partitions=args(1).toInt
// val hashTables=args(2).toInt
// val limit = args(3).toInt
// val threshold = args(4).toDouble
val cores="*"
val partitions=1
val hashTables=16
val limit = 1000
val jaccardDistance = 0.89
val master = "local["+cores+"]"
val ss = SparkSession.builder().master(master).appName("MinHashLSH").getOrCreate()
val sc = ss.sparkContext
val inputFile = "resources/data/node_information.csv"
println("reading from input file: " + inputFile)
val schemaStruct = StructType(
StructField("id", IntegerType) ::
StructField("pubYear", StringType) ::
StructField("title", StringType) ::
StructField("authors", StringType) ::
StructField("journal", StringType) ::
StructField("abstract", StringType) :: Nil
// Read the contents of the csv file in a dataframe. The csv file contains a header.
// var papers ="header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
var papers ="header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
import ss.implicits._
// Read the original graph edges, ground trouth
val originalGraphDF = sc.textFile("resources/data/Cit-HepTh.txt").map(line => {
val fields = line.split("\t")
(fields(0), fields(1))
}).toDF("nodeA_id", "nodeB_id").cache()
val originalGraphCount = originalGraphDF.count()
println("Ground truth count: " + originalGraphCount )
val nullAuthor = ""
val nullJournal = ""
val nullAbstract = ""
papers =, Seq("authors"))
papers =, Seq("journal"))
papers =, Seq("abstract"))
papers = papers.withColumn("nonNullAbstract", when(col("abstract") === nullAbstract, col("title")).otherwise(col("abstract")))
papers = papers.drop("abstract").withColumnRenamed("nonNullAbstract", "abstract")
val filteredGt="g").join("p"),(
$"g.nodeA_id" ===$"") || ($"g.nodeB_id" ===$"")
val filteredGtCount = filteredGt.count()
println("Filtered GroundTruth count: "+ filteredGtCount)
val tokPubYear = new Tokenizer().setInputCol("pubYear").setOutputCol("pubYear_words")
val tokTitle = new Tokenizer().setInputCol("title").setOutputCol("title_words")
val tokAuthors = new RegexTokenizer().setInputCol("authors").setOutputCol("authors_words").setPattern(",")
val tokJournal = new Tokenizer().setInputCol("journal").setOutputCol("journal_words")
val tokAbstract = new Tokenizer().setInputCol("abstract").setOutputCol("abstract_words")
println("Setting pipeline stages...")
val stages = Array(
tokPubYear, tokTitle, tokAuthors, tokJournal, tokAbstract
// rTitle, rAuthors, rJournal, rAbstract
val pipeline = new Pipeline()
println("Transforming dataframe\n")
val model =
papers = model.transform(papers)
val udf_join_cols = udf(join(_: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String]))
val joinedDf = papers.withColumn(
).select("id", "paper_data").cache(),false)
val vocabSize = 1000000
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("paper_data").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(joinedDf)
val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)
val vectorizedDf = cvModel.transform(joinedDf).filter(isNoneZeroVector(col("features"))).select(col("id"), col("features"))
val mh = new MinHashLSH().setNumHashTables(hashTables)
val mhModel =
println("MinHashLSH.getHashTables: "+mh.getNumHashTables)
val dfA = ss.sqlContext.sql("select id as nodeA_id, features from vecDf").cache()
val dfB = ss.sqlContext.sql("select id as nodeB_id, features from vecDf").cache()
val predictionsDF = mhModel.approxSimilarityJoin(dfA, dfB, jaccardDistance, "JaccardDistance").cache()
val predictionsCount = predictionsDF.count()
println("Predictions count: "+predictionsCount)
val pairs = ss.sqlContext.sql("select datasetA.nodeA_id, datasetB.nodeB_id, JaccardDistance from predictions").cache()
val totalPredictions = pairs.count()
println("Threshold: "+threshold+"\n")
println("Hahs tables: "+hashTables+"\n")
println("Ground truth: "+filteredGtCount)
println("Total edges found: "+totalPredictions +" \n")
println("Calculating true positives...\n")
val truePositives ="g").join("p"),
($"g.nodeA_id" === $"p.nodeA_id" && $"g.nodeB_id" === $"p.nodeB_id") || ($"g.nodeA_id" === $"p.nodeB_id" && $"g.nodeB_id" === $"p.nodeA_id")
println("True Positives: "+truePositives+"\n")
println("Calculating false positives...\n")
val falsePositives = predictionsCount - truePositives
println("False Positives: "+falsePositives+"\n")
println("Calculating true negatives...\n")
val pairsPerTwoCount = (limit *(limit - 1)) / 2
val trueNegatives = (pairsPerTwoCount - truePositives) - falsePositives
println("True Negatives: "+trueNegatives+"\n")
val falseNegatives = filteredGtCount - truePositives
println("False Negatives: "+falseNegatives)
val truePN = (truePositives+trueNegatives).toFloat
println("TP + TN sum: "+truePN+"\n")
val sum = (truePN + falseNegatives+ falsePositives).toFloat
println("TP +TN +FP+ FN sum: "+sum+"\n")
val accuracy = (truePN/sum).toFloat
println("Accuracy: "+accuracy+"\n")
val precision = truePositives.toFloat / (truePositives+falsePositives).toFloat
val recall = truePositives.toFloat/(truePositives+falseNegatives).toFloat
val f1Score = 2*(recall*precision)/(recall+precision).toFloat
println("F1 score: "+f1Score+"\n")
I forget to tell you that I am running this code in a cluster with 40 cores and 64g of RAM. Note that approximate similarity join (Spark's implementation) works with JACCARD DISTANCE and not with JACCARD INDEX. So I provide as a similarity threshold the JACCARD DISTANCE which for my case is jaccardDistance = 1 - threshold. (threshold = Jaccard Index ).
I was expecting to get higher accuracy and f1 score while I am increasing the hash tables. Do you have any idea about my issue?
Thank all of you in advance!
There are multiple visible problems here, and probably more hidden, so just to enumerate a few:
LSH is not really a classifier and attempt to evaluate it as one doesn't make much sense, even if you assume that text similarity is somehow a proxy for citation (which is big if).
If the problem was to be framed as classification problem it should be treated as multi-label classification (each paper can cite or be cited by multiple sources) not multi-class classification, hence simple accuracy is not meaningful.
Even if it was a classification and could be evaluated as such your calculations don't include actual negatives, which don't meet the threshold of the approxSimilarityJoin
Also setting threshold to 1 restricts joins to either exact matches or cases of hash collisions - hence preference towards LSH with higher collisions rates.
Text processing approach you took is rather pedestrian and prefers non-specific features (remember you don't optimize your actual goal, but text similarity).
Such approach, especially treating everything as equal, discards majority of useful information in the set primarily, but not limited to, temporal relationships..

Building a random forest in spark, explanation?

I have the a data frame df with the following structure:
amount gender_num marital_num
10000 1 1
20000 1 2
1400 2 1
Lets say I am building an ML to predict the column 'gender_num' in spark using random forest
I am doing the following:
val df1 = df("loan_amount", 'loan_amount.cast("Double")).withColumn("gender_num", 'gender_num.cast("String")).
withColumn("marital_num", 'marital_num.cast("String"))
val labeled = => LabeledPoint(df1.gender_num, Vectors.dense(df1.loan_amount, df1.marital_num)))
val numClasses = 7
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32
val model = RandomForest.trainClassifier(labeled, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
My code is failing at the second step:
138: error: value gender_num is not a member of org.apache.spark.sql.DataFrame
I would really appreciate if someone can explain this to me, the documentation is very hard to follow, newbie here!
That's because you are using R like syntax of DataFrame.
You should access row's data like this:
val labeled = { row => LabeledPoint(row(1).toDouble, Vectors.dense(row(0).toDouble, row(1).toDouble))}
You can also create case class and use Dataset syntax:
case class ParsedData (amount : Double, gender_num : Int, marital_num : Int)
val labeled =[ParsedData].map(row => LabeledPoint(df1.gender_num, Vectors.dense(df1.loan_amount, df1.marital_num)))

How to improve join performance when broadcast variable is used?

I am new to Spark. I have two RDDs where one with a size of 9 GB (400 million lines) (RDD1) and the other 110 KB (4 million lines) (RDD2). I use RDD2 as a broadcast variable to decrease the shuffle process. My code works but again for the reduceByKey part it is too slow.
I have been playing around partition numbers. If I set 10,000 partitions for both RDD it starts to spill. So i increased it to 20K, 30K and 100K. It stopped spilling however it is extremely slow. On the other hand, I used
set("spark.akka.frameSize","1000") but it did not work out. How could I improve this code?
Here is my code:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[*]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\.txt",30000)...RDD1
val emp_new = sc.textFile("\\.txt",10000)...RDD2
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)