Performance of BucketedRandomProjectionLSH (org.apache.spark.ml.feature.BucketedRandomProjectionLSH) - scala

Hi I am using the BucketedRandomProjectionLSH (2 buckets 3 hash tables) algorithm to find similar people in a dataset of ~300,000 records. I am creating a sparse vector of bigrams for each record (1296 dimensions in each vector) and doing an approximate similarity self join on the dataset which as I mentioned is not too large.
On an 3 node spark cluster (Master:m3.xlarge, Core:2 m4.4xlarge), it takes ~7 hours to complete.
The performance is too slow and I am looking for some benchmarks that someone may have created for this algorithm. Additionally, any guidance on how to tune this algorithm will be really helpful.
Here is the code snippet for your reference:
val rdd=sc.loadFromMongoDB(ReadConfig(Map("uri" -> "mongodb://localhost:27017/Single.master","readPreference.name" -> "secondaryPreferred")))
val aggregatedRdd = rdd.withPipeline(Seq(Document.parse("{$unwind:'$sources'}"),Document.parse("{$project:{_id:0,id:'$sources._id',val:{$toLower:{$concat:['$sources.first_name','$sources.middle_name','$sources.last_name',{$substr:['$sources.gender',0,1]},'$sources.dob','$sources.address.street','$sources.address.city','$sources.address.state','$sources.address.zip','$sources.phone','$sources.email']}}}}")))
val fDF=aggregatedRdd.map(line=>line.values()).map(ll=>bigramMap(ll.toArray)).toDF("id","idx","keys")
val columnNames = Seq("idx","keys")
val result = fDF.select(columnNames.head, columnNames.tail: _*)
val brp = new BucketedRandomProjectionLSH().setBucketLength(2).setNumHashTables(3).setInputCol("keys").setOutputCol("values")
val model = brp.fit(result)
var outDD=model.approxSimilarityJoin(result, result, 100).filter("datasetA.idx < datasetB.idx").select(col("datasetA.idx").alias("idA"),col("datasetB.idx").alias("idB"),col("distCol"))

I tried BucketedRandomProjectionLSH to 10,000,000 data.
It takes 3hours.
I only stored Dataframe's cash before.
df.persist()

Related

spark cogroup/join KeyValueGroupedDataset with Dataset

I have 2 datasets.
First has a number of rows with unique keys
ds1
key val1 val2
1 a 1
2 a 2
3 b 3
4 c 3
In the second same key can be encountered many times.
ds2
key val1 val2
1 x x
1 x g
2 u h
5 i j
I need to join them, but the logic inside is too complicated for a simple join so instead I decided to use cogroup and iterate over the data.
val ds1 = df1.as[ds1].groupByKey(_.key)
val ds2 = df2.as[ds2].groupByKey(_.key)
ds2.cogroup(ds1)(
(k:String, ds2:Iterator[ds2], ds1:Iterator[ds1]) => {
//some logic
}
)
The problem is I don't actually need to group ds1, because I know it holds unique keys, but cogroup doesn't accept the ds overwise. I know there is the fullOuterJoin in the RDD class, but it has worse performance as far as I know.
val rdd1 = df1.as[ds1].rdd.map(x => (x.key, x))
val rdd2 = df2.as[ds2].rdd.groupBy(_.key)
rdd2.fullOuterJoin(rdd1)
Would it actually affect the performance? What alternatives are there if so?
I'm using spark 2.2.
In Spark performance majorly depend on how much data you are processing because always remember spark is a computational engine. The better you provide the data to the executor the better the performance would be.
Join is for simple queries where as co-group is for grouping the two data frames. There are different ways of improving the performance but in your case you can create two different data frames and then make a simple join[If your dataframe is big enough] . Although co-group is performing grouping in the same executor due to which its performance is always better.

Spark, applying filters on DataFrame(or RDD) multiple times without redundant evaluations

I have a Spark DataFrame that needs heavy evaluations for the chaining of the parent RDDs.
val df: DataFrame[(String, Any)] = someMethodCalculatingDF()
val out1 = df.filter(_._1 == "Key1").map(_._2).collect()
val out2 = df.filter(_._1 == "Key2").map(_._2)
out1 is a very small data ( one or two Rows in each partition) and collected for further use.
out2 is a Dataframe and will be used to generate another RDD that will be materialized later.
So, df will be evaluated twice, which is heavy.
Caching could be a solution, but in my application, it wont be, because the data could be really really BIG. The memory would be overflowed.
Is there any genius :) who could suggest another way bypassing the redundant evaluations?
It's actually a scenario which occurs in our cluster on a daily basis. From our experience this methodology work for us the best.
When we need to use same calculated dataframe twice(on different branches) we do as follows:
Calculation phase is heavy and resulted in a rather small dataframe -> cache it.
Calculation phase is light resulted in a big dataframe -> let it calculate twice.
Calculation is heavy resulted in a big data frame -> write it to disk(HDFS or S3) split the job on splitting point to two different batch processing job. In this you don't repeat the heavy calculation and you don't shred your cache(which will either way probably use the disk).
Calculation phase is light resulting in a small Dataframe. Your life is good and you can go home :).
I'm not familiar with dataset API so will write a solution using RDD api.
val rdd: RDD[(String, Int)] = ???
//First way
val both: Map[String, Iterable[Int]] = rdd.filter(e => e._1 == "Key1" || e._1 == "Key2")
.groupByKey().collectAsMap()
//Second way
val smallCached = rdd.filter(e => e._1 == "Key1" || e._1 == "Key2").cache()
val out1 = smallCached.filter(_._1 == "Key1").map(_._2).collect()
val out2 = smallCached.filter(_._1 == "Key2").map(_._2).collect()

Apply function to subset of Spark Datasets (Iteratively)

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:
val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
val latMin = row._1
val latMax = latMin + 0.0001
val lonMin = row._2
val lonMax = lonMin + 0.0001
val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
val sampleDS = filterDF.sample(false, 0.05, 1234)
(sampleDS)
})
val output = sampleDataDS.reduce(_ union _)
I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.
I'm thinking I need to find a different solution, but I don't see it.
I've referenced these questions thus far:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
Creating a Spark DataFrame from an RDD of lists

Scala: two sliding more efficiently

I am working on the Quick Start of Apache Spark. I was wondering about efficiency of transformations on collections. I would like to know how to improve the following code:
// Variable initialisation
val N = 300.0
val input = (0.0 to N-1 by 1.0).toArray
val firstBigDivi = 100
val windowDuration = 6
val windowStep = 3
// Process
val windowedInput = inputArray.
sliding(firstBigDivi,firstBigDivi).toArray. //First, a big division
map(arr=>arr.sliding(windowDuration,windowStep).toArray)//Second, divide the division
Is there another way to do the same more efficiently? I think this code iterates twice over the input array (which could be an issue for big collections) is that right?
sliding creates an Iterator, so mapping that would be "cheap". You have a superfluous .toArray though between sliding and map. It suffices
val windowedInputIt = input.
sliding(firstBigDivi,firstBigDivi) //First, a big division
.map(arr=>arr.sliding(windowDuration,windowStep).toArray)
Then you can evaluate that iterator into an Array by writing
val windowedInput = windowedInputIt.toArray

converting RDD to vector with fixed length file data

I'm new to Spark + Scala and still developing my intuition. I have a file containing many samples of data. Every 2048 lines represents a new sample. I'm attempting to convert each sample into a vector and then run through a k-means clustering algorithm. The data file looks like this:
123.34 800.18
456.123 23.16
...
When I'm playing with a very small subset of the data, I create an RDD from the file like this:
val fileData = sc.textFile("hdfs://path/to/file.txt")
and then create the vector using this code:
val freqLineCount = 2048
val numSamples = 200
val freqPowers = fileData.map( _.split(" ")(1).toDouble )
val allFreqs = freqPowers.take(numSamples*freqLineCount).grouped(freqLineCount)
val lotsOfVecs = allFreqs.map(spec => Vectors.dense(spec) ).toArray
val lotsOfVecsRDD = sc.parallelize( lotsOfVecs ).cache()
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(lotsOfVecsRDD, numClusters, numIterations)
The key here is that I can call .grouped on an array of strings and it returns an array of arrays with the sequential 2048 values. That is then trivial to convert to vectors and run it through the KMeans training algo.
I'm attempting to run this code on a much larger data set and running into java.lang.OutOfMemoryError: Java heap space errors. Presumably because I'm calling the take method on my freqPowers variable and then performing some operations on that data.
How would I go about achieving my goal of running KMeans on this data set keeping in mind that
each data sample occurs every 2048 lines in the file (so the file should be parsed somewhat sequentially)
this code needs to run on a distributed cluster
I need to not run out of memory :)
thanks in advance
You can do something like:
val freqLineCount = 2048
val freqPowers = fileData.flatMap(_.split(" ")(1).toDouble)
// Replacement of your current code.
val groupedRDD = freqPowers.zipWithIndex().groupBy(_._2 / freqLineCount)
val vectorRDD = groupedRDD.map(grouped => Vectors.dense(grouped._2.map(_._1).toArray))
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(vectorRDD, numClusters, numIterations)
The replacing code uses zipWithIndex() and division of longs to group RDD elements into freqLineCount chunks. After the grouping, the elements in question are extracted into their actual vectors.