Reducing with a bloom filter - scala

I would like to get a fast approximate set membership, based on a String-valued function applied to a large Spark RDD of String Vectors (~1B records). Basically the idea would be to reduce into a Bloom filter. This bloom filter could then be broadcasted to the workers for further use.
More specifically, I currently have
rdd: RDD[Vector[String]]
f: Vector[String] => String
val uniqueVals = rdd.map(f).distinct().collect()
val uv = sc.broadcast(uniqueVals)
But uniqueVals is too large to be practical, and I would like to replace it with something of smaller (and known) size, i.e. a bloom filter.
My questions:
is it possible to reduce into a Bloom filter, or do I have to collect first, and then construct it in the driver?
is there a mature Scala/Java Bloom filter implementation available that would be suitable for this?

Yes, Bloom filters can be reduced, because they have some nice properties (they are monoids). This means you can do all the aggregation operations in parallel, doing effectively just one pass over the data to construct the BloomFilter for each partition and then reduce those BloomFilters together to get a single BloomFilter that you can query for contains.
There are at least two implementations of BloomFilter in Scala and both seem mature projects (haven't actually used them in production). The first one is Breeze and the second one is Twitter's Algebird. Both contain implementations of different sketches and a lot more.
This is an example how to do that with Breeze:
import breeze.util.BloomFilter
val nums = List(1 to 20: _*).map(_.toString)
val rdd = sc.parallelize(nums, 5)
val bf = rdd.mapPartitions { iter =>
val bf = BloomFilter.optimallySized[String](10000, 0.001)
iter.foreach(i => bf += i)
Iterator(bf)
}.reduce(_ | _)
println(bf.contains("5")) // true
println(bf.contains("31")) // false

Related

Spark, applying filters on DataFrame(or RDD) multiple times without redundant evaluations

I have a Spark DataFrame that needs heavy evaluations for the chaining of the parent RDDs.
val df: DataFrame[(String, Any)] = someMethodCalculatingDF()
val out1 = df.filter(_._1 == "Key1").map(_._2).collect()
val out2 = df.filter(_._1 == "Key2").map(_._2)
out1 is a very small data ( one or two Rows in each partition) and collected for further use.
out2 is a Dataframe and will be used to generate another RDD that will be materialized later.
So, df will be evaluated twice, which is heavy.
Caching could be a solution, but in my application, it wont be, because the data could be really really BIG. The memory would be overflowed.
Is there any genius :) who could suggest another way bypassing the redundant evaluations?
It's actually a scenario which occurs in our cluster on a daily basis. From our experience this methodology work for us the best.
When we need to use same calculated dataframe twice(on different branches) we do as follows:
Calculation phase is heavy and resulted in a rather small dataframe -> cache it.
Calculation phase is light resulted in a big dataframe -> let it calculate twice.
Calculation is heavy resulted in a big data frame -> write it to disk(HDFS or S3) split the job on splitting point to two different batch processing job. In this you don't repeat the heavy calculation and you don't shred your cache(which will either way probably use the disk).
Calculation phase is light resulting in a small Dataframe. Your life is good and you can go home :).
I'm not familiar with dataset API so will write a solution using RDD api.
val rdd: RDD[(String, Int)] = ???
//First way
val both: Map[String, Iterable[Int]] = rdd.filter(e => e._1 == "Key1" || e._1 == "Key2")
.groupByKey().collectAsMap()
//Second way
val smallCached = rdd.filter(e => e._1 == "Key1" || e._1 == "Key2").cache()
val out1 = smallCached.filter(_._1 == "Key1").map(_._2).collect()
val out2 = smallCached.filter(_._1 == "Key2").map(_._2).collect()

Spark Dataset equivalent for scala's "collect" taking a partial function

Regular scala collections have a nifty collect method which lets me do a filter-map operation in one pass using a partial function. Is there an equivalent operation on spark Datasets?
I'd like it for two reasons:
syntactic simplicity
it reduces filter-map style operations to a single pass (although in spark I am guessing there are optimizations which spot these things for you)
Here is an example to show what I mean. Suppose I have a sequence of options and I want to extract and double just the defined integers (those in a Some):
val input = Seq(Some(3), None, Some(-1), None, Some(4), Some(5))
Method 1 - collect
input.collect {
case Some(value) => value * 2
}
// List(6, -2, 8, 10)
The collect makes this quite neat syntactically and does one pass.
Method 2 - filter-map
input.filter(_.isDefined).map(_.get * 2)
I can carry this kind of pattern over to spark because datasets and data frames have analogous methods.
But I don't like this so much because isDefined and get seem like code smells to me. There's an implicit assumption that map is receiving only Somes. The compiler can't verify this. In a bigger example, that assumption would be harder for a developer to spot and the developer might swap the filter and map around for example without getting a syntax error.
Method 3 - fold* operations
input.foldRight[List[Int]](Nil) {
case (nextOpt, acc) => nextOpt match {
case Some(next) => next*2 :: acc
case None => acc
}
}
I haven't used spark enough to know if fold has an equivalent so this might be a bit tangential.
Anyway, the pattern match, the fold boiler plate and the rebuilding of the list all get jumbled together and it's hard to read.
So overall I find the collect syntax the nicest and I'm hoping spark has something like this.
The answers here are incorrect, at least with the current of Spark.
RDDs do in fact have a collect method that takes a partial function and applies a filter & map to the data. This is completely different from the parameterless .collect() method. See the Spark source code RDD.scala # line 955:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
This does not materialize the data from the RDD, as opposed to the parameterless .collect() method in RDD.scala # line 923:
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
In the documentation, notice how the
def collect[U](f: PartialFunction[T, U]): RDD[U]
method does not have a warning associated with it about the data being loaded into the driver's memory:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD#collect[U](f:PartialFunction[T,U])(implicitevidence$29:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U]
It's very confusing for Spark to have these overloaded methods doing completely different things.
edit: My mistake! I misread the question, we're talking about DataSets not RDDs. Still, the accepted answer says that
"the Spark documentation points out, however, "this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory."
Which is incorrect! The data is not loaded into the driver's memory when calling the partial function version of .collect() - only when calling the parameterless version. Calling .collect(partial_function) should have about the same performance as calling .filter() and .map() sequentially, as shown in the source code above.
Just for the sake of completeness:
The RDD API does have such a method, so it's always an option to convert a given Dataset / DataFrame to RDD, perform the collect operation and convert back, e.g.:
val dataset = Seq(Some(1), None, Some(2)).toDS()
val dsResult = dataset.rdd.collect { case Some(i) => i * 2 }.toDS()
However, this will probably perform worse than using a map and filter on the Dataset (for the reason explained in #stefanobaghino's answer).
As for DataFrames, this particular example (using Option) is somewhat misleading, as the conversion into a DataFrame actually does the "flatenning" of Options into their values (or null for None), so the equivalent expression would be:
val dataframe = Seq(Some(1), None, Some(2)).toDF("opt")
dataframe.withColumn("opt", $"opt".multiply(2)).filter(not(isnull($"opt")))
Which, I think, suffers less from your concerns of having the map operation "assume" anything about its input.
The collect method defined over RDDs and Datasets is used to materialize the data in the driver program.
Despite not having something akin to the Collections API collect method, your intuition is right: since both operations are evaluated lazily, the engine has the opportunity to optimize the operations and chain them so that they are performed with maximum locality.
For the use case you mentioned in particular I would suggest you take flatMap in consideration, which works on both RDDs and Datasets:
// Assumes the usual spark-shell environment
// sc: SparkContext, spark: SparkSession
val collection = Seq(Some(1), None, Some(2), None, Some(3))
val rdd = sc.parallelize(collection)
val dataset = spark.createDataset(rdd)
// Both operations will yield `Array(2, 4, 6)`
rdd.flatMap(_.map(_ * 2)).collect
dataset.flatMap(_.map(_ * 2)).collect
// You can also express the operation in terms of a for-comprehension
(for (option <- rdd; n <- option) yield n * 2).collect
(for (option <- dataset; n <- option) yield n * 2).collect
// The same approach is valid for traditional collections as well
collection.flatMap(_.map(_ * 2))
for (option <- collection; n <- option) yield n * 2
EDIT
As correctly pointed out in another question, RDDs actually have the collect method that transforms an RDD by applying a partial function just like it happens in normal collections. As the Spark documentation points out, however, "this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory."
I just wanted to extend stefanobaghino's answer by including an example of a for comprehension with a case class as many use cases for this will probably involve case classes.
Also options are monads which makes the accepted answer very simple in this case as the for neatly drops out the None values, but that approach wouldn't extend to non-monads like case classes:
case class A(b: Boolean, i: Int, d: Double)
val collection = Seq(A(true, 3), A(false, 10), A(true, -1))
val rdd = ...
val dataset = ...
// Select out and double all the 'i' values where 'b' is true:
for {
A(b, i, _) <- dataset
if b
} yield i * 2
You can always create your own extension method:
implicit class DatasetOps[T](ds: Dataset[T]) {
def collectt[U](pf: PartialFunction[T, U])(implicit enc: Encoder[U]): Dataset[U] = {
ds.flatMap(pf.lift(_))
}
}
such that:
// val ds = Dataset(1, 2, 3)
ds.collectt { case x if x % 2 == 1 => x * 3 }
// Dataset(3, 9)
Note that I've unfortunately not been able to name it collect (thus the awful suffix t) as the signature would otherwise (I think) clash with the existing Dataset#collect method that transforms a Dataset into an Array.

What's the performance impact of converting between `DataFrame`, `RDD` and back?

While my first instinct is to use DataFrames for everything, it's just not possible -- some operations are clearly easier and / or better performing as RDD operations, not to mention certain APIs like GraphX only work on RDDs.
I seem to be spending a lot of time these days converting back and forth between DataFrames and RDDs -- so what's the performance hit? Take RDD.checkpoint -- there's no DataFrame equivalent, so what happens under the hood when I do:
val df = Seq((1,2),(3,4)).toDF("key","value")
val rdd = df.rdd.map(...)
val newDf = rdd.map(r => (r.getInt(0), r.getInt(1))).toDF("key","value")
Obviously, this is a trivally small example, but it would be great to know what happens behind the scene in the conversion.
Let's look at df.rdd first. This is defined as:
lazy val rdd: RDD[Row] = {
// use a local variable to make sure the map closure doesn't capture the whole DataFrame
val schema = this.schema
queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row])
}
}
So firstly, it runs queryExecution.toRdd, which basically prepares the execution plan based on the operators used to build up the DataFrame, and computes an RDD[InternalRow] that represents the outcome of plan.
Next these InternalRows (which are only for internal use) of that RDD will be mapped to normal Rows. This will entail the following for each row:
override def toScala(row: InternalRow): Row = {
if (row == null) {
null
} else {
val ar = new Array[Any](row.numFields)
var idx = 0
while (idx < row.numFields) {
ar(idx) = converters(idx).toScala(row, idx)
idx += 1
}
new GenericRowWithSchema(ar, structType)
}
}
So it loops over all elements, coverts them to 'scala' space (from Catalyst space), and creates the final row with them. toDf will pretty much do these things in reverse.
This all will indeed have some impact on your performance. How much depends on how complex these operations are compared to the things you do with the data. The bigger possible impact however will be that Spark's Catalyst optimizer can only optimize the operations between the conversions to and from RDDs, rather than optimize the full execution plan in its whole. It would be interesting to see which operations you have trouble with, I find most things can be done using basic expressions or UDFs. Using modules that only work on RDDs is a very valid use case though!

Comparing Subsets of an RDD

I’m looking for a way to compare subsets of an RDD intelligently.
Lets say I had an RDD with key/value pairs of type (Int->T). I eventually need to say “compare all values of key 1 with all values of key 2 and compare values of key 3 to the values of key 5 and key 7”, how would I go about doing this efficiently?
The way I’m currently thinking of doing it is by creating a List of filtered RDDs and then using RDD.cartesian()
def filterSubset[T] = (b:Int, r:RDD[(Int, T)]) => r.filter{case(name, _) => name == b}
Val keyPairs:(Int, Int) // all key pairs
Val rddPairs = keyPairs.map{
case (a, b) =>
filterSubset(a,r).cartesian(filterSubset(b,r))
}
rddPairs.map{whatever I want to compare…}
I would then iterate the list and perform a map on each of the RDDs of pairs to gather the relational data that I need.
What I can’t tell about this idea is whether it would be extremely inefficient to set up possibly of hundreds of map jobs and then iterate through them. In this case, would the lazy valuation in spark optimize the data shuffling between all of the maps? If not, can someone please recommend a possibly more efficient way to approach this issue?
Thank you for your help
One way you can approach this problem is to replicate and partition your data to reflect key pairs you want to compare. Lets start with creating two maps from the actual keys to the temporary keys we'll use for replication and joins:
def genMap(keys: Seq[Int]) = keys
.zipWithIndex.groupBy(_._1)
.map{case (k, vs) => (k -> vs.map(_._2))}
val left = genMap(keyPairs.map(_._1))
val right = genMap(keyPairs.map(_._2))
Next we can transform data by replicating with new keys:
def mapAndReplicate[T: ClassTag](rdd: RDD[(Int, T)], map: Map[Int, Seq[Int]]) = {
rdd.flatMap{case (k, v) => map.getOrElse(k, Seq()).map(x => (x, (k, v)))}
}
val leftRDD = mapAndReplicate(rddPairs, left)
val rightRDD = mapAndReplicate(rddPairs, right)
Finally we can cogroup:
val cogrouped = leftRDD.cogroup(rightRDD)
And compare / filter pairs:
cogrouped.values.flatMap{case (xs, ys) => for {
(kx, vx) <- xs
(ky, vy) <- ys
if cosineSimilarity(vx, vy) <= threshold
} yield ((kx, vx), (ky, vy)) }
Obviously in the current form this approach is limited. It assumes that values for arbitrary pair of keys can fit into memory and require a significant amount of network traffic. Still it should give you some idea how to proceed.
Another possible approach is to store data in the external system (for example database) and fetch required key-value pairs on demand.
Since you're trying to find similarity between elements I would also consider completely different approach. Instead of naively comparing key-by-key I would try to partition data using custom partitioner which reflects expected similarity between documents. It is far from trivial in general but should give much better results.
Using Dataframe you can easily do the cartesian operation using join:
dataframe1.join(dataframe2, dataframe1("key")===dataframe2("key"))
It will probably do exactly what you want, but efficiently.
If you don't know how to create an Dataframe, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-dataframes

Scala ParArray Sorting

How to sort in ascending order a ParArray collection such as
ParArray(1,3,2)
or else, which parallel collections may be more suitable for this purpose ?
Update
How to implement a parallel algorithm on ParArray that may prove more efficient than casting to a non parallel collection for sequential sorting ?
How to implement a parallel algorithm on ParArray that may prove more
efficient than casting to a non parallel collection for sequential
sorting?
My first obvervation would be that there doesn't seem to be much performance penalty for "converting" parallel arrays to sequential and back:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
val diff: Long = t1 - t0
println(s"Elapsed time: ${diff * 1.0 / 1E9}s")
result
}
def main(args: Array[String]): Unit = {
val size: Int = args.headOption.map(_.toInt).getOrElse(1000000)
val input = Array.fill(size)(Random.nextInt())
val arrayCopy: Array[Int] = Array.ofDim(size)
input.copyToArray(arrayCopy)
time { input.sorted }
val parArray = arrayCopy.par
val result = time { parArray.seq.sorted.toArray.par }
}
gives
> run 1000000
[info] Running Runner 1000000
Elapsed time: 0.344659236s
Elapsed time: 0.321363896s
For all Array sizes I tested the results are very similar and usually somehow in favor of the second expression. So in case you were worried that converting to sequential collections and back will kill the performance gains you achieved on other operations - I don't think you should be.
When it comes to utilizing Scala's parallel collections to achieve parallel sorting that in some cases would perform better than the default - I don't think there's an obvious good way of doing that, but it wouldn't hurt to try:
What I thought should work would be splitting the input array into as many subarrays as you have cores in your computer (preferably without any unnecessary copying) and sorting the parts concurrently. Afterwards one might merge (as in merge sort) the parts together. Here's how the code might look like:
val maxThreads = 8 //for simplicity we're not configuring the thread pool explicitly
val groupSize:Int = size/maxThreads + 1
val ranges: IndexedSeq[(Int, Int)] = (0 until maxThreads).map(i => (i * groupSize, (i + 1) * groupSize))
time {
//parallelizing sorting for each range
ranges.par.foreach {case (from, to) =>
input.view(from, to).sortWith(_ < _)
}
//TODO merge the parts together
}
Unfortunately there's this old bug that prevents us from doing anything fun with views. There doesn't seem to be any Scala built-in mechanism (other than views) for sorting just a part of a collection. This is why I tried coding my own merge sort algorithm with the signature of def mergeSort(a: Array[Int], r: Range): Unit to use it as I described above. Unfortunately it seems to be more than 4 times less effective than the scala Array.sorted method so I don't think it could be used to gain efficiency over the standard sequential approach.
If I understand your situation correctly, your dataset fits in memory, so using something like Hadoop and MapReduce would be premature. What you might try though would be Apache Spark - other than adding a dependency you wouldn't need to set up any cluster or install anything for Spark to use all cores of your machine in a basic configuration. Its RDD's are ideologically similar to Scala's Parallel Collections, but with additional functionalities. And they (in a way) support parallel sorting.
If you build your Scala project against Java 8, there is the new Arrays.parallelSort you can use:
def sort[T <: Comparable](parArray: ParArray[T])(implicit c: ClassTag[T]): ParArray[T] = {
var array = new Array[T](parArray.size) // Or, to prevent copying, var array = parArray.seq.array.asInstanceOf[Array[T]] might work?
parArray.copyToArray(array)
java.util.Arrays.parallelSort(array)
ParArray.createFromCopy(array)
}
There are no parallel sorting algorithms available in the Scala standard library. For this reason, the parallel collection don't provide sorted, sortBy, or sortWith methods. You will have to convert to an appropriate sequential class (e.g. with toArray) before sorting.
If your data can fit in memory, then single thread in memory sort is fast enough. If you need to load a lot of data from disk or HDFS, then you can do the sort on a distributed system like hadoop or spark.
def parallelSort[A : Ordering](seq: ParIterable[A]): TreeSet[A] = {
seq.aggregate[TreeSet[A]](TreeSet.empty[A])(
(set, a) => set + a,
(set, set) => set ++ set)
}