finding least subset in an rdd in spark - scala

I have an RDD of BitSets like
Array[scala.collection.mutable.BitSet] = Array(BitSet(1, 2), BitSet(1, 7), BitSet(8, 9, 10, 11), BitSet(1, 2, 3, 4),BitSet(8,9,10),BitSet(1,2,3))
I want rdd of (BitSet(1,2),BitSet(1,7),BitSet(8,9,10))
that is i require least subset or the BitSet which is not subset to any BitSet
here (1,2),(1,2,3)(1,2,3,4) -----here least subset is (1,2) and
(1,7) is not subset to other BitSets ---so (1,7) also present in the result

This problem can be solved using aggregation
val rdd = sc.parallelize(
Seq(BitSet(1, 2),
BitSet(1, 7),
BitSet(8, 9, 10, 11),
BitSet(1, 2, 3, 4),
BitSet(8, 9, 10),
BitSet(1, 2, 3)))
val result = rdd.aggregate(Seq[BitSet]())(accumulate, merge)
List(BitSet(8, 9, 10), BitSet(1, 7), BitSet(1, 2))
accumulate is a function that collects data on each spark partition
def accumulate(acc: Seq[BitSet], elem: BitSet): Seq[BitSet] = {
val subsets = acc.filterNot(elem.subsetOf)
if (notSuperSet(subsets)(elem)) elem +: subsets else subsets
merge combines results of partitions
def merge(left: Seq[BitSet], right: Seq[BitSet]): Seq[BitSet] = {
val leftSubsets = left.filter(notSuperSet(right))
val rightSubsets = right.filter(notSuperSet(leftSubsets))
leftSubsets ++ rightSubsets
and a helper function notSuperSet that is a predicate to check whether a new set is superset of other sets
def notSuperSet(subsets: Seq[BitSet])(set: BitSet): Boolean =


Spark: Split stream of elements to stream of lists of elements

I'd like to split an RDD to sequences of elements separated by a delimiter.
Say I have this RDD:
val data = Array(1, 2, 0, 4, 5, 0, 4, 5, 6, 1, 2, 0, 4)
val distData = sc.parallelize(data)
I'd like to transform it into the following RDD:
(Seq(1, 2), Seq(4, 5), Seq(4, 5, 6, 1, 2), Seq(4))
And I'd like to preserve "laziness" of that RDD - that's it, I want to be able to take(n) without evaluating the whole data.
I was able to solve it using accumulator and groupByKey, but unfortunately it won't preserve laziness because groupByKey needs the whole RDD to be evaluated.
val seqId = sc.longAccumulator("Sequance id")
.flatMap { e =>
if (e == 0) {
} else
Some(seqId.value, e)
.map { case(id, e) => e.toSeq }
Any ideas how to solve this problem in a lazy way?

Spark Scala - Split columns into multiple rows

Following the question that I post here:
Spark Mllib - Scala
I've another one doubt... Is possible to transform a dataset like this:
Into this:
Basically I want to discover all the relationships between the movies. Is possible to do this?
My current code is:
val input = sc.textFile("PATH")
val raw =",")).toArray
val twoElementArrays = raw.flatMap(_.combinations(2))
val result = twoElementArrays ++ raw.filter(_.length == 1)
Given that input is a multi-line string.
scala> val raw =",")).toArray
raw: Array[Array[String]] = Array(Array(2, 1, 3), Array(1), Array(3, 6, 8))
Following approach discards one-element arrays, 1 in your example.
scala> val twoElementArrays = raw.flatMap(_.combinations(2))
twoElementArrays: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8))
It can be fixed by appending filtered raw collection.
scala> val result = twoElementArrays ++ raw.filter(_.length == 1)
result: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8), Array(1))
Order of combinations is not relevant I believe.
SparkContext.textFile returns RDD of lines, so it could be plugged in as:
val raw =","))

RDD intersections

I have a query about intersection between two RDDs.
My first RDD has a list of elements like this:
A = List(1,2,3,4), List(4,5,6), List(8,3,1),List(1,6,8,9,2)
And the second RDD is like this:
B = (1,2,3,4,5,6,8,9)
(I could store B in memory as a Set but not the first one.)
I would like to do an intersection of each element in A with B
How can I do this in Scala?
val result = x => x.intersect(B))
Both B and rdd have to have the same type (in this case List[Int]). Also, note that if B is big but fits in memory, you would want to probably broadcast it as documented here.
scala> val B = List(1,2,3,4,5,6,8,9)
B: List[Int] = List(1, 2, 3, 4, 5, 6, 8, 9)
scala> val rdd = sc.parallelize(Seq(List(1,2,3,4), List(4,5,6), List(8,3,1),List(1,6,8,9,2)))
rdd: org.apache.spark.rdd.RDD[List[Int]] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> x => x.intersect(B)).collect.mkString("\n")
res3: String =
List(1, 2, 3, 4)
List(4, 5, 6)
List(8, 3, 1)
List(1, 6, 8, 9, 2)

How to filter a RDD according to a function based another RDD in Spark?

I am a beginner of Apache Spark. I want to filter out all groups whose sum of weight is larger than a constant value in a RDD. The "weight" map is also a RDD. Here is a small-size demo, the groups to be filtered is stored in "groups", the constant value is 12:
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
val wm = weights.toArray.toMap
def isheavy(inp: String): Boolean = {
val allw = inp.split(",").map(wm(_)).sum
allw > 12
val result = groups.filter(isheavy)
When the input data is very large, > 10GB for example, I always encounter a "java heap out of memory" error. I doubted if it's caused by "weights.toArray.toMap", because it convert an distributed RDD to an Java object in JVM. So I tried to filter with RDD directly:
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
def isheavy(inp: String): Boolean = {
val items = inp.split(",")
val wm = => weights.filter(_._1 == x).first._2)
wm.sum > 12
val result = groups.filter(isheavy)
When I ran result.collect after loading this script into spark shell, I got a "java.lang.NullPointerException" error. Someone told me when a RDD is manipulated in another RDD, there will be a nullpointer exception, and suggest me to put the weight into Redis.
So how can I get the "result" without convert "weight" to Map, or put it into Redis? If there is a solution to filter a RDD based on another map-like RDD without the help of external datastore service?
Suppose your group is unique. Otherwise, first make it unique by distinct, etc.
If group or weights is small, it should be easy. If both group and weights are huge, you can try this, which may be more scalable, but also looks complicated.
val groups = sc.parallelize(List("a,b,c,d", "b,c,e", "a,c,d", "e,g"))
val weights = sc.parallelize(Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)))
//map groups to be (a, (a,b,c,d)), (b, (a,b,c,d), (c, (a,b,c,d)....
val g1=groups.flatMap(s=>s.split(",").map(x=>(x, Seq(s))))
//j will be (a, ((a,b,c,d),3)...
val j = g1.join(weights)
//k will be ((a,b,c,d), 3), ((a,b,c,d),2) ...
val k =>(x._2._1, x._2._2))
//l will be ((a,b,c,d), (3,2,5,1))...
val l = k.groupByKey()
//filter by sum the 2nd
val m = l.filter(x=>{var sum = 0; x._2.foreach(a=> {sum=sum+a});sum > 12})
//we only need the original list
//don't do this in real product, otherwise, all results go to driver.instead using saveAsTextFile, etc
scala> result.foreach(println)
The "java out of memory" error is coming because spark uses its spark.default.parallelism property while determining number of splits, which by default is number of cores available.
// From CoarseGrainedSchedulerBackend.scala
override def defaultParallelism(): Int = {
conf.getInt("spark.default.parallelism", math.max(totalCoreCount.get(), 2))
When the input becomes large, and you have limited memory, you should increase number of splits.
You can do something as follows:
val input = List("a,b,c,d", "b,c,e", "a,c,d", "e,g")
val splitSize = 10000 // specify some number of elements that fit in memory.
val numSplits = (input.size / splitSize) + 1 // has to be > 0.
val groups = sc.parallelize(input, numSplits) // specify the # of splits.
val weights = Array(("a", 3), ("b", 2), ("c", 5), ("d", 1), ("e", 9), ("f", 4), ("g", 6)).toMap
def isHeavy(inp: String) = inp.split(",").map(weights(_)).sum > 12
val result = groups.filter(isHeavy)
You may also consider increasing executor memory size using spark.executor.memory.

Scala: How to sort an array within a specified range of indices?

And I have a comparison function "compr" already in the code to compare two values.
I want something like this:
Sorting.stableSort(arr[i,j] , compr)
where arr[i,j] is a range of element in array.
Take the slice as a view, sort and copy it back (or take a slice as a working buffer).
scala> val vs = Array(3,2,8,5,4,9,1,10,6,7)
vs: Array[Int] = Array(3, 2, 8, 5, 4, 9, 1, 10, 6, 7)
scala> vs.view(2,5).toSeq.sorted.copyToArray(vs,2)
scala> vs
res31: Array[Int] = Array(3, 2, 4, 5, 8, 9, 1, 10, 6, 7)
Outside the REPL, the extra .toSeq isn't needed:
scala 2.13.8> val vs = Array(3, 2, 8, 5, 4, 9, 1, 10, 6, 7)
val vs: Array[Int] = Array(3, 2, 8, 5, 4, 9, 1, 10, 6, 7)
scala 2.13.8> vs.view.slice(2,5).sorted.copyToArray(vs,2)
val res0: Int = 3
scala 2.13.8> vs
val res1: Array[Int] = Array(3, 2, 4, 5, 8, 9, 1, 10, 6, 7)
Split array into three parts, sort middle part and then concat them, not the most efficient way, but this is FP who cares about performance =)
val sorted =
for {
first <- l.take(FROM)
sortingPart <- l.slice(FROM, UNTIL)
lastPart <- l.takeRight(UNTIL)
} yield (first ++ Sorter.sort(sortingPart) ++ lastPart)
Something like that:
def stableSort[T](x: Seq[T], i: Int, j: Int, comp: (T,T) => Boolean ):Seq[T] = {
x.take(i) ++ x.slice(i,j).sortWith(comp) ++ x.drop(i+j-1)
def comp: (Int,Int) => Boolean = { case (x1,x2) => x1 < x2 }
val x = Array(1,9,5,6,3)
stableSort(x,1,4, comp)
// > res0: Seq[Int] = ArrayBuffer(1, 5, 6, 9, 3)
If your class implements Ordering it would be less cumbersome.
This should be as good as you can get without reimplementing the sort. Creates just one extra array with the size of the slice to be sorted.
def stableSort[K:reflect.ClassTag](xs:Array[K], from:Int, to:Int, comp:(K,K) => Boolean) : Unit = {
val tmp = xs.slice(from,to)
scala.util.Sorting.stableSort(tmp, comp)
tmp.copyToArray(xs, from)