Get all possible combinations of 3 values from n possible elements - scala

I'm trying to get the list of all the possible combinations of 3 elements from a list of 30 items. I tried to use the following code, but it fails throwing an OutOfMemoryError. Is there any alternative approach which is more efficient than this?
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).
select("item_id").distinct.cache
val items.take(1) // Compute cache before join
val itemCombinations = items.select($"item_id".alias("id_A")).
join(
items.select($"item_id".alias("id_B")), $"id_A".lt($"id_B")).
join(
items.select($"item_id".alias("id_C")), $"id_B".lt($"id_C"))

The approach seems OK but might generate quite some overhead at the query execution level. Give that n is a fairly small number, we could do it using the Scala implementation directly:
val localItems = items.collect
val combinations = localItems.combinations(3)
The result is an iterator that can be consumed one element at the time, without significant memory overhead.
Spark Version (edit)
Given the desire to make a Spark version of this, it could be possible to avoid the query planner (assuming that the issue is there), by dropping to RDD level. This is basically the same expression as the join in the question:
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).select("item_id").rdd
val combinations = items.cartesian(items).filter{case(x,y) => x<y}.cartesian(items).filter{case ((x,y),z) => y<z}
Running an equivalent code in my local machine:
val data = List.fill(1000)(scala.util.Random.nextInt(999))
val rdd = sparkContext.parallelize(data)
val combinations = rdd.cartesian(rdd).filter{case(x,y) => x<y}.cartesian(rdd).filter{case ((x,y),z) => y<z}
combinations.count
// res5: Long = 165623528

Related

Scala: two sliding more efficiently

I am working on the Quick Start of Apache Spark. I was wondering about efficiency of transformations on collections. I would like to know how to improve the following code:
// Variable initialisation
val N = 300.0
val input = (0.0 to N-1 by 1.0).toArray
val firstBigDivi = 100
val windowDuration = 6
val windowStep = 3
// Process
val windowedInput = inputArray.
sliding(firstBigDivi,firstBigDivi).toArray. //First, a big division
map(arr=>arr.sliding(windowDuration,windowStep).toArray)//Second, divide the division
Is there another way to do the same more efficiently? I think this code iterates twice over the input array (which could be an issue for big collections) is that right?
sliding creates an Iterator, so mapping that would be "cheap". You have a superfluous .toArray though between sliding and map. It suffices
val windowedInputIt = input.
sliding(firstBigDivi,firstBigDivi) //First, a big division
.map(arr=>arr.sliding(windowDuration,windowStep).toArray)
Then you can evaluate that iterator into an Array by writing
val windowedInput = windowedInputIt.toArray

Merge multiple RDD generated in loop

I am calling a function in scala which gives an RDD[(Long,Long,Double)] as its output.
def helperfunction(): RDD[(Long, Long, Double)]
I call this function in loop in another part of the code and I want to merge all the generated RDDs. The loop calling the function looks something like this
for (i <- 1 to n){
val tOp = helperfunction()
// merge the generated tOp
}
What I want to do is something similar to what StringBuilder would do for you in Java when you wanted to merge the strings. I have looked at techniques of merging RDDs, which mostly point to using union function like this
RDD1.union(RDD2)
But this requires both RDDs to be generated before taking their union. I though of initializing a var RDD1 to accumulate the results outside the for loop but I am not sure how can I initialize a blank RDD of type [(Long,Long,Double)]. Also I am starting out with spark, so I am not even sure if this is the most elegant method to solve this problem.
Instead of using vars, you can use functional programming paradigms to achieve what you want :
val rdd = (1 to n).map(x => helperFunction()).reduce(_ union _)
Also, if you still need to create an empty RDD, you can do it using :
val empty = sc.emptyRDD[(long, long, String)]
You're correct that this might not be the optimal way to do this, but we would need more info on what you're trying to accomplish with generating a new RDD with each call to your helper function.
You could define 1 RDD prior to the loop and assign it a var then run it through your loop. Here's an example:
val rdd = sc.parallelize(1 to 100)
val rdd_tuple = rdd.map(x => (x.toLong, (x*10).toLong, x.toDouble))
var new_rdd = rdd_tuple
println("Initial RDD count: " + new_rdd.count())
for (i <- 2 to 4) {
new_rdd = new_rdd.union(rdd_tuple)
}
println("New count after loop: " + new_rdd.count())

How to get a subset of a RDD?

I am new to Spark. If I have a RDD consists of key-value pairs, what is the efficient way to return a subset of this RDD containing the keys that appear more than a certain times in the original RDD?
For example, if my original data RDD is like this:
val dataRDD=sc.parallelize(List((1,34),(5,3),(1,64),(3,67),(5,0)),3)
I want to get a new RDD, in which the keys appear more than once in dataRDD. The newRDD should contains these tuples: (1,34),(5,3),(1,64),(5,0). How can I get this new RDD? Thank you very much.
Count keys and filter infrequent:
val counts = dataRDD.keys.map((_, 1)).reduceByKey(_ + _)
val infrequent = counts.filter(_._2 == 1)
If number of infrequent values is to large to be handled in memory you can use PairRDDFunctions.subtractByKey:
dataRDD.subtractByKey(infrequent)
otherwise a broadcast variable:
val infrequentKeysBd = sc.broadcast(infrequent.keys.collect.toSet)
dataRDD.filter{ case(k, _) => !infrequentKeysBd.value.contains(k)}
If number of frequent keys is very low you can filter frequent keys and use a broadcast variable as above:
val frequent = counts.filter(_._2 > 1)
val frequentKeysBd = ??? // As before
dataRDD.filter{case(k, _) => frequentKeysBd.value.contains(k)}

Joining process with broadcast variable ends up endless spilling

I am joining two RDDs from text files in standalone mode. One has 400 million (9 GB) rows, and the other has 4 million (110 KB).
3-grams doc1 3-grams doc2
ion - 100772C111 ion - 200772C222
on - 100772C111 gon - 200772C222
n - 100772C111 n - 200772C222
... - .... ... - ....
ion - 3332145654 on - 58898874
mju - 3332145654 mju - 58898874
... - .... ... - ....
In each file, doc numbers (doc1 or doc2) appear one under the other. And as a result of join I would like to get a number of common 3-grams between the docs.e.g.
(100772C111-200772C222,2) --> There two common 3-grams which are 'ion' and ' n'
The server on which I run my code has 128 GB RAM and 24 cores. I set my IntelliJ configurations - VM options part with -Xmx64G
Here is my code for this:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[4]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.6").set("spark.storage.memoryFraction", "0.4")
.set("spark.executor.memory","40g")
.set("spark.driver.memory","40g")
val sc = new SparkContext(conf)
val emp = sc.textFile("\\doc1.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_new = sc.textFile("\\doc2.txt").map(line => (line.split("\t")(3),line.split("\t")(1))).distinct()
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
val joined = emp.mapPartitions(iter => for {
(k, v1) <- iter
v2 <- emp_newBC.value.getOrElse(k, Iterable())
} yield (s"$v1-$v2", 1))
val olsun = joined.reduceByKey((a,b) => a+b)
olsun.map(x => x._1 + "\t" + x._2).saveAsTextFile("...\\out.txt")
So as seen, during join process using broadcast variable my key values change. So it seems I need to repartition the joined values? And it is highly expensive. As a result, i ended up too much spilling issue, and it never ended. I think 128 GB memory must be sufficient. As far as I understood, when broadcast variable is used shuffling is being decreased significantly? So what is wrong with my application?
Thanks in advance.
EDIT:
I have also tried spark's join function as below:
var joinRDD = emp.join(emp_new);
val kkk = joinRDD.map(line => (line._2,1)).reduceByKey((a, b) => a + b)
again ending up too much spilling.
EDIT2:
val conf = new SparkConf().setAppName("abdulhay").setMaster("local[12]").set("spark.shuffle.spill", "true")
.set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6")
.set("spark.executor.memory","50g")
.set("spark.driver.memory","50g")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\wos14.txt").map{line => val s = line.split("\t"); (s(5),s(0))}//.distinct()
val emp_new = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - scala\\fwo_word.txt").map{line => val s = line.split("\t"); (s(3),s(1))}//.distinct()
val cog = emp_new.cogroup(emp)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
(l1.toSeq ++ l2.toSeq).combinations(2).map { case Seq(x, y) => if (x < y) ((x, y),1) else ((y, x),1) }.toList
}
val com = skk.countByKey()
I would not use broadcast variables. When you say:
val emp_newBC = sc.broadcast(emp_new.groupByKey.collectAsMap)
Spark is first moving the ENTIRE dataset into the master node, a huge bottleneck and prone to produce memory errors on the master node. Then this piece of memory is shuffled back to ALL nodes (lots of network overhead), bound to produce memory issues there too.
Instead, join the RDDs themselves using join (see description here)
Figure out also if you have too few keys. For joining Spark basically needs to load the entire key into memory, and if your keys are too few that might still be too big a partition for any given executor.
A separate note: reduceByKey will repartition anyway.
EDIT: ---------------------
Ok, given the clarifications, and assuming that the number of 3-grams per doc# is not too big, this is what I would do:
Key both files by 3-gram to get (3-gram, doc#) tuples.
cogroup both RDDs, that gets you the 3gram key and 2 lists of doc#
Process those in a single scala function, output a set of all unique permutations of (doc-pairs).
then do coutByKey or countByKeyAprox to get a count of the number of distinct 3-grams for each doc pair.
Note: you can skip the .distinct() calls with this one. Also, you should not split every line twice. Change line => (line.split("\t")(3),line.split("\t")(1))) for line => { val s = line.split("\t"); (s(3),s(1)))
EDIT 2:
You also seem to be tuning your memory badly. For instance, using .set("spark.shuffle.memoryFraction", "0.4").set("spark.storage.memoryFraction", "0.6") leaves basically no memory for task execution (since they add up to 1.0). I should have seen that sooner but was focused on the problem itself.
Check the tunning guides here and here.
Also, if you are running it on a single machine, you might try with a single, huge executor (or even ditch Spark completely), as you don't need overhead of a distributed processing platform (and distributed hardware failure tolerance, etc).

Merging two RDDs based on common key and then outputting

I'm fairly new to Scala and Spark and functional programming in general, so forgive me if this is a pretty basic question.
I'm merging two CSV files, so I got a lot of inspiration from this: Merge the intersection of two CSV files with Scala
although that is just Scala code and I wanted to write it in Spark to handle much larger CSV files.
This part of the code I think I've got right:
val csv1 = sc.textFile(Csv1Location).cache()
val csv2 = sc.textFile(Csv2Location).cache()
def GetInput1Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv1KeyLocation))
def GetInput2Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv2KeyLocation))
val intersectionOfKeys = csv1 map GetInput1Key intersection(csv2 map GetInput2Key)
val map1 = csv1 map (input => GetInput1Key(input) -> input)
val map2 = csv2 map (input => GetInput2Key(input) -> input)
val broadcastedIntersection = sc.broadcast(intersectionOfKeys.collect.toSet)
And this is where I'm a little lost. I have a set of keys (intersectionOfKeys) that are present in both of my RDDs, and I have two RDDs that contain [Key, String] maps. If they were plain maps I could just do:
val output = broadcastedIntersection.value map (key => map1(key) + ", " + map2(key))
but that syntax isn't working.
Please let me know if you need any more information about the CSV files or what I'm trying to accomplish. Also, I'd love any syntactical and/or idiomatic comments on my code as well if you all notice anything incorrect.
Update:
val csv1 = sc.textFile(Csv1Location).cache()
val csv2 = sc.textFile(Csv2Location).cache()
def GetInput1Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv1KeyLocation))
def GetInput2Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv2KeyLocation))
val intersectionOfKeys = csv1 map GetInput1Key intersection(csv2 map GetInput2Key)
val map1 = csv1 map (input => GetInput1Key(input) -> input)
val map2 = csv2 map (input => GetInput2Key(input) -> input)
val intersections = map1.join(map2)
intersections take NumOutputs foreach println
This code worked and did what I needed to do, but I was wondering if there were any modifications or performance implications of using join. I remember reading somewhere that join is typically really expensive and time consuming because all the data needs to be sent to all the distributed workers.
I think hveiga is correct, a join would be simpler:
val csv1KV = csv1.map(line=>(GetInput1Key(line), line))
val csv2KV = csv2.map(line=>(GetInput2Key(line), line))
val joined = csv1KV join csv2KV
joined.mapValues(lineTuple = lineTuple._1 + ", " lineTuple._2
This is more performant AND readable as far as I can see as you would need to join the two sets together at some point, and your way relies on a single machine mentality where you would have to pull each collection in to make sure you are requesting the line from all partitions. Note that I used mapValues, which at least keeps your sets hash partitioned and cuts down on network noise.