RDD SUM on Lists alternative - scala

I have this contrived example:
val rdd = sc.parallelize(List(("A", List(1, 1)),
("B", List(2, 2, 2, 200)),
("C", List(3, 3))))
and can do this to tally the overall sum of the RDD:
rdd.map(_._2.sum).sum
or
rdd.flatMapValues(identities).values.sum
Can I sum overall taking into account there is a List, Array, etc. in a 1-step process? Or are these two approaches the basics of overall summing that need necessarily to be a two step process?

As for my understanding both your solutions are right.
There are some other options however. For instance, here is an elegant way of doing the same:
rdd.flatMap(_._2).sum

Related

Best way to find min of element in tuple

I have an Array of 3-tuples in scala as such: (a:Int, b:Int, val:Double), and I need to return an Array which has for all pairs (a, b) the minimum val. It is clear to me how to do this by going to a Map:
a.map(t => ((t._1,t ._2), t._3)).groupBy(_._1).mapValues(_.map(_._2).min).toArray
but I would like to avoid maps for the purposes of memory optimization. Is there a clean way to do this without Map?
Try groupMapReduce, it does all the same, but in a single pass:
tuples.groupMapReduce(t => (t._1, t._2))(_._3)(_ min _)
Runnable example:
val tuples = List(
(0, 0, 58),
(1, 1, 100),
(0, 0, 1),
(0, 1, 42),
(1, 0, 123),
(1, 0, 3),
(0, 1, 2),
(1, 1, 4)
)
tuples.groupMapReduce(t => (t._1, t._2))(_._3)(_ min _).foreach(println)
Output:
((0,0),1)
((1,1),4)
((1,0),3)
((0,1),2)
It should strictly decrease the load on GC compared to your solution, because it doesn't generate any intermediate lists for the grouped values, and neither for those mapped grouped values in the _.map(_._2)-step in your original solution.
It does not completely eliminate the intermediate Map, but unless you can provide any more efficient structure for storing the values (such as a 2-D array for, by limiting the possible as and bs to be relatively small & strictly positive), the occurrence of a Map seems more or less unavoidable.

Intersection of Two HashMap (HashMap<Integer,HashSet<Integer>>) RDDs in Scala for Spark

I am working in Scala for programming in Spark on a Standalone machine (PC having Windows 10). I am a newbie and don't have experience in programming in scala and spark. So I will be very thankful for the help.
Problem:
I have a HashMap, hMap1, whose values are HashSets of Integer entries (HashMap>). I then store its values (i.e., many HashSet values) in an RDD. The code is as below
val rdd1 = sc.parallelize(Seq(hMap1.values()))
Now I have another HashMap, hMap2, of same type i.e., HashMap>. Its values are also stored in an RDD as
val rdd2 = sc.parallelize(Seq(hMap2.values()))
I want to know how can I intersect the values of hMap1 and hMap2
For example:
Input:
the data in rdd1 = [2, 3], [1, 109], [88, 17]
and data in rdd2 = [2, 3], [1, 109], [5,45]
Output
so the output = [2, 3], [1, 109]
Problem statement
My understanding of your question is the following:
Given two RDDs of type RDD[Set[Integer]], how can I produce an RDD of their common records.
Sample data
Two RDDs generated by
val rdd1 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(88, 17)))
val rdd2 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(5, 45)))
Possible solution
If my understanding of the problem statement is correct, you could use rdd1.intersection(rdd2) if your RDDs are as I thought. This is what I tried on a spark-shell with Spark 2.2.0:
rdd1.intersection(rdd2).collect
which yielded the output:
Array(Set(2, 3), Set(1, 109))
This works because Spark can compare elements of type Set[Integer], but note that this is not generalisable to any object Set[MyObject] unless you defined the equality contract of MyObject.

Getting unique values of pairs in an RDD when the order within the pair is irrelevant

I have a rdd with the values of
a,b
a,c
a,d
b,a
c,d
d,c
d,e
what I need is an rdd that contains the reciprocal pairs, but just one set. It would have to be:
a,b or b,a
c,d or d,c
I was thinking they could be added to a list and looped thru to find the the opposite pair, if one exists filter the first value out, and delete the reciprocal pair. I am thinking there must be a way of using scala functions like join or case, but I am having difficulty understanding them
If you don't mind the order of each pair to change(e.g., (a,b) to become (b,a)), you can give a simple and easy to parallelize solution. The examples below use numbers but the pairs can be anything; as long as the values are comparable.
In vanilla Scala:
List(
(2, 1),
(3, 2),
(1, 2),
(2, 4),
(4, 2)).map{ case(a,b) => if (a>b) (a,b) else (b,a) }.toSet
This will result in:
res1: Set[(Int, Int)] = Set((2, 1), (3, 2), (4, 2))
In Spark RDD the above can be expressed as:
sc.parallelize((2, 1)::(3, 2)::(2, 1)::(4, 2)::(4, 2)::Nil).map{ case(a,b) =>
if (a>b) (a,b) else (b,a) }.distinct()

Scala: sliding(N,N) vs grouped(N)

I found myself lately using sliding(n,n) when I need to iterate collections in groups of n elements without re-processing any of them. I was wondering if it would be more correct to iterate those collections by using grouped(n). My question is if there is an special reason to use one or another for this specific case in terms of performance.
val listToGroup = List(1,2,3,4,5,6,7,8)
listToGroup: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8)
listToGroup.sliding(3,3).toList
res0: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6), List(7, 8))
listToGroup.grouped(3).toList
res1: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6), List(7, 8))
The reason to use sliding instead of grouped is really only applicable when you want to have the 'windows' be of a length different than what you 'slide' by (that is to say, using sliding(m, n) where m != n):
listToGroup.sliding(2,3).toList
//returns List(List(1, 2), List(4, 5), List(7, 8))
listToGroup.sliding(4,3).toList
//returns List(List(1, 2, 3, 4), List(4, 5, 6, 7), List(7, 8))
As som-snytt points out in a comment, there's not going to be any performance difference, as both of them are implemented within Iterator as returning a new GroupedIterator. However, it's simpler to write grouped(n) than sliding(n, n), and your code will be cleaner and more obvious in its intended behavior, so I would recommend grouped(n).
As an example for where to use sliding, consider this problem where grouped simply doesn't suffice:
Given a list of numbers, find the sublist of length 4 with the greatest sum.
Now, putting aside the fact that a dynamic programming approach can produce a more efficient result, this can be solved as:
def maxLengthFourSublist(list: List[Int]): List[Int] = {
list.sliding(4,1).maxBy(_.sum)
}
If you were to use grouped here, you wouldn't get all the sublists, so sliding is more appropriate.

How to transform RDD[(Key, Value)] into Map[Key, RDD[Value]]

I searched a solution for a long time but didn't get any correct algorithm.
Using Spark RDDs in scala, how could I transform a RDD[(Key, Value)] into a Map[key, RDD[Value]], knowing that I can't use collect or other methods which may load data into memory ?
In fact, my final goal is to loop on Map[Key, RDD[Value]] by key and call saveAsNewAPIHadoopFile for each RDD[Value]
For example, if I get :
RDD[("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)]
I'd like :
Map[("A" -> RDD[1, 2, 3]), ("B" -> RDD[4, 5]), ("C" -> RDD[6])]
I wonder if it would cost not too much to do it using filter on each key A, B, C of RDD[(Key, Value)], but I don't know if calling filter as much times there are different keys would be efficient ? (off course not, but maybe using cache ?)
Thank you
You should use the code like this (Python):
rdd = sc.parallelize( [("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)] ).cache()
keys = rdd.keys().distinct().collect()
for key in keys:
out = rdd.filter(lambda x: x[0] == key).map(lambda (x,y): y)
out.saveAsNewAPIHadoopFile (...)
One RDD cannot be a part of another RDD and you have no option to just collect keys and transform their related values to a separate RDD. In my example you would iterate over the cached RDD which is ok and would work fast
It sounds like what you really want is to save your KV RDD to a separate file for each key. Rather than creating a Map[Key, RDD[Value]] consider using a MultipleTextOutputFormat similar to the example here. The code is pretty much all there in the example.
The benefit of this approach is that you're guaranteed to only take one pass over the RDD after the shuffle and you get the same result you wanted. If you did this by filtering and creating several IDs as suggested in the other answer (unless your source supported pushdown filters) you would end up taking one pass over the dataset for each individual key which would be way slower.
This is my simple test code.
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val groupby_RDD = test_RDD.groupByKey()
val result_RDD = groupby_RDD.map{v =>
var result_list:List[Int] = Nil
for (i <- v._2) {
result_list ::= i
}
(v._1, result_list)
}
The result is below
result_RDD.take(3)
>> res86: Array[(String, List[Int])] = Array((A,List(1, 3, 2)), (B,List(5, 4)), (C,List(6)))
Or you can do it like this
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val nil_list:List[Int] = Nil
val result2 = test_RDD.aggregateByKey(nil_list)(
(acc, value) => value :: acc,
(acc1, acc2) => acc1 ::: acc2 )
The result is this
result2.take(3)
>> res209: Array[(String, List[Int])] = Array((A,List(3, 2, 1)), (B,List(5, 4)), (C,List(6)))