How to understand reduceByKey(_:::_) in Spark Scala [closed] - scala

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I can't understand the example of spark with scala
val result = rdd.map(x => ((x._1, x._2), List(x._3)).reduceByKey(_:::_)

::: is a method for concatenating two Scala Lists. For example:
List(1, 2) ::: List(3) // List(1, 2, 3)
_ ::: _ is a shortcut for the binary function (l1, l2) => l1 ::: l2
Given a PairRDD (i.e. RDD[(K, V)]), method reduceByKey takes a function (V, V) => V to perform reduction on values of type V for each key of type K.
With a PairRDD of type RDD[(K, List[U])], one can perform a (l1, l2) => l1 ::: l2 reduction on the List[U] values for each key, as shown in the following example:
val rdd = sc.parallelize(Seq(
('x', 1, "a"),
('x', 1, "b"),
('y', 2, "c"),
('y', 2, "d"),
('y', 2, "e")
))
val pairRDD = rdd.map(r => ((r._1, r._2), List(r._3))) // RDD[((Char, Int), List[String])]
val result = pairRDD.reduceByKey(_ ::: _)
result.collect
// Array[((Char, Int), List[String])] = Array(
// ((y, 2), List(c, d, e)),
// ((x, 1), List(a, b))
// )

Related

spark iterate consecutive elements in list to rows

I have a spark rdd with a column like
List(1, 3, 4, 8)
List(2, 3)
List(1, 5, 6)
I would like to get a new rdd with consecutive elements in each list to rows, like
(1, 3)
(3, 4)
(4, 8)
(2, 3)
(1, 5)
(5, 6)
How can I achieve this with scala?
Consider:
using a complementary (plain Scala) function with signature List[Int] => List[(Int, Int)] to achieve the desired result for the single list
and
passing this function to your RDD's flatMap method.
This complementary function may look like this:
def makeTuples(l: List[Int],
acc: List[(Int, Int)] = List.empty): List[(Int, Int)] =
l match {
case Nil | _ :: Nil => acc.reverse
case a :: b :: rest => makeTuples(b :: rest, (a, b) :: acc)
}

Spark scala faster way to groupbykey and sort rdd values [duplicate]

This question already has answers here:
take top N after groupBy and treat them as RDD
(4 answers)
Closed 4 years ago.
I have a rdd with format of each row (key, (int, double))
I would like to transform the rdd into (key, ((int, double), (int, double) ...) )
Where the the values in the new rdd is the top N values pairs sorted by the double
So far I came up with the solution below but it's really slow and runs forever, it works fine with smaller rdd but now the rdd is too big
val top_rated = test_rated.partitionBy(new HashPartitioner(4)).sortBy(_._2._2).groupByKey()
.mapValues(x => x.takeRight(n))
I wonder if there are better and faster ways to do this?
The most efficient way is probably aggregateByKey
type K = String
type V = (Int, Double)
val rdd: RDD[(K, V)] = ???
//TODO: implement a function that adds a value to a sorted array and keeps top N elements. Returns the same array
def addToSortedArray(arr: Array[V], newValue: V): Array[V] = ???
//TODO: implement a function that merges 2 sorted arrays and keeps top N elements. Returns the first array
def mergeSortedArrays(arr1: Array[V], arr2: Array[V]): Array[V] = ??? //TODO
val result: RDD[(K, Array[(Int, Double)])] = rdd.aggregateByKey(zeroValue = new Array[V](0))(seqOp = addToSortedArray, combOp = mergeSortedArrays)
Since you're interested only in the top-N values in your RDD, I would suggest that you avoid sorting across the entire RDD. In addition, use the more performing reduceByKey rather than groupByKey if at all possible. Below is an example using a topN method, borrowed from this blog:
def topN(n: Int, list: List[(Int, Double)]): List[(Int, Double)] = {
def bigHead(l: List[(Int, Double)]): List[(Int, Double)] = list match {
case Nil => list
case _ => l.tail.foldLeft( List(l.head) )( (acc, x) =>
if (x._2 <= acc.head._2) x :: acc else acc :+ x
)
}
def update(l: List[(Int, Double)], e: (Int, Double)): List[(Int, Double)] = {
if (e._2 > l.head._2) bigHead((e :: l.tail)) else l
}
list.drop(n).foldLeft( bigHead(list.take(n)) )( update ).sortWith(_._2 > _._2)
}
val rdd = sc.parallelize(Seq(
("a", (1, 10.0)), ("a", (4, 40.0)), ("a", (3, 30.0)), ("a", (5, 50.0)), ("a", (2, 20.0)),
("b", (3, 30.0)), ("b", (1, 10.0)), ("b", (4, 40.0)), ("b", (2, 20.0))
))
val n = 2
rdd.
map{ case (k, v) => (k, List(v)) }.
reduceByKey{ (acc, x) => topN(n, acc ++ x) }.
collect
// res1: Array[(String, List[(Int, Double)])] =
// Array((a,List((5,50.0), (4,40.0))), (b,List((4,40.0), (3,30.0)))))

Flink WindowFunction Fold

I create a sliding window and hope to recursively pack all the elements enter that window period,
This is chunk of the code
.map(x => ((x.pickup.get.latitude, x.pickup.get.longitude), (x.dropoff.get.latitude, x.dropoff.get.longitude)))
.windowAll(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))
.fold(List[((Double, Double), (Double, Double))]) {(acc, v) => acc :+ ((v._1._1, v._1._2), (v._2._1, v._2._2))}
I hope to create a List in which the elements are tuple, but this does not work.
I tried this and it works:
val l2 : List[((Int, Int), (Int, Int))] = List(((1, 1), (2, 2)))
val newl2 = l2 :+ ((3, 3), (4, 4))
How can I do this?
Thanks so much
The first argument of the fold function needs to be the initial value and not the type. Changing the last line into:
.fold(List.empty[((Long, Long), (Long, Long))]) {(acc, v) => acc :+ ((v._1._1, v._1._2), (v._2._1, v._2._2))}
should do the trick.

The operation about Merge two tuples

Now there are two thus tuples.
1st tuple:((A,1),(B,3),(D,5)......)
2nd tuple:((A,3),(B,1),(E,6)......)
And the function is to merge those two tuples to this.
((A,1,3),(B,3,1),(D,5,0),(E,0,6)......)
If the first tuple contains a key that is not in the second tuple, set the value to 0, and vice versa. How could I code this function in scala?
Lets say you get the input in the format
val tuple1: List[(String, Int)] = List(("A",1),("B",3),("D",5),("E",0))
val tuple2: List[(String, Int)] = List(("A",3),("B",1),("D",6))
You can write a merge function as
def merge(tuple1: List[(String, Int)],tuple2: List[(String, Int)]) =
{
val map1 = tuple1.toMap
val map2 = tuple2.toMap
map1.map{ case (k,v) =>
(k,v,map2.get(k).getOrElse(0))
}
}
On calling the function
merge(tuple1,tuple2)
You will get the output as
res0: scala.collection.immutable.Iterable[(String, Int, Int)] = List((A,1,3), (B,3,1), (D,5,6), (E,0,0))
Please let me know if that answers your question.
val t1= List(("A",1),("B",3),("D",5),("E",0))
val t2= List(("A",3),("B",1),("D",6),("H",5))
val t3 = t2.filter{ case (k,v) => !t1.exists(case (k1,_) => k1==k)) }.map{case (k,_) => (k,0)}
val t4 = t1.filter{ case (k,v) => !t2.exists{case (k1,_) => k1==k} }.map{case (k,_) => (k,0)}
val t5=(t1 ++ t3).sortBy{case (k,v) => k}
val t6=(t2 ++ t4).sortBy{case (k,v) => k}
t5.zip(t6).map{case ((k,v1),(_,v2)) => (k,v1,v2) }
res: List[(String, Int, Int)] = List(("A", 1, 3), ("B", 3, 1), ("D", 5, 6), ("E", 0, 0), ("H", 0, 5))
In terms of what's happening here
t3 and t4 - find the missing value in t1 and t2 respectively and add them with a zero value
t5 and t6 sort the unified list (t1 with t3 and t2 with t4). Lastly they are zipped together and transformed to the desired output

Reduce/fold over scala sequence with grouping

In scala, given an Iterable of pairs, say Iterable[(String, Int]),
is there a way to accumulate or fold over the ._2s based on the ._1s? Like in the following, add up all the #s that come after A and separately the # after B
List(("A", 2), ("B", 1), ("A", 3))
I could do this in 2 steps with groupBy
val mapBy1 = list.groupBy( _._1 )
for ((key,sublist) <- mapBy1) yield (key, sublist.foldLeft(0) (_+_._2))
but then I would be allocating the sublists, which I would rather avoid.
You could build the Map as you go and convert it back to a List after the fact.
listOfPairs.foldLeft(Map[String,Int]().withDefaultValue(0)){
case (m,(k,v)) => m + (k -> (v + m(k)))
}.toList
You could do something like:
list.foldLeft(Map[String, Int]()) {
case (map, (k,v)) => map + (k -> (map.getOrElse(k, 0) + v))
}
You could also use groupBy with mapValues:
list.groupBy(_._1).mapValues(_.map(_._2).sum).toList
res1: List[(String, Int)] = List((A,5), (B,1))