Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I can't understand the example of spark with scala
val result = rdd.map(x => ((x._1, x._2), List(x._3)).reduceByKey(_:::_)
::: is a method for concatenating two Scala Lists. For example:
List(1, 2) ::: List(3) // List(1, 2, 3)
_ ::: _ is a shortcut for the binary function (l1, l2) => l1 ::: l2
Given a PairRDD (i.e. RDD[(K, V)]), method reduceByKey takes a function (V, V) => V to perform reduction on values of type V for each key of type K.
With a PairRDD of type RDD[(K, List[U])], one can perform a (l1, l2) => l1 ::: l2 reduction on the List[U] values for each key, as shown in the following example:
val rdd = sc.parallelize(Seq(
('x', 1, "a"),
('x', 1, "b"),
('y', 2, "c"),
('y', 2, "d"),
('y', 2, "e")
))
val pairRDD = rdd.map(r => ((r._1, r._2), List(r._3))) // RDD[((Char, Int), List[String])]
val result = pairRDD.reduceByKey(_ ::: _)
result.collect
// Array[((Char, Int), List[String])] = Array(
// ((y, 2), List(c, d, e)),
// ((x, 1), List(a, b))
// )
Related
I have a spark rdd with a column like
List(1, 3, 4, 8)
List(2, 3)
List(1, 5, 6)
I would like to get a new rdd with consecutive elements in each list to rows, like
(1, 3)
(3, 4)
(4, 8)
(2, 3)
(1, 5)
(5, 6)
How can I achieve this with scala?
Consider:
using a complementary (plain Scala) function with signature List[Int] => List[(Int, Int)] to achieve the desired result for the single list
and
passing this function to your RDD's flatMap method.
This complementary function may look like this:
def makeTuples(l: List[Int],
acc: List[(Int, Int)] = List.empty): List[(Int, Int)] =
l match {
case Nil | _ :: Nil => acc.reverse
case a :: b :: rest => makeTuples(b :: rest, (a, b) :: acc)
}
This question already has answers here:
take top N after groupBy and treat them as RDD
(4 answers)
Closed 4 years ago.
I have a rdd with format of each row (key, (int, double))
I would like to transform the rdd into (key, ((int, double), (int, double) ...) )
Where the the values in the new rdd is the top N values pairs sorted by the double
So far I came up with the solution below but it's really slow and runs forever, it works fine with smaller rdd but now the rdd is too big
val top_rated = test_rated.partitionBy(new HashPartitioner(4)).sortBy(_._2._2).groupByKey()
.mapValues(x => x.takeRight(n))
I wonder if there are better and faster ways to do this?
The most efficient way is probably aggregateByKey
type K = String
type V = (Int, Double)
val rdd: RDD[(K, V)] = ???
//TODO: implement a function that adds a value to a sorted array and keeps top N elements. Returns the same array
def addToSortedArray(arr: Array[V], newValue: V): Array[V] = ???
//TODO: implement a function that merges 2 sorted arrays and keeps top N elements. Returns the first array
def mergeSortedArrays(arr1: Array[V], arr2: Array[V]): Array[V] = ??? //TODO
val result: RDD[(K, Array[(Int, Double)])] = rdd.aggregateByKey(zeroValue = new Array[V](0))(seqOp = addToSortedArray, combOp = mergeSortedArrays)
Since you're interested only in the top-N values in your RDD, I would suggest that you avoid sorting across the entire RDD. In addition, use the more performing reduceByKey rather than groupByKey if at all possible. Below is an example using a topN method, borrowed from this blog:
def topN(n: Int, list: List[(Int, Double)]): List[(Int, Double)] = {
def bigHead(l: List[(Int, Double)]): List[(Int, Double)] = list match {
case Nil => list
case _ => l.tail.foldLeft( List(l.head) )( (acc, x) =>
if (x._2 <= acc.head._2) x :: acc else acc :+ x
)
}
def update(l: List[(Int, Double)], e: (Int, Double)): List[(Int, Double)] = {
if (e._2 > l.head._2) bigHead((e :: l.tail)) else l
}
list.drop(n).foldLeft( bigHead(list.take(n)) )( update ).sortWith(_._2 > _._2)
}
val rdd = sc.parallelize(Seq(
("a", (1, 10.0)), ("a", (4, 40.0)), ("a", (3, 30.0)), ("a", (5, 50.0)), ("a", (2, 20.0)),
("b", (3, 30.0)), ("b", (1, 10.0)), ("b", (4, 40.0)), ("b", (2, 20.0))
))
val n = 2
rdd.
map{ case (k, v) => (k, List(v)) }.
reduceByKey{ (acc, x) => topN(n, acc ++ x) }.
collect
// res1: Array[(String, List[(Int, Double)])] =
// Array((a,List((5,50.0), (4,40.0))), (b,List((4,40.0), (3,30.0)))))
I create a sliding window and hope to recursively pack all the elements enter that window period,
This is chunk of the code
.map(x => ((x.pickup.get.latitude, x.pickup.get.longitude), (x.dropoff.get.latitude, x.dropoff.get.longitude)))
.windowAll(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(1)))
.fold(List[((Double, Double), (Double, Double))]) {(acc, v) => acc :+ ((v._1._1, v._1._2), (v._2._1, v._2._2))}
I hope to create a List in which the elements are tuple, but this does not work.
I tried this and it works:
val l2 : List[((Int, Int), (Int, Int))] = List(((1, 1), (2, 2)))
val newl2 = l2 :+ ((3, 3), (4, 4))
How can I do this?
Thanks so much
The first argument of the fold function needs to be the initial value and not the type. Changing the last line into:
.fold(List.empty[((Long, Long), (Long, Long))]) {(acc, v) => acc :+ ((v._1._1, v._1._2), (v._2._1, v._2._2))}
should do the trick.
Now there are two thus tuples.
1st tuple:((A,1),(B,3),(D,5)......)
2nd tuple:((A,3),(B,1),(E,6)......)
And the function is to merge those two tuples to this.
((A,1,3),(B,3,1),(D,5,0),(E,0,6)......)
If the first tuple contains a key that is not in the second tuple, set the value to 0, and vice versa. How could I code this function in scala?
Lets say you get the input in the format
val tuple1: List[(String, Int)] = List(("A",1),("B",3),("D",5),("E",0))
val tuple2: List[(String, Int)] = List(("A",3),("B",1),("D",6))
You can write a merge function as
def merge(tuple1: List[(String, Int)],tuple2: List[(String, Int)]) =
{
val map1 = tuple1.toMap
val map2 = tuple2.toMap
map1.map{ case (k,v) =>
(k,v,map2.get(k).getOrElse(0))
}
}
On calling the function
merge(tuple1,tuple2)
You will get the output as
res0: scala.collection.immutable.Iterable[(String, Int, Int)] = List((A,1,3), (B,3,1), (D,5,6), (E,0,0))
Please let me know if that answers your question.
val t1= List(("A",1),("B",3),("D",5),("E",0))
val t2= List(("A",3),("B",1),("D",6),("H",5))
val t3 = t2.filter{ case (k,v) => !t1.exists(case (k1,_) => k1==k)) }.map{case (k,_) => (k,0)}
val t4 = t1.filter{ case (k,v) => !t2.exists{case (k1,_) => k1==k} }.map{case (k,_) => (k,0)}
val t5=(t1 ++ t3).sortBy{case (k,v) => k}
val t6=(t2 ++ t4).sortBy{case (k,v) => k}
t5.zip(t6).map{case ((k,v1),(_,v2)) => (k,v1,v2) }
res: List[(String, Int, Int)] = List(("A", 1, 3), ("B", 3, 1), ("D", 5, 6), ("E", 0, 0), ("H", 0, 5))
In terms of what's happening here
t3 and t4 - find the missing value in t1 and t2 respectively and add them with a zero value
t5 and t6 sort the unified list (t1 with t3 and t2 with t4). Lastly they are zipped together and transformed to the desired output
In scala, given an Iterable of pairs, say Iterable[(String, Int]),
is there a way to accumulate or fold over the ._2s based on the ._1s? Like in the following, add up all the #s that come after A and separately the # after B
List(("A", 2), ("B", 1), ("A", 3))
I could do this in 2 steps with groupBy
val mapBy1 = list.groupBy( _._1 )
for ((key,sublist) <- mapBy1) yield (key, sublist.foldLeft(0) (_+_._2))
but then I would be allocating the sublists, which I would rather avoid.
You could build the Map as you go and convert it back to a List after the fact.
listOfPairs.foldLeft(Map[String,Int]().withDefaultValue(0)){
case (m,(k,v)) => m + (k -> (v + m(k)))
}.toList
You could do something like:
list.foldLeft(Map[String, Int]()) {
case (map, (k,v)) => map + (k -> (map.getOrElse(k, 0) + v))
}
You could also use groupBy with mapValues:
list.groupBy(_._1).mapValues(_.map(_._2).sum).toList
res1: List[(String, Int)] = List((A,5), (B,1))