How to Sum a part of a list in RDD - scala

I have an RDD, and I would like to sum a part of the list.
(key, element2 + element3)
(1, List(2.0, 3.0, 4.0, 5.0)), (2, List(1.0, -1.0, -2.0, -3.0))
output should look like this,
(1, 7.0), (2, -3.0)
Thanks

You can map and indexing on the second part:
yourRddOfTuples.map(tuple => {val list = tuple._2; list(1) + list(2)})
Update after your comment, convert it to Vector:
yourRddOfTuples.map(tuple => {val vs = tuple._2.toVector; vs(1) + vs(2)})
Or if you do not want to use conversions:
yourRddOfTuples.map(_._2.drop(1).take(2).sum)
This skips the first element (.drop(1)) from the second element of the tuple (.map(_._2), takes the next two (.take(2)) (might be less if you have less) and sums them (.sum).

You can map the key-list pair to obtain the 2nd and 3rd list elements as follows:
val rdd = sc.parallelize(Seq(
(1, List(2.0, 3.0, 4.0, 5.0)),
(2, List(1.0, -1.0, -2.0, -3.0))
))
rdd.map{ case (k, l) => (k, l(1) + l(2)) }.collect
// res1: Array[(Int, Double)] = Array((1,7.0), (2,-3.0))

Related

Spark scala faster way to groupbykey and sort rdd values [duplicate]

This question already has answers here:
take top N after groupBy and treat them as RDD
(4 answers)
Closed 4 years ago.
I have a rdd with format of each row (key, (int, double))
I would like to transform the rdd into (key, ((int, double), (int, double) ...) )
Where the the values in the new rdd is the top N values pairs sorted by the double
So far I came up with the solution below but it's really slow and runs forever, it works fine with smaller rdd but now the rdd is too big
val top_rated = test_rated.partitionBy(new HashPartitioner(4)).sortBy(_._2._2).groupByKey()
.mapValues(x => x.takeRight(n))
I wonder if there are better and faster ways to do this?
The most efficient way is probably aggregateByKey
type K = String
type V = (Int, Double)
val rdd: RDD[(K, V)] = ???
//TODO: implement a function that adds a value to a sorted array and keeps top N elements. Returns the same array
def addToSortedArray(arr: Array[V], newValue: V): Array[V] = ???
//TODO: implement a function that merges 2 sorted arrays and keeps top N elements. Returns the first array
def mergeSortedArrays(arr1: Array[V], arr2: Array[V]): Array[V] = ??? //TODO
val result: RDD[(K, Array[(Int, Double)])] = rdd.aggregateByKey(zeroValue = new Array[V](0))(seqOp = addToSortedArray, combOp = mergeSortedArrays)
Since you're interested only in the top-N values in your RDD, I would suggest that you avoid sorting across the entire RDD. In addition, use the more performing reduceByKey rather than groupByKey if at all possible. Below is an example using a topN method, borrowed from this blog:
def topN(n: Int, list: List[(Int, Double)]): List[(Int, Double)] = {
def bigHead(l: List[(Int, Double)]): List[(Int, Double)] = list match {
case Nil => list
case _ => l.tail.foldLeft( List(l.head) )( (acc, x) =>
if (x._2 <= acc.head._2) x :: acc else acc :+ x
)
}
def update(l: List[(Int, Double)], e: (Int, Double)): List[(Int, Double)] = {
if (e._2 > l.head._2) bigHead((e :: l.tail)) else l
}
list.drop(n).foldLeft( bigHead(list.take(n)) )( update ).sortWith(_._2 > _._2)
}
val rdd = sc.parallelize(Seq(
("a", (1, 10.0)), ("a", (4, 40.0)), ("a", (3, 30.0)), ("a", (5, 50.0)), ("a", (2, 20.0)),
("b", (3, 30.0)), ("b", (1, 10.0)), ("b", (4, 40.0)), ("b", (2, 20.0))
))
val n = 2
rdd.
map{ case (k, v) => (k, List(v)) }.
reduceByKey{ (acc, x) => topN(n, acc ++ x) }.
collect
// res1: Array[(String, List[(Int, Double)])] =
// Array((a,List((5,50.0), (4,40.0))), (b,List((4,40.0), (3,30.0)))))

Scala - Reduce list of tuples by key

I have list of tuples which contains userId and point. I want to combine or reduce this list by the key.
val points: List[(Int, Double)] = List(
(1, 1.0),
(2, 3.2),
(4, 2.0),
(1, 4.0),
(2, 6.8)
)
The expected result should look like:
List((1, 5.0), (2, 10.0), (4, 2.0))
I tried with groupBy and mapValue, but got an error:
val aggrPoint: Map[Int, Double] = incomes.groupBy(_._1).mapValues(seq => seq.reduce(_._2 + _._2))
Error:(16, 180) type mismatch;
found : Double
required: (Int, Double)
What am I doing wrong, and is there a idiomatic way to achieve this?
P.S) I found that in Spark aggregateByKey does this job. But, is there a built-in method in Scala?
What am I doing wrong, and is there a idiomatic way to achieve this?
let's go step by step to see what are you doing wrong. (I am going to use REPL)
first of all lets define the points
scala> val points: List[(Int, Double)] = List(
| (1, 1.0),
| (2, 3.2),
| (4, 2.0),
| (1, 4.0),
| (2, 6.8)
| )
points: List[(Int, Double)] = List((1,1.0), (2,3.2), (4,2.0), (1,4.0), (2,6.8))
As you can see that you have List[Tuple2[Int, Double]] so when you do groupBy and mapValues as
scala> points.groupBy(_._1).mapValues(seq => println(seq))
List((2,3.2), (2,6.8))
List((4,2.0))
List((1,1.0), (1,4.0))
res1: scala.collection.immutable.Map[Int,Unit] = Map(2 -> (), 4 -> (), 1 -> ())
You can see that seq object is of List[Tuple2[Int, Double]] again but only contains the grouped tuples as list.
So when you apply seq.reduce(_._2 + _._2), the reduce function takes two inputs of Tuple2[Int, Double] but the output is Double only which doesn't match for the next iteration on seq as the expected input is Tuple2[Int, Double]. Thats the main issue. All you have to do is match the input and output types for reduce function
One way would be to match Tuple2[Int, Double] as
scala> points.groupBy(_._1).mapValues(seq => seq.reduce{(x,y) => (x._1, x._2 + y._2)})
res6: scala.collection.immutable.Map[Int,(Int, Double)] = Map(2 -> (2,10.0), 4 -> (4,2.0), 1 -> (1,5.0))
But this isn't your desired output, so you can extract the double value from the reduced Tuple2[Int, Double] as
scala> points.groupBy(_._1).mapValues(seq => seq.reduce{(x,y) => (x._1, x._2 + y._2)}._2)
res8: scala.collection.immutable.Map[Int,Double] = Map(2 -> 10.0, 4 -> 2.0, 1 -> 5.0)
or you can just use map before you apply reduce function as
scala> points.groupBy(_._1).mapValues(seq => seq.map(_._2).reduce(_ + _))
res3: scala.collection.immutable.Map[Int,Double] = Map(2 -> 10.0, 4 -> 2.0, 1 -> 5.0)
I hope the explanation is clear enough to understand your mistake and you must have understood how a reduce function works
You can map the tuples in the mapValues to their 2nd elements then sum them as follows:
points.groupBy(_._1).mapValues( _.map(_._2).sum ).toList
// res1: List[(Int, Double)] = List((2,10.0), (4,2.0), (1,5.0))
Using collect
points.groupBy(_._1).collect{
case e => e._1 -> e._2.map(_._2).sum
}.toList
//res1: List[(Int, Double)] = List((2,10.0), (4,2.0), (1,5.0))

Counting number of occurrences of Array element in a RDD

I have a RDD1 with Key-Value pair of type [(String, Array[String])] (i will refer to them as (X, Y)), and a Array Z[String].
I'm trying for every element in Z to count how many X instances there are that have Z in Y. I want my output as ((X, Z(i)), #ofinstances).
RDD1= ((A, (2, 3, 4), (B, (4, 4, 4)), (A, (4, 5)))
Z = (1, 4)
then i want to get:
(((A, 4), 2), ((B, 4), 1))
Hope that made sense.
As you can see over, i only want an element if there is atleast one occurence.
I have tried this so far:
val newRDD = RDD1.map{case(x, y) => for(i <- 0 to (z.size-1)){if(y.contains(z(i))) {((x, z(i)), 1)}}}
My output here is an RDD[Unit]
Im not sure if what i'm asking for is even possible, or if i have to do it an other way.
So it is just another word count
val rdd = sc.parallelize(Seq(
("A", Array("2", "3", "4")),
("B", Array("4", "4", "4")),
("A", Array("4", "5"))))
val z = Array("1", "4")
To make lookups efficient convert z to Set:
val zs = z.toSet
val result = rdd
.flatMapValues(_.filter(zs contains _).distinct)
.map((_, 1))
.reduceByKey(_ + _)
where
_.filter(zs contains _).distinct
filters out values that occur in z and deduplicates.
result.take(2).foreach(println)
// ((B,4),1)
// ((A,4),2)

Scala select n-th element of the n-th element of an array / RDD

I have the following RDD:
val a = List((3, 1.0), (2, 2.0), (4, 2.0), (1,0.0))
val rdd = sc.parallelize(a)
Ordering the tuple elements by their right hand component in ascending order I would like to:
Pick the 2nd smallest result i.e. (3, 1.0)
Select the left hand element i.e. 3
The following code does that but it is so ugly and inefficient that I was wondering if someone could suggest something better.
val b = ((rdd.takeOrdered(2).zipWithIndex.map{case (k,v) => (v,k)}).toList find {x => x._1 == 1}).map(x => x._2).map(x=> x._1)
Simply:
implicit val ordering = scala.math.Ordering.Tuple2[Double, Int]
rdd.map(_.swap).takeOrdered(2).max.map { case (k, v) => v }
Sorry, I don't know spark, but maybe a standard method from Scala collections would work?:
rdd.sortBy {case (k,v) => v -> k}.apply(2)
val a = List((3, 1.0), (2, 2.0), (4, 2.0), (1,0.0))
val rdd = sc.parallelize(a)
rdd.map(_.swap).takeOrdered(2).max._2

sort two lists by their first element and zip them in scala

val descrList = cursorReal.interfaceInfo.interfaces.map {
case values => (values.ifIndex , values.ifName , values.ifType)
}
val ipAddressList = cursorReal.interfaceIpAndIndex.filter(x=> (!x.ifIpAddress.equalsIgnoreCase("0"))).map {
case values => (values.ifIndex,values.ifIpAddress)
}
For instance,
val descrList =
List((12,"VoIP-Null0",1), (8,"FastEthernet6",6), (19,"Vlan11",53),
(4,"FastEthernet2",6), (15,"Vlan1",53), (11,"GigabitEthernet0",6),
(9,"FastEthernet7",6), (22,"Vlan20",53), (13,"Wlan-GigabitEthernet0",6),
(16,"Async1",1), (5,"FastEthernet3",6), (10,"FastEthernet8",6),
(21,"Vlan12",53), (6,"FastEthernet4",6), (1,"wlan-ap0",24),
(17,"Virtual-Template1",131), (14,"Null0",1), (20,"Vlan10",53),
(2,"FastEthernet0",6), (18,"NVI0",1), (7,"FastEthernet5",6),
(29,"Virtual-Access7",131), (3,"FastEthernet1",6), (28,"Virtual-Access6",131))
val ipAddressList = List((21,"192.168.12.1"), (19,"192.168.11.1"),
(11,"104.36.252.115"), (20,"192.168.10.1"),
(22,"192.168.20.1"))
In both the lists first element is index and i have to merge these two list index wise . It means
(21,"192.168.12.1") this ipAddress should merge with (21,"Vlan12",53) and form new list like below (21,"Vlan12",53,"192.168.12.1").
scala> descrList map {case (index, v1, v2) =>
(index, v1, v2, ipAddressList.toMap.getOrElse(index, "empty"))}
res0: List[(Int, String, Int, String)] = List(
(12,VoIP-Null0,1,empty), (8,FastEthernet6,6,empty), (19,Vlan11,53,192.168.11.1),
(4,FastEthernet2,6,empty), (15,Vlan1,53,empty), (11,GigabitEthernet0,6,104.36.252.115),
(9,FastEthernet7,6,empty), (22,Vlan20,53,192.168.20.1), (13,Wlan-GigabitEthernet0,6,empty),
(16,Async1,1,empty), (5,FastEthernet3,6,empty), (10,FastEthernet8,6,empty),
(21,Vlan12,53,192.168.12.1), (6,FastEthernet4,6,empty), (1,wlan-ap0,24,empty), (17,Virtual-
Template1,131,empty), (14,Null0,1,empty), (20,Vlan10,53,192.168.10.1), (2,FastEthernet0,6,empty),
(18,NVI0,1,empty), (7,FastEthernet5,6,empty), (29,Virtual-Access7,131,empty),
(3,FastEthernet1,6,empty), (28,Virtual-Access6,131,empty))
First, I would suggest you produce a Map instead of a List. A Map by nature has an indexer, and in your case this would be the ifIndex value.
Once you have Maps in place, you can use something like this (sample from this other SO Best way to merge two maps and sum the values of same key?)
From Rex Kerr:
map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
Or like this from Matthew Farwell:
(map1.keySet ++ map2.keySet).map (i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0))}.toMap
If you cannot use Maps for whatever reason, then look into your existing project libraries. If you have Scalaz, then you have some tools already available.
Scalaz: https://github.com/scalaz/scalaz
If you have Slick, you also have some nice tools to directly use.
Slick: http://slick.typesafe.com/docs/
Consider first converting decrList to a Map, like this,
val a = (for ( (k,v1,v2) <- descrList) yield k -> (v1,v2)).toMap
Then we can look up keys for ipAddressList and agglomerate elements into a new tuple, as follows,
for ( (k,ip) <- ipAddressList ; v = a.getOrElse(k,("none","none")) ) yield (k,v._1,v._2,ip)
Hence, for ipAddressList,
res: List((21,Vlan12,53,192.168.12.1), (19,Vlan11,53,192.168.11.1),
(11,GigabitEthernet0,6,104.36.252.115), (20,Vlan10,53,192.168.10.1),
(22,Vlan20,53,192.168.20.1))
Given the data:
val descrList =
List((12, "VoIP-Null0", 1), (8, "FastEthernet6", 6), (19, "Vlan11", 53),
(4, "FastEthernet2", 6), (15, "Vlan1", 53), (11, "GigabitEthernet0", 6),
(9, "FastEthernet7", 6), (22, "Vlan20", 53), (13, "Wlan-GigabitEthernet0", 6),
(16, "Async1", 1), (5, "FastEthernet3", 6), (10, "FastEthernet8", 6),
(21, "Vlan12", 53), (6, "FastEthernet4", 6), (1, "wlan-ap0", 24),
(17, "Virtual-Template1", 131), (14, "Null0", 1), (20, "Vlan10", 53),
(2, "FastEthernet0", 6), (18, "NVI0", 1), (7, "FastEthernet5", 6),
(29, "Virtual-Access7", 131), (3, "FastEthernet1", 6), (28, "Virtual-Access6", 131))
val ipAddressList = List((21, "192.168.12.1"), (19, "192.168.11.1"),
(11, "104.36.252.115"), (20, "192.168.10.1"),
(22, "192.168.20.1"))
Merge and sort:
val addrMap = ipAddressList.toMap
val output = descrList
.filter(x => addrMap.contains(x._1))
.map(x => x match { case (i, a, b) => (i, a, b, addrMap(i)) })
.sortBy(_._1)
output foreach println
Output:
(11,GigabitEthernet0,6,104.36.252.115)
(19,Vlan11,53,192.168.11.1)
(20,Vlan10,53,192.168.10.1)
(21,Vlan12,53,192.168.12.1)
(22,Vlan20,53,192.168.20.1)