reduceByKey of List[Int] - scala

Suppose I have RDD(String,List[Int]), i.e. ("David",List(60,70,80)),("John",List(70,80,90)). How can I use reduceByKey in scala to calculate average of List[Int]. In the end, I want to have another RDD which is like ("David",70),("John",80)

Something based on reduceByKey doesn't directly look good because of it's type signature:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
V in your case is List[Int] so you're getting a RDD[(String, List[Int])].
A workaround is use a List of one element, the actual average:
val rddAvg: RDD[(String, Int)] =
rdd1
.reduceByKey { case (key, numbers) => List(numbers.sum / numbers.length) }
.mapValues(_.head)
You could as well attempt something based on aggregateByKey: this function can return a different result type and would do the trick in one step.
Later edit: I dropped the example with groupByKey as it is performance wise inferior to reduceByKey or aggregateByKey for a use-case like computing an average

val data1 = List(("David", List(60, 70, 80)), ("John", List(70, 80, 90)))
val rdd1 = sc.parallelize(data1)
print(rdd1.mapValues(value => value.sum.toDouble / value.size).collect)

Related

Applying distinct on rdd cosidering both key- value pair and not just on the basis of keys

I have 2 pair RDDs on which I am doing union to give a third RDD.
But resulting RDD has tupes which are repeated:
rdd3 = {(1,2) , (3,4) , (1,2)}
I want to remove duplicate tuples from rdd3 but only if both the key value pair of tuple is same.
How can i do that?
Please directly invoke the spark-scala lib api:
def distinct(): RDD[T]
Remember that it is a generic method with a type parameter.
If you invoke it with your rdd, of type RDD[(Int, Int)], it will give your distinct pairs of type (Int, Int) in your rdd, just as it is.
If you want to see the internal of this method. see below for the signature:
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
You can use distinct for example
val data= sc.parallelize(
Seq(
("Foo","41","US","3"),
("Foo","39","UK","1"),
("Bar","57","CA","2"),
("Bar","72","CA","2"),
("Baz","22","US","6"),
("Baz","36","US","6"),
("Baz","36","US","6")
)
)
remove duplicate :
val distinctData = data.distinct()
distinctData.collect

Group By error 33 : wrong number of parameter

I am kind of new to Spark and programming too and i am not able to understand how to deal with Rdd of type rdd.RDD[(Int, Iterable[Double])] = ShuffledRDD[10] at groupByKey . I am little bit interested in learning groupByKey in spark and i have a filtered RDD
scala> p.first
res11: (Int, Double) = (1,299.98)
I go the above result after applying GroupByKEy instead of reduceByKey now i have rdd of type (Int, Iterable[Double]) and i want to get result like (Int , sum (Double)).
I have tried this but got the error.
scala> val price = g.map((a,b) => (a, sum(b)))
<console>:33: error: wrong number of parameters; expected = 1
val price = g.map((a,b) => (a, sum(b)))
Please suggest and help me in this to understand it
g.mapValues(_.sum), which is short for g.map { case (k, v) => (k, v.sum) }

Break tuple in RDD to two tuples

I have an RDD[(String, (Iterable[Int], Iterable[Coordinate]))]
What I would like to do, is to break the Iterable[Int] to tuples, that each one will be like (String,Int,Iterable[Coordinate])
For an example, I would like to transform:
('a',<1,2,3>,<(45.34,32.33),(45.36,32.34)>)
('b',<1>,<(46.64,32.66),(46.67,32.71)>)
to
('a',1,<(45.34,32.33),(45.36,32.34)>)
('a',2,<(45.34,32.33),(45.36,32.34)>)
('a',3,<(45.34,32.33),(45.36,32.34)>)
('b',1,<(46.64,32.66),(46.67,32.71)>)
How is done is Scala?
Try to use flatMap:
rdd.flatMap {case (v, i1, i2) => i1.map(i=>(v, i, i2)}

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.

Scala Spark - Reduce RDD by adding multiple values per key

I have a Spark RDD that is in the format of (String, (Int, Int)) and I would like to add the Int values together to create a (String, Int) map.
This is an example of an element in my RDD:
res13: (String, (Int, Int)) = (9D4669B432A0FD,(1,1))
I would like to end with an RDD of (String, Int) = (9D4669B432A0FD,2)
You should just map the values to the sum of the second pair:
yourRdd.map(pair => (pair._1, pair._2._1 + pair._2._2))
#marios suggested the following nicer syntax in an edit:
Or if you want to make it a bit more readable:
yourRdd.map{case(str, (x1,x2)) => (str, x1+x2)}
Gabor Bakos answer is correct if there are unique keys. But If you have multiple identical keys and if you want to reduce it to unique keys then use reduceByKey.
Example:
val data = Array(("9888wq",(1,2)),("abcd",(1,1)),("abcd",(3,2)),("9888wq",(4,2)))
val rdd= sc.parallelize(data)
val result = rdd.map(x => (x._1,(x._2._1+x._2._2))).reduceByKey((x,y) => x+y)
result.foreach(println)
Output :
(9888wq,9)
(abcd,7)