Group By error 33 : wrong number of parameter - scala

I am kind of new to Spark and programming too and i am not able to understand how to deal with Rdd of type rdd.RDD[(Int, Iterable[Double])] = ShuffledRDD[10] at groupByKey . I am little bit interested in learning groupByKey in spark and i have a filtered RDD
scala> p.first
res11: (Int, Double) = (1,299.98)
I go the above result after applying GroupByKEy instead of reduceByKey now i have rdd of type (Int, Iterable[Double]) and i want to get result like (Int , sum (Double)).
I have tried this but got the error.
scala> val price = g.map((a,b) => (a, sum(b)))
<console>:33: error: wrong number of parameters; expected = 1
val price = g.map((a,b) => (a, sum(b)))
Please suggest and help me in this to understand it

g.mapValues(_.sum), which is short for g.map { case (k, v) => (k, v.sum) }

Related

reduceByKey of List[Int]

Suppose I have RDD(String,List[Int]), i.e. ("David",List(60,70,80)),("John",List(70,80,90)). How can I use reduceByKey in scala to calculate average of List[Int]. In the end, I want to have another RDD which is like ("David",70),("John",80)
Something based on reduceByKey doesn't directly look good because of it's type signature:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
V in your case is List[Int] so you're getting a RDD[(String, List[Int])].
A workaround is use a List of one element, the actual average:
val rddAvg: RDD[(String, Int)] =
rdd1
.reduceByKey { case (key, numbers) => List(numbers.sum / numbers.length) }
.mapValues(_.head)
You could as well attempt something based on aggregateByKey: this function can return a different result type and would do the trick in one step.
Later edit: I dropped the example with groupByKey as it is performance wise inferior to reduceByKey or aggregateByKey for a use-case like computing an average
val data1 = List(("David", List(60, 70, 80)), ("John", List(70, 80, 90)))
val rdd1 = sc.parallelize(data1)
print(rdd1.mapValues(value => value.sum.toDouble / value.size).collect)

Getting the mode from an RDD

I would like to get the mode (the most common number) from an rdd using Spark + Scala.
I can get it doing the following but I think it could be a better way to calculate this. The most important thing is if more than one value has the same number of repetition, I need to return both of them.
Let's see my example code:
val l = List(3,4,4,3,3,7,7,7,9)
val rdd = spark.sparkContext.parallelize(l)
val grouped = rdd.map (e => (e, 1)).groupBy(_._1).map(e=> (e._1, e._2.size))
val maxRep = grouped.collect().maxBy(_._2)._2
val mode = grouped.filter(e => e._2 == maxRep).map(e => e._1).collect
And the result is right:
Array[Int] = Array(3, 7)
but is there a better way to do this? I mean considering the performance because the original RDD would be much bigger than this.
This should work and be a little bit more efficient.
(only if you are sure the total number of elements is small)
val counted = rdd.countByValue()
val max = counted.valuesIterator.max
val maxElements = count.collect { case (k, v) if (v == max) => k }
If there could be many elements, consider this alternative which is memory safe.
val counted = rdd.map(x => (x, 1L)).reduceByKey(_ + _).cache()
val max = counted.values.max
val maxElements = counted.map { case (k, v) => (v, k) }.lookup(max)
How about get the max key-value pair from a double groupBy? This works even better for bigger data size.
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max
// res1: (Int, Iterable[(Int, Int)]) = (3,CompactBuffer((3,3), (7,3)))
To get the element
rdd.groupBy(identity).mapValues(_.size).groupBy(_._2).max._2.map(_._1)
// res4: Iterable[Int] = List(3, 7)
The first groupBy will get element into (element -> count) with type Map[Int, Long], the second groupBy will group (element -> count) by count, like (count -> Iterable((element, count)), then simply max to get the key-value pair with the maximum key value, which is the count.

Scala Spark - Reduce RDD by adding multiple values per key

I have a Spark RDD that is in the format of (String, (Int, Int)) and I would like to add the Int values together to create a (String, Int) map.
This is an example of an element in my RDD:
res13: (String, (Int, Int)) = (9D4669B432A0FD,(1,1))
I would like to end with an RDD of (String, Int) = (9D4669B432A0FD,2)
You should just map the values to the sum of the second pair:
yourRdd.map(pair => (pair._1, pair._2._1 + pair._2._2))
#marios suggested the following nicer syntax in an edit:
Or if you want to make it a bit more readable:
yourRdd.map{case(str, (x1,x2)) => (str, x1+x2)}
Gabor Bakos answer is correct if there are unique keys. But If you have multiple identical keys and if you want to reduce it to unique keys then use reduceByKey.
Example:
val data = Array(("9888wq",(1,2)),("abcd",(1,1)),("abcd",(3,2)),("9888wq",(4,2)))
val rdd= sc.parallelize(data)
val result = rdd.map(x => (x._1,(x._2._1+x._2._2))).reduceByKey((x,y) => x+y)
result.foreach(println)
Output :
(9888wq,9)
(abcd,7)

Can reduceBykey be used to change type and combine values - Scala Spark?

In code below I'm attempting to combine values:
val rdd: org.apache.spark.rdd.RDD[((String), Double)] =
sc.parallelize(List(
(("a"), 1.0),
(("a"), 3.0),
(("a"), 2.0)
))
val reduceByKey = rdd.reduceByKey((a , b) => String.valueOf(a) + String.valueOf(b))
reduceByValue should contain (a , 1,3,2) but receive compile time error :
Multiple markers at this line - type mismatch; found : String required: Double - type mismatch; found : String
required: Double
What determines the type of the reduce function? Can the type not be converted?
I could use groupByKey to achieve same result but just want to understand reduceByKey.
No, given an rdd of type RDD[(K,V)], reduceByKey will take an associative function of type (V,V) => V.
If we want to apply a reduction that changes the type of the values to another arbitrary type, then we can use aggregateByKey:
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)
Using the zeroValue and the seqOp function, it provides a fold-like operation at the map side while the associate function combOp combines the results of the seqOp to the final result, much like reduceByKey would do.
As we can appreciate from the signature, while the collection values are of type V the result of aggregateByKey will be of an arbitrary type U
Applied to the example above, aggregateByKey would look like this:
rdd.aggregateByKey("")({case (aggr , value) => aggr + String.valueOf(value)}, (aggr1, aggr2) => aggr1 + aggr2)
The problem with your code is that your Value type mismatch. You can achieve the same output with reduceByKey, provided you changed the type of value in your RDD.
val rdd: org.apache.spark.rdd.RDD[((String), String)] =
sc.parallelize(List(
("a", "1.0"),
("a", "3.0"),
("a", "2.0")
))
val reduceByKey = rdd.reduceByKey((a , b) => a.concat(b))
Here is the same example. As long as the function you pass to reduceByKey takes two parameter of the type Value( Double in your case ) and returns a single parameter of the same type, your reduceByKey will work.

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}