aggregateByKey method not working in spark rdd - scala

Below is my sample data:
1,Siddhesh,43,32000
1,Siddhesh,12,4300
2,Devil,10,1000
2,Devil,10,3000
2,Devil,11,2000
I created pair RDD to perform combineByKey and aggregateByKey operations. Below is my code:
val rd=sc.textFile("file:///home/cloudera/Desktop/details.txt").map(line=>line.split(",")).map(p=>((p(0).toString,p(1).toString),(p(3).toLong,p(2).toString.toInt)))
Above I paired data of first two columns as key and the remaining columns as value. Now I want only distinct values from the right tuple for 3rd column in dataset which I was able to do with the combineByKey. Below is my code:
val reduced = rd.combineByKey(
(x:(Long,Int))=>{(x._1,Set(x._2))},
(x:(Long,Set[Int]),y:(Long,Int))=>(x._1+y._1,x._2+y._2),
(x:(Long,Set[Int]),y:(Long,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
)
scala> reduced.foreach(println)
((1,Siddhesh),(36300,Set(43, 12)))
((2,Devil),(6000,Set(10, 11)))
Now I map it so that I can get the sum of values of unique distinct keys.
scala> val newRdd=reduced.map(p=>(p._1._1,p._1._2,p._2._1,p._2._2.size))
scala> newRdd.foreach(println)
(1,Siddhesh,36300,2)
(2,Devil,6000,2)
Here for devil the last value is 2 since I have 10 as 2 values for 'Devil' record in the dataset and since I have had used Set it eliminates the duplicates. So now I tried it with aggregateByKey. Below is my code with error:
val rd=sc.textFile("file:///home/cloudera/Desktop/details.txt").map(line=>line.split(",")).map(p=>((p(0).toString,p(1).toString),(p(3).toString.toInt,p(2).toString.toInt)))
I converted the value column from long to int because while initializing it was throwing error on '0'
scala> val reducedByAggKey = rd.aggregateByKey((0,0))(
| (x:(Int,Set[Int]),y:(Int,Int))=>(x._1+y._1,x._2+y._2),
| (x:(Int,Set[Int]),y:(Int,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
| )
<console>:36: error: type mismatch;
found : scala.collection.immutable.Set[Int]
required: Int
(x:(Int,Set[Int]),y:(Int,Int))=>(x._1+y._1,x._2+y._2),
^
<console>:37: error: type mismatch;
found : scala.collection.immutable.Set[Int]
required: Int
(x:(Int,Set[Int]),y:(Int,Set[Int]))=>{(x._1+y._1,x._2++y._2)}
^
And as suggested by Leo, below is my code with error:
scala> val reduced = rdd.aggregateByKey((0, Set.empty[Int]))(
| (x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, y._2+x._2),
| (x: (Int, Set[Int]), y: (Int, Set[Int])) => (x._1 + y._1, y._2++ x._2)
| )
<console>:36: error: overloaded method value + with alternatives:
(x: Double)Double <and>
(x: Float)Float <and>
(x: Long)Long <and>
(x: Int)Int <and>
(x: Char)Int <and>
(x: Short)Int <and>
(x: Byte)Int <and>
(x: String)String
cannot be applied to (Set[Int])
(x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, y._2+x._2),
^
So where am I making mess over here ?? Please correct me

If I understand your requirement correctly, to get the full count rather than distinct count, use List instead of Set for the aggregations. As to the problem with your aggregateByKey, it's due to the incorrect type of the zeroValue which should be (0, List.empty[Int]) (would've been (0, Set.empty[Int]) if you were to stick to using Set):
val reduced = rdd.aggregateByKey((0, List.empty[Int]))(
(x: (Int, List[Int]), y: (Int, Int)) => (x._1 + y._1, y._2 :: x._2),
(x: (Int, List[Int]), y: (Int, List[Int])) => (x._1 + y._1, y._2 ::: x._2)
)
reduced.collect
// res1: Array[((String, String), (Int, List[Int]))] =
// Array(((2,Devil),(6000,List(11, 10, 10))), ((1,Siddhesh),(36300,List(12, 43))))
val newRdd = reduced.map(p => (p._1._1, p._1._2, p._2._1, p._2._2.size))
newRdd.collect
// res2: Array[(String, String, Int, Int)] =
// Array((2,Devil,6000,3), (1,Siddhesh,36300,2))
Note that the Set to List change would apply to your combineByKey code as well if you want the full count instead of distinct count.
[UPDATE]
For distinct count per your comment, simply stay with Set with zeroValue set to (0, Set.empty[Int]):
val reduced = rdd.aggregateByKey((0, Set.empty[Int]))(
(x: (Int, Set[Int]), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2),
(x: (Int, Set[Int]), y: (Int, Set[Int])) => (x._1 + y._1, x._2 ++ y._2)
)
reduced.collect
// res3: Array[((String, String), (Int, scala.collection.immutable.Set[Int]))] =
// Array(((2,Devil),(6000,Set(10, 11))), ((1,Siddhesh),(36300,Set(43, 12))))

Related

Reduce by inspecting tuple element

I want to find the Tuple with the largest second element:
mylist.reduce { (x, y) => {
if (y._1 > x._1) y
else x
}}
Where x and y are of type Tuple3[DenseVector[Int], Double, PipelineModel].
I get the error that > cannot be resolved. What's up with that? Using foldLeft and providing a zero element did not help either.
Can I write the code nicer? (It doesn't look so nice, I think.)
In a triplet (a, b, c) triplet._2 gives you the second element.
_1 gives first element
_2 gives second element
_3 gives third element
Tuples are not zero based.
scala> val triplet = (1, 2, 3)
triplet: (Int, Int, Int) = (1,2,3)
scala> triplet._1
res0: Int = 1
scala> triplet._2
res1: Int = 2
scala> triplet._3
res2: Int = 3
Answer 1:
In your case triplet._1 gives the first element of the triplet (tuple3) which is DenseVector[Int] element on which you cannot use >. Thats why > is not resolved.
Answer 2:
maxBy
l.maxBy(_._2)
Scala REPL
scala> val l = List((1, 2, 3), (0, 0, 1))
l: List[(Int, Int, Int)] = List((1,2,3), (0,0,1))
scala> l.maxBy(_._2)
res1: (Int, Int, Int) = (1,2,3)
Reduce
l.reduce { (x, y) => if (x._2 > y._2) x else y }
Scala REPL
scala> val l = List((1, 2, 3), (0, 0, 1))
l: List[(Int, Int, Int)] = List((1,2,3), (0,0,1))
scala> l.reduce { (x, y) => if (x._2 > y._2) x else y }
res3: (Int, Int, Int) = (1,2,3)

Spark Mapvalues vs. Map

So I saw this question on stackoverflow asked by another user and I tried to write the code myself as I am trying to practice scala and spark:
The question was to find the per-key average from a list:
Assuming the list is: ( (1,1), (1,3), (2,4), (2,3), (3,1) )
The code was:
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
map{ case (key, value) => (key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))
So basically the above code will create an RDD of type [Int, (Int, Int)] where the first Int is the key and the value is (Int, Int) where the first Int here is the addition of all the values with the same key and the second Int is the amount of times the key appeared.
I understand what is going on but for some reason when I rewrite the code like this:
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value: (Int, Int) => (value._1 / value._2))
result.collectAsMap().map(println(_))
When I use mapValues instead of map with the case keyword the code doesn't work.It gives an error saying error: not found: type / What is the difference when using map with case and mapValues. Since I thought map values will just take the value (which in this case is a (Int,Int)) and return to you a new value and the key remains the same for the key value pair.
try
val result = input.combineByKey(
(v) => (v, 1),
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).
mapValues(value => (value._1 / value._2))
result.collectAsMap().map(println(_))
Never mind I found a good article to my problem: http://danielwestheide.com/blog/2012/12/12/the-neophytes-guide-to-scala-part-4-pattern-matching-anonymous-functions.html
If anyone else has the same problem that explains it well!

scala list of anonymous functions

Like to create a list of function literals and (a) avoid pre-defining them and (b) use shorthand syntax. Failing at the moment.
def g = (x: Int) => x + 1 //pre-defined
def h = (x: Int) => x + 2
List(g,h) //succeeds
List( (x: Int) => x + 1, (x: Int) => x + 2) ) //fails
^';' expected but ')' found.
Please clarify, do you know types of your functions in advance?
If yes, then you can explicitly specify type of your list:
# List[Int=>Int](x => x + 1, x => x + 2)
res20: List[Int => Int] = List(<function1>, <function1>)
Or even shorter:
# List[Int=>Int](_ + 1, _ + 2)
res21: List[Int => Int] = List(<function1>, <function1>)
If you want List type to be inferred, try following syntax:
# List({ x: Int => x + 1}, { x: Int => x + 2 })
res22: List[Int => Int] = List(<function1>, <function1>)

Why can Scala compiler not infer Stream type operations?

Lets say I want to have a Stream of squares. A simple way to declare it would be:
scala> def squares(n: Int): Stream[Int] = n * n #:: squares(n + 1)
But doing so, yields an error:
<console>:8: error: overloaded method value * with alternatives:
(x: Double)Double <and>
(x: Float)Float <and>
(x: Long)Long <and>
(x: Int)Int <and>
(x: Char)Int <and>
(x: Short)Int <and>
(x: Byte)Int
cannot be applied to (scala.collection.immutable.Stream[Int])
def squares(n: Int): Stream[Int] = n * n #:: squares(n + 1)
^
so, why can't Scala infer the type of n which is obviously an Int? Can someone please explain what's going on?
It's just a precedence issue. Your expression is being interpreted as n * (n #:: squares(n + 1)), which is clearly not well-typed (hence the error).
You need to add parentheses:
def squares(n: Int): Stream[Int] = (n * n) #:: squares(n + 1)
Incidentally, this isn't an inference problem, because the types are known (i.e., n is known to be of type Int, so it need not be inferred).

Scala: Defining a function to be the correct type

I've been playing around with Scala code and have come up against a compiler error which I don't understand. The code generates a vector of pairs of Ints and then tries to filter it.
val L = for (x <- (1 to 5)) yield (x, x * x)
val f = (x: Int, y: Int) => x > 3
println(L.filter(f))
The compiler complains about trying to use f as an argument for the filter method with the compiler error message being:
error: type mismatch;
found : (Int, Int) => Boolean
required: ((Int, Int)) => Boolean
How do I define the function f correctly to satisfy the required function type? I tried to add extra parentheses around (x: Int, y: Int) but this gave:
error: not a legal formal parameter
val f = ((x: Int, y: Int)) => x > 3
^
f has type Function2[Int, Int, Boolean]. L's type is IndexedSeq[Tuple2[Int, Int]] and so filter expects a function of type Function1[Tuple2[Int, Int], Boolean]. Every FunctionN[A, B, .., R] trait has a method tupled, which returns a function of type Function1[TupleN[A, B, ..], R]. You can use it here to transform f to the type expected by L.filter.
println(L.filter(f.tupled))
> Vector((4,16), (5,25))
Alternatively you can redefine f to be a Function1[Tuple2[Int, Int], Boolean] as follows and use it directly.
val f = (t: (Int, Int)) => t._1 > 3
println(L.filter(f))
> Vector((4,16), (5,25))
val f = (xy: (Int, Int)) => xy._1 > 3
println (L.filter (f))
If you do
val f = (x: Int, y: Int) => x > 3
you define a function which takes two ints, which is not the same as a function which takes a pair of ints as parameter.
Compare:
scala> val f = (x: Int, y: Int) => x > 3
f: (Int, Int) => Boolean = <function2>
scala> val f = (xy: (Int, Int)) => xy._1 > 3
f: ((Int, Int)) => Boolean = <function1>
If you don't want to rewrite your function to explicitely useing Tuple2 (as suggested by missingfaktor and user unknown), you can define a implicit method to do it automatically. This lets the function f untouched (you aren't forced to always call it with a Tuple2 parameter) and easier to understand, because you still use the identifiers x and y.
implicit def fun2ToTuple[A,B,Res](f:(A,B)=>Res):((A,B))=>Res =
(t:(A,B)) => f(t._1, t._2)
val L = for (x <- (1 to 5)) yield (x, x * x)
val f = (x: Int, y: Int) => x > 3
val g = (x: Int, y: Int) => x % 2 > y % 3
L.filter(f) //> Vector((4,16), (5,25))
L.filter(g) //> Vector((3,9))
f(0,1) //> false
f((4,2)) //> true
Now every Function2 can also be used as a Function1 with an Tuple2 as parameter, because it uses the implicit method to convert the function if needed.
For functions with more than two parameters the implicit defs looks similiar:
implicit def fun3ToTuple[A,B,C,Res](f:(A,B,C)=>Res):((A,B,C))=>Res =
(t:(A,B,C)) => f(t._1, t._2, t._3)
implicit def fun4ToTuple[A,B,C,D,Res](f:(A,B,C,D)=>Res):((A,B,C,D))=>Res =
(t:(A,B,C,D)) => f(t._1, t._2, t._3, t._4)
...