Tuple element is not getting converted to float during reduceByKey - scala

I am preparing for CCA175, I am using the oldest available version of spark, Spark 1.3.0.
As shown below, I am converting the element to Float while mapping but while reducing it is showing a compile time error.
scala> val revenuePerDay = ordersJoinOrderItems.map(x => (x._2._1, (x._1, (x._2._2).toFloat)))
revenuePerDay: org.apache.spark.rdd.RDD[(String, (Int, Float))] =
MapPartitionsRDD[21] at map at <console>:31
After mapping I can see that it is mapped as Float but when I am running the below command it is showing an error:
scala> revenuePerDay.reduceByKey((x,y) => x._2._2 + y._2._2)
<console>:34: error: value _2 is not a member of Float
revenuePerDay.reduceByKey((x,y) => x._2._2 + y._2._2)
^

PairRDDFunctions.reduceByKey works on a pair of values:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Since your tuple is of the form: (String, (Int, Float)), the key (String) isn't part of the method signature.
reduceByKey expects a function of type (V, V) => V. Since your input is of type (Int, Float), and result is of type Float, this won't work.
Instead, we'll need to use the more verbose PairRDDFunctions.combineByKey:
revenuePerDay.combineByKey[Float](_._2, (acc, x) => acc + x._2, (x, y) => x + y)
Or, you can use the slightly similar PairRDDFunctions.aggregateByKey:
revenuePerDay.aggregateByKey(0F)((acc, x) => acc + x._2, (x, y) => x + y)
Edit
Another suggestion by #zero323 is to use mapValues with reduceByKey:
revenuePerDay.mapValues(_._2).reduceByKey(_ + _)

Related

How to get minimum value for each distinct key using ReduceByKey() in Scala

I have a flat map that returns the Sequence Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)) now I need to use the reduceByKey() on the sequence that I got from the flat map to find the minimum value for each key.
I tried using .reduceByKey(a,min(b)) and .reduceByKey((a, b) => if (a._1 < b._1) a else b) but neither of them are working.
This is my code
for(i<- 1 to 5){
var graph=graph.flatMap{ in => in match{ case (x, y, zs) => (x, y) :: zs.map(z => (z, y))}
.reduceByKey((a, b) => if (a._1 < b._1) a else b)
}
For each distinct key the flatmap generates I need to get the minimum value for that key. Eg: the flatmap generates Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)) the resultByKey() should generate (20,1),(22,1),(23,6),(24,6)
Here is the signature of reduceByKey:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Basically, given a RDD of key/value pairs, you need to provide a function that reduces two values (and not the entire pair) into one. Therefore, you can use it as follows:
val rdd = sc.parallelize(Seq((20,6),(22,6),(23,6),(24,6),(20,1),(22,1)))
val result = rdd.reduceByKey((a, b) => if (a < b) a else b)
result.collect
// Array[(Int, Int)] = Array((24,6), (20,1), (22,1), (23,6))

Accessing Scala pair

Consider the following function definition:
def foo(l: List[(Char, Int)])
The following expression is valid
l.map(t => t._2 + t._1)
Is there a way to access the elements of the pair by name?
I have tried the following, but it does not compile:
l.map((c: Char, x: Int) => c + x)
There is no way to unpack a tuple with round brackets, you'll need curly ones (which apply a partial function):
l.map { case (c, x) => c + x }
In the future, with Dotty, you should be able to unpack it as follows:
l.map((c, x) => c + x)

RDD Aggregate in spark

I am an Apache Spark learner and have come across a RDD action aggregate which I have no clue of how it functions. Can some one spell out and explain in detail step by step how did we arrive at the below result for the code here
RDD input = {1,2,3,3}
RDD Aggregate function :
rdd.aggregate((0, 0))
((x, y) =>
(x._1 + y, x._2 + 1),
(x, y) =>
(x._1 + y._1, x._2 + y._2))
output : {9,4}
Thanks
If you are not sure what is going on it is best to follow the types. Omitting implicit ClassTag for brevity we start with something like this
abstract class RDD[T] extends Serializable with Logging
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
If you ignore all the additional parameters you'll see that aggregate is a function which maps from RDD[T] to U. It means that the type of the values in the input RDD doesn't have to be the same as the type of the output value. So it is clearly different than for example reduce:
def reduce(func: (T, T) ⇒ T): T
or fold:
def fold(zeroValue: T)(op: (T, T) => T): T
The same as fold, aggregate requires a zeroValue. How to choose it? It should be an identity (neutral) element with respect to combOp.
You also have to provide two functions:
seqOp which maps from (U, T) to U
combOp which maps from (U, U) to U
Just based on this signatures you should already see that only seqOp may access the raw data. It takes some value of type U another one of type T and returns a value of type U. In your case it is a function with a following signature
((Int, Int), Int) => (Int, Int)
At this point you probably suspect it is used for some kind of fold-like operation.
The second function takes two arguments of type U and returns a value of type U. As stated before it should be clear that it doesn't touch the original data and can operate only on the values already processed by the seqOp. In your case this function has a signature as follows:
((Int, Int), (Int, Int)) => (Int, Int)
So how can we get all of that together?
First each partition is aggregated using standard Iterator.aggregate with zeroValue, seqOp and combOp passed as z, seqop and combop respectivelly. Since InterruptibleIterator used internally doesn't override aggregate it should be executed as a simple foldLeft(zeroValue)(seqOp)
Next partial results collected from each partition are aggregated using combOp
Lets assume that input RDD has three partitions with following distribution of values:
Iterator(1, 2)
Iterator(2, 3)
Iterator()
You can expect that execution, ignoring absolute order, will be equivalent to something like this:
val seqOp = (x: (Int, Int), y: Int) => (x._1 + y, x._2 + 1)
val combOp = (x: (Int, Int), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2)
Seq(Iterator(1, 2), Iterator(3, 3), Iterator())
.map(_.foldLeft((0, 0))(seqOp))
.reduce(combOp)
foldLeft for a single partition can look like this:
Iterator(1, 2).foldLeft((0, 0))(seqOp)
Iterator(2).foldLeft((1, 1))(seqOp)
(3, 2)
and over all partitions
Seq((3,2), (6,2), (0,0))
which combined will give you observed result:
(3 + 6 + 0, 2 + 2 + 0)
(9, 4)
In general this is a common pattern you will find all over Spark where you pass neutral value, a function used to process values per partition and a function used to merge partial aggregates from different partitions. Some other examples include:
aggregateByKey
User Defined Aggregate Functions
Aggregators on Spark Datasets.
Here is my understanding for your reference:
Imagine you have two nodes, one take the input of the first two list elements {1,2}, and another takes {3, 3}. (The partition here is only for convenient)
At the first node:
"(x, y) => (x._1 + y, x._2 + 1)" , the first x is (0,0) as given, and y is your first element 1, and you will have output (0+1, 0+1), then comes your second element y=2, and output (1 + 2, 1 + 1), which is (3, 2)
At the second node, same procedure happens in parallel, and you'll have (6, 2).
"(x, y) => (x._1 + y._1, x._2 + y._2)", tells you to merge two nodes, and you'll get (9,4)
one thing worth noticing is (0,0) is actually added to the result
length(rdd)+1 times.
"scala> rdd.aggregate((1,1)) ((x, y) =>(x._1 + y, x._2 + 1), (x, y) => (x._1 + y._1, x._2 + y._2))
res1: (Int, Int) = (14,9)"

Scala type mismatch when adding elements to hashmap

I am representing a graph's adjacency list in Scala in the variable a.
val a = new HashMap[Int, Vector[Tuple2[Int, Int]]] withDefaultValue Vector.empty
for(i <- 1 to N) {
val Array(x, y, r) = readLine.split(" ").map(_.toInt)
a(x) += new Tuple2(y, r)
a(y) += new Tuple2(x, r)
}
I am reading each edge in turn(x and y are nodes, while r is the cost of the edge). After reading it, I am adding it to the adjacency list.
However, when adding the Tuples containing a neighbouring node and a cost to the HashMap I get:
Solution.scala:17: error: type mismatch;
found : (Int, Int)
required: String
a(x) += new Tuple2(y, r)
I don't understand why it wants String. I haven't specified String anywhere.
+= is the operator for concatenating to a String.
You would probably want to do something like: a.update(x, a.getOrElse(x, Vector()) :+ (x, r)).
Also, you are writing Java code in Scala. It compiles, but amounts to abuse of the language :/
Consider doing something like this next time:
val a = Range(1, N)
.map { _ => readline.split(" ").map (_.toInt) }
.flatMap { case Array(x, y, r) =>
Seq(x -> (y, r), y -> (x, r))
}
.groupBy(_._1)
.mapValues { _.map ( _._2) }

SortByValue for a RDD of tuples

Recently I was asked (in a class assignment) to find the top 10 occurring words inside RDD. I submitted my assignment with a working solution which looks like
wordsRdd
.map(x => (x, 1))
.reduceByKey(_ + _)
.map(case (x, y) => (y, x))
.sortByKey(false)
.map(case (x, y) => (y, x))
.take(10)
So basically, I swap the tuple, sort by key, and then swap again. Then finally take 10. I don't find the repeated swapping very elegant.
So I wonder if there is a more elegant way of doing this.
I searched and found some people using Scala implicits to convert the RDD into a Scala Sequence and then doing the sortByValue, but I don't want to convert RDD to a Scala Seq, because that will kill the distributed nature of the RDD.
So is there a better way?
How about this:
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(Ordering.by(-1 * _._2))
or a little bit more verbose:
object WordCountPairsOrdering extends Ordering[(String, Int)] {
def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(WordCountPairsOrdering)