Is it possible to user reduceByKey((x, y, z) => ...)? - scala

Is it possible to have a reduceByKey of the way: reduceByKey((x, y, z) => ...)?
Because I have a RDD:
RDD[((String, String, Double), (Double, Double, scala.collection.immutable.Map[String,Double]))]
And I want reduce by key and I tried with this operation:
reduceByKey((x, y, z) => (x._1 + y._1 + z._1, x._2 + y._2 + z._2, (((x._3)++y._3)++z._3)))
and it shows me a error message: missing parameter type
Before I tested with two elements and it works, but with 3 I really don't know which is my error. What is the way to do that?

Here's what you're missing, reduceByKey is telling you that you have a Key-Value pairing. Conceptually there can only ever be 2 items in a pair, it's part of the what makes a pair a pair. Hence, the full signature of reduceByKey can only ever be a 2-Tuple as it's signature. So, no, you can't directly have a function of arity 3, only of arity 2.
Here's how I'd handle your situation:
reduceByKey((key,value) =>
val (one, two, three) = key
val (dub1, dub2, nameName) = value
// rest of work
}
However, let me make one slight suggestion? Use a case class for your value. It's easier to grok and is essentially equivalent to your 3-tuple.

if you see the reduceByKey function on PairRDDFunctions, it looks like,
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
hence, its not possible to have it work on a 3-tuple.
However, you can wrap your 3-tuple into a model and still keep your first string as the key making your RDD as RDD[(string, your-model)] and now you can aggregate the model in whatever way you want.
Hope this helps.

Related

Why is scala.collection.immutable.List[Object] not GenTraversableOnce[?]

Simple question, and sorry if this is a stupid question as I am just beginning in scala. I am getting a type mismatch error that says:
found : (AnyRef, org.apache.tinkerpop.gremlin.hadoop.structure.io.VertexWritable) => List[Object]
required: ((AnyRef, org.apache.tinkerpop.gremlin.hadoop.structure.io.VertexWritable)) => scala.collection.GenTraversableOnce[?]
But according to this post (I have a Scala List, how can I get a TraversableOnce?), a scala.collection.immutable.List is an Iterable and therefore also a GenTraversableOnce. And yet this error seems to indicate otherwise. And furthermore, when I actually look at the link in the accepted answer of that post, I don't see any reference to the word "traversable".
If the problem has to do with my inner class not being correct, then I have to say this error is extremely uninformative, since requiring that the inner class be of type "?" is obviously a vacuous statement ... Any help in understanding this would be appreciated.
Function2[X, Y, Z] is not the same thing as Function1[(X, Y), Z].
Compare these two definitions:
val f: ((Int, Int)) => Int = xy => xy._1 + xy._2
val f: (Int, Int) => Int = (x, y) => x + y
The first could also be written with a pattern-matching, that first decomposes the tuple:
val f: ((Int, Int)) => Int = { case (x, y) => x + y }
This is exactly what the error message asks you to do: provide an unary function that takes a tuple as argument, not a binary function. Note that there is the tupled-method, that does exactly that.
The return types of the functions are mostly irrelevant here, the compiler doesn't get to unify them, because it fails on the types of the inputs.
Also related:
Same story with eta-expansions: Why does my implementation of Haskell snd not compile in Scala

Conditionally using .reverse in the same line with Scala

I have a composition of combinators in Scala, and the last one is .top, which I could use as .top(num)(Ordering[(Int, Int)].reverse) depending on a boolean parameter.
How do I implement this composition of combinators to use or not use .reverse depending on the boolean parameter, in the same line? I mean, without creating another val to indicate whether .reverse is used?
val mostPopularHero = sparkContext
.textFile("resource/marvel/Marvel-graph.txt") // build up superhero co-apperance data
.map(countCoOccurrences) // convert to (hero ID, number of connections) RDD
.reduceByKey((x, y) => x + y) // combine entries that span more than one line
.map(x => (x._2, x._1)) // flip it from (hero ID, number of connections) to (number of connections, hero ID)
.top(num)(Ordering[(Int, Int)].reverse)
Solution 0
As nicodp has already pointed out, if you have a boolean variable b in scope, you can simply replace the expression
Ordering[(Int, Int)]
by an if-expression
if (b) Ordering[(Int, Int)] else Ordering[(Int, Int)].reverse
I have to admit that this is the shortest and clearest solution I could come up with.
However... I didn't quite like that the expression Ordering[(Int, Int)] appears in the code twice. It doesn't really matter in this case, because it's short, but what if the expression were a bit longer? Apparently, even Ruby has something for such cases.
So, I tried to come up with some ways to not repeat the subexpression Ordering[(Int, Int)]. The nicest solution would be if we had a default Id-monad implementation in the standard library, because then we could simply wrap the one value in pure, and then map it using the boolean.
But there is no Id in standard library. So, here are a few other proposals, just for the case that the expression in question becomes longer:
Solution 1
You can use blocks as expressions in scala, so you can replace the above
Ordering[(Int, Int)] by:
{val x = Ordering[(Int, Int)]; if (b) x else x.reverse}
Update: Wait! This is shorter than the version with repetition! ;)
Solution 2
Define the function that conditionally reverses an ordering, declare Ordering[(Int, Int)] as the type of the argument, and then
instead of re-typing Ordering[(Int, Int)] as an expression, use implicitly:
((x: Ordering[(Int, Int)]) => if (b) x else x.reverse)(implicitly)
Solution 3
We don't have Id, but we can abuse constructors and eliminators of other functors. For example, one could wrap the complex expression in a List or Option, then map it, then unpack the result. Here is a variant with Some:
Some(Ordering[(Int, Int)]).map{ x => if(b) x else x.reverse }.get
Ideally, this would have been Id instead of Some. Notice that Solution 1 does something similar with the default ambient monad.
Solution 4
Finally, if the above pattern occurs more than once in your code, it might be worth it to introduce some extra syntax to deal with it:
implicit class ReversableOrderingOps[X](ord: Ordering[X]) {
def reversedIf(b: Boolean): Ordering[X] = if (b) ord.reverse else ord
}
Now you can define orderings like this:
val myConditionHolds = true
val myOrd = Ordering[(Int, Int)] reversedIf myConditionHolds
or use it in your lengthy expression directly:
val mostPopularHero = sparkContext
.textFile("resource/marvel/Marvel-graph.txt")
.map(countCoOccurrences)
.reduceByKey((x, y) => x + y)
.map(x => (x._2, x._1))
.top(num)(Ordering[(Int, Int)] reversedIf myConditionHolds)
I'm not quite sure if you have access to the boolean parameter here or not, but you can work this out as follows:
.top(num)(if (booleanParameter) Ordering[(Int, Int)].reverse else Ordering[(Int, Int)])

Can someone explain this scala aggregate function with two initial values

I am very new to Scala this problem I was trying to solve in spark, which also uses Scala for performing operation on RDD's.
Till now, I have only seen aggregate functions with only one initial value (i,e some-input.aggregate(Initial-value)((acc,value)=>(acc+value))), but this program has two initial values (0,0).
As per my understanding this program is for calculating the running average and keeping track of the count so far.
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble
I know that in foldLeft / aggregate we supply initial values, so that in case of empty collection we get the default value, and both have accumulator and value part.
But in this case, we have two initial values, and accumulator is accessing tuple values. Where is this tuple defined?
Can someone please explain this whole program line by line.
but this program has two initial values (0,0).
They aren't two parameters, they're one Tuple2:
input.aggregate((0, 0))
The value passed to aggregate is surrounded by additional round brackets, (( )), which are used as syntactic sugar for Tuple2.apply. This is where you're seeing the tuple come from.
If you look a the method definition (I'm assuming this is RDD.aggregate), you'll see it takes a single parameter in the first argument list:
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)
(implicit arg0: ClassTag[U]): U

Scala - Operation in case (x,y)=> x++y

I am new to scala. I was reading one code not able to understand . Can someone please help me to understand below code ?
def intersectByKey[K: ClassTag, V: ClassTag](rdd1: RDD[(K, V)], rdd2: RDD[(K, V)]): RDD[(K, V)] = {
rdd1.cogroup(rdd2).flatMapValues{
case (Nil, _) => None
case (_, Nil) => None
case (x, y) => x++y
}
}
What does below line means ? How it will be evaluated ?
case (x, y) => x++y
Thanks
rdd1.cogroup(rdd2) returns a value of type RDD[(K, (Iterable[V], Iterable[V]))].
So - in case (x, y) both x and y are Iterable[V].
Iterable overloads the ++ operator, with an implementation that simply means union - returning an iterable with all of x's values followed by all of y's values.
The function cogroup returns a RDD[(K, (Seq[V], Seq[W]))].
So the value is of type Tuple2. When you use flatMapValues it will flatmap over the values, which are of Type Seq.
++ for Seqs mean concatenating them. Resulting in a combined Seq.
case (x, y) means that you're using pattern matching. In your case if none of the values of your Tuple are Nil, the function will return x ++ y.
The advantage of using flatMapValues in this case is that it will flatten the result, therefore losing all the None values.
You can check out the documentation here. Also if you're not sure what exactly Pattern matching or flatmaps are check this for pattern matching and this for flatmap

Adding Constant to RDD

I have a really stupid question, I know that a RDD is immutable, but is there any way that you can add a column of constant to a RDD?
More specifically, I have an RDD of RDD[a:String, b:String], I wish to add a column of 1's after it so that I have a RDD of RDD[a:Stirng, b:String, c:Int].
The reason is that I want to use the reduceByKey function to process these strings, and an arbitrary Int (that will be constantly updated) will help the function in reducing.
Solution in Scala is to use map simply
rdd.map( t => (t._1, t._2, 1))
Or
rdd.map{ case (a, b) => (a, b, 1)}
You can easily do it with map function, here's an example in Python:
rdd.map(lambda (a,b): (a,b,1))