Using reduceByKey in Apache Spark (Scala) - scala

I have a list of Tuples of type : (user id, name, count).
For example,
val x = sc.parallelize(List(
("a", "b", 1),
("a", "b", 1),
("c", "b", 1),
("a", "d", 1))
)
I'm attempting to reduce this collection to a type where each
element name is counted.
So in above val x is converted to :
(a,ArrayBuffer((d,1), (b,2)))
(c,ArrayBuffer((b,1)))
Here is the code I am currently using :
val byKey = x.map({case (id,uri,count) => (id,uri)->count})
val grouped = byKey.groupByKey
val count = grouped.map{case ((id,uri),count) => ((id),(uri,count.sum))}
val grouped2: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey
grouped2.foreach(println)
I'm attempting to use reduceByKey as it performs faster than groupByKey.
How can reduceByKey be implemented instead of above code to provide
the same mapping ?

Following your code:
val byKey = x.map({case (id,uri,count) => (id,uri)->count})
You could do:
val reducedByKey = byKey.reduceByKey(_ + _)
scala> reducedByKey.collect.foreach(println)
((a,d),1)
((a,b),2)
((c,b),1)
PairRDDFunctions[K,V].reduceByKey takes an associative reduce function that can be applied to the to type V of the RDD[(K,V)]. In other words, you need a function f[V](e1:V, e2:V) : V . In this particular case with sum on Ints: (x:Int, y:Int) => x+y or _ + _ in short underscore notation.
For the record: reduceByKey performs better than groupByKey because it attemps to apply the reduce function locally before the shuffle/reduce phase. groupByKey will force a shuffle of all elements before grouping.

Your origin data structure is: RDD[(String, String, Int)], and reduceByKey can only be used if data structure is RDD[(K, V)].
val kv = x.map(e => e._1 -> e._2 -> e._3) // kv is RDD[((String, String), Int)]
val reduced = kv.reduceByKey(_ + _) // reduced is RDD[((String, String), Int)]
val kv2 = reduced.map(e => e._1._1 -> (e._1._2 -> e._2)) // kv2 is RDD[(String, (String, Int))]
val grouped = kv2.groupByKey() // grouped is RDD[(String, Iterable[(String, Int)])]
grouped.foreach(println)

The syntax is below:
reduceByKey(func: Function2[V, V, V]): JavaPairRDD[K, V],
which says for the same key in an RDD it takes the values (which will be definitely of same type) performs the operation provided as part of function and returns the value of same type as of parent RDD.

Related

value reduceByKey is not a member of org.apache.spark.rdd.RDD[(Int, Int)] after import

I created this RDD :
scala> val data=sc.textFile("sparkdata.txt")
Then I am trying to return the content of the file :
scala> data.collect
I am dividing existing data in individual word using :
scala> val splitdata = data.flatMap(line => line.split(" "));
scala> splitdata.persist()
scala> splitdata.collect;
Now, I am doing the map reduce operation :
scala> val mapdata = splitdata.map(word => (word,1));
scala> mapdata.collect;
scala> val reducedata = mapdata.reduceByKey(_+_);
To get the result :
scala> reducedata.collect;
When I want to display the first 10th rows :
splitdata.groupByKey(identity).count().show(10)
I get the following error :
<console>:38: error: value groupByKey is not a member of org.apache.spark.rdd.RDD[String]
splitdata.groupByKey(identity).count().show(10)
^
<console>:38: error: missing argument list for method identity in object Predef
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `identity _` or `identity(_)` instead of `identity`.
splitdata.groupByKey(identity).count().show(10)
^
Similar to reduceByKey(), groupByKey() is a method for PairRDDs of type RDD[K, V], rather than for general RDDs. While reduceByKey() uses a provided binary function to reduce a RDD[K, V] to another RDD[K, V], groupByKey() transforms a RDD[K, V] into a RDD[(K, Iterable[V])]. To further transform the Iterable[V] by key, one would typically apply mapValues() (or flatMapValues) with a provided function.
For example:
val rdd = sc.parallelize(Seq(
"apple", "apple", "orange", "banana", "banana", "orange", "apple", "apple", "orange"
))
rdd.map((_, 1)).reduceByKey(_ + _).collect
// res1: Array[(String, Int)] = Array((apple,4), (banana,2), (orange,3))
rdd.map((_, 1)).groupByKey().mapValues(_.sum).take(2)
// res2: Array[(String, Int)] = Array((apple,4), (banana,2))
In case you're interested only in getting the count of groups after applying groupByKey():
rdd.map((_, 1)).groupByKey().count()
// res3: Long = 3

Scala - Future.sequence on Tuples

I have a Seq of Tuples:
val seqTuple: Seq[(String, Future[String])] = Seq(("A", Future("X")), ("B", Future("Y")))
and I want to get:
val futureSeqTuple: Future[Seq[(String, String)]] = Future(Seq(("A", "X"), ("B", "Y")))
I know I can do:
val futureSeq: Future[Seq[String]] = Future.sequence(seqTuple.map(_._2))
but I am losing the first String in the Tuple.
What is the best way to get a Future[Seq[(String, String)]]?
Use the futures in tuples to map each tuple to future of tuple first,
then sequence:
Future.sequence(
seqTuple.map{case (s1, fut_s2) => fut_s2.map{s2 => (s1, s2)} }
)
Step by step, from inner terms to outer terms:
The inner map converts Future("X") to Future(("A", "X")).
The outer map converts each ("A", Future("X")) into an Future(("A", "X")), thus giving you a Seq[Future[(String, String)]].
Now you can use sequence on that to obtain Future[Seq[(String, String)]]
The answer given here works fine, but I think Future.traverse would work more succinctly here:
Future.traverse(seqTuple) {
case (s1, s2Future) => s2Future.map{ s2 => (s1, s2) }
}
This function involves converting the input argument :)

Scala convert Iterable of Tuples to Array of Just Tuple._2

Say I have a Iterable[(Int, String)]. How do I get an array of just the "values"? That is, how do I convert from Iterable[(Int, String)] => Array[String]? The "keys" or "values" do not have to be unique, and that's why I put them in quotation marks.
iterable.map(_._2).toArray
_._2 : take out the second element of the tuple represented by input variable( _ ) whose name I don't care.
Simply:
val iterable: Iterable[(Int, String)] = Iterable((1, "a"), (2, "b"))
val values = iterable.toArray.map(_._2)
Simply map the iterable and extract the second element(tuple._2),
scala> val iterable: Iterable[(Int, String)] = Iterable((100, "Bring me the horizon"), (200, "Porcupine Tree"))
iterable: Iterable[(Int, String)] = List((100,Bring me the horizon), (200,Porcupine Tree))
scala> iterable.map(tuple => tuple._2).toArray
res3: Array[String] = Array(Bring me the horizon, Porcupine Tree)
In addition to the already suggested map you might want to build the array as you map from tuple to string instead of converting at some point as it might save an iteration.
import scala.collection
val values: Array[String] = iterable.map(_._2)(collection.breakOut)

Spark 1.5.1, Scala 2.10.5: how to expand an RDD[Array[String], Vector]

I am using Spark 1.5.1 with Scala 2.10.5
I have an RDD[Array[String], Vector] for each element of the RDD:
I want to take each String in the Array[String] and combine it
with the Vector to create a tuple (String, Vector), this step will lead to the creation of several tuples from each element of the initial RDD
The goal is to end by building an RDD of tuples: RDD[(String,
Vector)], this RDD contains all the tuples created in the previous step.
Thanks
Consider this :
rdd.flatMap { case (arr, vec) => arr.map( (s) => (s, vec) ) }
(The first flatMap lets you get a RDD[(String, Vector)] as an output as opposed to a map which would get you a RDD[Array[(String, Vector)]])
Have you tried this?
// rdd: RDD[Array[String], Vector] - initial RDD
val new_rdd = rdd
.flatMap {
case (array: Array[String], vec: Vector) => array.map(str => (str, vec))
}
Toy example (I'm running it in spark-shell):
val rdd = sc.parallelize(Array((Array("foo", "bar"), 100), (Array("one", "two"), 200)))
val new_rdd = rdd
.map {
case (array: Array[String], vec: Int) => array.map(str => (str, vec))
}
.flatMap(arr => arr)
new_rdd.collect
res14: Array[(String, Int)] = Array((foo,100), (bar,100), (one,200), (two,200))

Bind extra information to a future sequence

Say I have been given a list of futures with each one linked to an key such as:
val seq: Seq[(Key, Future[Value])]
And my goal is to produce a list of key value tuples once all futures have completed:
val complete: Seq[(Key, Value)]
I am wondering if this can be achieved using a sequence call. For example I know I can do the following:
val complete = Future.sequence(seq.map(_._2).onComplete {
case Success(s) => s
case Failure(NonFatal(e)) => Seq()
}
But this will only returns me a sequence of Value objects and I lose the pairing information between Key and Value. The problem being that Future.sequence expects a sequence of Futures.
How could I augment this to maintain the key/value pairing in my complete sequence?
Thanks
Des
How about transforming your Seq[(Key, Future[Value])] to Seq[Future[(Key, Value)]] first.
val seq: Seq[(Key, Future[Value])] = // however your implementation is
val futurePair: Seq[Future[(Key, Value)]] = for {
(key, value) <- seq
} yield value.map(v => (key, v))
Now you can use sequence to get Future[Seq[(Key, Value)]].
val complete: Future[Seq[(String, Int)]] = Future.sequence(futurePair)
Just a different expression of the other answer, using unzip and zip.
scala> val vs = Seq(("one",Future(1)),("two",Future(2)))
vs: Seq[(String, scala.concurrent.Future[Int])] = List((one,scala.concurrent.impl.Promise$DefaultPromise#4e38d975), (two,scala.concurrent.impl.Promise$DefaultPromise#35f8a9d3))
scala> val (ks, fs) = vs.unzip
ks: Seq[String] = List(one, two)
fs: Seq[scala.concurrent.Future[Int]] = List(scala.concurrent.impl.Promise$DefaultPromise#4e38d975, scala.concurrent.impl.Promise$DefaultPromise#35f8a9d3)
scala> val done = (Future sequence fs) map (ks zip _)
done: scala.concurrent.Future[Seq[(String, Int)]] = scala.concurrent.impl.Promise$DefaultPromise#56913163
scala> done.value
res0: Option[scala.util.Try[Seq[(String, Int)]]] = Some(Success(List((one,1), (two,2))))
or maybe save on zippage:
scala> val done = (Future sequence fs) map ((ks, _).zipped)
done: scala.concurrent.Future[scala.runtime.Tuple2Zipped[String,Seq[String],Int,Seq[Int]]] = scala.concurrent.impl.Promise$DefaultPromise#766a52f5
scala> done.value.get.get.toList
res1: List[(String, Int)] = List((one,1), (two,2))