Replace groupByKey() with reduceByKey() - scala

This is a follow up question from here. I am trying to implement k-means based on this implementation. It works great, but I would like to replace groupByKey() with reduceByKey(), but I am not sure how (I am not worried about performance now). Here is the relevant minified code:
val data = sc.textFile("dense.txt").map(
t => (t.split("#")(0), parseVector(t.split("#")(1)))).cache()
val read_mean_centroids = sc.textFile("centroids.txt").map(
t => (t.split("#")(0), parseVector(t.split("#")(1))))
var centroids = read_mean_centroids.takeSample(false, K, 42).map(x => x._2)
do {
var closest = read_mean_centroids.map(p => (closestPoint(p._2, centroids), p._2))
var pointsGroup = closest.groupByKey() // <-- THE VICTIM :)
var newCentroids = pointsGroup.mapValues(ps => average(ps.toSeq)).collectAsMap()
..
Notice that println(newCentroids) will give:
Map(23 -> (-6.269305E-4, -0.0011746404, -4.08004E-5), 8 -> (-5.108732E-4, 7.336348E-4, -3.707591E-4), 17 -> (-0.0016383086, -0.0016974678, 1.45..
and println(closest):
MapPartitionsRDD[6] at map at kmeans.scala:75
Relevant question: Using reduceByKey in Apache Spark (Scala).
Some documentation:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Merge the values for each key using an associative reduce function.
def reduceByKey(func: (V, V) ⇒ V, numPartitions: Int): RDD[(K, V)]
Merge the values for each key using an associative reduce function.
def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]
Merge the values for each key using an associative reduce function.
def groupByKey(): RDD[(K, Iterable[V])]
Group the values for each key in the RDD into a single sequence.

You could use an aggregateByKey() (a bit more natural than reduceByKey()) like this to compute newCentroids:
val newCentroids = closest.aggregateByKey((Vector.zeros(dim), 0L))(
(agg, v) => (agg._1 += v, agg._2 + 1L),
(agg1, agg2) => (agg1._1 += agg2._1, agg1._2 + agg2._2)
).mapValues(agg => agg._1/agg._2).collectAsMap
For this to work you will need to compute the dimensionality of your data, i.e. dim, but you only need to do this once. You could probably use something like val dim = data.first._2.length.

Related

How to properly iterate over Array[String]?

I have a function in scala which I send arguments to, I use it like this:
val evega = concat.map(_.split(",")).keyBy(_(0)).groupByKey().map{case (k, v) => (k, f(v))}
My function f is:
val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
def f(v: Array[String]): Int = {
val parsedDates = v.map(LocalDate.parse(_, formatter))
parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear}
And this is the error I get:
found : Iterable[Array[String]]
required: Array[String]
I already tried using:
val evega = concat.map(_.split(",")).keyBy(_(0)).groupByKey().map{case (k, v) => (k, for (date <- v) f(date))}
But I get massive errors.
Just to get a better picture, data in concat is:
1974,1974-06-22
1966,1966-07-20
1954,1954-06-19
1994,1994-06-27
1954,1954-06-26
2006,2006-07-04
2010,2010-07-07
1990,1990-06-30
...
It is type RDD[String].
How can I properly iterate over that and get a single Int from that function f?
The RDD types alongside your pipeline are:
concat.map(_.split(",")) gives an RDD[Array[String]]
for instance Array("1954", "1954-06-19")
concat.map(_.split(",")).keyBy(_(0)) gives RDD[(String, Array[String])]
for instance ("1954", Array("1954", "1954-06-19"))
concat.map(_.split(",")).keyBy(_(0)).groupByKey() gives RDD[(String, Iterable[Array[String]])]
for instance Iterable(("1954", Iterable(Array("1954", "1954-06-19"), Array("1954", "1954-06-24"))))
Thus when you map at the end, the type of values is Iterable[Array[String]].
Since your input is "1974,1974-06-22", the solution could consist in replacing your keyBy transformation by a map:
input.map(_.split(",")).map(x => x(0) -> x(1)).groupByKey().map{case (k, v) => (k, f(v))}
Indeed, .map(x => x(0) -> x(1)) (instead of .map(x => x(0) -> x) whose keyBy(_(0)) is syntactic sugar for) will provide for the value the second element of the split array instead of the array itself. Thus giving RDD[(String, String)] during this second step rather than RDD[(String, Array[String])].

Spark Rdd - using sortBy with multiple column values

After grouping my dataset , it look like this
(AD_PRES,1)
(AD_VP,2)
(FI_ACCOUNT,5)
(FI_MGR,1)
(IT_PROG,5)
(PU_CLERK,5)
(PU_MAN,1)
(SA_MAN,5)
(ST_CLERK,20)
(ST_MAN,5)
Here i want to sort by key as descending and value as ascending . So tried below lines of code.
emp_data.map(s => (s.JOB_ID, s.FIRST_NAME.concat(",").concat(s.LAST_NAME))).groupByKey().map({
case (x, y) => (x, y.toList.size)
}).sortBy(s => (s._1, s._2))(Ordering.Tuple2(Ordering.String.reverse, Ordering.Int.reverse))
it is causing below exception.
not enough arguments for expression of type (implicit ord: Ordering[(String, Int)], implicit ctag: scala.reflect.ClassTag[(String, Int)])org.apache.spark.rdd.RDD[(String, Int)]. Unspecified value parameter ctag.
RDD.sortBy takes both ordering and class tags as implicit arguments.
def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
You cannot just provide a subset of these and expect things to work. Instead you can provide block local implicit ordering:
{
implicit val ord = Ordering.Tuple2[String, Int](Ordering.String.reverse, Ordering.Int.reverse)
emp_data.map(s => (s.JOB_ID, s.FIRST_NAME.concat(",").concat(s.LAST_NAME))).groupByKey().map({
case (x, y) => (x, y.toList.size)
}).sortBy(s => (s._1, s._2))
}
though you should really use reduceByKey not groupByKey in such case.

Scala: Difference between Map.map and Map.transform? Why Map.map requires pattern matching in its parameter?

For an immutable Map,
val original = Map("A"->1, "B"->2)
I can either use
original.map { case (k, v) => (k, v + 1) }
Or
original.transform((_, v) => v + 1)
to transform the values.
But why map() method requires case pattern matching but transform() doesn't? Is it because of these methods are defined in different implicit types?
Someone has marked my question as a duplicate of another question [Difference between mapValues and transform in Map. It is not the same. I am asking Map.map not Map.mapValues. Also I am asking the different way of using the two methods.
With map method you can change (don't want to use transform word here) whole Map converting it to another Map, List etc
val m = Map(1->"a")
m.map { case (k,v) => (k+1) -> (v + 1) } // Map(2 -> a1)
m.map { case (k,v) => k+v } // List(1a)
With transform method you can change only values considering their keys
m.transform { case (k, v) => v + 1 } // Map(1 -> a1)
Transform take a function that has two values as inputs, the first is the key and the second the value. Pattern matching is not needed since the two values are passed in individually.
On the other hand, the function passed to map takes in a single tuple containing the key and value of the element as an input. Pattern matching is used to break this tuple into it's components. You don't have to use pattern matching, but that would mean working with the tuple object instead of it's contents.
The difference is in the function they receive. As you can see in the API
def transform[W, That](f: (K, V) ⇒ W)(implicit bf: CanBuildFrom[Map[K, V], (K, W), That]): That
def map[B](f: (A) ⇒ B): Map[B]
transform's function receives a tuple f: (K, V) ⇒ W while map's function receives a single value (which can obviously be a Tuple) f: (A) ⇒ B
So if you want to treat differently and in a easy-to-read way you should use the case word.
You can also do something like this, but is way less readeable:
original.map(r => (r._1, r._2+1))

How to merge two LinkedHashMaps[Int, ListBuffer[Int]] in Scala?

I found this method:
def merge[K, V](maps: Seq[Map[K, V]])(f: (K, V, V) => V): Map[K, V] = {
maps.foldLeft(Map.empty[K, V]) { case (merged, m) =>
m.foldLeft(merged) { case (acc, (k, v)) =>
acc.get(k) match {
case Some(existing) => acc.updated(k, f(k, existing, v))
case None => acc.updated(k, v)
}
}
}
}
but it gives me a type mismatch error if i use it like this:
val mergeMsg = (map1: LinkedHashMap[Int, ListBuffer[Int]], map2: LinkedHashMap[Int, ListBuffer[Int]]) =>
{
val ms=Seq(map1, map2)
merge(ms.map(_.mapValues(List(_)))){(_, v1, v2) => v1 ++ v2}
}
The error says:
"Type mismatch, expected: mutable.Seq[Mutable.Map[NotInferedK, NotInferedV]], actual: mutable.Seq[Map[Int, List[ListBuffer[Int]]]]"
How can i solve this? I know it's something simple, but i'm new to scala.
The problem is that you are passing in to merge a sequence of mutable LinkedHashMaps. The function requires a sequence of immutable Maps.
You need to convert your LinkedHashMaps to the correct type first. The simplest way to do this is probably to call .toMap before you perform the mapValues.
merge(ms.map(_.toMap.mapValues(List(_)))){(_, v1, v2) => v1 ++ v2}
Update
Alternatively the method signature for Merge can be change to explicitly use scala.collection.Map. It will, by default, use scala.collection.immutable.Map.
def merge[K, V](maps: Seq[scala.collection.Map[K, V]])(f: (K, V, V) => V): scala.collection.Map[K, V]
val mergeMsg = (map1: LinkedHashMap[Int, ListBuffer[Int]],
map2: LinkedHashMap[Int, ListBuffer[Int]]) => {
val ms = Seq (map1.toMap, map2.toMap)
merge (ms) ((_, lb1, lb2) => (lb1 ++ lb2))
}
So the Type needs just to be converted to Map.
The k is not used for the process of updating, so we use _ instead.
The lbs are for ListBuffers.

Scala - Join List of tuples by Key

I am looking for a way to join two list of tuples in scala to get same result than Apache spark gives me using join function.
Example:
Having two list of tuples such us:
val l1 = List((1,1),(1,2),(2,1),(2,2))
l1: List[(Int, Int)] = List((1,1), (1,2), (2,1), (2,2))
val l2 = List((1,(1,2)), (2,(2,3)))
l2: List[(Int, (Int, Int))] = List((1,(1,2)), (2,(2,3)))
What is the best way to join by key both list to get the following result?
l3: List[(Int,(Int,(Int,Int)))] = ((1,(1,(1,2))),(1,(2,(1,2))),(2,(1,(2,3))),(2,(2,(2,3))))
You can use a for comprehension and take advantage of using the '`' in the pattern matching. That is, it will match only when keys from the first list are the same with the ones in the second list ("`k`" means the key in the tuple must be equal to the value of k).
val res = for {
(k, v1) <- l1
(`k`, v2) <- l2
} yield (k, (v1, v2))
I hope you find this helpful.
You might want do do something like this:
val l3=l1.map(tup1 => l2.filter(tup2 => tup1._1==tup2._1).map(tup2 => (tup1._1, (tup1._2, tup2._2)))).flatten
It Matches the same Indexes, creates sublists and then combines the list of lists with the flatten-command
This results to:
List((1,(1,(1,2))), (1,(2,(1,2))), (2,(1,(2,3))), (2,(2,(2,3))))
Try something like this:
val l2Map = l2.toMap
val l3 = l1.flatMap { case (k, v1) => l2Map.get(k).map(v2 => (k, (v1, v2))) }
what can be rewritten to more general form using implicits:
package some.package
import scala.collection.TraversableLike
import scala.collection.generic.CanBuildFrom
package object collection {
implicit class PairTraversable[K, V, C[A] <: TraversableLike[A, C[A]]](val seq: C[(K, V)]) {
def join[V2, C2[A] <: TraversableLike[A, C2[A]]](other: C2[(K, V2)])
(implicit canBuildFrom: CanBuildFrom[C[(K, V)], (K, (V, V2)), C[(K, (V, V2))]]): C[(K, (V, V2))] = {
val otherMap = other.toMap
seq.flatMap { case (k, v1) => otherMap.get(k).map(v2 => (k, (v1, v2))) }
}
}
}
and then simply:
import some.package.collection.PairTraversable
val l3 = l1.join(l2)
This solution converts second sequence to map (so it consumes some additional memory), but is much faster, than solutions in other answers (compare it for large collections, e.g. 10000 elements, on my laptop it is 5ms vs 2500ms).
Little late. This solution will give you back the original size of l1 and return Option(None) for missing values in l2. (Left join instead of inner join)
val m2 = l2.map{ case(k,v) => (k -> v)}.toMap
val res2 = l1.map { case(k,v) =>
val v2 = m2.get(k)
(k, (v, v2))
}