Spark Rdd - using sortBy with multiple column values - scala

After grouping my dataset , it look like this
Here i want to sort by key as descending and value as ascending . So tried below lines of code. => (s.JOB_ID, s.FIRST_NAME.concat(",").concat(s.LAST_NAME))).groupByKey().map({
case (x, y) => (x, y.toList.size)
}).sortBy(s => (s._1, s._2))(Ordering.Tuple2(Ordering.String.reverse, Ordering.Int.reverse))
it is causing below exception.
not enough arguments for expression of type (implicit ord: Ordering[(String, Int)], implicit ctag: scala.reflect.ClassTag[(String, Int)])org.apache.spark.rdd.RDD[(String, Int)]. Unspecified value parameter ctag.

RDD.sortBy takes both ordering and class tags as implicit arguments.
def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
You cannot just provide a subset of these and expect things to work. Instead you can provide block local implicit ordering:
implicit val ord = Ordering.Tuple2[String, Int](Ordering.String.reverse, Ordering.Int.reverse) => (s.JOB_ID, s.FIRST_NAME.concat(",").concat(s.LAST_NAME))).groupByKey().map({
case (x, y) => (x, y.toList.size)
}).sortBy(s => (s._1, s._2))
though you should really use reduceByKey not groupByKey in such case.


Sort list of String based on it's value in given map

I have one String list
val str = List("A","B","C","D")
and given:
val map = Map(("A"->3),("B"->1),("C"->10),("D"->5))
To sort str list based on given map value, I have tried str.sortBy(map). It's giving me error "A" key is not found. Could someone please help me out what am I doing wrong?
It should work as it is. Let's explain why. The signature of sortBy is:
def sortBy[B](f: A => B)(implicit ord: Ordering[B]): C = sorted(ord on f)
Therefore when you do str.sortBy(map), sortBy expects to get String => Int. str.sortBy(map) is equivalent to:
str.sortBy(s => map(s))
Note that Map extends MapOps(in Scala 2.13, in Scala 2.12 it is MapLike). MapOps(and MapLike) exposes an apply method, which takes (in your case) String and returns Int:
def apply(key: K): V = get(key) match {
case None => default(key)
case Some(value) => value
Hence writing str.sortBy(map) is the same as:
str.sortBy(s => map.apply(s))
which is the same as:
Code run at Scastie.

Supplying a code block as one of multiple method parameters

Consider these overloaded groupBy signatures:
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy[K](f, defaultPartitioner(this))
def groupBy[K](
f: T => K,
numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
groupBy(f, new HashPartitioner(numPartitions))
A correct/working invocation of the former is as follows:
val groupedRdd = df.rdd.groupBy{ r => r.getString(r.fieldIndex("centroidId"))}
But I am unable to determine how to add the second parameter. Here is the obvious attempt - which gives syntax errors:
val groupedRdd = df.rdd.groupBy{ r => r.getString(r.fieldIndex("centroidId")),
I had also tried (also with syntax errors) :
val groupedRdd = df.rdd.groupBy({ r => r.getString(r.fieldIndex("centroidId"))},
btw Here is an approach that does work .. but I am looking for the inline syntax
def func(r: Row) = r.getString(r.fieldIndex("centroidId"))
val groupedRdd = df.rdd.groupBy( func _, nPartitions)
Since this is a generic method with type parameters T, K, Scala sometimes can't infer what types those should be from the context. In such cases you can help it by providing type annotation like this:
df.rdd.groupBy({ r: Row => r.getString(r.fieldIndex("centroidId")) }, nPartitions)
This is also the reason why this approach works:
def func(r: Row) = r.getString(r.fieldIndex("centroidId"))
val groupedRdd = df.rdd.groupBy(func _, nPartitions)
This fixes the type for r to be a Row similarly to the approach above.

Type issues when using flatMapGroups

I am translating spark-1.6 rdd to spark-2.x datasets
The original code was:
val sample_data : Dataset[(Int, Array[Double])]
val samples : Array[Array[Array[Double]]] = sample_data.rdd
.groupBy(x => x._1)
.map(x => {
val (id: Int, points: Iterable[(Int, Array[Double])]) = x
val data1 = => x._2).toArray
The sample_data.rdd no longer works so I am trying to do the same operations using datasets. The new approach uses flatMapGroups
val sample_data : Dataset[(Int, Array[Double])]
val samples : Array[Array[Array[Double]]] = sample_data
.groupByKey(x => x._1)
.flatMapGroups ( (id: Int, points: Iterable[(Int, Array[Double])]) =>
Iterator(, y:Array[Double]) => y)).toList
The error given is:
Error:(36, 25) overloaded method value map with alternatives: [B,
That](f: ((Int, Array[Double])) => B)(implicit bf:
Array[Double])],B,That])That [B](f: ((Int, Array[Double])) =>
B)Iterator[B] cannot be applied to ((Int, Array[Double]) =>
Iterator(, y:Array[Double])
=> y)).toList
Can you please provide an example of how to use flatMapGroups and how to understand the given error?
points is actually an Iterator, but you are casting it to an Iterable, so the compiler is telling you to make it an Iterator.
This is what you are trying to do:
val samples: Array[Array[Array[Double]]] = sample_data
.flatMapGroups((id: Int, points: Iterator[(Int, Array[Double])]) =>
Rewrapping in an Iterator isn't serving you a purpose, so you can just use mapGroups like so:
.mapGroups((_, points) =>
However in both cases, there is no encoder for an Array[Array[_]]. Look here for more detail.
So either implement the implicit Encoder yourself (existing Encoders), or stick to the RDD interface.

subsets manipulation on vectors in spark scala

I have an RDD curRdd of the form
res10: org.apache.spark.rdd.RDD[(scala.collection.immutable.Vector[(Int, Int)], Int)] = ShuffledRDD[102]
with curRdd.collect() producing the following result.
Array((Vector((5,2)),1), (Vector((1,1)),2), (Vector((1,1), (5,2)),2))
Here key : vector of pairs of ints and value: count
Now, I want to convert it into another RDD of the same form RDD[(scala.collection.immutable.Vector[(Int, Int)], Int)] by percolating down the counts.
That is (Vector((1,1), (5,2)),2)) will contribute its count of 2 to any key which is a subset of it like (Vector((5,2)),1) becomes (Vector((5,2)),3).
For the example above, our new RDD will have
(Vector((5,2)),3), (Vector((1,1)),4), (Vector((1,1), (5,2)),2)
How do I achieve this? Kindly help.
First you can introduce subsets operation for Seq:
implicit class SubSetsOps[T](val elems: Seq[T]) extends AnyVal {
def subsets: Vector[Seq[T]] = elems match {
case Seq() => Vector(elems)
case elem +: rest => {
val recur = rest.subsets
recur ++ +: _)
empty subset will allways the be first element in the result vector, so you can omit it with .tail
Now your task is pretty obvious map-reduce which is flatMap-reduceByKey in terms of RDD:
val result = curRdd
.flatMap { case (keys, count) => -> count) }
.reduceByKey(_ + _)
This implementation could introduce new sets in the result, if you would like to choose only those that was presented in the original collection, you can join result with original:
val result = curRdd
.flatMap { case (keys, count) => -> count) }
.reduceByKey(_ + _)
.join(curRdd map identity[(Seq[(Int, Int)], Int)])
.map { case (key, (v, _)) => (key, v) }
Note that map identity step is needed to convert key type from Vector[_] to Seq[_] in the original RDD. You can instead modify SubSetsOps definition substituting all occurencest of Seq[T] with Vector[T] or change definition following (hardcode scala.collection) way:
import scala.collection.SeqLike
import scala.collection.generic.CanBuildFrom
implicit class SubSetsOps[T, F[e] <: SeqLike[e, F[e]]](val elems: F[T]) extends AnyVal {
def subsets(implicit cbf: CanBuildFrom[F[T], T, F[T]]): Vector[F[T]] = elems match {
case Seq() => Vector(elems)
case elem +: rest => {
val recur = rest.subsets
recur ++ +: _)

Sorting keys in an RDD

I need to sort the keys in an RDD, but there is no natural sorting order (not ascending or descending). I wouldn't even know how to write a Comparator to do it. Say I had a map of apples, pears, oranges, and grapes, I want to sort by oranges, apples, grapes, and pears.
Any ideas on how to do this in Spark/Scala? Thanks!
In Scala, you need to look for the Ordering[T] trait rather than the Comparator interface -- mostly a cosmetic difference so that the focus is on the attribute of the data rather than a thing which compares two instances of the data. Implementing the trait requires that the compare(T,T) method be defined. A very explicit version of the enumerated comparison could be:
object fruitOrdering extends Ordering[String] {
def compare(lhs: String, rhs: String): Int = (lhs, rhs) match {
case ("orange", "orange") => 0
case ("orange", _) => -1
case ("apple", "orange") => 1
case ("apple", "apple") => 0
case ("apple", _) => -1
case ("grape", "orange") => 1
case ("grape", "apple") => 1
case ("grape", "grape") => 0
case ("grape", _) => -1
case ("pear", "orange") => 1
case ("pear", "apple") => 1
case ("pear", "grape") => 1
case ("pear", "pear") => 0
case ("pear", _) => -1
case _ => 0
Or, to slightly adapt zero323's answer:
object fruitOrdering2 extends Ordering[String] {
private val values = Seq("orange", "apple", "grape", "pear")
// generate the map based off of indices so we don't have to worry about human error during updates
private val ordinalMap = values.zipWithIndex.toMap.withDefaultValue(Int.MaxValue)
def compare(lhs: String, rhs: String): Int = ordinalMap(lhs).compare(ordinalMap(rhs))
Now that you have an instance of Ordering[String], you need to inform the sortBy method use this ordering rather than the built-in one. If you look at the signature for RDD#sortBy you'll see the full signature is
def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
That implicit Ordering[K] in the second parameter list is normally looked up by the compiler for pre-defined orderings -- that's how it knows what the natural ordering should be. Any implicit parameter, however, can be given an explicit value instead. Note that if you supply one implicit value then you need to supply all, so in this case we also need to provide the ClassTag[K]. That's always generated by the compiler but can be easily explicitly generated using scala.reflect.classTag.
Specifying all of that, the invocation would look like:
import scala.reflect.classTag
rdd.sortBy { case (key, _) => key }(fruitOrdering, classOf[String])
That's still pretty messy, though, isn't it? Luckily we can use implicit classes to take away a lot of the cruft. Here's a snippet that I use fairly commonly:
package com.example.spark
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
package object implicits {
implicit class RichSortingRDD[A : ClassTag](underlying: RDD[A]) {
def sorted(implicit ord: Ordering[A]): RDD[A] =
underlying.sortBy(identity)(ord, implicitly[ClassTag[A]])
def sortWith(fn: (A, A) => Int): RDD[A] = {
val ord = new Ordering[A] { def compare(lhs: A, rhs: A): Int = fn(lhs, rhs) }
implicit class RichSortingPairRDD[K : ClassTag, V](underlying: RDD[(K, V)]) {
def sortByKey(implicit ord: Ordering[K]): RDD[(K, V)] =
underlying.sortBy { case (key, _) => key } (ord, implicitly[ClassTag[K]])
def sortByKeyWith(fn: (K, K) => Int): RDD[(K, V)] = {
val ord = new Ordering[K] { def compare(lhs: K, rhs: K): Int = fn(lhs, rhs) }
And in action:
import com.example.spark.implicits._
val rdd = sc.parallelize(Seq(("grape", 0.3), ("apple", 5.0), ("orange", 5.6)))
// Array[(String, Double)] = Array((orange,5.6), (apple,5.0), (grape,0.3))
rdd.sortByKey.collect // Natural ordering by default
// Array[(String, Double)] = Array((apple,5.0), (grape,0.3), (orange,5.6))
rdd.sortWith(_._2 compare _._2).collect // sort by the value instead
// Array[(String, Double)] = Array((grape,0.3), (apple,5.0), (orange,5.6))
If the only way you can describe the order is enumeration then simply enumerate:
val order = Map("orange" -> 0L, "apple" -> 1L, "grape" -> 2L, "pear" -> 3L)
val rdd = sc.parallelize(Seq(("grape", 0.3), ("apple", 5.0), ("orange", 5.6)))
val sorted = rdd.sortBy{case (key, _) => order.getOrElse(key, Long.MaxValue)}
// Array[(String, Double)] = Array((orange,5.6), (apple,5.0), (grape,0.3))
There is a sortBy method in Spark which allows you to define an arbitrary ordering and whether you want ascending or descending. E.g.
scala> val rdd = sc.parallelize(Seq ( ("a", 1), ("z", 7), ("p", 3), ("a", 13) ))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[331] at parallelize at <console>:70
scala> rdd.sortBy( _._2, ascending = false) .collect.mkString("\n")
res34: String =
scala> rdd.sortBy( _._1, ascending = false) .collect.mkString("\n")
res35: String =
scala> rdd.sortBy
def sortBy[K](f: T => K, ascending: Boolean, numPartitions: Int)(implicit ord: scala.math.Ordering[K], ctag: scala.reflect.ClassTag[K]): RDD[T]
The last part tells you what the signature of sortBy is. The ordering used in previous examples is by the first and second part of the pair.
Edit: answered too quickly, without checking your question, sorry... Anyway, you would define your ordering like in your example:
def myord(fruit:String) = fruit match {
case "oranges" => 1 ;
case "apples" => 2;
case "grapes" =>3;
case "pears" => 4;
case _ => 5}
val rdd = sc.parallelize(Seq("apples", "oranges" , "pears", "grapes" , "other") )
Then, the result of ordering would be:
scala> rdd.sortBy[Int](myord, ascending = true).collect.mkString("\n")
res1: String =
I don't know about spark, but with pure Scala collections that would be
For example,
val l: List[String] = List("the", "big", "bang")
val sortedByFirstLetter = l.sortBy(_.head)
// List(big, bang, the)