I have an RDD curRdd of the form
res10: org.apache.spark.rdd.RDD[(scala.collection.immutable.Vector[(Int, Int)], Int)] = ShuffledRDD[102]
with curRdd.collect() producing the following result.
Array((Vector((5,2)),1), (Vector((1,1)),2), (Vector((1,1), (5,2)),2))
Here key : vector of pairs of ints and value: count
Now, I want to convert it into another RDD of the same form RDD[(scala.collection.immutable.Vector[(Int, Int)], Int)] by percolating down the counts.
That is (Vector((1,1), (5,2)),2)) will contribute its count of 2 to any key which is a subset of it like (Vector((5,2)),1) becomes (Vector((5,2)),3).
For the example above, our new RDD will have
(Vector((5,2)),3), (Vector((1,1)),4), (Vector((1,1), (5,2)),2)
How do I achieve this? Kindly help.
First you can introduce subsets operation for Seq:
implicit class SubSetsOps[T](val elems: Seq[T]) extends AnyVal {
def subsets: Vector[Seq[T]] = elems match {
case Seq() => Vector(elems)
case elem +: rest => {
val recur = rest.subsets
recur ++ recur.map(elem +: _)
empty subset will allways the be first element in the result vector, so you can omit it with .tail
Now your task is pretty obvious map-reduce which is flatMap-reduceByKey in terms of RDD:
val result = curRdd
.flatMap { case (keys, count) => keys.subsets.tail.map(_ -> count) }
.reduceByKey(_ + _)
This implementation could introduce new sets in the result, if you would like to choose only those that was presented in the original collection, you can join result with original:
val result = curRdd
.flatMap { case (keys, count) => keys.subsets.tail.map(_ -> count) }
.reduceByKey(_ + _)
.join(curRdd map identity[(Seq[(Int, Int)], Int)])
.map { case (key, (v, _)) => (key, v) }
Note that map identity step is needed to convert key type from Vector[_] to Seq[_] in the original RDD. You can instead modify SubSetsOps definition substituting all occurencest of Seq[T] with Vector[T] or change definition following (hardcode scala.collection) way:
import scala.collection.SeqLike
import scala.collection.generic.CanBuildFrom
implicit class SubSetsOps[T, F[e] <: SeqLike[e, F[e]]](val elems: F[T]) extends AnyVal {
def subsets(implicit cbf: CanBuildFrom[F[T], T, F[T]]): Vector[F[T]] = elems match {
case Seq() => Vector(elems)
case elem +: rest => {
val recur = rest.subsets
recur ++ recur.map(elem +: _)
I am still learning the basics of Scala, therefore I am asking for your understanding. Is it any possible way to use fold method to print only names beginning with "A"
Object Scala {
val names: List[String] = List("Adam", "Mick", "Ann");
def main(args: Array[String]) {
println(names.foldLeft("my list of items starting with A: ")(_+_));
Have a look at the signature of foldLeft
def foldLeft[B](z: B)(op: (B, A) => B): B
z is the initial value
op is a function taking two arguments, namely accumulated result so far B, and the next element to be processed A
returns the accumulated result B
Now consider this concrete implementation
val names: List[String] = List("Adam", "Mick", "Ann")
val predicate: String => Boolean = str => str.startsWith("A")
names.foldLeft(List.empty[String]) { (accumulated: List[String], next: String) =>
if (predicate(next)) accumulated.prepended(next) else accumulated
z = List.empty[String]
op = (accumulated: List[String], next: String) => if (predicate(next)) accumulated.prepended(next) else accumulated
Usually we would write this inlined and rely on type inference so we do not have two write out full types all the time, so it becomes
names.foldLeft(List.empty[String]) { (acc, next) =>
if (next.startsWith("A")) next :: acc else acc
// val res1: List[String] = List(Ann, Adam)
On of the key ideas when working with List is to always prepend an element instead of append
names.foldLeft(List.empty[String]) { (accumulated: List[String], next: String) =>
if (predicate(next)) accumulated.appended(next) else accumulated
because prepending is much more efficient. However note how this makes the accumulated result in reverse order, so
List(Ann, Adam)
instead of perhaps required
List(Adam, Ann)
so often-times we perform one last traversal by calling reverse like so
names.foldLeft(List.empty[String]) { (acc, next) =>
if (next.startsWith("A")) next :: acc else acc
// val res1: List[String] = List(Adam, Ann)
The answer from #Mario Galic is a good one and should be accepted. (It's the polite thing to do).
Here's a slightly different way to filter for starts-with-A strings.
val names: List[String] = List("Adam", "Mick", "Ann")
println(names.foldLeft("my list of items starting with A: "){
case (acc, s"A$nme") => acc + s"A$nme "
case (acc, _ ) => acc
//output: "my list of items starting with A: Adam Ann"
I am looking for a way to join two list of tuples in scala to get same result than Apache spark gives me using join function.
Having two list of tuples such us:
val l1 = List((1,1),(1,2),(2,1),(2,2))
l1: List[(Int, Int)] = List((1,1), (1,2), (2,1), (2,2))
val l2 = List((1,(1,2)), (2,(2,3)))
l2: List[(Int, (Int, Int))] = List((1,(1,2)), (2,(2,3)))
What is the best way to join by key both list to get the following result?
l3: List[(Int,(Int,(Int,Int)))] = ((1,(1,(1,2))),(1,(2,(1,2))),(2,(1,(2,3))),(2,(2,(2,3))))
You can use a for comprehension and take advantage of using the '`' in the pattern matching. That is, it will match only when keys from the first list are the same with the ones in the second list ("`k`" means the key in the tuple must be equal to the value of k).
val res = for {
(k, v1) <- l1
(`k`, v2) <- l2
} yield (k, (v1, v2))
I hope you find this helpful.
You might want do do something like this:
val l3=l1.map(tup1 => l2.filter(tup2 => tup1._1==tup2._1).map(tup2 => (tup1._1, (tup1._2, tup2._2)))).flatten
It Matches the same Indexes, creates sublists and then combines the list of lists with the flatten-command
This results to:
List((1,(1,(1,2))), (1,(2,(1,2))), (2,(1,(2,3))), (2,(2,(2,3))))
Try something like this:
val l2Map = l2.toMap
val l3 = l1.flatMap { case (k, v1) => l2Map.get(k).map(v2 => (k, (v1, v2))) }
what can be rewritten to more general form using implicits:
package some.package
import scala.collection.TraversableLike
import scala.collection.generic.CanBuildFrom
package object collection {
implicit class PairTraversable[K, V, C[A] <: TraversableLike[A, C[A]]](val seq: C[(K, V)]) {
def join[V2, C2[A] <: TraversableLike[A, C2[A]]](other: C2[(K, V2)])
(implicit canBuildFrom: CanBuildFrom[C[(K, V)], (K, (V, V2)), C[(K, (V, V2))]]): C[(K, (V, V2))] = {
val otherMap = other.toMap
seq.flatMap { case (k, v1) => otherMap.get(k).map(v2 => (k, (v1, v2))) }
and then simply:
import some.package.collection.PairTraversable
val l3 = l1.join(l2)
This solution converts second sequence to map (so it consumes some additional memory), but is much faster, than solutions in other answers (compare it for large collections, e.g. 10000 elements, on my laptop it is 5ms vs 2500ms).
Little late. This solution will give you back the original size of l1 and return Option(None) for missing values in l2. (Left join instead of inner join)
val m2 = l2.map{ case(k,v) => (k -> v)}.toMap
val res2 = l1.map { case(k,v) =>
val v2 = m2.get(k)
(k, (v, v2))
I'm extending Sparks AccumulableParam[mutable.HashMap[Int,Long], Int] with Scala, for some experiments. Part of this, is to define the method def addInPlace(t1: mutable.HashMap[Int,Long], t2: mutable.HashMap[Int,Long]): mutable.HashMap[Int,Long].
What I want to do:
import scala.collection.mutable.HashMap
def addInPlace(t1: mutable.HashMap[Int,Long], t2: mutable.HashMap[Int,Long]):
mutable.HashMap[Int,Long] = {
t1 ++ t2.map { case (s, c) => (s, c + t1.getOrElse(s, 0L)) }
I get the error:
Expression of type mutable.Map[Int, Long] doesn't conform to selected type mutable.HashMap[Int, Long]
In this case the ++ operator returns Map instead of HashMap, even though both terms t1 and t2.map {...} are of type HashMap[int, Long].
So my question is, how to make ++ return a HashMap instead, or how to convert the resulting Map to a HashMap.
An ugly way to do it is to use asInstanceOf[mutable.HashMap[Int, Long]] on the result map:
def addInPlace(t1: mutable.HashMap[Int,Long], t2: mutable.HashMap[Int,Long]): mutable.HashMap[Int,Long] = {
val result = t1 ++ t2.map { case (s, c) => (s, c + t1.getOrElse(s, 0L)) }
result.asInstanceOf[mutable.HashMap[Int, Long]]
I need to sort the keys in an RDD, but there is no natural sorting order (not ascending or descending). I wouldn't even know how to write a Comparator to do it. Say I had a map of apples, pears, oranges, and grapes, I want to sort by oranges, apples, grapes, and pears.
Any ideas on how to do this in Spark/Scala? Thanks!
In Scala, you need to look for the Ordering[T] trait rather than the Comparator interface -- mostly a cosmetic difference so that the focus is on the attribute of the data rather than a thing which compares two instances of the data. Implementing the trait requires that the compare(T,T) method be defined. A very explicit version of the enumerated comparison could be:
object fruitOrdering extends Ordering[String] {
def compare(lhs: String, rhs: String): Int = (lhs, rhs) match {
case ("orange", "orange") => 0
case ("orange", _) => -1
case ("apple", "orange") => 1
case ("apple", "apple") => 0
case ("apple", _) => -1
case ("grape", "orange") => 1
case ("grape", "apple") => 1
case ("grape", "grape") => 0
case ("grape", _) => -1
case ("pear", "orange") => 1
case ("pear", "apple") => 1
case ("pear", "grape") => 1
case ("pear", "pear") => 0
case ("pear", _) => -1
case _ => 0
Or, to slightly adapt zero323's answer:
object fruitOrdering2 extends Ordering[String] {
private val values = Seq("orange", "apple", "grape", "pear")
// generate the map based off of indices so we don't have to worry about human error during updates
private val ordinalMap = values.zipWithIndex.toMap.withDefaultValue(Int.MaxValue)
def compare(lhs: String, rhs: String): Int = ordinalMap(lhs).compare(ordinalMap(rhs))
Now that you have an instance of Ordering[String], you need to inform the sortBy method use this ordering rather than the built-in one. If you look at the signature for RDD#sortBy you'll see the full signature is
def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
That implicit Ordering[K] in the second parameter list is normally looked up by the compiler for pre-defined orderings -- that's how it knows what the natural ordering should be. Any implicit parameter, however, can be given an explicit value instead. Note that if you supply one implicit value then you need to supply all, so in this case we also need to provide the ClassTag[K]. That's always generated by the compiler but can be easily explicitly generated using scala.reflect.classTag.
Specifying all of that, the invocation would look like:
import scala.reflect.classTag
rdd.sortBy { case (key, _) => key }(fruitOrdering, classOf[String])
That's still pretty messy, though, isn't it? Luckily we can use implicit classes to take away a lot of the cruft. Here's a snippet that I use fairly commonly:
package com.example.spark
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
package object implicits {
implicit class RichSortingRDD[A : ClassTag](underlying: RDD[A]) {
def sorted(implicit ord: Ordering[A]): RDD[A] =
underlying.sortBy(identity)(ord, implicitly[ClassTag[A]])
def sortWith(fn: (A, A) => Int): RDD[A] = {
val ord = new Ordering[A] { def compare(lhs: A, rhs: A): Int = fn(lhs, rhs) }
implicit class RichSortingPairRDD[K : ClassTag, V](underlying: RDD[(K, V)]) {
def sortByKey(implicit ord: Ordering[K]): RDD[(K, V)] =
underlying.sortBy { case (key, _) => key } (ord, implicitly[ClassTag[K]])
def sortByKeyWith(fn: (K, K) => Int): RDD[(K, V)] = {
val ord = new Ordering[K] { def compare(lhs: K, rhs: K): Int = fn(lhs, rhs) }
And in action:
import com.example.spark.implicits._
val rdd = sc.parallelize(Seq(("grape", 0.3), ("apple", 5.0), ("orange", 5.6)))
// Array[(String, Double)] = Array((orange,5.6), (apple,5.0), (grape,0.3))
rdd.sortByKey.collect // Natural ordering by default
// Array[(String, Double)] = Array((apple,5.0), (grape,0.3), (orange,5.6))
rdd.sortWith(_._2 compare _._2).collect // sort by the value instead
// Array[(String, Double)] = Array((grape,0.3), (apple,5.0), (orange,5.6))
If the only way you can describe the order is enumeration then simply enumerate:
val order = Map("orange" -> 0L, "apple" -> 1L, "grape" -> 2L, "pear" -> 3L)
val rdd = sc.parallelize(Seq(("grape", 0.3), ("apple", 5.0), ("orange", 5.6)))
val sorted = rdd.sortBy{case (key, _) => order.getOrElse(key, Long.MaxValue)}
// Array[(String, Double)] = Array((orange,5.6), (apple,5.0), (grape,0.3))
There is a sortBy method in Spark which allows you to define an arbitrary ordering and whether you want ascending or descending. E.g.
scala> val rdd = sc.parallelize(Seq ( ("a", 1), ("z", 7), ("p", 3), ("a", 13) ))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[331] at parallelize at <console>:70
scala> rdd.sortBy( _._2, ascending = false) .collect.mkString("\n")
res34: String =
scala> rdd.sortBy( _._1, ascending = false) .collect.mkString("\n")
res35: String =
scala> rdd.sortBy
def sortBy[K](f: T => K, ascending: Boolean, numPartitions: Int)(implicit ord: scala.math.Ordering[K], ctag: scala.reflect.ClassTag[K]): RDD[T]
The last part tells you what the signature of sortBy is. The ordering used in previous examples is by the first and second part of the pair.
Edit: answered too quickly, without checking your question, sorry... Anyway, you would define your ordering like in your example:
def myord(fruit:String) = fruit match {
case "oranges" => 1 ;
case "apples" => 2;
case "grapes" =>3;
case "pears" => 4;
case _ => 5}
val rdd = sc.parallelize(Seq("apples", "oranges" , "pears", "grapes" , "other") )
Then, the result of ordering would be:
scala> rdd.sortBy[Int](myord, ascending = true).collect.mkString("\n")
res1: String =
I don't know about spark, but with pure Scala collections that would be
For example,
val l: List[String] = List("the", "big", "bang")
val sortedByFirstLetter = l.sortBy(_.head)
// List(big, bang, the)
I have a scala IndexedSeq[(Int, Future[Long])]).
I would like to fill out this function:
def getMininumIfCountIsPositive(distances: IndexedSeq[(Int, Future[Long])]): Future[Option[Int]] = {
If there does not exist an element where the Long is greater than 0, should return a Future of None. If there are elements where the Long is greater than 0, should return a Future of the minimum associated Int.
This is what I've got right now:
Future.sequence(distances.map {
case (index, count) => count.map(index -> _)
}) map {
s =>
Option(s.filter(_._2 > 0).minBy(_._1)._1)
But, I don't know how to handle the case where there are no elements that pass the filter, or where Futures have failed.
Map your sequence of Int, Future[Long] to a sequence of Future[(Int,Long)]:
val futureOfSequence = a map ( b: (Int, Future[Long]) => b._2 map ( c => (b._1,c)))
Then use Future.sequence to convert that sequence of Future[(Int,Long)] to Future[IndexedSeq(Int,Long)]
val sequenceOfFuture = Future.sequence(futureOfSequence)
Now you can map that Future to your Future[Option[Int]]:
val finalResult = sequenceOfFuture map ( iSeq: IndexedSeq[(Int,Long)] => /* your logic goes here */ )
Here is an efficient version, derived from the one in the question:
Future.traverse(distances) {
case (index, count) => count.map(index -> _)
} map { _.foldLeft(None: Option[Int]) {
case (a, (_, x)) if x <= 0 => a
case (None, (i, _)) => Some(i)
case (Some(ai), (i, _)) => Some(ai min i)
Future.traverse lets us combine the Future.sequence and map operations together. The foldLeft combines all the logic from filter and minBy and produces the appropriate Option.
Both Future.traverse and Future.sequence produce a failed future if any of the futures they are built from fails, so you already have proper failure handling.
Rather long-winded..
def get(a: IndexedSeq[(Int, Future[Long])]): Future[Option[Int]] = {
Future.sequence( // Convert the Seq[Future] to Future[Seq]
a.map{ case (index, f) =>
f.map(l => (index, l)) // map each Future to be paired with its index
.recover{ case _: Throwable => (0, 0L)} // recover failed Futures as (0, 0) since they'll be thrown out anyway
).map{ seq =>
Option(seq.minBy(_._2)) // Map the Seq to it's minimum element wrapped in Option
.filter(_._2 > 0) // Remove non-positives
.map(_._1) // Take the index
trait Test2 {
import scala.concurrent.Future
import scala.concurrent.Future.{traverse, successful}
implicit def context: scala.concurrent.ExecutionContext
def logic(in: IndexedSeq[(Int, Long)]): Option[Int]
def getMininumIfCountIsPositive(a: IndexedSeq[(Int, Future[Long])]): Future[Option[Int]] = {
traverse(a) { case (i, f) => successful(i).zip(f) } map(logic)