removing the some in left join RDD in spark - scala

I'm running a left join in a Spark RDD but sometimes I get an output like this:
(k, (v, Some(w)))
or
(k, (v, None))
how do I make it so it give me back just
(k, (v, (w)))
or
(k, (v, ()))
here is how I'm combining 2 files..
def formatMap3(
left: String = "", right: String = "")(m: String = "") = {
val items = m.map{k => {
s"$k"}}
s"$left$items$right"
}
val combPrdGrp = custPrdGrp3.leftOuterJoin(cmpgnPrdGrp3)
val combPrdGrp2 = combPrdGrp.groupByKey
val combPrdGrp3 = combPrdGrp2.map { case (n, list) =>
val formattedPairs = list.map { case (a, b) => s"$a $b" }
s"$n ${formattedPairs.mkString}"
}

If you're just interesting in getting formatted output without the Somes/Nones, then something like this should work:
val combPrdGrp3 = combPrdGrp2.map { case (n, list) =>
val formattedPairs = list.map {
case (a, Some(b)) => s"$a $b"
case (a, None) => s"$a, ()"
}
s"$n ${formattedPairs.mkString}"
}
If you have other uses in mind then you probably need to provide more details.

The leftOuterJoin() function in Spark returns the tuples containing the join key, the left set's value and an Option of the right set's value. To extract from the Option class, simply call getOrElse() on the right set's value in the resultant RDD. As an example:
scala> val rdd1 = sc.parallelize(Array(("k1", 4), ("k4", 7), ("k8", 10), ("k6", 1), ("k7", 4)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[13] at parallelize at <console>:21
scala> val rdd2 = sc.parallelize(Array(("k5", 4), ("k4", 3), ("k0", 2), ("k6", 5), ("k1", 6)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[14] at parallelize at <console>:21
scala> val rdd_join = rdd1.leftOuterJoin(rdd2).map { case (a, (b, c: Option[Int])) => (a, (b, (c.getOrElse()))) }
rdd_join: org.apache.spark.rdd.RDD[(String, (Int, AnyVal))] = MapPartitionsRDD[18] at map at <console>:25'
scala> rdd_join.take(5).foreach(println)
...
(k4,(7,3))
(k6,(1,5))
(k7,(4,()))
(k8,(10,()))
(k1,(4,6))

Related

Reverse a Map which has A Set as its value using HOF

I am trying to reverse a map that has a String as the key and a set of numbers as its value
My goal is to create a list that contains a tuple of a number and a list of strings that had the same number in the value set
I have this so far:
def flipMap(toFlip: Map[String, Set[Int]]): List[(Int, List[String])] = {
toFlip.flatMap(_._2).map(x => (x, toFlip.keys.toList)).toList
}
but it is only assigning every String to every Int
val map = Map(
"A" -> Set(1,2),
"B" -> Set(2,3)
)
should produce:
List((1, List(A)), (2, List(A, B)), (3, List(B)))
but is producing:
List((1, List(A, B)), (2, List(A, B)), (3, List(A, B)))
This works to, but it's not exactly what you might need and you may need some conversions to get the exact data type you need:
toFlip.foldLeft(Map.empty[Int, Set[String]]) {
case (acc, (key, numbersSet)) =>
numbersSet.foldLeft(acc) {
(updatingMap, newNumber) =>
updatingMap.updatedWith(newNumber) {
case Some(existingSet) => Some(existingSet + key)
case None => Some(Set(key))
}
}
}
I used Set to avoid duplicate key insertions in the the inner List, and used Map for better look up instead of the outer List.
You can do something like this:
def flipMap(toFlip: Map[String, Set[Int]]): List[(Int, List[String])] =
toFlip
.toList
.flatMap {
case (key, values) =>
values.map(value => value -> key)
}.groupMap(_._1)(_._2)
.view
.mapValues(_.distinct)
.toList
Note, I personally would return a Map instead of a List
Or if you have cats in scope.
def flipMap(toFlip: Map[String, Set[Int]]): Map[Int, Set[String]] =
toFlip.view.flatMap {
case (key, values) =>
values.map(value => Map(value -> Set(key)))
}.toList.combineAll
// both scala2 & scala3
scala> map.flatten{ case(k, s) => s.map(v => (k, v)) }.groupMapReduce{ case(k, v) => v }{case(k, v) => List(k)}{ _ ++ _ }
val res0: Map[Int, List[String]] = Map(1 -> List(A), 2 -> List(A, B), 3 -> List(B))
// scala3 only
scala> map.flatten((k, s) => s.map(v => (k, v))).groupMapReduce((k, v) => v)((k, v) => List(k))( _ ++ _ )
val res1: Map[Int, List[String]] = Map(1 -> List(A), 2 -> List(A, B), 3 -> List(B))

Transforming one record into multiple records

If the format of the input is
(x1,(a,b,c,List(key1, key2))
(x2,(a,b,c,List(key3))
and I would like to achieve this output
(key1,(a,b,c,x1))
(key2,(a,b,c,x1))
(key3,(a,b,c,x2))
Here is the code:
var hashtags = joined_d.map(x => (x._1, (x._2._1._1, x._2._2, x._2._1._4, getHashTags(x._2._1._4))))
var hashtags_keys = hashtags.map(x => if(x._2._4.size == 0) (x._1, (x._2._1, x._2._2, x._2._3, 0)) else
x._2._4.map(y => (y, (x._2._1, x._2._2, x._2._3, 1))))
The function getHashTags() returns a list. If the list is not empty, we want to use each elements in the list as the new key. How should i work around this issue?
With rdd created as:
val rdd = sc.parallelize(
Seq(
("x1",("a","b","c",List("key1", "key2"))),
("x2", ("a", "b", "c", List("key3")))
)
)
You can use flatMap like this:
rdd.flatMap{ case (x, (a, b, c, list)) => list.map(k => (k, (a, b, c, x))) }.collect
// res12: Array[(String, (String, String, String, String))] =
// Array((key1,(a,b,c,x1)),
// (key2,(a,b,c,x1)),
// (key3,(a,b,c,x2)))
Here's one way to do it:
val rdd = sc.parallelize(Seq(
("x1", ("a", "b", "c", List("key1", "key2"))),
("x2", ("a", "b", "c", List("key3")))
))
val rdd2 = rdd.flatMap{
case (x, (a, b, c, l)) => l.map( (_, (a, b, c, x) ) )
}
rdd2.collect
// res1: Array[(String, (String, String, String, String))] = Array((key1,(a,b,c,x1)), (key2,(a,b,c,x1)), (key3,(a,b,c,x2)))

Spark/Scala: Expand a list of (List[String], String) tuples

Basically this question only for Scala.
How can I do the following transformation given an RDD with elements of the form
(List[String], String) => (String, String)
e.g.
([A,B,C], X)
([C,D,E], Y)
to
(A, X)
(B, X)
(C, X)
(C, Y)
(D, Y)
(E, Y)
So
scala> val l = List((List('a, 'b, 'c) -> 'x), List('c, 'd, 'e) -> 'y)
l: List[(List[Symbol], Symbol)] = List((List('a, 'b, 'c),'x),
(List('c, 'd, 'e),'y))
scala> l.flatMap { case (innerList, c) => innerList.map(_ -> c) }
res0: List[(Symbol, Symbol)] = List(('a,'x), ('b,'x), ('c,'x), ('c,'y),
('d,'y), ('e,'y))
With Spark you can solve your problem with:
object App {
def main(args: Array[String]) {
val input = Seq((List("A", "B", "C"), "X"), (List("C", "D", "E"), "Y"))
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]")
val sc = new SparkContext(conf)
val rdd = sc.parallelize(input)
val result = rdd.flatMap {
case (list, label) => {
list.map( (_, label))
}
}
result.foreach(println)
}
}
This will output:
(C,Y)
(D,Y)
(A,X)
(B,X)
(E,Y)
(C,X)
I think that the RDD flatMapValues suits this case best.
val A = List((List(A,B,C),X),(List(A,B,C),Y))
val rdd = sc.parallelize(A)
val output = rdd.map(x=>(x._2,x._1)).flatMapValues(x=>x)
which will map X with every value in the List(A,B,C) resulting in RDD of pairs of RDD[(X,A),(X,B),(X,C)...(Y,A),(Y,B),(Y,C)]
val l = (List(1, 2, 3), "A")
val result = l._1.map((_, l._2))
println(result)
Will give you:
List((1,A), (2,A), (3,A))
Using beautiful for comprehensions and making the parameters generic
def convert[F, S](input: (List[F], S)): List[(F, S)] = {
for {
x <- input._1
} yield {
(x, input._2)
}
}
a sample call
convert(List(1, 2, 3), "A")
will give you
List((1,A), (2,A), (3,A))

Groupby like Python's itertools.groupby

In Python I'm able to group consecutive elements with the same key by using itertools.groupby:
>>> items = [(1, 2), (1, 5), (1, 3), (2, 9), (3, 7), (1, 5), (1, 4)]
>>> import itertools
>>> list(key for key,it in itertools.groupby(items, lambda tup: tup[0]))
[1, 2, 3, 1]
Scala has groupBy as well, but it produces different result - a map pointing from key to all the values found in the iterable with the specified key (not the consecutive runs with the same key):
scala> val items = List((1, 2), (1, 5), (1, 3), (2, 9), (3, 7), (1, 5), (1, 4))
items: List[(Int, Int)] = List((1,2), (1,5), (1,3), (2,9), (3,7), (1,5), (1,4))
scala> items.groupBy {case (key, value) => key}
res0: scala.collection.immutable.Map[Int,List[(Int, Int)]] = Map(2 -> List((2,9)), 1 -> List((1,2), (1,5), (1,3), (1,5), (1,4)), 3 -> List((3,7)))
What is the most eloquent way of achieving the same as with Python itertools.groupby?
If you just want to throw out sequential duplicates, you can do something like this:
def unchain[A](items: Seq[A]) = if (items.isEmpty) items else {
items.head +: (items zip items.drop(1)).collect{ case (l,r) if r != l => r }
}
That is, just compare the list to a version of itself shifted by one place, and only keep the items which are different. It's easy to add a (same: (a1: A, a2: A) => Boolean) parameter to the method and use !same(l,r) if you want custom behavior for what counts as the same (e.g. do it just by key).
If you want to keep the duplicates, you can use Scala's groupBy to get a very compact (but inefficient) solution:
def groupSequential(items: Seq[A])(same: (a1: A, a2: A) => Boolean) = {
val ns = (items zip items.drop(1)).
scanLeft(0){ (n,cc) => if (same(cc._1, cc._2)) n+1 else n }
(ns zip items).groupBy(_._1).toSeq.sortBy(_._1).map(_._2)
}
Using List.span, like this
def keyMultiSpan(l: List[(Int,Int)]): List[List[(Int,Int)]] = l match {
case Nil => List()
case h :: t =>
val ms = l.span(_._1 == h._1)
ms._1 :: keyMultiSpan(ms._2)
}
Hence let
val items = List((1, 2), (1, 5), (1, 3), (2, 9), (3, 7), (1, 5), (1, 4))
and so
keyMultiSpan(items).map { _.head._1 }
res: List(1, 2, 3, 1)
Update
A more readable syntax, as suggested by #Paul, an implicit class for possibly neater usage, and type parameterisation for generality,
implicit class RichSpan[A,B](val l: List[(A,B)]) extends AnyVal {
def keyMultiSpan(): List[List[(A,B)]] = l match {
case Nil => List()
case h :: t =>
val (f, r) = l.span(_._1 == h._1)
f :: r.keyMultiSpan()
}
}
Thus, use it as follows,
items.keyMultiSpan.map { _.head._1 }
res: List(1, 2, 3, 1)
Here is a succinct but inefficient solution:
def pythonGroupBy[T, U](items: Seq[T])(f: T => U): List[List[T]] = {
items.foldLeft(List[List[T]]()) {
case (Nil, x) => List(List(x))
case (g :: gs, x) if f(g.head) == f(x) => (x :: g) :: gs
case (gs, x) => List(x) :: gs
}.map(_.reverse).reverse
}
And here is a better one, that only invokes f on each element once:
def pythonGroupBy2[T, U](items: Seq[T])(f: T => U): List[List[T]] = {
if (items.isEmpty)
List(List())
else {
val state = (List(List(items.head)), f(items.head))
items.tail.foldLeft(state) { (state, x) =>
val groupByX = f(x)
state match {
case (g :: gs, groupBy) if groupBy == groupByX => ((x :: g) :: gs, groupBy)
case (gs, _) => (List(x) :: gs, groupByX)
}
}._1.map(_.reverse).reverse
}
}
Both solutions fold over items, building up a list of groups as they go. pythonGroupBy2 also keeps track of the value of f for the current group. At the end, we have to reverse each group and the list of groups in order to get the correct order.
Try:
val items = List((1, 2), (1, 5), (1, 3), (2, 9), (3, 7), (1, 5), (1, 4))
val res = compress(items.map(_._1))
/** Eliminate consecutive duplicates of list elements **/
def compress[T](l : List[T]) : List[T] = l match {
case head :: next :: tail if (head == next) => compress(next :: tail)
case head :: tail => head :: compress(tail)
case Nil => List()
}
/** Tail recursive version **/
def compress[T](input: List[T]): List[T] = {
def comp(remaining: List[T], l: List[T], last: Any): List[T] = {
remaining match {
case Nil => l
case head :: tail if head == last => comp(tail, l, head)
case head :: tail => comp(tail, head :: l, head)
}
}
comp(input, Nil, Nil).reverse
}
Where compress is the solution of one of the 99 Problems in Scala.
hmm couldn't find something out of the box but this will do it
def groupz[T](list:List[T]):List[T] = {
list match {
case Nil => Nil
case x::Nil => List(x)
case x::xs if (x == xs.head) => groupz(xs)
case x::xs => x::groupz(xs)
}}
//now let's add this functionality to List class
implicit def addPythonicGroupToList[T](list:List[T]) = new {def pythonGroup = groupz(list)}
and now you can do:
val items = List((1, 2), (1, 5), (1, 3), (2, 9), (3, 7), (1, 5), (1, 4))
items.map(_._1).pythonGroup
res1: List[Int] = List(1, 2, 3, 1)
Here is a simple solution that I used for a problem I stumbled on at work. In this case I didn't care too much about space, so did not worry about efficient iterators. Used an ArrayBuffer to accumulate the results.
(Don't use this with enormous amounts of data.)
Sequential GroupBy
import scala.collection.mutable.ArrayBuffer
object Main {
/** Returns consecutive keys and groups from the iterable. */
def sequentialGroupBy[A, K](items: Seq[A], f: A => K): ArrayBuffer[(K, ArrayBuffer[A])] = {
val result = ArrayBuffer[(K, ArrayBuffer[A])]()
if (items.nonEmpty) {
// Iterate, keeping track of when the key changes value.
var bufKey: K = f(items.head)
var buf: ArrayBuffer[A] = ArrayBuffer()
for (elem <- items) {
val key = f(elem)
if (key == bufKey) {
buf += elem
} else {
val group: (K, ArrayBuffer[A]) = (bufKey, buf)
result += group
bufKey = key
buf = ArrayBuffer(elem)
}
}
// Append last group.
val group: (K, ArrayBuffer[A]) = (bufKey, buf)
result += group
}
result
}
def main(args: Array[String]): Unit = {
println("\nExample 1:")
sequentialGroupBy[Int, Int](
Seq(1, 4, 5, 7, 9, 8, 16),
x => x % 2
).foreach(println)
println("\nExample 2:")
sequentialGroupBy[String, Boolean](
Seq("pi", "nu", "rho", "alpha", "xi"),
x => x.length > 2
).foreach(println)
}
}
Running the above code results in the following:
Example 1:
(1,ArrayBuffer(1))
(0,ArrayBuffer(4))
(1,ArrayBuffer(5, 7, 9))
(0,ArrayBuffer(8, 16))
Example 2:
(false,ArrayBuffer(pi, nu))
(true,ArrayBuffer(rho, alpha))
(false,ArrayBuffer(xi))

What's the idiomatic way to map producing 0 or 1 results per entry?

What's the idiomatic way to call map over a collection producing 0 or 1 result per entry?
Suppose I have:
val data = Array("A", "x:y", "d:e")
What I'd like as a result is:
val target = Array(("x", "y"), ("d", "e"))
(drop anything without a colon, split on colon and return tuples)
So in theory I think I want to do something like:
val attempt1 = data.map( arg => {
arg.split(":", 2) match {
case Array(l,r) => (l, r)
case _ => (None, None)
}
}).filter( _._1 != None )
What I'd like to do is avoid the need for the any-case and get rid of the filter.
I could do this by pre-filtering (but then I have to test the regex twice):
val attempt2 = data.filter( arg.contains(":") ).map( arg => {
val Array(l,r) = arg.split(":", 2)
(l,r)
})
Last, I could use Some/None and flatMap...which does get rid of the need to filter, but is it what most scala programmers would expect?
val attempt3 = data.flatMap( arg => {
arg.split(":", 2) match {
case Array(l,r) => Some((l,r))
case _ => None
}
})
It seems to me like there'd be an idiomatic way to do this in Scala, is there?
With a Regex extractor and collect :-)
scala> val R = "(.+):(.+)".r
R: scala.util.matching.Regex = (.+):(.+)
scala> Array("A", "x:y", "d:e") collect {
| case R(a, b) => (a, b)
| }
res0: Array[(String, String)] = Array((x,y), (d,e))
Edit:
If you want a map, you can do:
scala> val x: Map[String, String] = Array("A", "x:y", "d:e").collect { case R(a, b) => (a, b) }.toMap
x: Map[String,String] = Map(x -> y, d -> e)
If performance is a concern, you can use collection.breakOut as shown below to avoid creation of an intermediate array:
scala> val x: Map[String, String] = Array("A", "x:y", "d:e").collect { case R(a, b) => (a, b) } (collection.breakOut)
x: Map[String,String] = Map(x -> y, d -> e)