Flatten value in paired RDD in spark - scala

I have a paired RDD that looks like
(a1, (a2, a3))
(b1, (b2, b3))
...
I want to flatten the values to obtain
(a1, a2, a3)
(b1, b2, b3)
...
Currently I'm doing
rddData.map(x => (x._1, x._2._1, x._2._2))
Is there a better way of performing the conversion? The above solution gets ugly if value contains many elements instead of just 2.

When I'm trying to avoid all the ugly underscore number stuff that comes with tuple manipulation I like to use case notation:
rddData.map { case (a, (b, c)) => (a, b, c) }
You can also give your variables meaningful names to make your code self documenting and the use of curly braces means you have fewer nested parentheses.
EDIT:
The map { case ... } pattern is pretty compact and can be used for surprisingly deep nested tuples as long as the structure is known at compile time. If you absolutely, positively cannot know the structure of the tuple at compile time, then here is some hacky, slow code that, probably, can flatten any arbitrarily nested tuple... as long as there are no more than 23 elements in total. It works by recursivly converting each element of the tuple to a list, flatmap-ing it to a single list, then using scary reflection to convert the list back into a tuple as seen here.
def flatten(b:Product): List[Any] = {
b.productIterator.toList.flatMap {
case x: Product => flatten(x)
case y: Any => List(y)
}
}
def toTuple[Any](as:List[Any]):Product = {
val tupleClass = Class.forName("scala.Tuple" + as.size)
tupleClass.getConstructors.apply(0).newInstance(as.map(_.asInstanceOf[AnyRef]):_*).asInstanceOf[Product]
}
rddData.map(t => toTuple(flatten(t)))

There is no better way. The 1st answer is equivalent to:
val abc2 = xyz.map{ case (k, v) => (k, v._1, v._2) }
which is equivalent to your own example.

Related

Fold method using List as accumulator

To find prime factors of a number I was using this piece of code :
def primeFactors(num: Long): List[Long] = {
val exists = (2L to math.sqrt(num).toLong).find(num % _ == 0)
exists match {
case Some(d) => d :: primeFactors(num/d)
case None => List(num)
}
}
but this I found a cool and more functional approach to solve this using this code:
def factors(n: Long): List[Long] = (2 to math.sqrt(n).toInt)
.find(n % _ == 0).fold(List(n)) ( i => i.toLong :: factors(n / i))
Earlier I was using foldLeft or fold simply to get sum of a list or other simple calculations, but here I can't seem to understand how fold is working and how this is breaking out of the recursive function.Can somebody plz explain how fold functionality is working here.
Option's fold
If you look at the signature of Option's fold function, it takes two parameters:
def fold[B](ifEmpty: => B)(f: A => B): B
What it does is, it applies f on the value of Option if it is not empty. If Option is empty, it simply returns output of ifEmpty (this is termination condition for recursion).
So in your case, i => i.toLong :: factors(n / i) represents f which will be evaluated if Option is not empty. While List(n) is termination condition.
fold used for collection / iterators
The other fold that you are taking about for getting sum of collection, comes from TraversableOnce and it has signature like:
def foldLeft[B](z: B)(op: (B, A) => B): B
Here, z is starting value (suppose incase of sum it's 0) and op is associative binary operator which is applied on z and each value of collection from left to right.
So both folds differ in their implementation.

Why does Scala require pattern variables to be linear?

Scala requires pattern variables to be linear, i.e. pattern
variable may not occur more than once in a pattern. Thus, this example does not compile:
def tupleTest(tuple: (Int, Int)) = tuple match {
case (a, a) => a
case _ => -1
}
But you can use two pattern variables and a guard to check equality instead:
def tupleTest(tuple: (Int, Int)) = tuple match {
case (a, b) if a == b => a
case _ => -1
}
So why does Scala require pattern variables to be linear? Are there any cases that can not be transformed like this?
Edit
It is easy to transform the first example into the second (Scala to Scala). Of all occurrences of a variable v in the pattern take the expressions that is evaluated first and assign it to the variable v. For each other occurrence introduce a new variable with a name that is not used in the current scope. For each of those variables v' add a guard v == v'. It is the same way a programmer would go (=> same efficiency). Is there any problem with this approach? I'd like to see an example that can not be transformed like this.
Because case (a, b) is basically assigning val a to _._1 and val b to _._2 (at least you can view it like that). In case of case (a, a), you cannot assign val a to both _._1 and _._2.
Actually the thing you want to do would have been looked like
case (a, `a`) => ???
as scala uses backtick to match an identifier. But unfortunately that still doesn't work as the visibility of a is given only after => (would have been fun though, I also hate writing case (a, b) if a = b =>). And the reason of this is probably just because it is harder to write a compiler that supports that

Scala - Create a tuple from another

I am trying to find out how, using the map operator, I can create a tuple with a different number of fields from an existing ones.
Namely, if I have a tuple in the form (String, Int1, Int2) I want to create a tuple with 2 fields in the form of (String, Int1 + Int2), where the first field will be the same as in the original tuple and the second field will be the addition of the 2nd and the 3rd field of the original tuple.
using pattern match:
tuple match { case (a, b, c) => (a, b + c) }
map is not a member of tuple. in case the tuple is type of element in a collection, it can be used.
collection map { case (a, b, c) => (a, b + c) }
I suggest using Shyamendra Solanki's answer, however for the sake of completeness I wanted to point out that Shapeless provides map and flatMap methods for tuples
For a single tuple of type (String,Int,Int), in addition to extracting values with pattern matching as mentioned by #Shyamendra Solanki, note methods _1, _2 and _3; hence
def create(t : (String,Int,Int)) = (t._1, t._2+t._3)
is also a feasible approach, though perhaps not so readable.
For a given collection of tuples, consider also for comprehensions where
def create(xs : Seq[(String,Int,Int)]) = for ((s,i1,i2) <- xs) yield (s, i1+i2 )
conveys the desired semantics.

for-expression to flatMap Conversion

The following for-expression seems intuitive to me. Take each item in List(1), then map over List("a"), and then return a List[(Int, String)].
scala> val x = for {
| a <- List(1)
| b <- List("a")
| } yield (a,b)
x: List[(Int, String)] = List((1,a))
Now, converting it to a flatMap, it seems less clear to me. If I understand correctly, I need to call flatMap first since I'm taking the initial List(1), and then applying a function to convert from A => List[B].
scala> List(1).flatMap(a => List("a").map(b => (a,b) ))
res0: List[(Int, String)] = List((1,a))
After using the flatMap, it seemed necessary to use a map since I needed to go from A => B.
But, as the number of items increases in the for-expression (say 2 to 3 items), how do I know whether to use a map or flatMap when converting from for-expression to flatMap?
In using the for comprehension you always flatMap until the last value that you extract which you map. So if you have three items:
for {
a <- List("a")
b <- List("b")
c <- List("c")
} yield (a, b, c)
It would be the same as:
List("a").flatMap(a => List("b").flatMap(b => List("c").map(c => (a, b, c))))
If you look at the signature of flatMap it's A => M[B]. So as we add elements to the for comprehension we need to flatMap them in since we continue to add M[B] to the comprehension. When we get to the last element, there's nothing left to add so we use map since we just want to go from A => B. Hope that makes sense, if not take you should watch some of the videos in the Reactive Programming class on Coursera as they go over this quite a bit.

difference between foldLeft and reduceLeft in Scala

I have learned the basic difference between foldLeft and reduceLeft
foldLeft:
initial value has to be passed
reduceLeft:
takes first element of the collection as initial value
throws exception if collection is empty
Is there any other difference ?
Any specific reason to have two methods with similar functionality?
Few things to mention here, before giving the actual answer:
Your question doesn't have anything to do with left, it's rather about the difference between reducing and folding
The difference is not the implementation at all, just look at the signatures.
The question doesn't have anything to do with Scala in particular, it's rather about the two concepts of functional programming.
Back to your question:
Here is the signature of foldLeft (could also have been foldRight for the point I'm going to make):
def foldLeft [B] (z: B)(f: (B, A) => B): B
And here is the signature of reduceLeft (again the direction doesn't matter here)
def reduceLeft [B >: A] (f: (B, A) => B): B
These two look very similar and thus caused the confusion. reduceLeft is a special case of foldLeft (which by the way means that you sometimes can express the same thing by using either of them).
When you call reduceLeft say on a List[Int] it will literally reduce the whole list of integers into a single value, which is going to be of type Int (or a supertype of Int, hence [B >: A]).
When you call foldLeft say on a List[Int] it will fold the whole list (imagine rolling a piece of paper) into a single value, but this value doesn't have to be even related to Int (hence [B]).
Here is an example:
def listWithSum(numbers: List[Int]) = numbers.foldLeft((List.empty[Int], 0)) {
(resultingTuple, currentInteger) =>
(currentInteger :: resultingTuple._1, currentInteger + resultingTuple._2)
}
This method takes a List[Int] and returns a Tuple2[List[Int], Int] or (List[Int], Int). It calculates the sum and returns a tuple with a list of integers and it's sum. By the way the list is returned backwards, because we used foldLeft instead of foldRight.
Watch One Fold to rule them all for a more in depth explanation.
reduceLeft is just a convenience method. It is equivalent to
list.tail.foldLeft(list.head)(_)
foldLeft is more generic, you can use it to produce something completely different than what you originally put in. Whereas reduceLeft can only produce an end result of the same type or super type of the collection type. For example:
List(1,3,5).foldLeft(0) { _ + _ }
List(1,3,5).foldLeft(List[String]()) { (a, b) => b.toString :: a }
The foldLeft will apply the closure with the last folded result (first time using initial value) and the next value.
reduceLeft on the other hand will first combine two values from the list and apply those to the closure. Next it will combine the rest of the values with the cumulative result. See:
List(1,3,5).reduceLeft { (a, b) => println("a " + a + ", b " + b); a + b }
If the list is empty foldLeft can present the initial value as a legal result. reduceLeft on the other hand does not have a legal value if it can't find at least one value in the list.
For reference, reduceLeft will error if applied to an empty container with the following error.
java.lang.UnsupportedOperationException: empty.reduceLeft
Reworking the code to use
myList foldLeft(List[String]()) {(a,b) => a+b}
is one potential option. Another is to use the reduceLeftOption variant which returns an Option wrapped result.
myList reduceLeftOption {(a,b) => a+b} match {
case None => // handle no result as necessary
case Some(v) => println(v)
}
The basic reason they are both in Scala standard library is probably because they are both in Haskell standard library (called foldl and foldl1). If reduceLeft wasn't, it would quite often be defined as a convenience method in different projects.
From Functional Programming Principles in Scala (Martin Odersky):
The function reduceLeft is defined in terms of a more general function, foldLeft.
foldLeft is like reduceLeft but takes an accumulator z, as an additional parameter, which is returned when foldLeft is called on an empty list:
(List (x1, ..., xn) foldLeft z)(op) = (...(z op x1) op ...) op x
[as opposed to reduceLeft, which throws an exception when called on an empty list.]
The course (see lecture 5.5) provides abstract definitions of these functions, which illustrates their differences, although they are very similar in their use of pattern matching and recursion.
abstract class List[T] { ...
def reduceLeft(op: (T,T)=>T) : T = this match{
case Nil => throw new Error("Nil.reduceLeft")
case x :: xs => (xs foldLeft x)(op)
}
def foldLeft[U](z: U)(op: (U,T)=>U): U = this match{
case Nil => z
case x :: xs => (xs foldLeft op(z, x))(op)
}
}
Note that foldLeft returns a value of type U, which is not necessarily the same type as List[T], but reduceLeft returns a value of the same type as the list).
To really understand what are you doing with fold/reduce,
check this: http://wiki.tcl.tk/17983
very good explanation. once you get the concept of fold,
reduce will come together with the answer above:
list.tail.foldLeft(list.head)(_)
Scala 2.13.3, Demo:
val names = List("Foo", "Bar")
println("ReduceLeft: "+ names.reduceLeft(_+_))
println("ReduceRight: "+ names.reduceRight(_+_))
println("Fold: "+ names.fold("Other")(_+_))
println("FoldLeft: "+ names.foldLeft("Other")(_+_))
println("FoldRight: "+ names.foldRight("Other")(_+_))
outputs:
ReduceLeft: FooBar
ReduceRight: FooBar
Fold: OtherFooBar
FoldLeft: OtherFooBar
FoldRight: FooBarOther