How do `map` and `reduce` methods work in Spark RDDs? - scala

Following code is from the quick start guide of Apache Spark.
Can somebody explain me what is the "line" variable and where it comes from?
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
Also, how does a value get passed into a,b?
Link to the QSG http://spark.apache.org/docs/latest/quick-start.html

First, according to your link, the textfile is created as
val textFile = sc.textFile("README.md")
such that textfile is a RDD[String] meaning it is a resilient distributed dataset of type String. The API to access is very similar to that of regular Scala collections.
So now what does this map do?
Imagine you have a list of Strings and want to convert that into a list of Ints, representing the length of each String.
val stringlist: List[String] = List("ab", "cde", "f")
val intlist: List[Int] = stringlist.map( x => x.length )
The map method expects a function. A function, that goes from String => Int. With that function, each element of the list is transformed. So the value of intlist is List( 2, 3, 1 )
Here, we have created an anonymous function from String => Int. That is x => x.length. One can even write the function more explicit as
stringlist.map( (x: String) => x.length )
If you do use write the above explicit, you can
val stringLength : (String => Int) = {
x => x.length
}
val intlist = stringlist.map( stringLength )
So, here it is absolutely evident, that stringLength is a function from String to Int.
Remark: In general, map is what makes up a so called Functor. While you provide a function from A => B, map of the functor (here List) allows you use that function also to go from List[A] => List[B]. This is called lifting.
Answers to your questions
What is the "line" variable?
As mentioned above, line is the input parameter of the function line => line.split(" ").size
More explicit
(line: String) => line.split(" ").size
Example: If line is "hello world", the function returns 2.
"hello world"
=> Array("hello", "world") // split
=> 2 // size of Array
How does a value get passed into a,b?
reduce also expects a function from (A, A) => A, where A is the type of your RDD. Lets call this function op.
What does reduce. Example:
List( 1, 2, 3, 4 ).reduce( (x,y) => x + y )
Step 1 : op( 1, 2 ) will be the first evaluation.
Start with 1, 2, that is
x is 1 and y is 2
Step 2: op( op( 1, 2 ), 3 ) - take the next element 3
Take the next element 3:
x is op(1,2) = 3 and y = 3
Step 3: op( op( op( 1, 2 ), 3 ), 4)
Take the next element 4:
x is op(op(1,2), 3 ) = op( 3,3 ) = 6 and y is 4
Result here is the sum of the list elements, 10.
Remark: In general reduce calculates
op( op( ... op(x_1, x_2) ..., x_{n-1}), x_n)
Full example
First, textfile is a RDD[String], say
TextFile
"hello Tyth"
"cool example, eh?"
"goodbye"
TextFile.map(line => line.split(" ").size)
2
3
1
TextFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
3
Steps here, recall `(a, b) => if (a > b) a else b)`
- op( op(2, 3), 1) evaluates to op(3, 1), since op(2, 3) = 3
- op( 3, 1 ) = 3

Map and reduce are methods of RDD class, which has interface similar to scala collections.
What you pass to methods map and reduce are actually anonymous function (with one param in map, and with two parameters in reduce). textFile calls provided function for every element (line of text in this context) it has.
Maybe you should read some scala collection introduction first.
You can read more about RDD class API here:
https://spark.apache.org/docs/1.2.1/api/scala/#org.apache.spark.rdd.RDD

what map function does is, it takes the list of arguments and map it to some function. Similar to map function in python, if you are familiar.
Also, File is like a list of Strings. (not exactly but that's how it's being iterated)
Let's consider this is your file.
val list_a: List[String] = List("first line", "second line", "last line")
Now let's see how map function works.
We need two things, list of values which we already have and function to which we want to map this values. let's consider really simple function for understanding.
val myprint = (arg:String)=>println(arg)
this function simply takes single String argument and prints on the console.
myprint("hello world")
hello world
if we match this function to your list, it's gonna print all the lines
list_a.map(myprint)
We can write an anonymous function as mentioned below as well, which does the same thing.
list_a.map(arg=>println(arg))
in your case, line is the first line of the file. you could change the argument name as you like. for example, in above example, if I change arg to line it would work without any issue
list_a.map(line=>println(line))

Related

Scala - create a new list and update particular element from existing list

I am new to Scala and new OOP too. How can I update a particular element in a list while creating a new list.
val numbers= List(1,2,3,4,5)
val result = numbers.map(_*2)
I need to update third element only -> multiply by 2. How can I do that by using map?
You can use zipWithIndex to map the list into a list of tuples, where each element is accompanied by its index. Then, using map with pattern matching - you single out the third element (index = 2):
val numbers = List(1,2,3,4,5)
val result = numbers.zipWithIndex.map {
case (v, i) if i == 2 => v * 2
case (v, _) => v
}
// result: List[Int] = List(1, 2, 6, 4, 5)
Alternatively - you can use patch, which replaces a sub-sequence with a provided one:
numbers.patch(from = 2, patch = Seq(numbers(2) * 2), replaced = 1)
I think the clearest way of achieving this is by using updated(index: Int, elem: Int). For your example, it could be applied as follows:
val result = numbers.updated(2, numbers(2) * 2)
list.zipWithIndex creates a list of pairs with original element on the left, and index in the list on the right (indices are 0-based, so "third element" is at index 2).
val result = number.zipWithIndex.map {
case (n, 2) => n*2
case n => n
}
This creates an intermediate list holding the pairs, and then maps through it to do your transformation. A bit more efficient approach is to use iterator. Iterators a 'lazy', so, rather than creating an intermediate container, it will generate the pairs one-by-one, and send them straight to the .map:
val result = number.iterator.zipWithIndex.map {
case (n, 2) => n*2
case n => n
}.toList
1st and the foremost scala is FOP and not OOP. You can update any element of a list through the keyword "updated", see the following example for details:
Signature :- updated(index,value)
val numbers= List(1,2,3,4,5)
print(numbers.updated(2,10))
Now here the 1st argument is the index and the 2nd argument is the value. The result of this code will modify the list to:
List(1, 2, 10, 4, 5).

flatMap with a map in scala

Why doesn't this work:
val m = Map( 1-> 2, 2-> 4, 3 ->6)
def h(k: Int, v: Int) = if (v > 2) Some(k->v) else None
m.flatMap { case(k,v) => h(k,v) }
m.flatMap { (k,v) => h(k,v) }
The one with the case statement gives me:
res1: scala.collection.immutable.Map[Int,Int] = Map(2 -> 4, 3 -> 6)
but the other one fails and says MIssing Type parameter v, and expected: Int, actual:(Int, Int)
The case keyword signifies pattern matching, so the Tuple2 (a Mapis an Iterable ofTuple2 elements) that you are flatMapping "over" gets decomposed into k and v. (The fact that flatMap works when the h function is producing an Option rather than a Map or Iterable is the Scala collections library being perhaps overly permissive.)
Without the case keyword, you are providing a function that requires two arguments, but flatMap needs a function that accepts a single argument (a Tuple2). So the second version does not typecheck.
For second one you can do this, if you don't want to use case.
m.flatMap { x => h(x._1, x._2) } // x is (key,value) pair here(each element in map), hence accessing the key , value as _1,_2 respectively

Scala - access collection members within map or flatMap

Suppose that I use a sequence of various maps and/or flatMaps to generate a sequence of collections. Is it possible to access information about the "current" collection from within any of those methods? For example, without knowing anything specific about the functions used in the previous maps or flatMaps, and without using any intermediate declarations, how can I get the maximum value (or length, or first element, etc.) of the collection upon which the last map acts?
List(1, 2, 3)
.flatMap(x => f(x) /* some unknown function */)
.map(x => x + ??? /* what is the max element of the collection? */)
Edit for clarification:
In the example, I'm not looking for the max (or whatever) of the initial List. I'm looking for the max of the collection after the flatMap has been applied.
By "without using any intermediate declarations" I mean that I do not want to use any temporary collections en route to the final result. So, the example by Steve Waldman below, while giving the desired result, is not what I am seeking. (I include this condition is mostly for aesthetic reasons.)
Edit for clarification, part 2:
The ideal solution would be some magic keyword or syntactic sugar that lets me reference the current collection:
List(1, 2, 3)
.flatMap(x => f(x))
.map(x => x + theCurrentList.max)
I'm prepared to accept the fact, however, that this simply is not possible.
Maybe just define the list as a val, so you can name it? I don't know of any facility built into map(...) or flatMap(...) that would help.
val myList = List(1, 2, 3)
myList
.flatMap(x => f(x) /* some unknown function */)
.map(x => x + myList.max /* what is the max element of the List? */)
Update: By this approach at least, if you have multiple transformations and want to see the transformed version, you'd have to name that. You could get away with
val myList = List(1, 2, 3).flatMap(x => f(x) /* some unknown function */)
myList.map(x => x + myList.max /* what is the max element of the List? */)
Or, if there will be multiple transformations, get in the habit of naming the stages.
val rawList = List(1, 2, 3)
val smordified = rawList.flatMap(x => f(x) /* some unknown function */)
val maxified = smordified.map(x => x + smordified.max /* what is the max element of the List? */)
maxified
Update 2: Watch it work in the REPL even with heterogenous types:
scala> def f( x : Int ) : Vector[Double] = Vector(x * math.random, x * math.random )
f: (x: Int)Vector[Double]
scala> val rawList = List(1, 2, 3)
rawList: List[Int] = List(1, 2, 3)
scala> val smordified = rawList.flatMap(x => f(x) /* some unknown function */)
smordified: List[Double] = List(0.40730853571901315, 0.15151641399798665, 1.5305929709857609, 0.35211231420067435, 0.644241939254793, 0.15530230501048903)
scala> val maxified = smordified.map(x => x + smordified.max /* what is the max element of the List? */)
maxified: List[Double] = List(1.937901506704774, 1.6821093849837476, 3.0611859419715217, 1.8827052851864352, 2.1748349102405538, 1.6858952759962498)
scala> maxified
res3: List[Double] = List(1.937901506704774, 1.6821093849837476, 3.0611859419715217, 1.8827052851864352, 2.1748349102405538, 1.6858952759962498)
It is possible, but not pretty, and not likely something you want if you are doing it for "aesthetic reasons."
import scala.math.max
def f(x: Int): Seq[Int] = ???
List(1, 2, 3).
flatMap(x => f(x) /* some unknown function */).
foldRight((List[Int](),List[Int]())) {
case (x, (xs, Nil)) => ((x :: xs), List.fill(xs.size + 1)(x))
case (x, (xs, xMax :: _)) => ((x :: xs), List.fill(xs.size + 1)(max(x, xMax)))
}.
zipped.
map {
case (x, xMax) => x + xMax
}
// Or alternately, a slightly more efficient version using Streams.
List(1, 2, 3).
flatMap(x => f(x) /* some unknown function */).
foldRight((List[Int](),Stream[Int]())) {
case (x, (xs, Stream())) =>
((x :: xs), Stream.continually(x))
case (x, (xs, curXMax #:: _)) =>
val newXMax = max(x, curXMax)
((x :: xs), Stream.continually(newXMax))
}.
zipped.
map {
case (x, xMax) => x + xMax
}
Seriously though, I just took this on to see if I could do it. While the code didn't turn out as bad as I expected, I still don't think it's particularly readable. I'd discourage using this over something similar to Steve Waldman's answer. Sometimes, it's simply better to just introduce a val, rather than being dogmatic about it.
You could define a mapWithSelf (resp. flatMapWithSelf) operation along these lines and add it as an implicit enrichment to the collection. For List it might look like:
// Scala 2.13 APIs
object Enrichments {
implicit class WithSelfOps[A](val lst: List[A]) extends AnyVal {
def mapWithSelf[B](f: (A, List[A]) => B): List[B] =
lst.map(f(_, lst))
def flatMapWithSelf[B](f: (A, List[A]) => IterableOnce[B]): List[B] =
lst.flatMap(f(_, lst))
}
}
The enrichment basically fixes the value of the collection before the operation and threads it through. It should be possible to generify this (at least for the strict collections), though it would look a little different in 2.12 vs. 2.13+.
Usage would look like
import Enrichments._
val someF: Int => IterableOnce[Int] = ???
List(1, 2, 3)
.flatMap(someF)
.mapWithSelf { (x, lst) =>
x + lst.max
}
So at the usage site, it's aesthetically pleasant. Note that if you're computing something which traverses the list, you'll be traversing the list every time (leading to a quadratic runtime). You can get around that with some mutability or by just saving the intermediate list after the flatMap.
One somewhat-simple way of referencing prior output within the current map/collect operation is to use a named reference outside the map, then reference it from within the map block:
var prevOutput = ... // starting value of whatever is referenced within the map
myValues.map {
prevOutput = ... // expression that references prior `prevOutput`
prevOutput // return above computed value for the map to collect
}
This draws attention to the fact that we're referencing prior elements while building the new sequence.
This would be more messy, though, if you wanted to reference arbitrarily previous values, not just the previous one.

Scala : placeholder inside tuple

I've played a bit with placeholder and found a strange case :
val integers = Seq(1, 2)
val f = (x:Int) => x + 1
integers.map((_, f(_)))
which returns
Seq[(Int, Int => Int)] = List((1,<function1>), (2,<function1>))
I was expecting
Seq[(Int, Int)] = List((1, 2), (2, 3))
If I make the following changes, everything works as expected :
integers.map(i => (i, f(i)))
Any idea why the function f is not applied during the mapping ?
The underscore stands in for the passed argument only once. So in integers.map((_, f(_))) the 1st _ is a value from integers but the 2nd _ has the stand-alone meaning of "partially applied function".
If your anonymous function takes 2 (or more) arguments then you can use 2 (or more) underscores, but each stands in for its passed argument only once.
The Scala compiler can't read your mind, so the _ placeholder syntax is only useful in very simple expressions.
In your example:
integers.map((_, f(_)))
it evaluates the f(_) as a standalone sub-expression, so you end up with something equivalent to this:
x => (x, y => f(y))
Even if the compiler didn't treat f(_) as its own sub-expression, the result would not be the same as what you say want:
integers.map(i => (i, f(i)))
You want both instances of _ to be treated as the same argument, which is not how _ works. Each occurrence of _ in an expression is always treated as a unique argument.

Putting two placeholders inside flatMap in Spark Scala to create Array

I am applying flatMap on a scala array and create another array from it:
val x = sc.parallelize(Array(1,2,3,4,5,6,7))
val y = x.flatMap(n => Array(n,n*100,42))
println(y.collect().mkString(","))
1,100,42,2,200,42,3,300,42,4,400,42,5,500,42,6,600,42,7,700,42
But I am trying to use placeholder "_" in the second line of the code where I create y in the following way:
scala> val y = x.flatMap(Array(_,_*100,42))
<console>:26: error: wrong number of parameters; expected = 1
val y = x.flatMap(Array(_,_*100,42))
^
Which is not working. Could someone explain what to do in such cases if I want to use placeholder?
In scala, the number of placeholders in a lambda indicates the cardinality of the lambda parameters.
So the last line is expanded as
val y = x.flatMap((x1, x2) => Array(x1, x2*100, 42))
Long story short, you can't use a placeholder to refer twice to the same element.
You have to use named parameters in this case.
val y = x.flatMap(x => Array(x, x*100, 42))
You can only use _ placeholder once per parameter. (In your case, flatMap method takes single argument, but you are saying -- hey compiler, expect two arguments which is not going to work)
val y = x.flatMap(i => Array(i._1, i._2*100,42))
should do the trick.
val y = x.flatMap { case (i1, i2) => Array(i1, i2*100,42) }
should also work (and probably more readable)