Putting two placeholders inside flatMap in Spark Scala to create Array - scala

I am applying flatMap on a scala array and create another array from it:
val x = sc.parallelize(Array(1,2,3,4,5,6,7))
val y = x.flatMap(n => Array(n,n*100,42))
println(y.collect().mkString(","))
1,100,42,2,200,42,3,300,42,4,400,42,5,500,42,6,600,42,7,700,42
But I am trying to use placeholder "_" in the second line of the code where I create y in the following way:
scala> val y = x.flatMap(Array(_,_*100,42))
<console>:26: error: wrong number of parameters; expected = 1
val y = x.flatMap(Array(_,_*100,42))
^
Which is not working. Could someone explain what to do in such cases if I want to use placeholder?

In scala, the number of placeholders in a lambda indicates the cardinality of the lambda parameters.
So the last line is expanded as
val y = x.flatMap((x1, x2) => Array(x1, x2*100, 42))
Long story short, you can't use a placeholder to refer twice to the same element.
You have to use named parameters in this case.
val y = x.flatMap(x => Array(x, x*100, 42))

You can only use _ placeholder once per parameter. (In your case, flatMap method takes single argument, but you are saying -- hey compiler, expect two arguments which is not going to work)
val y = x.flatMap(i => Array(i._1, i._2*100,42))
should do the trick.
val y = x.flatMap { case (i1, i2) => Array(i1, i2*100,42) }
should also work (and probably more readable)

Related

Broadcasting of scala options on sub-elements [duplicate]

I am new to scala, please help me with the below question.
Can we call map method on an Option? (e.g. Option[Int].map()?).
If yes then could you help me with an example?
Here's a simple example:
val x = Option(5)
val y = x.map(_ + 10)
println(y)
This will result in Some(15).
If x were None instead, y would also be None.
Yes:
val someInt = Some (2)
val noneInt:Option[Int] = None
val someIntRes = someInt.map (_ * 2) //Some (4)
val noneIntRes = noneInt.map (_ * 2) //None
See docs
You can view an option as a collection that contains exactly 0 or 1 items. Mapp over the collection gives you a container with the same number of items, with the result of applying the mapping function to every item in the original.
Sometimes it's more convenient to use fold instead of mapping Options. Consider the example:
scala> def printSome(some: Option[String]) = some.fold(println("Nothing provided"))(println)
printSome: (some: Option[String])Unit
scala> printSome(Some("Hi there!"))
Hi there!
scala> printSome(None)
Nothing provided
You can easily proceed with the real value inside fold, e.g. map it or do whatever you want, and you're safe with the default fold option which is triggered on Option#isEmpty.

What does an underscore after a scala method call mean?

The scala documentation has a code example that includes the following line:
val numberFunc = numbers.foldLeft(List[Int]())_
What does the underscore after the method call mean?
It's a partially applied function. You only provide the first parameter to foldLeft (the initial value), but you don't provide the second one; you postpone it for later. In the docs you linked they do it in the next line, where they define squares:
val numberFunc = numbers.foldLeft(List[Int]())_
val squares = numberFunc((xs, x) => xs:+ x*x)
See that (xs, x) => xs:+ x*x, that's the missing second parameter which you omitted while defining numberFunc. If you had provided it right away, then numberFunc would not be a function - it would be the computed value.
So basically the whole thing can also be written as a one-liner in the curried form:
val squares = numbers.foldLeft(List[Int]())((xs, x) => xs:+ x*x)
However, if you want to be able to reuse foldLeft over and over again, having the same collection and initial value, but providing a different function every time, then it's very convinient to define a separate numbersFunc (as they did in the docs) and reuse it with different functions, e.g.:
val squares = numberFunc((xs, x) => xs:+ x*x)
val cubes = numberFunc((xs, x) => xs:+ x*x*x)
...
Note that the compiler error message is pretty straightforward in case you forget the underscore:
Error: missing argument list for method foldLeft in trait
LinearSeqOptimized Unapplied methods are only converted to functions
when a function type is expected. You can make this conversion
explicit by writing foldLeft _ or foldLeft(_)(_) instead of
foldLeft. val numberFunc = numbers.foldLeft(ListInt)
EDIT: Haha I just realized that they did the exact same thing with cubes in the documentation.
I don't know if it helps but I prefer this syntax
val numberFunc = numbers.foldLeft(List[Int]())(_)
then numberFunc is basically a delegate corresponding to an instance method (instance being numbers) waiting for a parameter. Which later comes to be a lambda expression in the scala documentation example

Scala : placeholder inside tuple

I've played a bit with placeholder and found a strange case :
val integers = Seq(1, 2)
val f = (x:Int) => x + 1
integers.map((_, f(_)))
which returns
Seq[(Int, Int => Int)] = List((1,<function1>), (2,<function1>))
I was expecting
Seq[(Int, Int)] = List((1, 2), (2, 3))
If I make the following changes, everything works as expected :
integers.map(i => (i, f(i)))
Any idea why the function f is not applied during the mapping ?
The underscore stands in for the passed argument only once. So in integers.map((_, f(_))) the 1st _ is a value from integers but the 2nd _ has the stand-alone meaning of "partially applied function".
If your anonymous function takes 2 (or more) arguments then you can use 2 (or more) underscores, but each stands in for its passed argument only once.
The Scala compiler can't read your mind, so the _ placeholder syntax is only useful in very simple expressions.
In your example:
integers.map((_, f(_)))
it evaluates the f(_) as a standalone sub-expression, so you end up with something equivalent to this:
x => (x, y => f(y))
Even if the compiler didn't treat f(_) as its own sub-expression, the result would not be the same as what you say want:
integers.map(i => (i, f(i)))
You want both instances of _ to be treated as the same argument, which is not how _ works. Each occurrence of _ in an expression is always treated as a unique argument.

How do `map` and `reduce` methods work in Spark RDDs?

Following code is from the quick start guide of Apache Spark.
Can somebody explain me what is the "line" variable and where it comes from?
textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
Also, how does a value get passed into a,b?
Link to the QSG http://spark.apache.org/docs/latest/quick-start.html
First, according to your link, the textfile is created as
val textFile = sc.textFile("README.md")
such that textfile is a RDD[String] meaning it is a resilient distributed dataset of type String. The API to access is very similar to that of regular Scala collections.
So now what does this map do?
Imagine you have a list of Strings and want to convert that into a list of Ints, representing the length of each String.
val stringlist: List[String] = List("ab", "cde", "f")
val intlist: List[Int] = stringlist.map( x => x.length )
The map method expects a function. A function, that goes from String => Int. With that function, each element of the list is transformed. So the value of intlist is List( 2, 3, 1 )
Here, we have created an anonymous function from String => Int. That is x => x.length. One can even write the function more explicit as
stringlist.map( (x: String) => x.length )
If you do use write the above explicit, you can
val stringLength : (String => Int) = {
x => x.length
}
val intlist = stringlist.map( stringLength )
So, here it is absolutely evident, that stringLength is a function from String to Int.
Remark: In general, map is what makes up a so called Functor. While you provide a function from A => B, map of the functor (here List) allows you use that function also to go from List[A] => List[B]. This is called lifting.
Answers to your questions
What is the "line" variable?
As mentioned above, line is the input parameter of the function line => line.split(" ").size
More explicit
(line: String) => line.split(" ").size
Example: If line is "hello world", the function returns 2.
"hello world"
=> Array("hello", "world") // split
=> 2 // size of Array
How does a value get passed into a,b?
reduce also expects a function from (A, A) => A, where A is the type of your RDD. Lets call this function op.
What does reduce. Example:
List( 1, 2, 3, 4 ).reduce( (x,y) => x + y )
Step 1 : op( 1, 2 ) will be the first evaluation.
Start with 1, 2, that is
x is 1 and y is 2
Step 2: op( op( 1, 2 ), 3 ) - take the next element 3
Take the next element 3:
x is op(1,2) = 3 and y = 3
Step 3: op( op( op( 1, 2 ), 3 ), 4)
Take the next element 4:
x is op(op(1,2), 3 ) = op( 3,3 ) = 6 and y is 4
Result here is the sum of the list elements, 10.
Remark: In general reduce calculates
op( op( ... op(x_1, x_2) ..., x_{n-1}), x_n)
Full example
First, textfile is a RDD[String], say
TextFile
"hello Tyth"
"cool example, eh?"
"goodbye"
TextFile.map(line => line.split(" ").size)
2
3
1
TextFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
3
Steps here, recall `(a, b) => if (a > b) a else b)`
- op( op(2, 3), 1) evaluates to op(3, 1), since op(2, 3) = 3
- op( 3, 1 ) = 3
Map and reduce are methods of RDD class, which has interface similar to scala collections.
What you pass to methods map and reduce are actually anonymous function (with one param in map, and with two parameters in reduce). textFile calls provided function for every element (line of text in this context) it has.
Maybe you should read some scala collection introduction first.
You can read more about RDD class API here:
https://spark.apache.org/docs/1.2.1/api/scala/#org.apache.spark.rdd.RDD
what map function does is, it takes the list of arguments and map it to some function. Similar to map function in python, if you are familiar.
Also, File is like a list of Strings. (not exactly but that's how it's being iterated)
Let's consider this is your file.
val list_a: List[String] = List("first line", "second line", "last line")
Now let's see how map function works.
We need two things, list of values which we already have and function to which we want to map this values. let's consider really simple function for understanding.
val myprint = (arg:String)=>println(arg)
this function simply takes single String argument and prints on the console.
myprint("hello world")
hello world
if we match this function to your list, it's gonna print all the lines
list_a.map(myprint)
We can write an anonymous function as mentioned below as well, which does the same thing.
list_a.map(arg=>println(arg))
in your case, line is the first line of the file. you could change the argument name as you like. for example, in above example, if I change arg to line it would work without any issue
list_a.map(line=>println(line))

Scala code analyzer targets case variable names that are identical to the outer matched varables - "suspicous shadowing"

In the following code snippet in which the outer match vars (x,y) are case matched by (xx,yy):
scala> val (x,y) = (1,2)
x: Int = 1
y: Int = 2
scala> (x,y) match {
| case (xx:Int, yy:Int) => println(s"x=$x xx=$xx")
| }
x=1 xx=1
We could have also written that code as follows:
scala> (x,y) match {
| case (x:Int, y:Int) => println(s"x=$x y=$y")
| }
x=1 y=2
In this latter case the Scala Code Analyzers will inform us:
Suspicious shadowing by a Variable Pattern
OK. But is there any situation where we could end up actually misusing the inner variable (x or y) in place of the original outer match variables?
It seems this is purely stylistic? No actual possibility for bugs? If so i would be interested to learn what the bugs could be.
This could be confusing:
val x = Some(1)
val y = Some(2)
(x, y) match {
case (Some(x), Some(y)) => println(s"x=$x y=$y")
}
x and y have different types depending on whether you are inside or outside of the match. If this code wasn't using simply Option, and was several lines longer, it could be rather difficult to reason about.
Could any bugs arise from this? None that I can think of that aren't horribly contrived. You could for example, mistake one for another.
val list = List(1,2,3)
list match {
case x :: y :: list => list // List(3) and not List(1,2,3)
case x :: list => list // List with 1 element, should the outer list have size 2
case _ => list // Returns the outer list when empty
}
Not to mention what a horrible mess that is. Within the match, list sometimes refers to an inner symbol, and sometimes the outer list.
It's just code that's unnecessarily complicated to read and understand, there are no special bugs that could happen.