Apache-Spark : What is map(_._2) shorthand for? - scala

I read a project's source code, found:
val sampleMBR = inputMBR.map(_._2).sample
inputMBR is a tuple.
the function map's definition is :
map[U classTag](f:T=>U):RDD[U]
it seems that map(_._2) is the shorthand for map(x => (x._2)).
Anyone can tell me rules of those shorthand ?

The _ syntax can be a bit confusing. When _ is used on its own it represents an argument in the anonymous function. So if we working on pairs:
map(_._2 + _._2) would be shorthand for map(x, y => x._2 + y._2). When _ is used as part of a function name (or value name) it has no special meaning. In this case x._2 returns the second element of a tuple (assuming x is a tuple).

collection.map(_._2) emits a second component of the tuple. Example from pure Scala (Spark RDDs work the same way):
scala> val zipped = (1 to 10).zip('a' to 'j')
zipped: scala.collection.immutable.IndexedSeq[(Int, Char)] = Vector((1,a), (2,b), (3,c), (4,d), (5,e), (6,f), (7,g), (8,h), (9,i), (10,j))
scala> val justLetters = zipped.map(_._2)
justLetters: scala.collection.immutable.IndexedSeq[Char] = Vector(a, b, c, d, e, f, g, h, i, j)

Two underscores in '_._2' are different.
First '_' is for placeholder of anonymous function; Second '_2' is member of case class Tuple.
Something like:
case class Tuple3 (_1: T1, _2: T2, _3: T3)
{...}

I have found the solutions.
First the underscore here is as placeholder.
To make a function literal even more concise, you can use underscores
as placeholders for one or more parameters, so long as each parameter
appears only one time within the function literal.
See more about underscore in Scala at What are all the uses of an underscore in Scala?.

The first '_' is referring what is mapped to and since what is mapped to is a tuple you might call any function within the tuple and one of the method is '_2' so what below tells us transform input into it's second attribute.

Related

Folding lists in scala

Folding list in scala using /: and :\ operator
I tried to to look at different sites and they only talk about foldRight and foldLeft functions.
def sum(xs: List[Int]): Int = (0 /: xs) (_ + _)
sum(List(1,2,3))
res0: 6
The code segment works as described. But I am not able to completely understand the method definition. What I understand is that the one inside the first parenthesis -> 0 /: xs where /: is a right associate operator. The object is xs and the parameter is 0. I am not sure about the return type of the operation (most probably it would be another list?). The second part is a functional piece which sums its two parameters. But I don't understand what object invokes it ? and the name of function. Can someone please help me to understand.
The signature of :/ is
/:[B](z: B)(op: (B, A) ⇒ B): B
It is a method with multiple argument lists, so when it is just invoked with on argument (i.e. 0 /: xs in your case) the return type is (op: (B, A) ⇒ B): B. So you have to pass it a method with 2 parameters ( _ + _ ) that is used to combine the elements of the list starting from z.
This method is usually called foldLeft:
(0 /: xs)(_ + _) is the same as xs.foldLeft(0)(_ + _)
You can find more details here: https://www.scala-lang.org/api/2.12.3/scala/collection/immutable/List.html
Thanks #HaraldGliebe & #LuisMiguelMejíaSuárez for your great responses. I am enlightened now!. I am just summarisig the answer here which may benefit others who read this thread.
"/:" is actually the name of the function which is defined inside the List class. The signature of the function is: /:[B](z: B)(op: (B, A) ⇒ B): B --> where B is the type parameter, z is the first parameter; op is the second parameter which is of functional type.
The function follows curried version --> which means we can pass less number of parameters than that of the actual number. If we do that,
the partially applied function is stored in a temporary variable; we can then use the temporary variable to pass the remaining parameters.
If supplied with all parameters, "/:" can be called as: x./:(0)(_+_) where x is val/var of List type. OR "/:" can be called in two steps which are given as:
step:1 val temp = x./:(0)(_) where we pass only the first parameter. This results in a partially applied function which is stored in the temp variable.
step:2 temp(_+_) here using the partially applied function temp is passed with the second (final) parameter.
If we decide to follow the first style ( x./:(0)(_+_) ), calling the first parameter can be written in operator notion which is: x /: 0
Since the method name ends with a colon, the object will be pulled from right side. So x /: 0 is invalid and it has to be written as 0 /: x which is correct.
This one is equivalent to the temp variable. On following 0 /: x, second parameter also needs to be passed. So the whole construct becomes: (0/:x)(_+_)
This is how the definition of the function sum in the question, is interpreted.
We have to note that when we use curried version of the function in operator notion, we have to supply all the parameters in a single go.
That is: (0 /: x) (_) OR (0 /: x) _ seems throwing syntax errors.

Scala : placeholder inside tuple

I've played a bit with placeholder and found a strange case :
val integers = Seq(1, 2)
val f = (x:Int) => x + 1
integers.map((_, f(_)))
which returns
Seq[(Int, Int => Int)] = List((1,<function1>), (2,<function1>))
I was expecting
Seq[(Int, Int)] = List((1, 2), (2, 3))
If I make the following changes, everything works as expected :
integers.map(i => (i, f(i)))
Any idea why the function f is not applied during the mapping ?
The underscore stands in for the passed argument only once. So in integers.map((_, f(_))) the 1st _ is a value from integers but the 2nd _ has the stand-alone meaning of "partially applied function".
If your anonymous function takes 2 (or more) arguments then you can use 2 (or more) underscores, but each stands in for its passed argument only once.
The Scala compiler can't read your mind, so the _ placeholder syntax is only useful in very simple expressions.
In your example:
integers.map((_, f(_)))
it evaluates the f(_) as a standalone sub-expression, so you end up with something equivalent to this:
x => (x, y => f(y))
Even if the compiler didn't treat f(_) as its own sub-expression, the result would not be the same as what you say want:
integers.map(i => (i, f(i)))
You want both instances of _ to be treated as the same argument, which is not how _ works. Each occurrence of _ in an expression is always treated as a unique argument.

Difference between these two function formats

I am working on spark and not an expert in scala. I have got the two variants of map function. Could you please explain the difference between them.?
first variant and known format.
first variant
val.map( (x,y) => x.size())
Second variant -> This has been applied on tuple
val.map({case (x, y) => y.toString()});
The type of val is RDD[(IntWritable, Text)]. When i tried with first function, it gave error as below.
type mismatch;
found : (org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text) ⇒ Unit
required: ((org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text)) ⇒ Unit
When I added extra parenthesis it said,
Tuples cannot be directly destructured in method or function parameters.
Well you say:
The type of val is RDD[(IntWritable, Text)]
so it is a tuple of arity 2 with IntWritable and Text as components.
If you say
val.map( (x,y) => x.size())
what you're doing is you are essentially passing in a Function2, a function with two arguments to the map function. This will never compile because map wants a function with one argument. What you can do is the following:
val.map((xy: (IntWritable, Text)) => xy._2.toString)
using ._2 to get the second part of the tuple which is passed in as xy (the type annotation is not required but makes it more clear).
Now the second variant (you can leave out the outer parens):
val.map { case (x, y) => y.toString() }
this is special scala syntax for creating a PartialFunction that immediately matches on the tuple that is passed in to access the x and y parts. This is possible because PartialFunction extends from the regular Function1 class (Function1[A,B] can be written as A => B) with one argument.
Hope that makes it more clear :)
I try this in repl:
scala> val l = List(("firstname", "tom"), ("secondname", "kate"))
l: List[(String, String)] = List((firstname,tom), (secondname,kate))
scala> l.map((x, y) => x.size)
<console>:9: error: missing parameter type
Note: The expected type requires a one-argument function accepting a 2-Tuple.
Consider a pattern matching anonymous function, `{ case (x, y) => ... }`
l.map((x, y) => x.size)
maybe can give you some inspire.
Your first example is a function that takes two arguments and returns a String. This is similar to this example:
scala> val f = (x:Int,y:Int) => x + y
f: (Int, Int) => Int = <function2>
You can see that the type of f is (Int,Int) => Int (just slightly changed this to be returning an int instead of a string). Meaning that this is a function that takes two Int as arguments and returns an Int as a result.
Now the second example you have is a syntactic sugar (a shortcut) for writing something like this:
scala> val g = (k: (Int, Int)) => k match { case (x: Int, y: Int) => x + y }
g: ((Int, Int)) => Int = <function1>
You see that the return type of function g is now ((Int, Int)) => Int. Can you spot the difference? The input type of g has two parentheses. This shows that g takes one argument and that argument must be a Tuple[Int,Int] (or (Int,Int) for short).
Going back to your RDD, what you have is an Collection of Tuple[IntWritable, Text] so the second function will work, whereas the first one will not work.

Questions about placeholders in Scala

Consider the following definition in Scala:
val f = ((_: Int) + 1).toString()
The code assigns to f the string representation of the function literal _ + 1, which is quite natural, except that this is not i want. i intended to define a function that accepts an int argument, increments it by 1, and returns its string format.
To disambiguate, i have to write a lambda expression with explicit parameters:
val g = (x: Int) => (x + 1).toString()
So can i conclude that the placeholder syntax is not suitable for complex function literals?
Or is there some rule that states the scope of the function literal? It seems placeholders cannot be nested in parentheses(except the ones needed for defining its type) within the function literal
Yes, you are right in thinking that placeholder syntax is not useful beyond simple function literals.
For what it's worth, the particular case you mentioned can be written with a conjunction of placeholder syntax and function composition.
scala> val f = ((_: Int) + 1) andThen (_.toString)
f: Int => java.lang.String = <function1>
scala> f(34)
res14: java.lang.String = 35
It seems placeholders cannot be nested in parentheses(except the ones needed for defining its type) within the function literal
This is correct. See the rules for the placeholder syntax and a use case example.

What does param: _* mean in Scala?

Being new to Scala (2.9.1), I have a List[Event] and would like to copy it into a Queue[Event], but the following Syntax yields a Queue[List[Event]] instead:
val eventQueue = Queue(events)
For some reason, the following works:
val eventQueue = Queue(events : _*)
But I would like to understand what it does, and why it works? I already looked at the signature of the Queue.apply function:
def apply[A](elems: A*)
And I understand why the first attempt doesn't work, but what's the meaning of the second one? What is :, and _* in this case, and why doesn't the apply function just take an Iterable[A] ?
a: A is type ascription; see What is the purpose of type ascriptions in Scala?
: _* is a special instance of type ascription which tells the compiler to treat a single argument of a sequence type as a variable argument sequence, i.e. varargs.
It is completely valid to create a Queue using Queue.apply that has a single element which is a sequence or iterable, so this is exactly what happens when you give a single Iterable[A].
This is a special notation that tells the compiler to pass each element as its own argument, rather than all of it as a single argument. See here.
It is a type annotation that indicates a sequence argument and is mentioned as an "exception" to the general rule in section 4.6.2 of the language spec, "Repeated Parameters".
It is useful when a function takes a variable number of arguments, e.g. a function such as def sum(args: Int*), which can be invoked as sum(1), sum(1,2) etc. If you have a list such as xs = List(1,2,3), you can't pass xs itself, because it is a List rather than an Int, but you can pass its elements using sum(xs: _*).
For Python folks:
Scala's _* operator is more or less the equivalent of Python's *-operator.
Example
Converting the scala example from the link provided by Luigi Plinge:
def echo(args: String*) =
for (arg <- args) println(arg)
val arr = Array("What's", "up", "doc?")
echo(arr: _*)
to Python would look like:
def echo(*args):
for arg in args:
print "%s" % arg
arr = ["What's", "up", "doc?"]
echo(*arr)
and both give the following output:
What's
up
doc?
The Difference: unpacking positional parameters
While Python's *-operator can also deal with unpacking of positional parameters/parameters for fixed-arity functions:
def multiply (x, y):
return x * y
operands = (2, 4)
multiply(*operands)
8
Doing the same with Scala:
def multiply(x:Int, y:Int) = {
x * y;
}
val operands = (2, 4)
multiply (operands : _*)
will fail:
not enough arguments for method multiply: (x: Int, y: Int)Int.
Unspecified value parameter y.
But it is possible to achieve the same with scala:
def multiply(x:Int, y:Int) = {
x*y;
}
val operands = (2, 4)
multiply _ tupled operands
According to Lorrin Nelson this is how it works:
The first part, f _, is the syntax for a partially applied function in which none of the arguments have been specified. This works as a mechanism to get a hold of the function object. tupled returns a new function which of arity-1 that takes a single arity-n tuple.
Futher reading:
stackoverflow.com - scala tuple unpacking