Spark Scala Method Signature in DataFrame API - scala

Hi I am trying to understand scala more and I think I am a little lost with this method signature.
explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame
First off, what is the "<:" supposed to mean in the square brakcets? Are A and B supposed to be parameter types? But Column is the argument type.
Secondly, it looks like it does a lambda function from (Row) to Traversable[A] but I haven't seen a lambda yet that doesn't have the left side argument on the right side argument at least once.
Also, I 'm not 100 percent usre why it has the implicit arg0: piece
Thanks in advance!

what is the "<:" supposed to mean in the square brackets?
<: means subtype in scala, so here it means the A is a subtype of Product. It acts like a kind of upper bound which limits the type that can be passed here to be subtype of Product
Are A and B supposed to be parameter types? But Column is the argument
type
A is not the parameter type, it is a parameter by itself, which is called type parameter. It is a little bit confusing but basically it means you can pass any type that is a subtype of product to this position and use the type parameter inside the function. This makes the function more generic because it can handle different types at the same time and you don't have to write separate functions for different types;
it looks like it does a lambda function from (Row) to Traversable[A]
f: (row) => Traversable[A] is another parameter which in this case is a function type which accepts (row) and return Traversable[A]. By this definition, explode can accept a function as a parameter, in which case is a lambda expression;
To illustrate the last case:
def sum(x: Int, y: Int)(f: Int => Int) = f(x) + f(y)
sum: (x: Int, y: Int)(f: Int => Int)Int
sum(2,3)(x => 2*x)
res2: Int = 10
In conclusion, the function explode accepts three parameters in total, the first one A is a type parameter. The second and the third are real arguments of the function, with Input being of type Column as you have noticed and f being of type (row) => Traversable[A] which is a function type.

Related

Can wildcard anonymous functions be used when parameters are repeating in the result expression?

Find Square of an integer 'x'.
Without Placeholder
var square = (x:Int) => x*x
square(3) gives desired output 9.
With Placeholder var square = (_:Int)*(_:Int)
square(3) gives Error
Not enough arguments for method apply: (v1: Int, v2: Int)Int in trait Function2.
Unspecified value parameter v2.
Internally what is happening?
No, each occurrence of _ represents the next argument in the argument list for the function.
(_:Int)*(_:Int) is a function that takes two Int arguments and multiplies them.

Scala currying example in the tutorial is confusing me

I'm reading a bit about Scala currying here and I don't understand this example very much:
def foldLeft[B](z: B)(op: (B, A) => B): B
What is the [B] in square brackets? Why is it in brackets? The B after the colon is the return type right? What is the type?
It looks like this method has 2 parameter lists: one with a parameter named z and one with a parameter named op which is a function.
op looks like it takes a function (B, A) => B). What does the right side mean? It returns B?
And this is apparently how it is used:
val numbers = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val res = numbers.foldLeft(0)((m, n) => m + n)
print(res) // 55
What is going on? Why wasn't the [B] needed when called?
In Scala documentation that A type (sometimes A1) is often the placeholder for a collection's element type. So if you have...
List('c','q','y').foldLeft( //....etc.
...then A becomes the reference for Char because the list is a List[Char].
The B is a placeholder for a 2nd type that foldLeft will have to deal with. Specifically it is the type of the 1st parameter as well as the type of the foldLeft result. Most of the time you actually don't need to specify it when foldLeft is invoked because the compiler will infer it. So, yeah, it could have been...
numbers.foldLeft[Int](0)((m, n) => m + n)
...but why bother? The compiler knows it's an Int and so does anyone reading it (anyone who knows Scala).
(B, A) => B is a function that takes 2 parameters, one of type B and one of type A, and produces a result of type B.
What is the [B] in square brackets?
A type parameter (that's also known as "generics" if you've seen something like Java before)
Why is it in brackets?
Because type parameters are written in brackets, that's just Scala syntax. Java, C#, Rust, C++ use angle brackets < > for similar purposes, but since arrays in Scala are accessed as arr(idx), and (unlike Haskell or Python) Scala does not use [ ... ] for list comprehensions, the square brackets could be used for type parameters, and there was no need for angular brackets (those are more difficult to parse anyway, especially in a language which allows almost arbitrary names for infix and postfix operators).
The B after the colon is the return type right?
Right.
What is the type?
Ditto. The return type is B.
It looks like this method has 2 parameter lists: one with a parameter named z and one with a parameter named op which is a function.
This method has a type parameter B and two argument lists for value arguments, correct. This is done to simplify the type inference: the type B can be inferred from the first argument z, so it does not have to be repeated when writing down the lambda-expression for op. This wouldn't work if z and op were in the same argument list.
op looks like it takes a function (B, A) => B.
The type of the argument op is (B, A) => B, that is, Function2[B, A, B], a function that takes a B and an A and returns a B.
What does the right side mean? It returns B?
Yes.
What is going on?
m acts as accumulator, n is the current element of the list. The fold starts with integer value 0, and then accumulates from left to right, adding up all numbers. Instead of (m, n) => m + n, you could have written _ + _.
Why wasn't the [B] needed when called?
It was inferred from the type of z. There are many other cases where the generic type cannot be inferred automatically, then you would have to specify the return type explicitly by passing it as an argument in the square brackets.
This is what is called polymorphism. The function can work on multiple types and you sometimes want to give a parameter of what type will be worked with. Basically the B is a type parameter and can either be given explicitly as a type, which would be Int and then it should be given in square brackets or implicitly in parentheses like you did with the 0. Read about polymorphism here Scala polymorphism

Difference between these two function formats

I am working on spark and not an expert in scala. I have got the two variants of map function. Could you please explain the difference between them.?
first variant and known format.
first variant
val.map( (x,y) => x.size())
Second variant -> This has been applied on tuple
val.map({case (x, y) => y.toString()});
The type of val is RDD[(IntWritable, Text)]. When i tried with first function, it gave error as below.
type mismatch;
found : (org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text) ⇒ Unit
required: ((org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.Text)) ⇒ Unit
When I added extra parenthesis it said,
Tuples cannot be directly destructured in method or function parameters.
Well you say:
The type of val is RDD[(IntWritable, Text)]
so it is a tuple of arity 2 with IntWritable and Text as components.
If you say
val.map( (x,y) => x.size())
what you're doing is you are essentially passing in a Function2, a function with two arguments to the map function. This will never compile because map wants a function with one argument. What you can do is the following:
val.map((xy: (IntWritable, Text)) => xy._2.toString)
using ._2 to get the second part of the tuple which is passed in as xy (the type annotation is not required but makes it more clear).
Now the second variant (you can leave out the outer parens):
val.map { case (x, y) => y.toString() }
this is special scala syntax for creating a PartialFunction that immediately matches on the tuple that is passed in to access the x and y parts. This is possible because PartialFunction extends from the regular Function1 class (Function1[A,B] can be written as A => B) with one argument.
Hope that makes it more clear :)
I try this in repl:
scala> val l = List(("firstname", "tom"), ("secondname", "kate"))
l: List[(String, String)] = List((firstname,tom), (secondname,kate))
scala> l.map((x, y) => x.size)
<console>:9: error: missing parameter type
Note: The expected type requires a one-argument function accepting a 2-Tuple.
Consider a pattern matching anonymous function, `{ case (x, y) => ... }`
l.map((x, y) => x.size)
maybe can give you some inspire.
Your first example is a function that takes two arguments and returns a String. This is similar to this example:
scala> val f = (x:Int,y:Int) => x + y
f: (Int, Int) => Int = <function2>
You can see that the type of f is (Int,Int) => Int (just slightly changed this to be returning an int instead of a string). Meaning that this is a function that takes two Int as arguments and returns an Int as a result.
Now the second example you have is a syntactic sugar (a shortcut) for writing something like this:
scala> val g = (k: (Int, Int)) => k match { case (x: Int, y: Int) => x + y }
g: ((Int, Int)) => Int = <function1>
You see that the return type of function g is now ((Int, Int)) => Int. Can you spot the difference? The input type of g has two parentheses. This shows that g takes one argument and that argument must be a Tuple[Int,Int] (or (Int,Int) for short).
Going back to your RDD, what you have is an Collection of Tuple[IntWritable, Text] so the second function will work, whereas the first one will not work.

Scala Function.tupled and Function.untupled equivalent for variable arity, or, calling variable arity function with tuple

I was trying to do some stuff last night around accepting and calling a generic function (i.e. the type is known at the call site, but potentially varies across call sites, so the definition should be generic across arities).
For example, suppose I have a function f: (A, B, C, ...) => Z. (There are actually many such fs, which I do not know in advance, and so I cannot fix the types nor count of A, B, C, ..., Z.)
I'm trying to achieve the following.
How do I call f generically with an instance of (A, B, C, ...)? If the signature of f were known in advance, then I could do something involving Function.tupled f or equivalent.
How do I define another function or method (for example, some object's apply method) with the same signature as f? That is to say, how do I define a g for which g(a, b, c, ...) type checks if and only if f(a, b, c, ...) type checks? I was looking into Shapeless's HList for this. From what I can tell so far, HList at least solves the "representing an arbitrary arity args list" issue, and also, Shapeless would solve the conversion to and from tuple issue. However, I'm still not sure I understand how this would fit in with a function of generic arity, if at all.
How do I define another function or method with a related type signature to f? The biggest example that comes to mind now is some h: (A, B, C, ...) => SomeErrorThing[Z] \/ Z.
I remember watching a conference presentation on Shapeless some time ago. While the presenter did not explicitly demonstrate these things, what they did demonstrate (various techniques around abstracting/genericizing tuples vs HLists) would lead me to believe that similar things as the above are possible with the same tools.
Thanks in advance!
Yes, Shapeless can absolutely help you here. Suppose for example that we want to take a function of arbitrary arity and turn it into a function of the same arity but with the return type wrapped in Option (I think this will hit all three points of your question).
To keep things simple I'll just say the Option is always Some. This takes a pretty dense four lines:
import shapeless._, ops.function._
def wrap[F, I <: HList, O](f: F)(implicit
ftp: FnToProduct.Aux[F, I => O],
ffp: FnFromProduct[I => Option[O]]
): ffp.Out = ffp(i => Some(ftp(f)(i)))
We can show that it works:
scala> wrap((i: Int) => i + 1)
res0: Int => Option[Int] = <function1>
scala> wrap((i: Int, s: String, t: String) => (s * i) + t)
res1: (Int, String, String) => Option[String] = <function3>
scala> res1(3, "foo", "bar")
res2: Option[String] = Some(foofoofoobar)
Note the appropriate static return types. Now for how it works:
The FnToProduct type class provides evidence that some type F is a FunctionN (for some N) that can be converted into a function from some HList to the original output type. The HList function (a Function1, to be precise) is the Out type member of the instance, or the second type parameter of the FnToProduct.Aux helper.
FnFromProduct does the reverse—it's evidence that some F is a Function1 from an HList to some output type that can be converted into a function of some arity to that output type.
In our wrap method, we use FnToProduct.Aux to constrain the Out of the FnToProduct instance for F in such a way that we can refer to the HList parameter list and the O result type in the type of our FnFromProduct instance. The implementation is then pretty straightforward—we just apply the instances in the appropriate places.
This may all seem very complicated, but once you've worked with this kind of generic programming in Scala for a while it becomes more or less intuitive, and we'd of course be happy to answer more specific questions about your use case.

type inference in fold left one-liner?

I was trying to reverse a List of Integers as follows:
List(1,2,3,4).foldLeft(List[Int]()){(a,b) => b::a}
My question is that is there a way to specify the seed to be some List[_] where the _ is the type automatically filled in by scala's type-inference mechanism, instead of having to specify the type as List[Int]?
Thanks
Update: After reading a bit more on Scala's type inference, I found a better answer to your question. This article which is about the limitations of the Scala type inference says:
Type information in Scala flows from function arguments to their results [...], from left to right across argument lists, and from first to last across statements. This is in contrast to a language with full type inference, where (roughly speaking) type information flows unrestricted in all directions.
So the problem is that Scala's type inference is rather limited. It first looks at the first argument list (the list in your case) and then at the second argument list (the function). But it does not go back.
This is why neither this
List(1,2,3,4).foldLeft(Nil){(a,b) => b::a}
nor this
List(1,2,3,4).foldLeft(List()){(a,b) => b::a}
will work. Why? First, the signature of foldLeft is defined as:
foldLeft[B](z: B)(f: (B, A) => B): B
So if you use Nil as the first argument z, the compiler will assign Nil.type to the type parameter B. And if you use List(), the compiler will use List[Nothing] for B.
Now, the type of the second argument f is fully defined. In your case, it's either
(Nil.type, Int) => Nil.type
or
(List[Nothing], Int) => List[Nothing]
And in both cases the lambda expression (a, b) => b :: a is not valid, since its return type is inferred to be List[Int].
Note that the bold part above says "argument lists" and not "arguments". The article later explains:
Type information does not flow from left to right within an argument list, only from left to right across argument lists.
So the situation is even worse if you have a method with a single argument list.
The only way I know how is
scala> def foldList[T](l: List[T]) = l.foldLeft(List[T]()){(a,b) => b::a}
foldList: [T](l: List[T])List[T]
scala> foldList(List(1,2,3,4))
res19: List[Int] = List(4, 3, 2, 1)
scala> foldList(List("a","b","c"))
res20: List[java.lang.String] = List(c, b, a)