The declaration of spark org.apache.spark.sql.functions.countDistinct is:
def countDistinct(columnName: String, columnNames: String*): Column
def countDistinct(expr: Column, exprs: Column*): Column
The declaration receives variable arguments, but with a single String/Column firstly. So I cannot write code like this:
val id1sArr = id1.split(",").map(col(_))
df.agg(countDistinct(id1sArr: _*))
So my questions are:
Why varargs function countDistinct receive a single String/Column firstly? What is the advantage and disadvantage of this type of declaration?
How to adapt this declaration if I want to pass variable arguments?
The answer to why the declaration have a single string/column as the first argument is that
countDistinct requires at least one argument. If a declaration such as countDistinct(columnNames: String*) is used, zero arguments would be allowed.
As to how to pass a list of arguments, simply write:
df.agg(countDistinct(id1sArr.head, id1sArr.tail: _*))
Related
Why varargs can't be passed as another varargs without :_* ?
object Main {
def main(s: Array[String]): Unit = {
def someFunction(varars: String*) = {
someOtherFunction(varars) // Compilation ERRRO
someOtherFunction(varars:_*) // Works, but why ?
}
def someOtherFunction(someOtherVarars: String*): Unit = {
}
}
}
It's because varars is a single argument - an array of strings (note, that I'm not writing Array[String], because it is not the java Array, more details here), whereas by looking at the signature def someOtherFunction(someOtherVarars: String*): Unit, we can tell, that someOtherFunction takes multiple arguments of type String each. You cannot simply pass an array as the parameter to someOtherFunction, you need to "unfold" it first.
In other words an argument can be passed to someOtherFunction it has to be marked as a sequence argument. It would not make much sense to be able to pass varargs and varargs(1) to the single function. It's described in SLS §4.6.2.
"varargs"parameter means it can take any number of strings as an argument(i.e., a varargs field).
def someFunction(varars: String*): Seq[String] = {
varars }
if you define the above method and check for type of "varars" it has now become Seq[String] when you are using it. But when you pass it to another method which is expecting variableArgs type. It mismatch as it has become Seq[String] which should be converted to variable arguments using (someOtherVarars: _*).
_* operator tells the compiler to pass each element of the sequence as a separate argument, instead of passing it as a single argument.
I have a method that with the implicit parameter. i get a error when i convert it to function in 2 case :
1:
def action(implicit i:Int) = i + " in action"
val f = action _
then i get a StackOverflowError.
2:
def action(implicit i:Int) = i + " in action"
val f = action(_)
then i get a error: missing parameter type
I must write like this :
val f = (i:Int) => action(i)
that's ok. And if the parameter of 'action' is not the implicit , all case are right. So how to explain , and what i miss ?
If you specify a parameter to a function to be implicit, you are inviting the compiler to supply the value of that parameter for you. So how does the compiler find those values? It looks for values of the same type (Int in your case) that have been declared as implicit values in a variety of scopes.
(For simplicity, I'll just use a local scope in this example, but you might want to read up on this topic. Programming in Scala, 3rd Ed is a good first step.)
Note that the names of the implicit values are ignored and have no bearing on proceedings, the compiler only looks at the types of implicit values. If multiple implicit values with the required type are found in the same scope, then the compiler will complain about ambiguous implicit values.
For example, the following provides a function with an implicit parameter and a default value for that parameter within the current scope:
def greetPerson(name: String)(implicit greeting: String) = s"$greeting $name!"
implicit val defaultGreeting = "Hello" // Implicit value to be used for greeting argument.
val y = greetPerson("Bob") // Equivalent to greetPerson("Bob")(defaultGreeting).
val z = greetPerson("Fred")("Hi")
Note that y is just a String value of "Hello Bob!", and z is a string with the value "Hi Fred!"; neither of them are functions.
Also note that greetPerson is a curried function. This is because implicit parameters cannot be mixed with regular, non-implicit parameters in the same parameter list.
In general, it's bad practice to use common types (Int, Boolean, String, etc.) as values for implicit parameters. In a big program, there might be a lot of different implicit values in your scope, and you might pick up an unexpected value. For that reason, it's standard practice to wrap such values in a case class instead.
If you're trying to create a value that supplies some of the arguments of another function (that is, a partially applied function), then that would look something like this:
def greetPerson(greeting: String, name: String) = s"$greeting $name!"
val sayHello = greetPerson("Hello", _: String)
val y = sayHello("Bob") // "Hello Bob!"
val sayHi = greetPerson("Hi", _: String)
val z = sayHi("Fred") // "Hi Fred!"
In both cases, we're creating partially applied functions (sayHi and sayHello) that call greetPerson with the greeting parameter specified, but which allow us to specify the name parameter. Both sayHello and sayHi are still only values, but their values are partially applied functions rather than constants.
Depending upon your circumstances, I think the latter case may suit you better...
I would also read up on how the underscore character (_) is used in Scala. In a partially applied function declaration, it corresponds to the arguments that will be provided later. But it has a lot of other uses too. I think there's no alternative to reading up on Scala and learning how and when to use them.
I'd like to call a function returned by a function with an implicit parameter, simply and elegantly. This doesn't work:
def resolveA(implicit a: A): String => String = { prefix =>
s"$prefix a=$a"
}
case class A(n: Int)
implicit val a = A(1)
println(resolveA("-->")) // won't compile
I've figured out what's going on: Scala sees the ("-->") and thinks it's an attempt to explicitly fill in the implicit parameter list. I want to pass that as the prefix argument, but Scala sees it as the a argument.
I've tried some alternatives, like putting an empty parameter list () before the implicit one, but so far I've always been stopped by the fact that Scala thinks the argument to the returned function is an attempt to fill in the implicit parameter list of resolveA.
What's a nice way to do what I'm trying to do here, even if it's not as nice as the syntax I tried above?
Another option would be to use the apply method of the String => String function returned by resolveA. This way the compiler won't confuse the parameter lists, and is a little shorter than writing implicltly[A].
scala> resolveA[A].apply("-->")
res3: String = --> a=A(1)
I've looked around and found several other examples of this, but I don't really understand from those answers what's actually going on.
I'd like to understand why the following code fails to compile:
val df = readFiles(sqlContext).
withColumn("timestamp", udf(UDFs.parseDate _)($"timestamp"))
Giving the error:
Error:(29, 58) not enough arguments for method udf: (implicit evidence$2: reflect.runtime.universe.TypeTag[java.sql.Date], implicit evidence$3: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction.
Unspecified value parameter evidence$3.
withColumn("timestamp", udf(UDFs.parseDate _)($"timestamp")).
^
Whereas this code does compile:
val parseDate = udf(UDFs.parseDate _)
val df = readFiles(sqlContext).
withColumn("timestamp", parseDate($"timestamp"))
Obviously I've found a "workaround" but I'd really like to understand:
What this error really means. The info I have found on TypeTags and ClassTags has been really difficult to understand. I don't come from a Java background, which perhaps doesn't help, but I think I should be able to grasp it…
If I can achieve what I want without a separate function definition
The error message is a bit mis-leading indeed; the reason for it is that the function udf takes an implicit parameter list but you are passing an actual parameter. Since I don't know much about spark and since the udf signature is a bit convoluted I'll try to explain what is going on with a simplified example.
In practice udf is a function that given some explicit parameters and an implicit parameter list gives you another function; let's define the following function that given a pivot of type T for which we have an implicit Ordering will give as a function that allows us to split a sequence in two, one containing elements smaller than pivot and the other containing elements that are bigger:
def buildFn[T](pivot: T)(implicit ev: Ordering[T]): Seq[T] => (Seq[T], Seq[T]) = ???
Let's leave out the implementation as it's not important. Now, if I do the following:
val elements: Seq[Int] = ???
val (small, big) = buildFn(10)(elements)
I will make the same kind of mistake that you are showing in your code, i.e. the compiler will think that I am explicitly passing elements as the implicit parameter list and this won't compile. The error message of my example will be somewhat different from the one you have because in my case the number of parameters I am mistakenly passing for the implicit parameter list matches the expected one and then the error will be about types not lining up.
Instead, if I write it as:
val elements: Seq[Int] = ???
val fn = buildFn(10)
val (small, big) = fn(elements)
In this case the compiler will correctly pass the implicit parameters to the function. I don't know of any way to circumvent this problem, unless you want to pass the actual implicit parameters explicitly but I find it quite ugly and not always practical; for reference this is what I mean:
val elements: Seq[Int] = ???
val (small, big) = buildFn(10)(implicitly[Ordering[Int]])(elements)
could someone bring more shed on following piece of scala code which is not fully clear to me? I have following function defined
def ids(ids: String*) = {
_builder.ids(ids: _*)
this
}
Then I am trying to pass variable argument list to this function as follows:
def searchIds(kind: KindOfThing, adIds:String*) = {
...
ids(adIds)
}
Firstly, ids(adIds) piece doesn't work which is a bit odd at first as error message says: Type mismatch, expected: String, actual: Seq[String]. This means that variable arguments lists are not typed as collections or sequences.
In order to fix this use the trick ids(adIds: _*).
I am not 100% sure how :_* works, could someone put some shed on it?
If I remember correctly : means that operation is applied to right argument instead to left, _ means "apply" to passed element, ...
I checked String and Sequence scaladoc but wasn't able to find :_* method.
Could someone explain this?
Thx
You should look at your method definitions:
def ids(ids: String*)
Here you're saying that this method takes a variable number of strings, eg:
def ids(id1: String, id2: String, id3: String, ...)
Then the second method:
def searchIds(kind: KindOfThing, adIds:String*)
This also takes a variable number of string, which are packaged into a Seq[String], so adIds is actually a Seq, but your first method ids doesn't take a Seq, it takes N strings, that's why ids(adIds: _*) works.
: _* this is called the splat operator, what that's doing is splatting the Seq into N strings.