scala DataFrame selectExpr accepting one parameter or two? - scala

I am trying to understand someone's scala code (which was built and running fine) has:
// df is of type DataFrame
df.selectExpr("*", clause)
While looking at this link for DataFrame: https://spark.apache.org/docs/1.6.1/api/scala/#org.apache.spark.sql.DataFrame,
the syntax for the selectExpr has this signature below which seems accepting only one parameter:
def selectExpr(exprs: String*): DataFrame
So why the code I mentioned above passed in two parameters instead of one?
And what is "String*"? It shows it is of type "scala.Predef.String", but hard to find a clear example online talking about the use of "String* as a type.
Thanks for the help.

An asterisk after type name is just a Scala way to define repeated parameters (see SLS ยง4.6.3), that are very similar to varargs in Java.
So method declaration
def selectExpr(exprs: String*): DataFrame = { /*...*/ }
is roughly equivalent to Java version
public DataFrame selectExpr(String... exprs) { /*...*/ }
and creates a method that accepts from zero to probably-as-many-as-you'll-ever-want String arguments.

Related

Spark Scala: functional difference in notation using $?

Is there a functional difference between the following two expressions? The result looks the same to me but curious if there's an unknown unknown. What does the $ symbol indicate/how is it read?
df1.orderBy($"reasonCode".asc).show(10, false)
df1.orderBy(asc("reasonCode")).show(10, false)
Those two statements are equivalent and will lead to the identical result.
The $ notation is special for Scala Spark and is referring to an implicit StringToColumn method which interprets the subsequent string "reasonCode" as a Column
implicit class StringToColumn(val sc: StringContext) {
def $(args: Any*): ColumnName = {
new ColumnName(sc.s(args: _*))
}
}
In Scala Spark you have many ways to select a column. I have written down a full list of syntax varieties in another answer on select specific columns from spark dataframe.
Using different notations do not have any impact on the performance as they all get translated to the same set of RDD instructions through Spark's Catalyst optimizer.

What are Untyped Scala UDF and Typed Scala UDF? What are their differences?

I've been using Spark 2.4 for a while and just started switching to Spark 3.0 in these last few days. I got this error after switching to Spark 3.0 for running udf((x: Int) => x, IntegerType):
Caused by: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;
The solutions are proposed by Spark itself and after googling for a while I got to Spark Migration guide page:
In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
source: Spark Migration Guide
I notice that my usual way of using function.udf API, which is udf(AnyRef, DataType), is called UnTyped Scala UDF and the proposed solution, which is udf(AnyRef), is called Typed Scala UDF.
To my understanding, the first one looks more strictly typed than the second one where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
Is my understanding correct? Even after more intensive searching I still can't find any material explaining what is UnTyped Scala UDF and what is Typed Scala UDF.
So my questions are: What are they? What are their differences?
In typed scala UDF, UDF knows the types of the columns passed as argument, whereas in untyped scala UDF, UDF doesn't know the types of the columns passed as argument
When creating typed scala UDF, the types of columns passed as argument and output of the UDF are inferred from the function arguments and output types whereas when creating untyped scala UDF, there is not type inference at all, either for arguments or output.
What can be confusing is that when creating typed UDF the types are inferred from function and not explicitly passed as argument. To be more explicit, you can write typed UDF creation as follow:
val my_typed_udf = udf[Int, Int]((x: Int) => Int)
Now, let's look at the two points you raised.
To my understanding, the first one (eg udf(AnyRef, DataType)) looks more strictly typed than the second one (eg udf(AnyRef)) where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
According to spark functions scaladoc, signatures of the udf functions that transform a function to an UDF are actually, for the first one:
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction
And for the second one:
def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction
So the second one is actually more typed than the first one, as the second one takes into account the type of the function passed as argument, whereas the first one erases the type of the function.
That's why on the first one you need to define return type, because spark needs this information but can't infer it from function passed as argument as its return type is erased, whereas in the second one the return type is inferred from function passed as argument.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
What is important here is not the function, but how Spark creates an UDF from this function.
In both cases, the function to be transformed to UDF has its input and return types defined, but those types are erased and not taken into account when creating UDF using udf(AnyRef, DataType).
This doesn't answer your original question about what the different UDFs are, but if you want to get rid of the error, in Python you can include this line in your script: spark.sql("set spark.sql.legacy.allowUntypedScalaUDF=true").

Dataframe: Adding prefix to all columns in Scala

val prefix = "ABC"
val renamedColumns = df.columns.map(c=> df(c).as(s"$prefix$c"))
val dfNew = df.select(renamedColumns: _*)
Hi,
I am fairly new to scala and the code above works perfectly to add a prefix to all columns. Can someone please explain the breakdown of how it works ?
The second line above will return a map of col1 as ABCcol1, col2 as ABCcol2.... etc
I have trouble understanding what the third line is doing , especailly the ":_* at the end.
thanks for your help in advance.
The third line is an example of Scala's syntactic sugar. Essentially, Scala has ways to shorten just exactly what you are typing, and you have discovered the dreaded :_*.
There are two portions to this small bit - the : and the _* serve two different purposes. The : is typically for ascription, which tells the compiler "this is the type that I need to use for this method". The _* however, is your type - in Scala this is the type varargs. Varargs is a type that has an arbitrary number of values (good resource here). It allows you to pass a method a list that you do not know the number of elements in.
In your example, you are creating a variable called renamedColumns from the columns of your original dataframe, with the new string appendage. Although you may know just how many columns are in your df, Scala does not. When you create dfNew, you are running a select statement on that and passing in your new column names, of which there could be an arbitrary number.
Essentially, you do not know how many columns you may have, so you pass in your varargs to allow the number to be arbitrary, thus determined by the compiler.

Pass List[String] to function that takes f(args: String*) scala

I need to read in specific parquet files with spark, I know this can be done like so:
sqlContext
.read
.parquet("s3://bucket/key", "s3://bucket/key")
Right now I have a List[String] object with all these s3 paths in it but I don't know how I can pass this programmatically to the parquet function in Scala? There are way to many files to do it manually, any ideas how to get the files into the parquet function programmatically?
I've answer a similar question earlier concerning repeated parameters here.
As #Dima mentioned, you are looking for a splat operator because .parquet expected repeated arguments :
sqlContext.read.parquet(listOfStrings:_*)
More on repeated arguments in the Scala Language Specification seciton 4.6.2
Although it's the specs of scala 2.9, this part didn't change.

Spark toDF cannot resolve symbol after importing sqlContext implicits

I'm working on writing some unit tests for my Scala Spark application
In order to do so I need to create different dataframes in my tests. So I wrote a very short DFsBuilder code that basically allows me to add new rows and eventually create the DF. The code is:
class DFsBuilder[T](private val sqlContext: SQLContext, private val columnNames: Array[String]) {
var rows = new ListBuffer[T]()
def add(row: T): DFsBuilder[T] = {
rows += row
this
}
def build() : DataFrame = {
import sqlContext.implicits._
rows.toList.toDF(columnNames:_*) // UPDATE: added :_* because it was accidently removed in the original question
}
}
However the toDF method doesn't compile with a cannot resolve symbol toDF.
I wrote this builder code with generics since I need to create different kinds of DFs (different number of columns and different column types). The way I would like to use it is to define some certain case class in the unit test and use it for the builder
I know this issue somehow relates to the fact that I'm using generics (probably some kind of type erasure issue) but I can't quite put my finger on what the problem is exactly
And so my questions are:
Can anyone show me where the problem is? And also hopefully how to fix it
If this issue cannot be solved this way, could someone perhaps offer another elegant way to create dataframes? (I prefer not to pollute my unit tests with the creation code)
I obviously googled this issue first but only found examples where people forgot to import the sqlContext.implicits method or something about a case class out of scope which is probably not the same issue as I'm having
Thanks in advance
If you look at the signatures of toDF and of SQLImplicits.localSeqToDataFrameHolder (which is the implicit function used) you'll be able to detect two issues:
Type T must be a subclass of Product (the superclass of all case classes, tuples...), and you must provide an implicit TypeTag for it. To fix this - change the declaration of your class to:
class DFsBuilder[T <: Product : TypeTag](...) { ... }
The columnNames argument is not of type Array, it's a "repeated parameter" (like Java's "varargs", see section 4.6.2 here), so you have to convert the array into arguments:
rows.toList.toDF(columnNames: _*)
With these two changes, your code compiles (and works).