Why does $ not work with values of type String (and only with the string literals directly)? - scala

I have the following object which mimic an enumeration:
object ColumnNames {
val JobSeekerID = "JobSeekerID"
val JobID = "JobID"
val Date = "Date"
val BehaviorType = "BehaviorType"
}
Then I want to group a DF by a column. The following does not compile:
userJobBehaviourDF.groupBy($(ColumnNames.JobSeekerID))
If I change it to
userJobBehaviourDF.groupBy($"JobSeekerID")
It works.
How can I use $ and ColumnNames.JobSeekerID together to do this?

$ is a Scala feature called string interpolator.
Starting in Scala 2.10.0, Scala offers a new mechanism to create strings from your data: String Interpolation. String Interpolation allows users to embed variable references directly in processed string literals.
Spark leverages string interpolators in Spark SQL to convert $"col name" into a column.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
scala> :type $"hello"
org.apache.spark.sql.ColumnName
ColumnName type is a subtype of Column type and that's why you can use $-prefixed strings as column references where values of Column type are expected.
import org.apache.spark.sql.Column
val c: Column = $"columnName"
scala> :type c
org.apache.spark.sql.Column
How can I use $ and ColumnNames.JobSeekerID together to do this?
You cannot.
You should either map the column names (in the "enumerator") to the Column type using $ directly (that would require changing their types to Column) or using col or column functions when Columns are required.
col(colName: String): Column Returns a Column based on the given column name.
column(colName: String): Column Returns a Column based on the given column name.
$s Elsewhere
What's interesting is that Spark MLlib uses $-prefixed strings for ML parameters, but in this case $ is just a regular method.
protected final def $[T](param: Param[T]): T = getOrDefault(param)
It's also worth mentioning that (another) $ string interpolator is used in Catalyst DSL to create logical UnresolvedAttributes that could be useful for testing or Spark SQL internals exploration.
import org.apache.spark.sql.catalyst.dsl.expressions._
scala> :type $"hello"
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
String Interpolator in Scala
The string interpolator feature works (is resolved to a proper value) at compile time so either it is a string literal or it's going to fail.
$ is akin to the s string interpolator:
Prepending s to any string literal allows the usage of variables directly in the string.
Scala provides three string interpolation methods out of the box: s, f and raw and you can write your own interpolator as Spark did.

You can only use $ with string literals(values) If you want to use ColumnNames you can do as below
userJobBehaviourDF.groupBy(userJobBehaviourDF(ColumnNames.JobSeekerID))
userJobBehaviourDF.groupBy(col(ColumnNames.JobSeekerID))
From the Spark Docs for Column, here are different ways of representing a column:
df("columnName") // On a specific `df` DataFrame.
col("columnName") // A generic column no yet associated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
Hope this helps!

Related

Selecting columns of Dataframe in Spark Scala

If you want to select the first column of a dataframe this can be done:
df.select(df.columns(0))
df.columns(0) returns a string, so by giving the name of the column, the select is able to get the column correctly.
Now, suppose I want to select the first 3 columns of the dataset, this is what I would intuitively do:
df.select(df.columns.split(0,3):_*)
The _* operator would pass the array of strings as a varag to my understanding, and it would be the same as passing (df.column(1), df.column(2), df.column(3)) to the select statement. However this doesn't work and it is necessary to do this:
import org.apache.spark.sql.functions.col
df.select(sf.columns.split(0,3).map(i => col(i)):_*))
What is going on?
I think in the question you meant slice instead of split.
And as for your question,
df.columns.slice(0,3):_* is meant to be passed to functions with *-parameters (varargs), i.e. if you call select(columns:_*) then there must be a function defined with varargs, e.g. def select(cols: String*).
But there can only be one such function defined - no overloading here is possible.
Example on why it's not possible to define two different functions with same vararg-parameter declaration:
def select(cols: String*): String = "string"
select() // returns "string"
def select(cols: Column*): Int = 3
select() // now returns 3
And in Spark, that one function is defined not for Strings but for Columns:
def select(cols: Column*)
For Strings, the method is declared like this:
def select(col: String, cols: String*)
I suggest you to stick to Columns, like you do now, but with some syntax sugar:
df.select(df.columns.slice(0,3).map(col):_*))
Or if there's a need to pass column names as Strings, then you can use selectExpr:
df.selectExpr(df.columns.slice(0,3):_*)

pyspark aws glue UDF multi parmeter function? [duplicate]

I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.
Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
comparatorUDF = udf(lambda c, n: c == n, BooleanType())
df.where(comparatorUDF(col("name"), "Bonsanto")).show()
And I get the following error:
AnalysisException: u"cannot resolve 'Bonsanto' given input columns
name, age, balance;"
So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.
On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:
df.where(col("name") == "Bonsanto").show()
#+--------+---+-------+
#| name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+
Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:
Pass argument using currying:
def comparatorUDF(n):
return udf(lambda c: c == n, BooleanType())
df.where(comparatorUDF("Bonsanto")(col("name")))
This can be used with an argument of any type as long as it is serializable.
Use a SQL literal and the current implementation:
from pyspark.sql.functions import lit
df.where(comparatorUDF(col("name"), lit("Bonsanto")))
This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?

Create a new column in a Spark DataFrame using a var with constant value

I am trying to define a new column in a Spark DataFrame using a constant defined as a var. I'm using Zeppelin - in the initial cell, it starts with
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
spark.read.parquet("<path/to/file>")
The file contains a column named birth_year; I want to create a new column named age defined as $year - birth_year, where birth_year is a string column. I'm not quite clear on how to do this when the input argument to a UDF is a parameter. I've done a couple hours of searching and created a UDF, but I got an error message whose principal part is
<console>:71: error: type mismatch;
found : Int
required: org.apache.spark.sql.Column
spark.read.parquet("path/to/file").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge(year, col("birth_year"))).createOrReplaceTempView("tmp")
and a caret directly under 'year'.
I suspect that $year does not map into a variable of the same length as birth_year; I've seen the lit() function that appears to work for strings - does it work with integer values as well, or is there another function for this purpose?
I tried the following:
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
def createAge = udf((yr : Int, dob : Int) => {yr - dob})
spark.read.parquet("<path/to/file>").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge($"year", col("birth_year"))).createOrReplaceTempView("tmp")
Any suggestions welcome - thanks in advance for any help.
You can't use year directly as an input to the UDF since a it expects columns to operate on. To create a column with a constant value use lit(). You can call the UDF as follows:
df.withColumn("age", createAge(lit(year), $"birth_year".cast("int")))
However, when possible it's always preferred to use the in-built functions in Spark when possible. In this case, you do not need an UDF. Simply do:
df.withColumn("age", lit(year) - $"birth_year".cast("int"))
This should be much faster.

Changing several Spark DataFrame column types, dynamically and configurable

I'm new to Spark and Scala.
We have an external data source feeding us JSON. This JSON has quotes around all values including number and boolean fields. So by the time I get it into my DataFrame all the columns are strings. The end goal is to convert these JSON records into a properly typed Parquet files.
There are approximately 100 fields, and I need to change several of the types from string to int, boolean, or bigint (long). Further, each DataFrame we process will only have a subset of these fields, not all of them. So I need to be able to handle subsets of columns for a given DataFrame, compare each column to a known list of column types, and cast certain columns from string to int, bigint, and boolean depending on which columns appear in the DataFrame.
Finally, I need the list of column types to be configurable because we'll have new columns in the future and may want to get rid of or change old ones.
So, here's what I have so far:
// first I convert to all lower case for column names
val df = dfIn.toDF(dfIn.columns map(_.toLowerCase): _*)
// Big mapping to change types
// TODO how would I make this configurable?
// I'd like to drive this list from an external config file.
val dfOut = df.select(
df.columns.map {
///// Boolean
case a # "a" => df(a).cast(BooleanType).as(a)
case b # "b" => df(b).cast(BooleanType).as(b)
///// Integer
case i # "i" => df(i).cast(IntegerType).as(i)
case j # "j" => df(j).cast(IntegerType).as(j)
// Bigint to Double
case x # "x" => df(x).cast(DoubleType).as(x)
case y # "y" => df(y).cast(DoubleType).as(y)
case other => df(other)
}: _*
)
Is this a good efficient way to transform this data to having the types I want in Scala?
I could use some advice on how to drive this off an external 'config' file where I could define the column types.
My question evolved into this question. Good answer given there:
Spark 2.2 Scala DataFrame select from string array, catching errors

Aggregate a Spark data frame using an array of column names, retaining the names

I would like to aggregate a Spark data frame using an array of column names as input, and at the same time retain the original names of the columns.
df.groupBy($"id").sum(colNames:_*)
This works but fails to preserve the names. Inspired by the answer found here I unsucessfully tried this:
df.groupBy($"id").agg(sum(colNames:_*).alias(colNames:_*))
error: no `: _*' annotation allowed here
It works to take a single element like
df.groupBy($"id").agg(sum(colNames(2)).alias(colNames(2)))
How can make this happen for the entire array?
Just provide an sequence of columns with aliases:
val colNames: Seq[String] = ???
val exprs = colNames.map(c => sum(c).alias(c))
df.groupBy($"id").agg(exprs.head, exprs.tail: _*)