Selecting columns of Dataframe in Spark Scala - scala

If you want to select the first column of a dataframe this can be done:
df.select(df.columns(0))
df.columns(0) returns a string, so by giving the name of the column, the select is able to get the column correctly.
Now, suppose I want to select the first 3 columns of the dataset, this is what I would intuitively do:
df.select(df.columns.split(0,3):_*)
The _* operator would pass the array of strings as a varag to my understanding, and it would be the same as passing (df.column(1), df.column(2), df.column(3)) to the select statement. However this doesn't work and it is necessary to do this:
import org.apache.spark.sql.functions.col
df.select(sf.columns.split(0,3).map(i => col(i)):_*))
What is going on?

I think in the question you meant slice instead of split.
And as for your question,
df.columns.slice(0,3):_* is meant to be passed to functions with *-parameters (varargs), i.e. if you call select(columns:_*) then there must be a function defined with varargs, e.g. def select(cols: String*).
But there can only be one such function defined - no overloading here is possible.
Example on why it's not possible to define two different functions with same vararg-parameter declaration:
def select(cols: String*): String = "string"
select() // returns "string"
def select(cols: Column*): Int = 3
select() // now returns 3
And in Spark, that one function is defined not for Strings but for Columns:
def select(cols: Column*)
For Strings, the method is declared like this:
def select(col: String, cols: String*)
I suggest you to stick to Columns, like you do now, but with some syntax sugar:
df.select(df.columns.slice(0,3).map(col):_*))
Or if there's a need to pass column names as Strings, then you can use selectExpr:
df.selectExpr(df.columns.slice(0,3):_*)

Related

pyspark aws glue UDF multi parmeter function? [duplicate]

I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.
Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
comparatorUDF = udf(lambda c, n: c == n, BooleanType())
df.where(comparatorUDF(col("name"), "Bonsanto")).show()
And I get the following error:
AnalysisException: u"cannot resolve 'Bonsanto' given input columns
name, age, balance;"
So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.
On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:
df.where(col("name") == "Bonsanto").show()
#+--------+---+-------+
#| name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+
Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:
Pass argument using currying:
def comparatorUDF(n):
return udf(lambda c: c == n, BooleanType())
df.where(comparatorUDF("Bonsanto")(col("name")))
This can be used with an argument of any type as long as it is serializable.
Use a SQL literal and the current implementation:
from pyspark.sql.functions import lit
df.where(comparatorUDF(col("name"), lit("Bonsanto")))
This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?

overloaded method value select with alternatives

I'm trying to select more columns and cast all of them but I receive this error
"overloaded method value select with alternatives: (col:
String,cols: String*)org.apache.spark.sql.DataFrame (cols:
org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be
applied to (org.apache.spark.sql.Column, org.apache.spark.sql.Column,
String)"
the code is this:
val result = df.select(
col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.NUMERO_CONTRATTO}").cast(IntegerType),
col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.CODICE_PORTAFOGLIO}").cast(IntegerType),
col(s"${Constant.CS}_exp.${Constant.RATEALE}.${Constant.STORIA_DEL_CONTRATTO}"))
The last part of the error message means that the compiler cannot find a method "select" with an api that fit your code: select(Column, Column, String)
However, the compiler found 2 possible methods, but they don't fit:
select(col: String, cols: String*)
select(cols: Column*)
(the * means "any number of")
This, I am sure of.
However, I don't understand why you get that error with the code you've given that actually is select(Column, Column, Column) which fits the select(cols: Column*) api. For some reason, it consider the last argument to be a String. Maybe some parenthesis are wrongly placed
What I do in such cases, is to split the code to validate types:
val col1: Column = col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.NUMERO_CONTRATTO}").cast(IntegerType)
val col2: Column = col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.CODICE_PORTAFOGLIO}").cast(IntegerType)
val col3: Column = col(s"${Constant.CS}_exp.${Constant.RATEALE}.${Constant.STORIA_DEL_CONTRATTO}")
val result = df.select(col1, col2, col3)
and check it compiles alright

Why does $ not work with values of type String (and only with the string literals directly)?

I have the following object which mimic an enumeration:
object ColumnNames {
val JobSeekerID = "JobSeekerID"
val JobID = "JobID"
val Date = "Date"
val BehaviorType = "BehaviorType"
}
Then I want to group a DF by a column. The following does not compile:
userJobBehaviourDF.groupBy($(ColumnNames.JobSeekerID))
If I change it to
userJobBehaviourDF.groupBy($"JobSeekerID")
It works.
How can I use $ and ColumnNames.JobSeekerID together to do this?
$ is a Scala feature called string interpolator.
Starting in Scala 2.10.0, Scala offers a new mechanism to create strings from your data: String Interpolation. String Interpolation allows users to embed variable references directly in processed string literals.
Spark leverages string interpolators in Spark SQL to convert $"col name" into a column.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
scala> :type $"hello"
org.apache.spark.sql.ColumnName
ColumnName type is a subtype of Column type and that's why you can use $-prefixed strings as column references where values of Column type are expected.
import org.apache.spark.sql.Column
val c: Column = $"columnName"
scala> :type c
org.apache.spark.sql.Column
How can I use $ and ColumnNames.JobSeekerID together to do this?
You cannot.
You should either map the column names (in the "enumerator") to the Column type using $ directly (that would require changing their types to Column) or using col or column functions when Columns are required.
col(colName: String): Column Returns a Column based on the given column name.
column(colName: String): Column Returns a Column based on the given column name.
$s Elsewhere
What's interesting is that Spark MLlib uses $-prefixed strings for ML parameters, but in this case $ is just a regular method.
protected final def $[T](param: Param[T]): T = getOrDefault(param)
It's also worth mentioning that (another) $ string interpolator is used in Catalyst DSL to create logical UnresolvedAttributes that could be useful for testing or Spark SQL internals exploration.
import org.apache.spark.sql.catalyst.dsl.expressions._
scala> :type $"hello"
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
String Interpolator in Scala
The string interpolator feature works (is resolved to a proper value) at compile time so either it is a string literal or it's going to fail.
$ is akin to the s string interpolator:
Prepending s to any string literal allows the usage of variables directly in the string.
Scala provides three string interpolation methods out of the box: s, f and raw and you can write your own interpolator as Spark did.
You can only use $ with string literals(values) If you want to use ColumnNames you can do as below
userJobBehaviourDF.groupBy(userJobBehaviourDF(ColumnNames.JobSeekerID))
userJobBehaviourDF.groupBy(col(ColumnNames.JobSeekerID))
From the Spark Docs for Column, here are different ways of representing a column:
df("columnName") // On a specific `df` DataFrame.
col("columnName") // A generic column no yet associated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
Hope this helps!

Changing several Spark DataFrame column types, dynamically and configurable

I'm new to Spark and Scala.
We have an external data source feeding us JSON. This JSON has quotes around all values including number and boolean fields. So by the time I get it into my DataFrame all the columns are strings. The end goal is to convert these JSON records into a properly typed Parquet files.
There are approximately 100 fields, and I need to change several of the types from string to int, boolean, or bigint (long). Further, each DataFrame we process will only have a subset of these fields, not all of them. So I need to be able to handle subsets of columns for a given DataFrame, compare each column to a known list of column types, and cast certain columns from string to int, bigint, and boolean depending on which columns appear in the DataFrame.
Finally, I need the list of column types to be configurable because we'll have new columns in the future and may want to get rid of or change old ones.
So, here's what I have so far:
// first I convert to all lower case for column names
val df = dfIn.toDF(dfIn.columns map(_.toLowerCase): _*)
// Big mapping to change types
// TODO how would I make this configurable?
// I'd like to drive this list from an external config file.
val dfOut = df.select(
df.columns.map {
///// Boolean
case a # "a" => df(a).cast(BooleanType).as(a)
case b # "b" => df(b).cast(BooleanType).as(b)
///// Integer
case i # "i" => df(i).cast(IntegerType).as(i)
case j # "j" => df(j).cast(IntegerType).as(j)
// Bigint to Double
case x # "x" => df(x).cast(DoubleType).as(x)
case y # "y" => df(y).cast(DoubleType).as(y)
case other => df(other)
}: _*
)
Is this a good efficient way to transform this data to having the types I want in Scala?
I could use some advice on how to drive this off an external 'config' file where I could define the column types.
My question evolved into this question. Good answer given there:
Spark 2.2 Scala DataFrame select from string array, catching errors

Aggregate a Spark data frame using an array of column names, retaining the names

I would like to aggregate a Spark data frame using an array of column names as input, and at the same time retain the original names of the columns.
df.groupBy($"id").sum(colNames:_*)
This works but fails to preserve the names. Inspired by the answer found here I unsucessfully tried this:
df.groupBy($"id").agg(sum(colNames:_*).alias(colNames:_*))
error: no `: _*' annotation allowed here
It works to take a single element like
df.groupBy($"id").agg(sum(colNames(2)).alias(colNames(2)))
How can make this happen for the entire array?
Just provide an sequence of columns with aliases:
val colNames: Seq[String] = ???
val exprs = colNames.map(c => sum(c).alias(c))
df.groupBy($"id").agg(exprs.head, exprs.tail: _*)