Create a new column in a Spark DataFrame using a var with constant value - scala

I am trying to define a new column in a Spark DataFrame using a constant defined as a var. I'm using Zeppelin - in the initial cell, it starts with
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
spark.read.parquet("<path/to/file>")
The file contains a column named birth_year; I want to create a new column named age defined as $year - birth_year, where birth_year is a string column. I'm not quite clear on how to do this when the input argument to a UDF is a parameter. I've done a couple hours of searching and created a UDF, but I got an error message whose principal part is
<console>:71: error: type mismatch;
found : Int
required: org.apache.spark.sql.Column
spark.read.parquet("path/to/file").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge(year, col("birth_year"))).createOrReplaceTempView("tmp")
and a caret directly under 'year'.
I suspect that $year does not map into a variable of the same length as birth_year; I've seen the lit() function that appears to work for strings - does it work with integer values as well, or is there another function for this purpose?
I tried the following:
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
def createAge = udf((yr : Int, dob : Int) => {yr - dob})
spark.read.parquet("<path/to/file>").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge($"year", col("birth_year"))).createOrReplaceTempView("tmp")
Any suggestions welcome - thanks in advance for any help.

You can't use year directly as an input to the UDF since a it expects columns to operate on. To create a column with a constant value use lit(). You can call the UDF as follows:
df.withColumn("age", createAge(lit(year), $"birth_year".cast("int")))
However, when possible it's always preferred to use the in-built functions in Spark when possible. In this case, you do not need an UDF. Simply do:
df.withColumn("age", lit(year) - $"birth_year".cast("int"))
This should be much faster.

Related

pyspark add int column to a fixed date

I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()

In a Scala notebook on Apache Spark Databricks how do you correctly cast an array to type decimal(30,0)?

I am trying to cast an array as Decimal(30,0) for use in a select dynamically as:
WHERE array_contains(myArrayUDF(), someTable.someColumn)
However when casting with:
val arrIds = someData.select("id").withColumn("id", col("id")
.cast(DecimalType(30, 0))).collect().map(_.getDecimal(0))
Databricks accepts that and signature however already looks wrong to be:
intArrSurrIds: Array[java.math.BigDecimal] = Array(2181890000000,...) // ie, a BigDecimal
Which results in the below error:
Error in SQL statement: AnalysisException: cannot resolve.. due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<decimal(38,18)>, decimal(30,0)]
How do you correctly cast as decimal(30,0) in Spark Databricks Scala notebook instead of decimal(38,18) ?
Any help appreciated!
You can make arrIds an Array[Decimal] using the code below:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{Decimal, DecimalType}
val arrIds = someData.select("id")
.withColumn("id", col("id").cast(DecimalType(30, 0)))
.collect()
.map(row => Decimal(row.getDecimal(0), 30, 0))
However, it will not solve your problem because you lose the precision and scale once you create your user defined function, as I explain in this answer
To solve your problem, you need to cast the column someTable.someColumn to Decimal with the same precision and scale than the UDF returned type. So your WHERE clause should be:
WHERE array_contains(myArray(), cast(someTable.someColumn as Decimal(38, 18)))

pyspark udf with parameter

Need to transfer one pyspark dataframe colume checkin_time from milisec to timezone adjusted timestamp, timezone information is in another column tz_info.
Tried following:
def tz_adjust(x,tz_info):
if tz_info:
y = col(x)+ col(tz_info)
return from_unixtime(col(y)/1000)
else:
return from_unixtime(col(x)/1000)
def udf_tz_adjust(tz_info):
return udf(lambda l: tz_adjust(l, tz_info))
While using this udf to the column
df.withColumn('checkin_time',udf_tz_adjust('time_zone')(col('checkin_time')))
got some error:
AttributeError: 'NoneType' object has no attribute '_jvm'
Any idea to pass the second column as parameter to udf?
Thanks.
IMHO, what you are doing is a combination of UDF and partial function which could get tricky. I don't think you need to use UDF at all for your application purpose. You can do the following
#not tested
from pyspark.sql.functions import *
df.withColumn('checkin_time', when(col("tz_info").isNotNull(), (from_unixtime(col('checkin_time')) + F.col("tz_info"))/1000).otherwise(from_unixtime(col("checkin_time"))/1000))
UDF has its own serde inefficiencies which is even worse when using with python as it puts an extra overhead of converting scala datatypes into python datatypes.

change a dataframe row value with dynamic number of columns spark scala

I have a dataframe (contains 10 columns) for which I want to change the value of a row (for the last column only). I have written following code for this:
val newDF = spark.sqlContext.createDataFrame(WRADF.rdd.map(r=> {
Row(r.get(0), r.get(1),
r.get(2), r.get(3),
r.get(4), r.get(5),
r.get(6), r.get(7),
r.get(8), decrementCounter(r))
}), WRADF.schema)
I want to change the value of a row for 10th column only (for which I wrote decrementCounter() function). But the above code only runs for dataframes with 10 columns. I don't know how to convert this code so that it can run for different dataframe (with different number of columns). Any help will be appreciated.
Don't do something like this. Define udf
import org.apache.spark.sql.functions.udf._
val decrementCounter = udf((x: T) => ...) // adjust types and content to your requirements
df.withColumn("someName", decrementCounter($"someColumn"))
I think UDF will be a better choice because it can be applied using the Column name itself.
For more on udf you can take a look here : https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
For your code just use this :
import org.apache.spark.sql.functions.udf._
val decrementCounterUDF = udf(decrementCounter _)
df.withColumn("columnName", decrementCounterUDF($"columnName"))
What it will does is apply this decrementCounter function on each and every value of column columnName.
I hope this helps, cheers !

Why does $ not work with values of type String (and only with the string literals directly)?

I have the following object which mimic an enumeration:
object ColumnNames {
val JobSeekerID = "JobSeekerID"
val JobID = "JobID"
val Date = "Date"
val BehaviorType = "BehaviorType"
}
Then I want to group a DF by a column. The following does not compile:
userJobBehaviourDF.groupBy($(ColumnNames.JobSeekerID))
If I change it to
userJobBehaviourDF.groupBy($"JobSeekerID")
It works.
How can I use $ and ColumnNames.JobSeekerID together to do this?
$ is a Scala feature called string interpolator.
Starting in Scala 2.10.0, Scala offers a new mechanism to create strings from your data: String Interpolation. String Interpolation allows users to embed variable references directly in processed string literals.
Spark leverages string interpolators in Spark SQL to convert $"col name" into a column.
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
scala> :type $"hello"
org.apache.spark.sql.ColumnName
ColumnName type is a subtype of Column type and that's why you can use $-prefixed strings as column references where values of Column type are expected.
import org.apache.spark.sql.Column
val c: Column = $"columnName"
scala> :type c
org.apache.spark.sql.Column
How can I use $ and ColumnNames.JobSeekerID together to do this?
You cannot.
You should either map the column names (in the "enumerator") to the Column type using $ directly (that would require changing their types to Column) or using col or column functions when Columns are required.
col(colName: String): Column Returns a Column based on the given column name.
column(colName: String): Column Returns a Column based on the given column name.
$s Elsewhere
What's interesting is that Spark MLlib uses $-prefixed strings for ML parameters, but in this case $ is just a regular method.
protected final def $[T](param: Param[T]): T = getOrDefault(param)
It's also worth mentioning that (another) $ string interpolator is used in Catalyst DSL to create logical UnresolvedAttributes that could be useful for testing or Spark SQL internals exploration.
import org.apache.spark.sql.catalyst.dsl.expressions._
scala> :type $"hello"
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
String Interpolator in Scala
The string interpolator feature works (is resolved to a proper value) at compile time so either it is a string literal or it's going to fail.
$ is akin to the s string interpolator:
Prepending s to any string literal allows the usage of variables directly in the string.
Scala provides three string interpolation methods out of the box: s, f and raw and you can write your own interpolator as Spark did.
You can only use $ with string literals(values) If you want to use ColumnNames you can do as below
userJobBehaviourDF.groupBy(userJobBehaviourDF(ColumnNames.JobSeekerID))
userJobBehaviourDF.groupBy(col(ColumnNames.JobSeekerID))
From the Spark Docs for Column, here are different ways of representing a column:
df("columnName") // On a specific `df` DataFrame.
col("columnName") // A generic column no yet associated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
Hope this helps!