pyspark udf with parameter - pyspark

Need to transfer one pyspark dataframe colume checkin_time from milisec to timezone adjusted timestamp, timezone information is in another column tz_info.
Tried following:
def tz_adjust(x,tz_info):
if tz_info:
y = col(x)+ col(tz_info)
return from_unixtime(col(y)/1000)
else:
return from_unixtime(col(x)/1000)
def udf_tz_adjust(tz_info):
return udf(lambda l: tz_adjust(l, tz_info))
While using this udf to the column
df.withColumn('checkin_time',udf_tz_adjust('time_zone')(col('checkin_time')))
got some error:
AttributeError: 'NoneType' object has no attribute '_jvm'
Any idea to pass the second column as parameter to udf?
Thanks.

IMHO, what you are doing is a combination of UDF and partial function which could get tricky. I don't think you need to use UDF at all for your application purpose. You can do the following
#not tested
from pyspark.sql.functions import *
df.withColumn('checkin_time', when(col("tz_info").isNotNull(), (from_unixtime(col('checkin_time')) + F.col("tz_info"))/1000).otherwise(from_unixtime(col("checkin_time"))/1000))
UDF has its own serde inefficiencies which is even worse when using with python as it puts an extra overhead of converting scala datatypes into python datatypes.

Related

pyspark add int column to a fixed date

I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()

pyspark aws glue UDF multi parmeter function? [duplicate]

I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.
Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
comparatorUDF = udf(lambda c, n: c == n, BooleanType())
df.where(comparatorUDF(col("name"), "Bonsanto")).show()
And I get the following error:
AnalysisException: u"cannot resolve 'Bonsanto' given input columns
name, age, balance;"
So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.
On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:
df.where(col("name") == "Bonsanto").show()
#+--------+---+-------+
#| name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+
Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:
Pass argument using currying:
def comparatorUDF(n):
return udf(lambda c: c == n, BooleanType())
df.where(comparatorUDF("Bonsanto")(col("name")))
This can be used with an argument of any type as long as it is serializable.
Use a SQL literal and the current implementation:
from pyspark.sql.functions import lit
df.where(comparatorUDF(col("name"), lit("Bonsanto")))
This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?

In a Scala notebook on Apache Spark Databricks how do you correctly cast an array to type decimal(30,0)?

I am trying to cast an array as Decimal(30,0) for use in a select dynamically as:
WHERE array_contains(myArrayUDF(), someTable.someColumn)
However when casting with:
val arrIds = someData.select("id").withColumn("id", col("id")
.cast(DecimalType(30, 0))).collect().map(_.getDecimal(0))
Databricks accepts that and signature however already looks wrong to be:
intArrSurrIds: Array[java.math.BigDecimal] = Array(2181890000000,...) // ie, a BigDecimal
Which results in the below error:
Error in SQL statement: AnalysisException: cannot resolve.. due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<decimal(38,18)>, decimal(30,0)]
How do you correctly cast as decimal(30,0) in Spark Databricks Scala notebook instead of decimal(38,18) ?
Any help appreciated!
You can make arrIds an Array[Decimal] using the code below:
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.{Decimal, DecimalType}
val arrIds = someData.select("id")
.withColumn("id", col("id").cast(DecimalType(30, 0)))
.collect()
.map(row => Decimal(row.getDecimal(0), 30, 0))
However, it will not solve your problem because you lose the precision and scale once you create your user defined function, as I explain in this answer
To solve your problem, you need to cast the column someTable.someColumn to Decimal with the same precision and scale than the UDF returned type. So your WHERE clause should be:
WHERE array_contains(myArray(), cast(someTable.someColumn as Decimal(38, 18)))

Call function on Dataframe's columns has error TypeError: Column is not iterable

I am using Databricks with Spark 2.4. and i am coding Python
I have created this function to convert null to empty string
def xstr(s):
if s is None:
return ""
return str(s)
Then I have below code
from pyspark.sql.functions import *
lv_query = """
SELECT
SK_ID_Site, Designation_Site
FROM db_xxx.t_xxx
ORDER BY SK_ID_Site
limit 2"""
lvResult = spark.sql(lv_query)
a = lvResult1.select(map(xstr, col("Designation_Site")))
display(a)
I have this error : TypeError: Column is not iterable
what i need to do here is to call a function for each row that i have in my Dataframe. i would like to pass columns as parameters and have a result.
That's not how spark works. You cannot apply direct python code to a spark dataframe content.
There are already builtin functions that do the job for you.
from pyspark.sql import functions as F
a = lvResult1.select(
F.when(F.col("Designation_Site").isNull(), "").otherwise(
F.col("Designation_Site").cast("string")
)
)
In case you want some more complex functions that you cannot do with the builtin functions, you can use an UDF but it may impact a lot your performances (better check for existing builtin functions before building your own UDF).

Create a new column in a Spark DataFrame using a var with constant value

I am trying to define a new column in a Spark DataFrame using a constant defined as a var. I'm using Zeppelin - in the initial cell, it starts with
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
spark.read.parquet("<path/to/file>")
The file contains a column named birth_year; I want to create a new column named age defined as $year - birth_year, where birth_year is a string column. I'm not quite clear on how to do this when the input argument to a UDF is a parameter. I've done a couple hours of searching and created a UDF, but I got an error message whose principal part is
<console>:71: error: type mismatch;
found : Int
required: org.apache.spark.sql.Column
spark.read.parquet("path/to/file").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge(year, col("birth_year"))).createOrReplaceTempView("tmp")
and a caret directly under 'year'.
I suspect that $year does not map into a variable of the same length as birth_year; I've seen the lit() function that appears to work for strings - does it work with integer values as well, or is there another function for this purpose?
I tried the following:
%spark
import org.apache.spark.sql.functions._
var year : Int = 2016
def createAge = udf((yr : Int, dob : Int) => {yr - dob})
spark.read.parquet("<path/to/file>").withColumn("birth_year", $"birth_year" cast "Int").withColumn("age", createAge($"year", col("birth_year"))).createOrReplaceTempView("tmp")
Any suggestions welcome - thanks in advance for any help.
You can't use year directly as an input to the UDF since a it expects columns to operate on. To create a column with a constant value use lit(). You can call the UDF as follows:
df.withColumn("age", createAge(lit(year), $"birth_year".cast("int")))
However, when possible it's always preferred to use the in-built functions in Spark when possible. In this case, you do not need an UDF. Simply do:
df.withColumn("age", lit(year) - $"birth_year".cast("int"))
This should be much faster.