I am trying to create a function/udf for currency conversions, which maybe i can reuse in a spark notebook, it requires a SQL statement, is it possible to add it to the udf like so? If not what can I do?
Something like:
Def curexch(from_cur, exch_rt, exdate, to_cur)
If from_cur=='USD'
Return 1
Else
Return (select excrt from exchange_rate_tbl a where exdate>=a.exdate and to_cur= a.to_cur)
Do i have to add # to the reference of the join?
As Suggested by Skin and as per this Microsoft Document you can create UDF in Azure functions and here are the sample codes for it.
Register a function as UDF
def squared(s):
return s * s
spark.udf.register("squaredWithPython", squared)
You can eve set your return type as UDF and a default return type if StringType
from pyspark.sql.types import LongType
def squared_typed(s):
return s * s
spark.udf.register("squaredWithPython", squared_typed, LongType())
Related
I was creating a UDF that based on below function:
def return_output(column):
return {'features':{'site':'a.com', 'test':column, 'test_vocab':['a','b','c']}
but I am not sure how to define the return type
one example for column would be {"sentence":[0,1,2],"another_one":[0,1,2]}
so the final output would be looking like below:
{'features':{'home_page':'a.com', 'test':{"sentence":[0,1,2],"another_one":[0,1,2]}
, 'test_vocab':['a','b','c']}
How am I supposed to define the return type for this output?
This looks very JSON like so the correct type should be StructType - you can read more about it here - https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.types.StructType.html#structtype
I am analysing the following piece of code:
from pyspark.sql.functions import udf,col, desc
def error(value, pred):
return abs(value - pred)
udf_MAE = udf(lambda value, pred: MAE(value= value, pred = pred), FloatType())
I know an udf is an user defined function, but I don't understand what that means? Because udfwasn't define anywhere previously on the code?
User Defined Functions (UDFs) are useful when you need to define logic specific to your use case and when you need to encapsulate that solution for reuse. They should only be used when there is no clear way to accomplish a task using built-in functions..Azure DataBricks
Create your function (after you have made sure there is no built in function to perform similar task)
def greatingFunc(name):
return 'hello {name}!'
Then you need to register your function as a UDF by designating the following:
A name for access in Python (myGreatingUDF)
The function itself (greatingFunc)
The return type for the function (StringType)
myGreatingUDF = spark.udf.register("myGreatingUDF",greatingFunc,StringType())
Now you can call you UDF anytime you need it,
guest = 'John'
print(myGreatingUDF(guest))
Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.)
How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'.
What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
This is an old question, I just wanted to update it according to the new version of Spark.
Since Spark 3.0.0, the method that #Raphael Roth has mentioned is deprecated. Hence, you might get an AnalysisException. The reason is that the input closure using this method doesn't have type checking and the behavior might be different from what we expect in SQL when it comes to null values.
If you really know what you're doing, you need to set spark.sql.legacy.allowUntypedScalaUDF configuration to true.
Another solution is to use case class instead of schema. For example,
case class Foo(field1: String, field2: String)
val convertFunction: Seq[Row] => Seq[Foo] = input => {
input.map {
x => // do something with x and convert to Foo
}
}
val myUdf = udf(convertFunction)
I have an issue , I want to check if an array of string contains string present in another column . I am currently using below code which is giving an error.
.withColumn("is_designer_present", when(array_contains(col("list_of_designers"),$"dept_resp"),1).otherwise(0))
error:
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.ColumnName dept_resp
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
you can write a udf function to get your job done
import org.apache.spark.sql.functions._
def stringContains = udf((array: collection.mutable.WrappedArray[String], str: String) => array.contains(str))
df.withColumn("is_designer_present", when(stringContains(col("list_of_designers"), $"dept_resp"),1).otherwise(0))
You can return appropriate value from udf function itself so that you don't have to use when function
import org.apache.spark.sql.functions._
def stringContains = udf((array: collection.mutable.WrappedArray[String], str: String) => if (array.contains(str)) 1 else 0)
df.withColumn("is_designer_present", stringContains(col("list_of_designers"), $"dept_resp"))
With Spark 1.6 you can wrap your array_contains() as a string into the expr() function:
import org.apache.spark.sql.functions.expr
.withColumn("is_designer_present",
when(expr("array_contains(list_of_designers, dept_resp)"), 1).otherwise(0))
This form of array_contains inside the expr can accept a column as the second argument.
I know that this is a somewhat old question, but I found myself in a similar predicament, and found the following solution. It uses Spark native functions (so it doesn't suffer from UDF related performance regressions, and it does not rely in string expressions (which are hard to maintain).
def array_contains_column(arrayColumn: Column, valueColumn: Column): Column = {
new Column(ArrayContains(arrayColumn.expr, valueColumn.expr))
}
// ...
df.withColumn(
"is_designer_present",
when(
array_contains_column(col("list_of_designers"),col("dept_resp")),
1
).otherwise(0)
)
You can do it without UDFs using explode.
.withColumn("exploCol", explode($"dept_resp"))
.withColumn("aux", when($"exploCol" === col("list_of_designers"), 1).otherwise(0))
.drop("exploCol")
.groupBy($"dep_rest") //all cols except aux
.agg(sum($"aux") as "result")
And there you go, if result > 0 then "dept_rest" contains the value.
How can I use a UDF which works great in spark like
sparkSession.sql("select * from chicago where st_contains(st_makeBBOX(0.0, 0.0, 90.0, 90.0), geom)").show
taken from from http://www.geomesa.org/documentation/user/spark/sparksql.html
via spark`s more typesafe scala dataframe API?
If you have created a function, you can register the created UDF using:
sparksession.sqlContext.udf.register(yourFunction)
I hope this helps.
Oliviervs I think he's looking for something different. I think Georg wants to use the udf by string in the select api of the dataframe. For example:
val squared = (s: Long) => {
s * s
}
spark.udf.register("square", squared)
df.select(getUdf("square", col("num")).as("newColumn")) // something like this
Question in hand is if there exists a function called getUdf that could be utilized to retrieve a udf registered via string. Georg, Is that right?