use spark SQL udf in dataframe API - scala

How can I use a UDF which works great in spark like
sparkSession.sql("select * from chicago where st_contains(st_makeBBOX(0.0, 0.0, 90.0, 90.0), geom)").show
taken from from http://www.geomesa.org/documentation/user/spark/sparksql.html
via spark`s more typesafe scala dataframe API?

If you have created a function, you can register the created UDF using:
sparksession.sqlContext.udf.register(yourFunction)
I hope this helps.

Oliviervs I think he's looking for something different. I think Georg wants to use the udf by string in the select api of the dataframe. For example:
val squared = (s: Long) => {
s * s
}
spark.udf.register("square", squared)
df.select(getUdf("square", col("num")).as("newColumn")) // something like this
Question in hand is if there exists a function called getUdf that could be utilized to retrieve a udf registered via string. Georg, Is that right?

Related

I would like to use a UDF in synapse analytics

I am trying to create a function/udf for currency conversions, which maybe i can reuse in a spark notebook, it requires a SQL statement, is it possible to add it to the udf like so? If not what can I do?
Something like:
Def curexch(from_cur, exch_rt, exdate, to_cur)
If from_cur=='USD'
Return 1
Else
Return (select excrt from exchange_rate_tbl a where exdate>=a.exdate and to_cur= a.to_cur)
Do i have to add # to the reference of the join?
As Suggested by Skin and as per this Microsoft Document you can create UDF in Azure functions and here are the sample codes for it.
Register a function as UDF
def squared(s):
return s * s
spark.udf.register("squaredWithPython", squared)
You can eve set your return type as UDF and a default return type if StringType
from pyspark.sql.types import LongType
def squared_typed(s):
return s * s
spark.udf.register("squaredWithPython", squared_typed, LongType())

How to map a dataframe of datetime strings in spark to a dataframe of booleans?

I want to basically check whether every value in my dataframe of dates is the correct format "MM/dd/yy".
val df: DataFrame = spark.read.csv("----")
However, whenever I apply the function map:
df.map(x => right_format(x)).show()
and try to show this new dataframe/dataset, I'm getting a nonserializable error.
Does anyone know why?
I've tried to debug by using the intellij debugger, but to no avail.
val df: DataFrame = spark.read.csv("----")
df.map(x => right_format(x)).show()
Expected results: dataframe of boolean values
Actual results: Nonserializable error.
Does the non-Serializable error say something like SparkContext is non serializable?
Map runs in a distributed manner, and Spark will attempt to serialize and send right_format function def to all the nodes.
It looks like right_format is defined in the same scope as objects such as your SparkContext instance (for example, is all this in your main() method call?).
To get around this, I think you could do 1 of 2 things -
Define right_format() within the map block
df.map(x => {
def right_format(elem) = {...}
right_format(x)
}
).show()
Define an abstract object or a trait of helper functions that includes the function def for right_format.
Spark will serialize this object and send it to all the nodes. This should solve the issue that you're facing.

Dataset.groupByKey + untyped aggregation functions

Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?
As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.
Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))

Convert Dataframe back to RDD of case class in Spark

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

Apache Spark: get elements of Row by name

In a DataFrame object in Apache Spark (I'm using the Scala interface), if I'm iterating over its Row objects, is there any way to extract values by name? I can see how to do some really awkward stuff:
def foo(r: Row) = {
val ix = (0 until r.schema.length).map( i => r.schema(i).name -> i).toMap
val field1 = r.getString(ix("field1"))
val field2 = r.getLong(ix("field2"))
...
}
dataframe.map(foo)
I figure there must be a better way - this is pretty verbose, it requires creating this extra structure, and it also requires knowing the types explicitly, which if incorrect, will produce a runtime exception rather than a compile-time error.
You can use "getAs" from org.apache.spark.sql.Row
r.getAs("field1")
r.getAs("field2")
Know more about getAs(java.lang.String fieldName)
This is not supported at this time in the Scala API. The closest you have is this JIRA titled "Support converting DataFrames to typed RDDs"