Spark (python) - explain the difference between user defined functions and simple functions - pyspark

I am a Spark beginner. I am using Python and Spark dataframes. I just learned about user defined functions (udf) that one has to register first in order to use it.
Question: in what situation do you want to create a udf vs. just a simple (Python) function?
Thank you so much!

Your code will be neater if you use UDFs, because it will take a function, and the correct return type (defaults to string if empty), and create a column expression, which means you can write nice things like:
my_function_udf = udf(my_function, DoubleType())
myDf.withColumn("function_output_column", my_function_udf("some_input_column"))
This is just one example of how you can use a UDF to treat a function as a column. They also make it easy to introduce stuff like lists or maps into your function logic via a closure, which is explained very well here

Related

Writting function that can operate on RDD and Seq in Scala

I am trying to write functions that can receive both Spark RDD and Scala native Seq, so that I can showcase the performance difference of the two approaches. However, I couldn't figure out a common type or interface for the aforesaid function parameters. Let's imagine something simple like computing the mean using a map operation. Both RDD and Seq have this operation. I've tried using the type Either[RDD[Int], Seq[Int]] but it just doesn't typecheck :/.
Any pointer would be very appreciated :)
Basically, you can't. They don't show any common superclass - besides AnyRef I guess. Their map functions have completely different signature (params etc) even though they share a name (and purpose)

Add custom functions with optimisations (hence not as UDF)

I am struggling with optimisation of my custom functions currently being passed as UDFs. We take transformations configurably via a format like below, and hence cannot explicitly code transformation logic per setting.
transforms: [
{col: "id", expr: """ cast(someCustomFunction(aColumn) as string) """}
{col: "date", expr: """ date_format(cast(unix_timestamp(someColumn, "yyyyMMddHHmmss") as Timestamp), "yyyyMMdd") """}
],
I have registered someCustomFunction but I want to optimise this by somehow not creating it as a UDF since Spark blackboxes UDFs. I want to know what is the best approach for achieving this (and then sleeping peacefully):
Extending catalyst optmiser rules does not help since there is no logical optmisation I can give beforehand.
Column functions If I use them where/how do i register them (if there is a way to register them at all)
Custom transformations : Since I pass strings of unknown transformations, how to actually use custom transforms (code will help)
Registering custom functions beforehand like those in o.a.s.sql.functions package. All the entities in this package are protected or private. Do I have to copy all the spark code to local, add my functions and have my application use my local spark build( I hope not). If not, what is the right way to extend spark-sql to incorporate my functions?
Is there some other much easier way that I have missed?
I have been grappling with this for 3 days hence any help (preferably with a code sample) would be a giant Karmic brownie.

How use Window aggrgates on strongly typed Spark Datasets?

I'm slowely trying to adapt to the new (strongly typed) Dataset[U] from Spark 2.x, but struggling to maintain the type info when using Window functions.
case class Measurement(nb:Long,x:Double)
ds being a Dataset[Measurement], I would like to do something like
ds.map{m => (m,sum($"x").over(Window.orderBy($"nb"))}
But this will not work (as it gives my a Dataset[(Measurement,Column)]) instead of Dataset[(Measurement,Double)]
Using withColumn gives me a Dataset[Row], so I'm loosing the type info:
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb")))
So, is there a better way to use Window functions on strongly typed Datasets?
As you adding a new column to your dataset, I guess there is no choices but using the dataframe.as[New Type] method
More information can be found here How to add a column to Dataset without converting from a DataFrame and accessing it?
More information on Window functions can be found on this blog article Window Functions in Spark SQL by Databricks
You can use the as[U] method to convert a Dataframe (or Dataset[Row]) to a Dataset[U]
For you special case :
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb"))).as[(Measurement,Column)]
Hope it helps

How to sortBy in spark using a function?

For example, I want to sort by using the difference of two values in the tuple. How could I do that in Spark?
I want for example something like as follows.
rdd.sortBy(_._2._1 - _._2._2)
You can't use underscore more than once or it will be interpreted as two different arguments (and the expected function should only have one). Instead, name the argument and use it twice:
rdd.sortBy(r => r._2._1 - r._2._2)

Should I use partial function for database calls

As per my understanding, partial functions are functions that are defined for a subset of input values.
So should I use partial functions for DAO's. For example:
getUserById(userId: Long): User
There is always an input userId which does not exists in db. So can I say it is not defined. And lift it when I call this function.
If yes where do I stop. Should I use partial functions for all methods which are not defined, say for null.
PartialFunction is used when function is undefined for some elements of input data (input data may be Seq etc.)
For your case Option is better choice: it says that return data may be absent:
getUserById(userId:Long):Option[User]
I would avoid using partial functions at all, because scala makes it very easy to call a partial function as though it were a total function. Instead it's better to use a function that returns Option as #Sergey suggests; that way the "partial-ness" is always explicit.
Idiomatic scala does not use null so I wouldn't worry about methods which are not defined for null, but certainly it's worth returning Option for methods which are only defined for some of their possible input values. Better still, though, is to only accept suitable types as input. E.g. if you have a function that's only valid for non-empty lists, it should take (scalaz) NonEmptyList as input rather than List.