I am struggling with optimisation of my custom functions currently being passed as UDFs. We take transformations configurably via a format like below, and hence cannot explicitly code transformation logic per setting.
transforms: [
{col: "id", expr: """ cast(someCustomFunction(aColumn) as string) """}
{col: "date", expr: """ date_format(cast(unix_timestamp(someColumn, "yyyyMMddHHmmss") as Timestamp), "yyyyMMdd") """}
],
I have registered someCustomFunction but I want to optimise this by somehow not creating it as a UDF since Spark blackboxes UDFs. I want to know what is the best approach for achieving this (and then sleeping peacefully):
Extending catalyst optmiser rules does not help since there is no logical optmisation I can give beforehand.
Column functions If I use them where/how do i register them (if there is a way to register them at all)
Custom transformations : Since I pass strings of unknown transformations, how to actually use custom transforms (code will help)
Registering custom functions beforehand like those in o.a.s.sql.functions package. All the entities in this package are protected or private. Do I have to copy all the spark code to local, add my functions and have my application use my local spark build( I hope not). If not, what is the right way to extend spark-sql to incorporate my functions?
Is there some other much easier way that I have missed?
I have been grappling with this for 3 days hence any help (preferably with a code sample) would be a giant Karmic brownie.
Related
I am trying to write functions that can receive both Spark RDD and Scala native Seq, so that I can showcase the performance difference of the two approaches. However, I couldn't figure out a common type or interface for the aforesaid function parameters. Let's imagine something simple like computing the mean using a map operation. Both RDD and Seq have this operation. I've tried using the type Either[RDD[Int], Seq[Int]] but it just doesn't typecheck :/.
Any pointer would be very appreciated :)
Basically, you can't. They don't show any common superclass - besides AnyRef I guess. Their map functions have completely different signature (params etc) even though they share a name (and purpose)
I'm working with scalax to generate a graph of my Spark operationS. So, I have a custom library that generates my graph. So, let me show a sample:
val DAGWithoutGet = createGraphFromOps(ops)
val DAGWithGet = createGraphFromOps(ops).get
The return type of DAGWithoutGet is
scala.util.Try[scalax.collection.Graph[typeA, scalax.collection.GraphEdge.DiEdge]],
and, for DAGWithGet is
scalax.collection.Graph[typeA, scalax.collection.GraphEdge.DiEdge].
Here, typeA is a project related class representing a single Spark operation, not relevant for the context of this question. (for context only: What my custom library does is, essentially, generate a map of dependencies between those operations, creating a big Map object, and calling Graph(myBigMap: _*) to generate the graph).
As far as I know, calling the .get command on this point of my code or later should not make any difference, but that is not what I'm seeing.
Calling DAGWithoutGet.get.nodes has a return type of scalax.collection.Graph[typeA,DiEdge]#NodeSetT,
while calling DAGWithGet.nodes returns DAGWithGet.NodeSetT.
When I extract one of those nodes (using the .find method), I receive scalax.collection.Graph[typeA,DiEdge]#NodeT and DAGWithGet.NodeT types, respectively. Much to my dismay, even the methods available in each case are different - I cannot use pathTo (which happens to be what I want) or withSubgraph on the former, only on the latter.
My doubt is, then, after this relatively complex example: what is going on here? Why extracting the value from the Try construct on different moments leads to different types, one path dependent, and the other not - or, if that isn't correct, what may I be missing here?
I am a Spark beginner. I am using Python and Spark dataframes. I just learned about user defined functions (udf) that one has to register first in order to use it.
Question: in what situation do you want to create a udf vs. just a simple (Python) function?
Thank you so much!
Your code will be neater if you use UDFs, because it will take a function, and the correct return type (defaults to string if empty), and create a column expression, which means you can write nice things like:
my_function_udf = udf(my_function, DoubleType())
myDf.withColumn("function_output_column", my_function_udf("some_input_column"))
This is just one example of how you can use a UDF to treat a function as a column. They also make it easy to introduce stuff like lists or maps into your function logic via a closure, which is explained very well here
I'm slowely trying to adapt to the new (strongly typed) Dataset[U] from Spark 2.x, but struggling to maintain the type info when using Window functions.
case class Measurement(nb:Long,x:Double)
ds being a Dataset[Measurement], I would like to do something like
ds.map{m => (m,sum($"x").over(Window.orderBy($"nb"))}
But this will not work (as it gives my a Dataset[(Measurement,Column)]) instead of Dataset[(Measurement,Double)]
Using withColumn gives me a Dataset[Row], so I'm loosing the type info:
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb")))
So, is there a better way to use Window functions on strongly typed Datasets?
As you adding a new column to your dataset, I guess there is no choices but using the dataframe.as[New Type] method
More information can be found here How to add a column to Dataset without converting from a DataFrame and accessing it?
More information on Window functions can be found on this blog article Window Functions in Spark SQL by Databricks
You can use the as[U] method to convert a Dataframe (or Dataset[Row]) to a Dataset[U]
For you special case :
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb"))).as[(Measurement,Column)]
Hope it helps
Scala's play framework claims that Anorm, and writing your own SQL is better that ORM's. One of the reasons is that you anyway most often want only transfer data between database and frontend as json. However, most tutorials, and even Play documentation give examples of parsing sql's returned values into case classes, in order to parse it again into json. We still have an object relational mapping anyway, or am I missing a point?
In my database there exists a table with 33 columns. Declaring a case class takes me 33 lines, declaring a parser with ~ operator, takes another 33. Using case statement to create an Object, another 66! Seriously, what am I doing wrong? Is there any shortcut? In django the same thing takes only 33 lines.
If you're using Anorm within a Play application, then the mapping into a Json object of your case class (assuming it has fairly normal apply and unapply functions defined for it, which most do) should be pretty much as simple as defining an implicit which uses the >2.10 macro based Json-inception methods...so all you actually need is a definition like this:
implicit val myCaseFormats = Json.format[MyCaseClass]
where 'MyCaseClass' is the name of your case type. You could even bake this into the parser combinator you use for de-serialising row-sets back from the database...that would dramatically clean up your code and cut down the amount of code you have to write.
See here for details on the Json macros:
https://www.playframework.com/documentation/2.1.1/ScalaJsonInception
I use this quite extensively in a pretty large code-base and it does make things quite clean.
In terms of your parsers for Anorm, remember that you don't have to produce a case-class instance as a result of a parse...you can actually return anything you like, which could just be an indexed sequence of your column values (if you're using something like Shapeless to allow for mixed-type lists etc...) or some other structure.
You do hav macro support in Anorm as well so the the parsers for your case classes can be one liners like this:
import norm.{Macro, Rowset}
val parser = Macro.namedParser[MyCaseClass]
If you want to do something custom, (such as parse direct to JsValue) then you have the flexibility to just hand-craft a more crafty parser.
HTH