How use Window aggrgates on strongly typed Spark Datasets? - scala

I'm slowely trying to adapt to the new (strongly typed) Dataset[U] from Spark 2.x, but struggling to maintain the type info when using Window functions.
case class Measurement(nb:Long,x:Double)
ds being a Dataset[Measurement], I would like to do something like
ds.map{m => (m,sum($"x").over(Window.orderBy($"nb"))}
But this will not work (as it gives my a Dataset[(Measurement,Column)]) instead of Dataset[(Measurement,Double)]
Using withColumn gives me a Dataset[Row], so I'm loosing the type info:
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb")))
So, is there a better way to use Window functions on strongly typed Datasets?

As you adding a new column to your dataset, I guess there is no choices but using the dataframe.as[New Type] method
More information can be found here How to add a column to Dataset without converting from a DataFrame and accessing it?
More information on Window functions can be found on this blog article Window Functions in Spark SQL by Databricks

You can use the as[U] method to convert a Dataframe (or Dataset[Row]) to a Dataset[U]
For you special case :
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb"))).as[(Measurement,Column)]
Hope it helps

Related

Writting function that can operate on RDD and Seq in Scala

I am trying to write functions that can receive both Spark RDD and Scala native Seq, so that I can showcase the performance difference of the two approaches. However, I couldn't figure out a common type or interface for the aforesaid function parameters. Let's imagine something simple like computing the mean using a map operation. Both RDD and Seq have this operation. I've tried using the type Either[RDD[Int], Seq[Int]] but it just doesn't typecheck :/.
Any pointer would be very appreciated :)
Basically, you can't. They don't show any common superclass - besides AnyRef I guess. Their map functions have completely different signature (params etc) even though they share a name (and purpose)

Using DATASet Api with Spark Scala

Hi I am very new to spark/Scala and trying to implement some functionality.My requirement is very simple.I have to perform all the the operations using DataSet API.
Question1:
I converted the csv in form a case Class?Is it correct way of converting data frame to DataSet??Am I doing it correctly?
Also when I am trying to do transformation on orderItemFile1,for filter/map operation I am able to access with _.order_id.But same is not happening with groupBy
case class orderItemDetails (order_id_order_item:Int, item_desc:String,qty:Int, sale_value:Int)
val orderItemFile1=ss.read.format("csv")
.option("header",true)
.option("infersSchema",true)
.load("src/main/resources/Order_ItemData.csv").as[orderItemDetails]
orderItemFile1.filter(_.order_id_order_item>100) //Works Fine
orderItemFile1.map(_.order_id_order_item.toInt)// Works Fine
//Error .Inside group By I am unable to access it as _.order_id_order_item. Why So?
orderItemFile1.groupBy(_.order_id_order_item)
//Below Works..But How this will provide compile time safely as committed
//by DataSet Api.I can pass any wrong column name also here and it will be //caught only on run time
orderItemFile1.groupBy(orderItemFile1("order_id_order_item")).agg(sum(orderItemFile1("item_desc")))
Perhaps the functionality you're looking for is #groupByKey. See example here.
As for your first question, basically yes, you're reading a CSV into a Dataset[A] where A is a case class you've declared.

Add custom functions with optimisations (hence not as UDF)

I am struggling with optimisation of my custom functions currently being passed as UDFs. We take transformations configurably via a format like below, and hence cannot explicitly code transformation logic per setting.
transforms: [
{col: "id", expr: """ cast(someCustomFunction(aColumn) as string) """}
{col: "date", expr: """ date_format(cast(unix_timestamp(someColumn, "yyyyMMddHHmmss") as Timestamp), "yyyyMMdd") """}
],
I have registered someCustomFunction but I want to optimise this by somehow not creating it as a UDF since Spark blackboxes UDFs. I want to know what is the best approach for achieving this (and then sleeping peacefully):
Extending catalyst optmiser rules does not help since there is no logical optmisation I can give beforehand.
Column functions If I use them where/how do i register them (if there is a way to register them at all)
Custom transformations : Since I pass strings of unknown transformations, how to actually use custom transforms (code will help)
Registering custom functions beforehand like those in o.a.s.sql.functions package. All the entities in this package are protected or private. Do I have to copy all the spark code to local, add my functions and have my application use my local spark build( I hope not). If not, what is the right way to extend spark-sql to incorporate my functions?
Is there some other much easier way that I have missed?
I have been grappling with this for 3 days hence any help (preferably with a code sample) would be a giant Karmic brownie.

Spark (python) - explain the difference between user defined functions and simple functions

I am a Spark beginner. I am using Python and Spark dataframes. I just learned about user defined functions (udf) that one has to register first in order to use it.
Question: in what situation do you want to create a udf vs. just a simple (Python) function?
Thank you so much!
Your code will be neater if you use UDFs, because it will take a function, and the correct return type (defaults to string if empty), and create a column expression, which means you can write nice things like:
my_function_udf = udf(my_function, DoubleType())
myDf.withColumn("function_output_column", my_function_udf("some_input_column"))
This is just one example of how you can use a UDF to treat a function as a column. They also make it easy to introduce stuff like lists or maps into your function logic via a closure, which is explained very well here

No such element exception in machine learning pipeline using scala

I am trying to implement an ML pipeline in Spark using Scala and I used the sample code available on the Spark website. I am converting my RDD[labeledpoints] into a data frame using the functions available in the SQlContext package. It gives me a NoSuchElementException:
Code Snippet:
Error Message:
Error at the line Pipeline.fit(training_df)
The type Vector you have inside your for-loop (prob: Vector) takes a type parameter; such as Vector[Double], Vector[String], etc. You just need to specify the type you data your vector will store.
As a site note: The single argument overloaded version of createDataFrame() you use seems to be experimental. In case you are planning to use it for some long term project.
The pipeline in your code snippet is currently empty, so there is nothing to be fit. You need to specify the stages using .setStages(). See the example in the spark.ml documentation here.