Using DATASet Api with Spark Scala - scala

Hi I am very new to spark/Scala and trying to implement some functionality.My requirement is very simple.I have to perform all the the operations using DataSet API.
Question1:
I converted the csv in form a case Class?Is it correct way of converting data frame to DataSet??Am I doing it correctly?
Also when I am trying to do transformation on orderItemFile1,for filter/map operation I am able to access with _.order_id.But same is not happening with groupBy
case class orderItemDetails (order_id_order_item:Int, item_desc:String,qty:Int, sale_value:Int)
val orderItemFile1=ss.read.format("csv")
.option("header",true)
.option("infersSchema",true)
.load("src/main/resources/Order_ItemData.csv").as[orderItemDetails]
orderItemFile1.filter(_.order_id_order_item>100) //Works Fine
orderItemFile1.map(_.order_id_order_item.toInt)// Works Fine
//Error .Inside group By I am unable to access it as _.order_id_order_item. Why So?
orderItemFile1.groupBy(_.order_id_order_item)
//Below Works..But How this will provide compile time safely as committed
//by DataSet Api.I can pass any wrong column name also here and it will be //caught only on run time
orderItemFile1.groupBy(orderItemFile1("order_id_order_item")).agg(sum(orderItemFile1("item_desc")))

Perhaps the functionality you're looking for is #groupByKey. See example here.
As for your first question, basically yes, you're reading a CSV into a Dataset[A] where A is a case class you've declared.

Related

How to compare 2 datasets in scala?

I am creating unit tests for a Scala application using Scala Test. I have the actual and expected results as Dataset . When I verified manually both the data and schema matches between actual and expected datasets.
Actual Dataset= actual_ds
Expected Dataset = expected_ds
when I execute below command ,it returns False.
assert(actual_ds.equals(expected_ds))
Could anyone suggest what could be the reason. And is there any other inbuilt function to compare the datasets in scala?
Use one of libraries designed for spark tests, spark-fast-tests , spark-testing-base, spark-test
They are quite ease to use and with their help its easy to compare two datasets with formatted message on output
You may start with spark-fast-tests (you can find usage in readme file) and check others if it does not suite your needs (fro example if you need other output formatting)
That .equals() is from Java Object .equals so it's correct that the assert fails.
I would start testing two datasets with:
assert actual_ds.schema == expected_ds.schema
assert actual_ds.count() == expected_ds.count()
And then checking this question: DataFrame equality in Apache Spark

Add custom functions with optimisations (hence not as UDF)

I am struggling with optimisation of my custom functions currently being passed as UDFs. We take transformations configurably via a format like below, and hence cannot explicitly code transformation logic per setting.
transforms: [
{col: "id", expr: """ cast(someCustomFunction(aColumn) as string) """}
{col: "date", expr: """ date_format(cast(unix_timestamp(someColumn, "yyyyMMddHHmmss") as Timestamp), "yyyyMMdd") """}
],
I have registered someCustomFunction but I want to optimise this by somehow not creating it as a UDF since Spark blackboxes UDFs. I want to know what is the best approach for achieving this (and then sleeping peacefully):
Extending catalyst optmiser rules does not help since there is no logical optmisation I can give beforehand.
Column functions If I use them where/how do i register them (if there is a way to register them at all)
Custom transformations : Since I pass strings of unknown transformations, how to actually use custom transforms (code will help)
Registering custom functions beforehand like those in o.a.s.sql.functions package. All the entities in this package are protected or private. Do I have to copy all the spark code to local, add my functions and have my application use my local spark build( I hope not). If not, what is the right way to extend spark-sql to incorporate my functions?
Is there some other much easier way that I have missed?
I have been grappling with this for 3 days hence any help (preferably with a code sample) would be a giant Karmic brownie.

How use Window aggrgates on strongly typed Spark Datasets?

I'm slowely trying to adapt to the new (strongly typed) Dataset[U] from Spark 2.x, but struggling to maintain the type info when using Window functions.
case class Measurement(nb:Long,x:Double)
ds being a Dataset[Measurement], I would like to do something like
ds.map{m => (m,sum($"x").over(Window.orderBy($"nb"))}
But this will not work (as it gives my a Dataset[(Measurement,Column)]) instead of Dataset[(Measurement,Double)]
Using withColumn gives me a Dataset[Row], so I'm loosing the type info:
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb")))
So, is there a better way to use Window functions on strongly typed Datasets?
As you adding a new column to your dataset, I guess there is no choices but using the dataframe.as[New Type] method
More information can be found here How to add a column to Dataset without converting from a DataFrame and accessing it?
More information on Window functions can be found on this blog article Window Functions in Spark SQL by Databricks
You can use the as[U] method to convert a Dataframe (or Dataset[Row]) to a Dataset[U]
For you special case :
ds.withColumn("cumsum",sum($"x").over(Window.orderBy($"nb"))).as[(Measurement,Column)]
Hope it helps

No such element exception in machine learning pipeline using scala

I am trying to implement an ML pipeline in Spark using Scala and I used the sample code available on the Spark website. I am converting my RDD[labeledpoints] into a data frame using the functions available in the SQlContext package. It gives me a NoSuchElementException:
Code Snippet:
Error Message:
Error at the line Pipeline.fit(training_df)
The type Vector you have inside your for-loop (prob: Vector) takes a type parameter; such as Vector[Double], Vector[String], etc. You just need to specify the type you data your vector will store.
As a site note: The single argument overloaded version of createDataFrame() you use seems to be experimental. In case you are planning to use it for some long term project.
The pipeline in your code snippet is currently empty, so there is nothing to be fit. You need to specify the stages using .setStages(). See the example in the spark.ml documentation here.

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.