How to compare 2 datasets in scala? - scala

I am creating unit tests for a Scala application using Scala Test. I have the actual and expected results as Dataset . When I verified manually both the data and schema matches between actual and expected datasets.
Actual Dataset= actual_ds
Expected Dataset = expected_ds
when I execute below command ,it returns False.
assert(actual_ds.equals(expected_ds))
Could anyone suggest what could be the reason. And is there any other inbuilt function to compare the datasets in scala?

Use one of libraries designed for spark tests, spark-fast-tests , spark-testing-base, spark-test
They are quite ease to use and with their help its easy to compare two datasets with formatted message on output
You may start with spark-fast-tests (you can find usage in readme file) and check others if it does not suite your needs (fro example if you need other output formatting)

That .equals() is from Java Object .equals so it's correct that the assert fails.
I would start testing two datasets with:
assert actual_ds.schema == expected_ds.schema
assert actual_ds.count() == expected_ds.count()
And then checking this question: DataFrame equality in Apache Spark

Related

Using DATASet Api with Spark Scala

Hi I am very new to spark/Scala and trying to implement some functionality.My requirement is very simple.I have to perform all the the operations using DataSet API.
Question1:
I converted the csv in form a case Class?Is it correct way of converting data frame to DataSet??Am I doing it correctly?
Also when I am trying to do transformation on orderItemFile1,for filter/map operation I am able to access with _.order_id.But same is not happening with groupBy
case class orderItemDetails (order_id_order_item:Int, item_desc:String,qty:Int, sale_value:Int)
val orderItemFile1=ss.read.format("csv")
.option("header",true)
.option("infersSchema",true)
.load("src/main/resources/Order_ItemData.csv").as[orderItemDetails]
orderItemFile1.filter(_.order_id_order_item>100) //Works Fine
orderItemFile1.map(_.order_id_order_item.toInt)// Works Fine
//Error .Inside group By I am unable to access it as _.order_id_order_item. Why So?
orderItemFile1.groupBy(_.order_id_order_item)
//Below Works..But How this will provide compile time safely as committed
//by DataSet Api.I can pass any wrong column name also here and it will be //caught only on run time
orderItemFile1.groupBy(orderItemFile1("order_id_order_item")).agg(sum(orderItemFile1("item_desc")))
Perhaps the functionality you're looking for is #groupByKey. See example here.
As for your first question, basically yes, you're reading a CSV into a Dataset[A] where A is a case class you've declared.

Spark (python) - explain the difference between user defined functions and simple functions

I am a Spark beginner. I am using Python and Spark dataframes. I just learned about user defined functions (udf) that one has to register first in order to use it.
Question: in what situation do you want to create a udf vs. just a simple (Python) function?
Thank you so much!
Your code will be neater if you use UDFs, because it will take a function, and the correct return type (defaults to string if empty), and create a column expression, which means you can write nice things like:
my_function_udf = udf(my_function, DoubleType())
myDf.withColumn("function_output_column", my_function_udf("some_input_column"))
This is just one example of how you can use a UDF to treat a function as a column. They also make it easy to introduce stuff like lists or maps into your function logic via a closure, which is explained very well here

How to write scala unit tests to compare spark dataframes?

Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same.
Earlier implementation which worked -
if (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - true
Where da and ds are the generated dataframe and the created dataframe respectively.
Here I am running the program via the spark-shell.
Newer Implementation which doesn't work -
assert (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - false
Where da and ds are the generated dataframe and the created dataframe respectively.
Here I am using the assert method of scalatest instead, but the returned result is not returning as true.
Why try to use the new implementation when previous method worked? To have sbt use scalatest to always run the test file via sbt test or while compiling.
The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt.
The two programs are effectively the same but the results are different. What could be the problem?
Tests for compare dataframes exists in Spark Core, example:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes.
I solved the issue by using this as a dependency https://github.com/MrPowers/spark-fast-tests .
Another solution would be to iterate over the members of the dataframe individually and compare them.

how to use forAll in scalatest to generate only one object of a generator?

Im working with scalatest and scalacheck, alsso working with FeatureSpec.
I have a generator class that generate object for me that looks something like this:
object InvoiceGen {
def myObj = for {
country <- Gen.oneOf(Seq("France", "Germany", "United Kingdom", "Austria"))
type <- Gen.oneOf(Seq("Communication", "Restaurants", "Parking"))
amount <- Gen.choose(100, 4999)
number <- Gen.choose(1, 10000)
valid <- Arbitrary.arbitrary[Boolean]
} yield SomeObject(country, type, "1/1/2014", amount,number.toString, 35, "something", documentTypeValid, valid, "")
Now, I have the testing class which works with FeatureSpec and everything that I need to run the tests.
In this class I have scenarios, and in each scenario I want to generate a different object.
The thing is from what I understand is that to generate object is better to use forAll func, but for all will not sure to bring you an object so you can add minSuccessful(1) to make sure you get at list 1 obj....
I did it like this and it works:
scenario("some scenario") {
forAll(MyGen.myObj, minSuccessful(1)) { someObject =>
Given("A connection to the system")
loginActions shouldBe 'Connected
When("something")
//blabla
Then("something should happened")
//blabla
}
}
but im not sure exactly what it means.
What I want is to generate an invoice each scenario and do some actions on it...
im not sure why i care if the generation work or didnt work...i just want a generated object to work with.
TL;DR: To get one object, and only one, use myObj.sample.get. Unless your generator is doing something fancy that's perfectly safe and won't blow up.
I presume that your intention is to run some kind of integration/acceptance test with some randomly generated domain object—in other words (ab-)use scalacheck as a simple data generator—and you hope that minSuccessful(1) would ensure that the test only runs once.
Be aware that this is not the case!. scalacheck will run your test multiple times if it fails, to try and shrink the input data to a minimal counterexample.
If you'd like to ensure that your test runs only once you must use sample.
However, if running the test multiple times is fine, prefer minSuccessful(1) to "succeed fast" but still profit from minimized counterexamples in case the test fails.
Gen.sample returns an option because generators can fail:
ScalaCheck generators can fail, for instance if you're adding a filter (listingGen.suchThat(...)), and that failure is modeled with the Option type.
But:
[…] if you're sure that your generator never will fail, you can simply call Option.get like you do in your example above. Or you can use Option.getOrElse to replace None with a default value.
Generally if your generator is simple, i.e. does not use generators that could fail and does not use any filters on its own, it's perfectly safe to just call .get on the option returned by .sample. I've been doing that in the past and never had problems with it. If your generators frequently return None from .sample they'd likely make scalacheck fail to successfully generate values as well.
If all that you want is a single object use Gen.sample.get.
minSuccessful has a very different meaning: It's the minimal number of successful tests that scalacheck runs—which by no means implies
that scalacheck takes only a single value out of the generator, or
that the test runs only once.
With minSuccessful(1) scalacheck wants one successful test. It'll take samples out of the generator until the test runs at least once—i.e. if you filter the generated values with whenever in your test body scalacheck will take samples as long as whenever discards them.
If the test passes scalacheck is happy and won't run the test a second time.
However if the test fails scalacheck will try and produce a minimal example to fail the test. It'll shrink the input data and run the test as long as it fails and then provides you with the minimized counter example rather than the actual input that triggered the initial failure.
That's an important property of property testing as it helps you to discover bugs: The original data is frequently too large to lend itself for debugging. Minimizing it helps you discover the piece of input data that actually triggers the failure, i.e. corner cases like empty strings that you didn't think of.
I think the way you want to use Scalacheck (generate only one object and execute the test for it) defeats the purpose of property-based testing. Let me explain a bit in detail:
In classical unit-testing, you would generate your system under test, be it an object or a system of dependent objects, with some fixed data. This could e.g. be strings like "foo" and "bar" or, if you needed a name, you would use something like "John Doe". For integers and other data, you can also randomly choose some values.
The main advantage is that these are "plain" values—you can directly see them in the code and correlate them with the output of a failed test. The big disadvantage is that the tests will only ever run with the values you specified, which in turn means that your code is also only tested with these values.
In contrast, property-based testing allows you to just describe how the data should look like (e.g. "a positive integer", "a string of maximum 20 characters"). The testing framework will then—with the help of generators—generate a number of matching objects and execute the test for all of them. This way, you can be more sure that your code will actually be correct for different inputs, which after all is the purpose of testing: to check if your code does what it should for the possible inputs.
I never really worked with Scalacheck, but a colleague explained it to me that it also tries to cover edge-cases, e.g. putting in a 0 and MAX_INT for a positive integer, or an empty string for the aforementioned string with max. 20 characters.
So, to sum it up: Running a property-based test only once for one generic object is the wrong thing to do. Instead, once you have the generator infrastructure in place, embrace the advantage you then have and let your code be checked a lot more times!

No such element exception in machine learning pipeline using scala

I am trying to implement an ML pipeline in Spark using Scala and I used the sample code available on the Spark website. I am converting my RDD[labeledpoints] into a data frame using the functions available in the SQlContext package. It gives me a NoSuchElementException:
Code Snippet:
Error Message:
Error at the line Pipeline.fit(training_df)
The type Vector you have inside your for-loop (prob: Vector) takes a type parameter; such as Vector[Double], Vector[String], etc. You just need to specify the type you data your vector will store.
As a site note: The single argument overloaded version of createDataFrame() you use seems to be experimental. In case you are planning to use it for some long term project.
The pipeline in your code snippet is currently empty, so there is nothing to be fit. You need to specify the stages using .setStages(). See the example in the spark.ml documentation here.