How to write scala unit tests to compare spark dataframes? - scala

Purpose - Checking if a dataframe generated by spark and a manually created dataframe are the same.
Earlier implementation which worked -
if (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - true
Where da and ds are the generated dataframe and the created dataframe respectively.
Here I am running the program via the spark-shell.
Newer Implementation which doesn't work -
assert (da.except(ds).count() != 0 && ds.except(da).count != 0)
Boolean returned - false
Where da and ds are the generated dataframe and the created dataframe respectively.
Here I am using the assert method of scalatest instead, but the returned result is not returning as true.
Why try to use the new implementation when previous method worked? To have sbt use scalatest to always run the test file via sbt test or while compiling.
The same code to compare spark dataframes when run via the spark-shell, gives the correct output but shows an error when run using scalatest in sbt.
The two programs are effectively the same but the results are different. What could be the problem?

Tests for compare dataframes exists in Spark Core, example:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala
Libraries with tests shared code (SharedSQLContext, ect) present in central Maven repo, you can include them in project, and use "checkAnswer" methods for compare dataframes.

I solved the issue by using this as a dependency https://github.com/MrPowers/spark-fast-tests .
Another solution would be to iterate over the members of the dataframe individually and compare them.

Related

How to compare 2 datasets in scala?

I am creating unit tests for a Scala application using Scala Test. I have the actual and expected results as Dataset . When I verified manually both the data and schema matches between actual and expected datasets.
Actual Dataset= actual_ds
Expected Dataset = expected_ds
when I execute below command ,it returns False.
assert(actual_ds.equals(expected_ds))
Could anyone suggest what could be the reason. And is there any other inbuilt function to compare the datasets in scala?
Use one of libraries designed for spark tests, spark-fast-tests , spark-testing-base, spark-test
They are quite ease to use and with their help its easy to compare two datasets with formatted message on output
You may start with spark-fast-tests (you can find usage in readme file) and check others if it does not suite your needs (fro example if you need other output formatting)
That .equals() is from Java Object .equals so it's correct that the assert fails.
I would start testing two datasets with:
assert actual_ds.schema == expected_ds.schema
assert actual_ds.count() == expected_ds.count()
And then checking this question: DataFrame equality in Apache Spark

How to develop and test python transforms code locally?

What is the recommended way to develop and test python transforms code locally, given that the input datasets fit into memory of the local machine?
The simplest way that doesn't require you to mock the transforms package, would be to just extract your logic into a pure python with pyspark function that receives dataframes as input and returns the dataframe.
i.e.:
# yourtransform.py
from my_business_logic import magic_super_complex_computation
#transform_df(
Output("/foo/bar/out_dataset"),
input1=Input("/foo/bar/input1"),
input2=Input("/foo/bar/input2"))
def my_transform(input1, input2):
return magic_super_complex_computation(input1, input2)
You can now import in your test the magic_super_complex_computation and test it just with pyspark.
i.e:
from my_business_logic import magic_super_complex_computation
def test_magic_super_complex_computation(spark_context):
df1 = load_my_data_as_df(spark_context, "input1")
df2 = load_my_data_as_df(spark_context, "input2")
result = magic_super_complex_computation(input1, input2).collect()
assert len(result) == 123
Do note that this requires you to provide a valid spark context as a fixture in your pytest (or whatever testing framework you are using)

Scala dataframe - where is the spark/ scala dataframe source code for explode on github

As explained in this article, Explode is slow in scala 2.11.8 and spark 2.0.2. Without moving to higher spark versions, alternate methods to improve it are also slow. Since the issue has been fixed in later versions of spark, one approach would be to copy the fixed source code. In looking for the source code, I found a reference to explode in functions, but, I do not know how to track the function further. How would I find the source code for working Explode in new spark source code - so, I can use it instead of the current version of explode?
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala is the link I think you're looking for
I was able to find it by expanding all the import org.apache._ imports within https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala after seeing that the explode function there was just def explode(e: Column): Column = withExpr { Explode(e.expr) }
if you wanted to import the underlying Explode function, I believe the direct import would be import org.apache.spark.sql.catalyst.expressions.Explode

javanullpointerexception after df.na.fill("Missing") in scala?

I've been trying to learn/use Scala for machine learning and to do that I need to convert string variables to an index of dummies.
The way I've done it is with the StringIndexer in Scala. Before running I've used df.na.fill("missing") to replace missing values. Even after I run that I still get a NullPointerException.
Is there something else I should be doing or something else I should be checking? I used printSchema to filter only on the string columns to get the list of columns I needed to run StringIndexer on.
val newDf1 = reweight.na.fill("Missing")
val cat_cols = Array("highest_tier_nm", "day_of_week", "month",
"provided", "docsis", "dwelling_type_grp", "dwelling_type_cd", "market"
"bulk_flag")
val transformers: Array[org.apache.spark.ml.PipelineStage] = cat_cols
.map(cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index"))
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers
val categorical = new Pipeline().setStages(stages)
val cat_reweight = categorical.fit(newDf)
Normally when using machine learning you would train the model with one part of the data and then test it with another part. Hence, there are two different methods to use to reflect this. You have only used fit() which is equivalent to training a model (or a pipeline).
This mean that your cat_reweight is not a dataframe, it is a PipelineModel. A PipelineModel have a function transform() that takes data with the same format as the one used for training and gives a dataframe as output. In other words, you should add .transform(newDf1) after fit(newDf1).
Another possible issue is that in your code you have used fit(newDf) instead of fit(newDf1). Make sure the correct dataframe is used for both the fit() and transform() methods, otherwise you will get a NullPointerException.
It works for me when running locally, however, if you still get an error you could try to cache() after replacing the nulls and then performing an action to make sure all transformations are done.
Hope it helps!

Why does Spark application work in spark-shell but fail with "org.apache.spark.SparkException: Task not serializable" in Eclipse?

With the purpose of save a file (delimited by |) into a DataFrame, I have developed the next code:
val file = sc.textFile("path/file/")
val rddFile = file.map(a => a.split("\\|")).map(x => ArchivoProcesar(x(0), x(1), x(2), x(3))
val dfInsumos = rddFile.toDF()
My case class used for the creation of my DataFrame is defined as followed:
case class ArchivoProcesar(nombre_insumo: String, tipo_rep: String, validado: String, Cargado: String)
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly. But when I executed my program into eclipse, it throws me the next error:
Is it something missing inside my scala class that I'm using and running with eclipse. Or what could be the reason that my functions works correctly at the spark-shell but not in my eclipse app?
Regards.
I have done some functional tests using spark-shell, and my code works fine, generating the DataFrame correctly.
That's because spark-shell takes care of creating an instance of SparkContext for you. spark-shell then makes sure that references to SparkContext are not from "sensitive places".
But when I executed my program into eclipse, it throws me the next error:
Somewhere in your Spark application you hold a reference to org.apache.spark.SparkContext that is not serializable and so holds your Spark computation back from being serialized and send across the wire to executors.
As #T. Gawęda has mentioned in a comment:
I think that ArchivoProcesar is a nested class and as a nested class has a reference to the outer class that has a property of type SparkContext
So while copying the code from spark-shell to Eclipse you have added some additional lines that you don't show thinking that they are not necessary which happens to be quite the contrary. Find any places where you create and reference SparkContext and you will find the root cause of your issue.
I can see that the Spark processing happens inside ValidacionInsumos class that main method uses. I think the affecting method is LeerInsumosAValidar that does map transformation and that's where you should seek the answer.
Your case class must have public scope. You can't have ArchivoProcesar inside a class