Check equality for two Spark DataFrames in Scala - scala

I'm new to Scala and am having problems writing unit tests.
I'm trying to compare and check equality for two Spark DataFrames in Scala for unit testing, and realized that there is no easy way to check equality for two Spark DataFrames.
The C++ equivalent code would be (assuming that the DataFrames are represented as double arrays in C++):
int expected[10][2];
int result[10][2];
for (int row = 0; row < 10; row++) {
for (int col = 0; col < 2; col++) {
if (expected[row][col] != result[row][col]) return false;
The actual test would involve testing for equality based on the data types of the columns of the DataFrames (testing with precision tolerance for floats, etc).
It seems like there's not an easy way to iteratively loop over all the elements in the DataFrames using Scala and the other solutions for checking equality of two DataFrames such as df1.except(df2) do not work in my case as I need to be able to provide support for testing equality with tolerance for floats and doubles.
Of course, I could try to round all the elements beforehand and compare the results afterwards, but I would like to see if there are any other solutions that would allow me to iterate through the DataFrames to check for equality.

import org.scalatest.{BeforeAndAfterAll, FeatureSpec, Matchers}
outDf.collect() should contain theSameElementsAs (dfComparable.collect())
# or ( obs order matters ! )
// outDf.except(dfComparable).toDF().count should be(0)
outDf.except(dfComparable).count should be(0)

If you want to check if both the data frames are equal or not for testing purpose, you can make use of subtract() method of data frame (supported in version 1.3 and above)
You can check if diff of both data frames is empty or 0.
e.g. df1.subtract(df2).count() == 0

Assuming that you have a fixed # of col and rows, one solution could be join both Df's by row index (in case you do not have id's for the records), and then iterate direct in the final DF [with all the columns of both DF's].
Something like this:
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: double (nullable = true)
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: double (nullable = true)
| col1| col2| col3|
|1.20000001| 1.21| 1.2|
| 2.1111| 2.3| 22.2|
| 3.2|2.330000001| 2.333|
| 2.2444| 2.344|2.3331|
| col1| col2| col3|
| 1.2| 1.21| 1.2|
|2.1111| 2.3| 22.2|
| 3.2| 2.33| 2.333|
Added row index
| col1| col2| col3|row|
|1.20000001| 1.21| 1.2| 0|
| 2.1111| 2.3| 22.2| 1|
| 3.2|2.330000001| 2.333| 2|
| 2.2444| 2.344|2.3331| 3|
| col1| col2| col3|row|
| 1.2| 1.21| 1.2| 0|
|2.1111| 2.3| 22.2| 1|
| 3.2| 2.33| 2.333| 2|
|2.2444|2.344|2.3331| 3|
Combined DF
|row| col1| col2| col3| col1| col2| col3|
| 0|1.20000001| 1.21| 1.2| 1.2| 1.21| 1.2|
| 1| 2.1111| 2.3| 22.2|2.1111| 2.3| 22.2|
| 2| 3.2|2.330000001| 2.333| 3.2| 2.33| 2.333|
| 3| 2.2444| 2.344|2.3331|2.2444|2.344|2.3331|
This is how you can do that:
val finaldf1 = df1.withColumn("row", monotonically_increasing_id())
val finaldf2 = df2.withColumn("row", monotonically_increasing_id())
println("Added row index")
val joinedDfs = finaldf1.join(finaldf2, "row")
println("Combined DF")
val tolerance = 0.001
def isInValidRange(a: Double, b: Double): Boolean ={
joinedDfs.take(10).foreach(row => {
assert( isInValidRange(row.getDouble(1), row.getDouble(4)) , "Col1 validation. Row %s".format(row.getLong(0)+1))
assert( isInValidRange(row.getDouble(2), row.getDouble(5)) , "Col2 validation. Row %s".format(row.getLong(0)+1))
assert( isInValidRange(row.getDouble(3), row.getDouble(6)) , "Col3 validation. Row %s".format(row.getLong(0)+1))
Note: Assert's are not serialized, a workaround is use take() to avoid errors.


Is there an efficient way to return Array[Int] from a spark Dataframe without using collect()

I have a dataframe something like this.
|-- key1: string (nullable = true)
|-- value1: string (nullable = true)
| E1| 1|
| E3| 0|
| E4| 1|
| E2| 0|
And i convert "value1" column to array[Int] by using collect() function as below. But this is not efficient solution, it takes 10-15 seconds. Because there are lots of data in the dataframe and in each spark streaming cycle, data is collected to the driver.
val data = Seq(("E1","1"),
val columns = Seq("key1", "value1")
import spark.implicits._
val df = data.toDF(columns:_*)
val ordered_df = df.orderBy("key1").select("value1").collect().map(_(0)).toList
Output :
So, what is the efficient way to return Array of Int from the above dataframe without using Collect() function ?

How can I split a column containing array of some struct into separate columns?

I have the following scenarios:
case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])
val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
val df = entities.toDF()
| id| attr|
| 1|[[name,sasha], [d...|
| 2| [[home,hyd]]|
|-- id: string (nullable = true)
|-- attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
what I want to produce is
| id| name | home |
| 1| sasha |del |
| 2| null |hyd |
How do I go about this. I looked at quite a few similar questions on stack but couldn't find anything useful.
My main motive is to do groupBy on different attributes, thus want to bring it in the above mentioned format.
I looked into explode functionality. It breaks downs a list in separate rows, I don't want that. I want to create more columns from the array of attribute.
Similar things I found:
Spark - convert Map to a single-row DataFrame
Split 1 column into 3 columns in spark scala
Spark dataframe - Split struct column into 2 columns
That can easily be reduced to PySpark converting a column of type 'map' to multiple columns in a dataframe or How to get keys and values from MapType column in SparkSQL DataFrame. First convert attr to map<string, string>
import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}
val dfMap = df.withColumn("attr", map_from_entries($"attr"))
then it's just a matter of finding the unique keys
val keys =$"attr"))).as[String].distinct.collect
then selecting from the map
val result =$"id" +: => $"attr"(key) as key): _*)
| id| name|home|
| 1|sasha| del|
| 2| null| hyd|
Less efficient but more concise variant is to explode and pivot
val result = df
.select($"id", explode(map_from_entries($"attr")))
| id|home| name|
| 1| del|sasha|
| 2| hyd| null|
but in practice I'd advise against it.

How to combine 2 different dataframes together?

I have 2 DataFrames:
Users (~29.000.000 entries)
|-- userId: string (nullable = true)
Impressions (~1000 entries)
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- id: string (nullable = true)
I want to walk through all the Users and attach to each User 1 Impression from these ~1000 entries. So actually at each ~1000th User the Impression would be the same, then the loop on the Impressions would start from the beginning and assign the same ~1000 impressions for the next ~1000 users.
At the end I want to have a DataFrame with the combined data. Also the Users dataframe could be reused by adding the columns of the Impressions or a newly created one would work also as a result.
You have any ideas, which would be a good solution here?
What I would do is use the old trick of adding a monotically increasing ID to both dataframes, then create a new column on your LARGER dataframe (Users) which contains the modulo of each row's ID and the size of smaller dataframe.
This new column then provides a rolling matching key against the items in the Impressions dataframe.
This is a minimal example (tested) to give you the idea. Obviously this will work if you have 1000 impressions to join against:
var users = Seq("user1", "user2", "user3", "user4", "user5", "user6", "user7", "user8", "user9").toDF("users")
var impressions = Seq("a", "b", "c").toDF("impressions").withColumn("id", monotonically_increasing_id())
var cnt = impressions.count
users=users.withColumn("id", monotonically_increasing_id())
.withColumn("mod", $"id" mod cnt)
.join(impressions, $"mod"===impressions("id"))
|users| id|impressions| id|
|user1| 0| a| 0|
|user2| 1| b| 1|
|user3| 2| c| 2|
|user4| 3| a| 0|
|user5| 4| b| 1|
|user6| 5| c| 2|
|user7| 6| a| 0|
|user8| 7| b| 1|
|user9| 8| c| 2|
Sketch of idea:
Add monotonically increasing id to both dataframes Users and Impressions via
val indexedUsersDF = usersDf.withColumn("index", monotonicallyIncreasingId)
val indexedImpressionsDF = impressionsDf.withColumn("index", monotonicallyIncreasingId)
(see spark dataframe :how to add a index Column )
Determine number of rows in Impressions via count and store as int, e.g.
val numberOfImpressions = ...
Apply UDF to index-column in indexedUsersDF that computes the modulo in a seperate column (e.g. moduloIndex)
val moduloIndexedUsersDF =
Join moduloIndexedUsersDF and indexedImperessionsDF on

Spark Scala replace Dataframe blank records to "0"

I need to replace my Dataframe field's blank records to "0"
Here is my code -->
import sqlContext.implicits._
case class CInspections (business_id:Int, score:String, date:String, type1:String)
val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = ( line => line.split ("\t"))
val raw_inspectionsRDD = ( raw_inspections => CInspections (raw_inspections(0).toInt,raw_inspections(1), raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
I am using case class and then converting to Dataframe. But I need "score" as Int as I have to perform some operations and sort it.
But if I declare it as score:Int then I am getting error for blank values.
java.lang.NumberFormatException: For input string: "" 
|business_id|score| date| type1|
| 10| |20140807|Reinspection/Foll...|
| 10| 94|20140729|Routine - Unsched...|
| 10| |20140124|Reinspection/Foll...|
| 10| 92|20140114|Routine - Unsched...|
| 10| 98|20121114|Routine - Unsched...|
| 10| |20120920|Reinspection/Foll...|
| 17| |20140425|Reinspection/Foll...|
I need score field as Int because for the below query, it sort as String not Int and giving wrong result
sqlContext.sql("""select raw_inspectionsDF.score from raw_inspectionsDF where score <>"" order by score""").show()
| 100|
| 100|
| 100|
Empty string can't be converted to Integer, you need to make the Score nullable so that if the field is missing, it is represented as null, you can try the following:
import scala.util.{Try, Success, Failure}
1) Define a customized parse function which returns None, if the string can't be converted to an Int, in your case empty string;
def parseScore(s: String): Option[Int] = {
Try(s.toInt) match {
case Success(x) => Some(x)
case Failure(x) => None
2) Define the score field in your case class to be an Option[Int] type;
case class CInspections (business_id:Int, score: Option[Int], date:String, type1:String)
val raw_inspections = sc.textFile("test.csv")
val raw_inspectionsmap = => line.split("\t"))
3) Use the customized parseScore function to parse the score field;
val raw_inspectionsRDD = =>
CInspections(raw_inspections(0).toInt, parseScore(raw_inspections(1)),
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
// |-- business_id: integer (nullable = false)
// |-- score: integer (nullable = true)
// |-- date: string (nullable = true)
// |-- type1: string (nullable = true)
| 1| null| a| b|
| 2| 3| s| k|
4) After parsing the file correctly, you can easily replace null value with 0 using na functions fill:
| 1| 0| a| b|
| 2| 3| s| k|

Assign label to categorical data in a table in PySpark

I want to assign the label to the categorical numbers in a dataframe below using pyspark sql.
In the MARRIAGE column 1=Married and 2=Unmarried. In the EDUCATION Column 1=Grad and 2=Undergrad
Current Dataframe:
| 1| 2| 87|
| 1| 1| 123|
| 2| 2| 3|
| 2| 1| 8|
Resulting Dataframe:
|Married |Grad | 87|
|Married |UnderGrad| 123|
|UnMarried|Grad | 3|
|UnMarried|UnderGrad| 8|
Is it possible to assign the labels using a single udf and the withColumn()? Is there any way to assign in the single UDF by passing the whole dataframe and keep the column names as it is?
I can think of a solution to do the operation on each column by using separate udfs as below. But can't figure out if there's a way to do together.
from pyspark.sql import functions as F
def assign_marital_names(record):
if record == 1:
return "Married"
elif record == 2:
return "UnMarried"
def assign_edu_names(record):
if record == 1:
return "Grad"
elif record == 2:
return "UnderGrad"
assign_marital_udf = F.udf(assign_marital_names)
assign_edu_udf = F.udf(assign_edu_names)
df.withColumn("MARRIAGE", assign_marital_udf("MARRIAGE")).\
withColumn("EDUCATION", assign_edu_udf("EDUCATION")).show(truncate=False)
One UDF can result in only one column. But this can be structured column and UDF can apply labels on both marriage and education. See code below:
from pyspark.sql.types import *
from pyspark.sql import Row
udf_result = StructType([StructField('MARRIAGE', StringType()), StructField('EDUCATION', StringType())])
marriage_dict = {1: 'Married', 2: 'UnMarried'}
education_dict = {1: 'Grad', 2: 'UnderGrad'}
def assign_labels(marriage, education):
return Row(marriage_dict[marriage], education_dict[education])
assign_labels_udf = F.udf(assign_labels, udf_result)
df.withColumn('labels', assign_labels_udf('MARRIAGE', 'EDUCATION')).printSchema()
|-- MARRIAGE: long (nullable = true)
|-- EDUCATION: long (nullable = true)
|-- Total: long (nullable = true)
|-- labels: struct (nullable = true)
| |-- MARRIAGE: string (nullable = true)
| |-- EDUCATION: string (nullable = true)
But as you see, it's not replacing the original columns, it's just adding a new one. To replace them you will need to use withColumn twice and then drop labels.