DataFrame equality in Apache Spark - scala

Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.
Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns?
The motivation for the question is that there are often many ways to compute some big data result, each with its own trade-offs. As one explores these trade-offs, it is important to maintain correctness and hence the need to check for the equivalence/equality on a meaningful test data set.

Scala (see below for PySpark)
The spark-fast-tests library has two methods for making DataFrame comparisons (I'm the creator of the library):
The assertSmallDataFrameEquality method collects DataFrames on the driver node and makes the comparison
def assertSmallDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
if (!actualDF.collect().sameElements(expectedDF.collect())) {
throw new DataFrameContentMismatch(contentMismatchMessage(actualDF, expectedDF))
}
}
The assertLargeDataFrameEquality method compares DataFrames spread on multiple machines (the code is basically copied from spark-testing-base)
def assertLargeDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
try {
actualDF.rdd.cache
expectedDF.rdd.cache
val actualCount = actualDF.rdd.count
val expectedCount = expectedDF.rdd.count
if (actualCount != expectedCount) {
throw new DataFrameContentMismatch(countMismatchMessage(actualCount, expectedCount))
}
val expectedIndexValue = zipWithIndex(actualDF.rdd)
val resultIndexValue = zipWithIndex(expectedDF.rdd)
val unequalRDD = expectedIndexValue
.join(resultIndexValue)
.filter {
case (idx, (r1, r2)) =>
!(r1.equals(r2) || RowComparer.areRowsEqual(r1, r2, 0.0))
}
val maxUnequalRowsToShow = 10
assertEmpty(unequalRDD.take(maxUnequalRowsToShow))
} finally {
actualDF.rdd.unpersist()
expectedDF.rdd.unpersist()
}
}
assertSmallDataFrameEquality is faster for small DataFrame comparisons and I've found it sufficient for my test suites.
PySpark
Here's a simple function that returns true if the DataFrames are equal:
def are_dfs_equal(df1, df2):
if df1.schema != df2.schema:
return False
if df1.collect() != df2.collect():
return False
return True
or simplified
def are_dfs_equal(df1, df2):
return (df1.schema == df2.schema) and (df1.collect() == df2.collect())
You'll typically perform DataFrame equality comparisons in a test suite and will want a descriptive error message when the comparisons fail (a True / False return value doesn't help much when debugging).
Use the chispa library to access the assert_df_equality method that returns descriptive error messages for test suite workflows.

There are some standard ways in the Apache Spark test suites, however most of these involve collecting the data locally and if you want to do equality testing on large DataFrames then that is likely not a suitable solution.
Checking the schema first and then you could do an intersection to df3 and verify that the count of df1,df2 & df3 are all equal (however this only works if there aren't duplicate rows, if there are different duplicates rows this method could still return true).
Another option would be getting the underlying RDDs of both of the DataFrames, mapping to (Row, 1), doing a reduceByKey to count the number of each Row, and then cogrouping the two resulting RDDs and then do a regular aggregate and return false if any of the iterators are not equal.

I don't know about idiomatic, but I think you can get a robust way to compare DataFrames as you describe as follows. (I'm using PySpark for illustration, but the approach carries across languages.)
a = spark.range(5)
b = spark.range(5)
a_prime = a.groupBy(sorted(a.columns)).count()
b_prime = b.groupBy(sorted(b.columns)).count()
assert a_prime.subtract(b_prime).count() == b_prime.subtract(a_prime).count() == 0
This approach correctly handles cases where the DataFrames may have duplicate rows, rows in different orders, and/or columns in different orders.
For example:
a = spark.createDataFrame([('nick', 30), ('bob', 40)], ['name', 'age'])
b = spark.createDataFrame([(40, 'bob'), (30, 'nick')], ['age', 'name'])
c = spark.createDataFrame([('nick', 30), ('bob', 40), ('nick', 30)], ['name', 'age'])
a_prime = a.groupBy(sorted(a.columns)).count()
b_prime = b.groupBy(sorted(b.columns)).count()
c_prime = c.groupBy(sorted(c.columns)).count()
assert a_prime.subtract(b_prime).count() == b_prime.subtract(a_prime).count() == 0
assert a_prime.subtract(c_prime).count() != 0
This approach is quite expensive, but most of the expense is unavoidable given the need to perform a full diff. And this should scale fine as it doesn't require collecting anything locally. If you relax the constraint that the comparison should account for duplicate rows, then you can drop the groupBy() and just do the subtract(), which would probably speed things up notably.

Java:
assert resultDs.union(answerDs).distinct().count() == resultDs.intersect(answerDs).count();

There are 4 Options depending on whether you have duplicate rows or not.
Let's say we have two DataFrames, z1 and z1. Option 1/2 are good for rows without duplicates. You can try these in spark-shell.
Option 1: do except directly
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
def isEqual(left: DataFrame, right: DataFrame): Boolean = {
if(left.columns.length != right.columns.length) return false // column lengths don't match
if(left.count != right.count) return false // record count don't match
return left.except(right).isEmpty && right.except(left).isEmpty
}
Option 2: generate row hash by columns
def createHashColumn(df: DataFrame) : Column = {
val colArr = df.columns
md5(concat_ws("", (colArr.map(col(_))) : _*))
}
val z1SigDF = z1.select(col("index"), createHashColumn(z1).as("signature_z1"))
val z2SigDF = z2.select(col("index"), createHashColumn(z2).as("signature_z2"))
val joinDF = z1SigDF.join(z2SigDF, z1SigDF("index") === z2SigDF("index")).where($"signature_z1" =!= $"signature_z2").cache
// should be 0
joinDF.count
Option 3: use GroupBy(for DataFrame with duplicate rows)
val z1Grouped = z1.groupBy(z1.columns.map(c => z1(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val z2Grouped = z2.groupBy(z2.columns.map(c => z2(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val inZ1NotInZ2 = z1Grouped.except(z2Grouped).toDF()
val inZ2NotInZ1 = z2Grouped.except(z1Grouped).toDF()
// both should be size 0
inZ1NotInZ2.show
inZ2NotInZ1.show
Option 4, use exceptAll, which should also work for data with duplicate rows
// Source Code: https://github.com/apache/spark/blob/50538600ec/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2029
val inZ1NotInZ2 = z1.exceptAll(z2).toDF()
val inZ2NotInZ1 = z2.exceptAll(z1).toDF()
// same here, // both should be size 0
inZ1NotInZ2.show
inZ2NotInZ1.show

Try doing the following:
df1.except(df2).isEmpty

A scalable and easy way is to diff the two DataFrames and count the non-matching rows:
df1.diff(df2).where($"diff" != "N").count
If that number is not zero, then the two DataFrames are not equivalent.
The diff transformation is provided by spark-extension.
It identifies Inserted, Changed, Deleted and uN-changed rows.

You can do this using a little bit of deduplication in combination with a full outer join. The advantage of this approach is that it does not require you to collect results to the driver, and that it avoids running multiple jobs.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
// Generate some random data.
def random(n: Int, s: Long) = {
spark.range(n).select(
(rand(s) * 10000).cast("int").as("a"),
(rand(s + 5) * 1000).cast("int").as("b"))
}
val df1 = random(10000000, 34)
val df2 = random(10000000, 17)
// Move all the keys into a struct (to make handling nulls easy), deduplicate the given dataset
// and count the rows per key.
def dedup(df: Dataset[Row]): Dataset[Row] = {
df.select(struct(df.columns.map(col): _*).as("key"))
.groupBy($"key")
.agg(count(lit(1)).as("row_count"))
}
// Deduplicate the inputs and join them using a full outer join. The result can contain
// the following things:
// 1. Both keys are not null (and thus equal), and the row counts are the same. The dataset
// is the same for the given key.
// 2. Both keys are not null (and thus equal), and the row counts are not the same. The dataset
// contains the same keys.
// 3. Only the right key is not null.
// 4. Only the left key is not null.
val joined = dedup(df1).as("l").join(dedup(df2).as("r"), $"l.key" === $"r.key", "full")
// Summarize the differences.
val summary = joined.select(
count(when($"l.key".isNotNull && $"r.key".isNotNull && $"r.row_count" === $"l.row_count", 1)).as("left_right_same_rc"),
count(when($"l.key".isNotNull && $"r.key".isNotNull && $"r.row_count" =!= $"l.row_count", 1)).as("left_right_different_rc"),
count(when($"l.key".isNotNull && $"r.key".isNull, 1)).as("left_only"),
count(when($"l.key".isNull && $"r.key".isNotNull, 1)).as("right_only"))
summary.show()

try {
return ds1.union(ds2)
.groupBy(columns(ds1, ds1.columns()))
.count()
.filter("count % 2 > 0")
.count()
== 0;
} catch (Exception e) {
return false;
}
Column[] columns(Dataset<Row> ds, String... columnNames) {
List<Column> l = new ArrayList<>();
for (String cn : columnNames) {
l.add(ds.col(cn));
}
return l.stream().toArray(Column[]::new);}
columns method is supplementary and can be replaced by any method that returns Seq
Logic:
Union both the datasets, if columns are not matching, it will throw an exception and hence return false.
If columns are matching then groupBy on all columns and add a column count. Now, all the rows have count in the multiple of 2 (even for duplicate rows).
Check if there is any row that has count not divisible by 2, those are the extra rows.

Related

Filter dataframe into 3 buckets

I have a Scala Spark Dataset, ds and two functions isTypeA() and isTypeB(), which take rows in that Dataset and return whether or not that row should be classified as A or B respectively. They can both return true for the same row, in which case I want to classify that row as A. Finally, I want C to be the rows that are neither A or B. I would like to save this as 3 separate Datasets.
I can do this by using filter and calling the functions multiple times
val a = ds.filter(isTypeA(_))
val b = ds.filter(row => !isTypeA(row) && isTypeB(row))
val c = ds.filter(row => !isTypeA(row) && !isTypeB(row))
but is there a more efficient way to do it?

How to convert multidimensional array to dataframe using Spark in Scala?

This is my first time using spark or scala so I am a newbie. I have a 2D array, and I need to convert it to a dataframe. The sample data is a joined table that is in the form of rectangle (double), point (a,b) also doubles, and a boolean of whether or not the point lies within the rectangle. My end goal is to return a dataframe with the name of the rectangle, and how many times it appears where ST_contains is true. Since the query returns all the instances where it is true, I simply am trying to sort by rectangle (they are named as doubles) and count each occurrence. I put that in an array and then try to convert it to a dataset. Here is some of my code and what I have tried:
// Join two datasets (not my code)
spark.udf.register("ST_Contains",(queryRectangle:String, pointString:String)=>(HotzoneUtils.ST_Contains(queryRectangle, pointString)))
val joinDf = spark.sql("select rectangle._c0 as rectangle, point._c5 as point from rectangle,point where ST_Contains(rectangle._c0,point._c5)")
joinDf.createOrReplaceTempView("joinResult")
// MY CODE
// above join gets a view with rectangle, point, and true. so I need to loop through and count how many for each rectangle
//sort by rectangle asc first
joinDf.orderBy("rectangle")
var a = Array.ofDim[String](1, 2)
for (row <- joinDf.rdd.collect){
var count = 1
var previous_r = -1.0
var r = row.mkString(",").split(",")(0).toDouble
var p = row.mkString(",").split(",")(1).toDouble
var c = row.mkString(",").split(",")(2).toDouble
if (previous_r != -1){
if (previous_r == r){
//add another to the count
count = count + 1
}
else{
//stick the result in an array
a ++= Array(Array(previous_r.toString, count.toString))
}
}
previous_r = r
}
//create dataframe from array and return it
val df = spark.createDataFrame(a).toDF()
But I keep getting this error:
inferred type arguments [Array[String]] do not conform to method createDataFrame's type parameter bounds [A <: Product]
val df = spark.createDataFrame(a).toDF()
I also tried it without the .toDf() portion and still no luck. I tried it without the createDataFrame command and just the .toDf but that did not work either.
A few things here:
createDataFrame has multiple variations and the one you end up trying is probably:
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
Array[String] is no Seq[A <: Product]: String is not a Product.
The fastest approach I can think of is go into a Seq and then a DataFrame:
import spark.implicits._
Array("some string")
.toSeq
.toDF
or parallelize the Array[String] into a RDD[String] and then create the DataFrame.
That second toDF() has no value, createDataFrame already returns a DataFrame (if it worked).

Spark find previous value on each iteration of RDD

I've following code :-
val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
{
val previousRow = (records - 1)th row
val currentRow = records
// Some calculation based on both rows
}
}
So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,
EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
should do the trick. Note however that this is using a DeveloperAPI.
alternatively you can
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null
OLD STUFF:
how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.
But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)
EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do
rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)
if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

Tried/failed to replace null values with means in spark dataframe

Update:I was wrong, the error stems from the vectorassembler, not the random forest, or it comes from both. But the error/issue is the same. When I use the df_noNulls dataframe in the vectorAssembler, it says it cannot vectorize the columns because there are null values.
I've looked at other answers for this question and liberated/borrowed/stolen the answer code to try to get this to work. My end goal is RF/GB/other ML modeling, which do not take kindly to null values. I've put together the following code to pull all numeric columns, get each columns mean, then create a new dataframe that joins the two and replaces all the nulls with the mean. When I then try to create a vector of the numeric columns as the "features" part of the random forest, it returns an error that says "Values to assemble cannot be null".
val numCols = DF.schema.fields filter {
x => x.dataType match {
case x: org.apache.spark.sql.types.DoubleType => true
case x: org.apache.spark.sql.types.IntegerType => true
case x: org.apache.spark.sql.types.LongType => true
case _ => false
}
} map {x => x.name}
//NUMCOLS NOW IS AN ARRAY OF ALL NUMERIC COLUMN NAMES
val numDf = DF.select(numCols.map(col): _*)
//NUMDF IS A DATAFRAME OF ALL NUMERIC COLUMNS
val means = numDf.agg(numDf.columns.map(c => (c -> "avg")).toMap)
//CREATES A DATAFRAME OF MEANS OF ALL NUMERIC VARIABLES
means.persist()
//PERSIST TABLE 'MEANS' FOR JOINING --BROADCAST ALSO WORKS BUT I WAS GETTING MEMORY ISSUES WITH IT SO I SWITCHED IT
val exprs = numDf.columns.map(c => coalesce(col(c), col(s"avg($c)")).alias(c))
//EXPRS CREATES FUNCTION TO REPLACE NULLS WITH MEANS
val df_noNulls = DF.crossJoin(means).select(exprs: _*)
df_noNulls should now be a dataframe of only the numeric columns with no null values, they having been replaced with the column nulls. Yet when trying to make a vector of all the values(minus the label/target) I get the "Values to assemble cannot be null" error. I've attached a screenshot of the error in case that might help. It also says it failed to execute user defined function.
I know I've been asking a lot of questions about scala here recently, sorry about that, I'm just really trying to learn to do this. Below is the rest of the code to the RF step in case the mistake is there somewhere:
val num_feat = numCols.filter(! _.contains("call"))
val features=num_feat
val featureAssembler = new VectorAssembler().setInputCols(features).setOutputCol("features")
val reweight_vector = featureAssembler.transform(df_noNulls)
val rf50 = new RandomForestClassifier().setSeed(9).setLabelCol("call_ind").setFeaturesCol("features").setNumTrees(500).setMaxBins(100).fit(reweight_vector)
I am guessing that the cause for this is a column that is entirely null - in that case, the average would be null too. To avoid that, you can simply add another "fallback" in the coalesce expression, using a literal 0 for example:
val exprs = numDf.columns.map(c => coalesce(col(c), col(s"avg($c)"), lit(0.0)).alias(c))
With the rest of the code unchaned, this should ensure none of the values in df_noNulls is null.

Dropping the first and last row of an RDD with Spark

I'm reading in a text file using spark with sc.textFile(fileLocation) and need to be able to quickly drop the first and last row (they could be a header or trailer). I've found good ways of returning the first and last row, but no good one for removing them. Is this possible?
One way of doing this would be to zipWithIndex, and then filter out the records with indices 0 and count - 1:
// We're going to perform multiple actions on this RDD,
// so it's usually better to cache it so we don't read the file twice
rdd.cache()
// Unfortunately, we have to count() to be able to identify the last index
val count = rdd.count()
val result = rdd.zipWithIndex().collect {
case (v, index) if index != 0 && index != count - 1 => v
}
Do note that this might be be rather costly in terms of performance (if you cache the RDD - you use up memory; If you don't, you read the RDD twice). So, if you have any way of identifying these records based on their contents (e.g. if you know all records but these should contain a certain pattern), using filter would probably be faster.
This might be a lighter version:
val rdd = sc.parallelize(Array(1,2,3,4,5,6), 3)
val partitions = rdd.getNumPartitions
val rddFirstLast = rdd.mapPartitionsWithIndex { (idx, iter) =>
if (idx == 0) iter.drop(1)
else if (idx == partitions - 1) iter.sliding(2).map(_.head)
else iter
}
scala> rddFirstLast.collect()
res3: Array[Int] = Array(2, 3, 4, 5)
Here is my take on it, may require an action(count), expected results always and independent to number of partitions.
val rddRowCount = rdd.count()
val rddWithIndices = rdd.zipWithIndex()
val filteredRddWithIndices = rddWithIndices.filter(eachRow =>
if(eachRow._2 == 0) false
else if(eachRow._2 == rddRowCount - 1) false
else true
)
val finalRdd = filteredRddWithIndices.map(eachRow => eachRow._1)