Filter dataframe into 3 buckets - scala

I have a Scala Spark Dataset, ds and two functions isTypeA() and isTypeB(), which take rows in that Dataset and return whether or not that row should be classified as A or B respectively. They can both return true for the same row, in which case I want to classify that row as A. Finally, I want C to be the rows that are neither A or B. I would like to save this as 3 separate Datasets.
I can do this by using filter and calling the functions multiple times
val a = ds.filter(isTypeA(_))
val b = ds.filter(row => !isTypeA(row) && isTypeB(row))
val c = ds.filter(row => !isTypeA(row) && !isTypeB(row))
but is there a more efficient way to do it?

Related

Spark find previous value on each iteration of RDD

I've following code :-
val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
{
val previousRow = (records - 1)th row
val currentRow = records
// Some calculation based on both rows
}
}
So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,
EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
should do the trick. Note however that this is using a DeveloperAPI.
alternatively you can
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null
OLD STUFF:
how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.
But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)
EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do
rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)
if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

How can I use Spark to check if two HDFS datasets are equal?

For example, I have two Spark job's outputs: a: part-00000 part-00001... part-00099, b: part-00000 part-00001... part-00099.
Is there a easy way to test whether the a equals to b regardless of the lines' order. Notice that, the spark partition order are not the same, so for part-00000 in a and b might different even if the a equals to b.
you could calculate the intersect of two dataframes (common lines) and check its size:
val df1 = spark.read.parquet("file1")
val df2 = spark.read.parquet("file2")
val equal = df1.count == df2.count && df2.count == df1.intersect(df2).count

How to filter a sorted RDD by taking top N rows

I have 2 key-value pair RDD's A and B that I work with. Let's say that B has 10000 rows and I have sorted B by its values:
B = B0.map(_.swap).sortByKey().map(_.swap)
I need to take top 5000 from B and use that to join with A. I know I could do:
B1 = B.take(5000)
or
B1 = B.zipWithIndex().filter(_._2 < 5000).map(_._1)
It seems that both will trigger computation. Since B1 is just an intermediate result, I would like to have it not trigger real computation. Is there a better way to achieve that?
As far as I know, there is no other way to achieve that using RDD. But you can leverage the dataframe to achieve the same.
First convert your RDD to a dataframe.
Then limit the dataframe to limit 5000 value.
Then you can pick the new RDD from the dataframe.
Upto this point no calculation will be triggered by spark.
Below is a sample proof of concept.
def main(arg: Array[String]): Unit = {
import spark.implicits._
val a =
Array(
Array("key_1", "value_1"),
Array("key_2", "value_2"),
Array("key_3", "value_3"),
Array("key_4", "value_4"),
Array("key_5", "value_5")
)
val rdd = spark.sparkContext.makeRDD(a)
val df = rdd.map({
case Array(key, value) => PairRdd(key, value)
}).toDF()
val dfWithTop = df.limit(3)
val rddWithTop = dfWithTop.rdd
// upto this point no computation has been triggered
// rddWithTop.take(100) will trigger computation
}
case class PairRdd(key: String, value: String)

Get range of Dataframe Row

So I've loaded a dataframe from a parquet file. This dataframe now contains an unspecified number of columns. The first column is a Label, and the following are features.
I want to save each row in the dataframe as a LabeledPoint.
So far im thinking:
val labeledPoints: RDD[LabeledPoint] =df.map{row => LabeledPoint(row.getInt(0),Vectors.dense(row.getDouble(1),row.getDouble(2)))}
Its easy to get the column indexes, but when handling a lot of columns this won't hold. I'd like to be able to load the entire row starting from index 1 (since index 0 is the label) into a dense vector.
Any ideas?
This should do the trick
df.map {
row: Row =>
val data = for (index <- 1 until row.length) yield row.getDouble(index)
val vector = new DenseVector(data.toArray)
new LabeledPoint(row.getInt(0), vector)
}

DataFrame equality in Apache Spark

Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.
Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns?
The motivation for the question is that there are often many ways to compute some big data result, each with its own trade-offs. As one explores these trade-offs, it is important to maintain correctness and hence the need to check for the equivalence/equality on a meaningful test data set.
Scala (see below for PySpark)
The spark-fast-tests library has two methods for making DataFrame comparisons (I'm the creator of the library):
The assertSmallDataFrameEquality method collects DataFrames on the driver node and makes the comparison
def assertSmallDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
if (!actualDF.collect().sameElements(expectedDF.collect())) {
throw new DataFrameContentMismatch(contentMismatchMessage(actualDF, expectedDF))
}
}
The assertLargeDataFrameEquality method compares DataFrames spread on multiple machines (the code is basically copied from spark-testing-base)
def assertLargeDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
try {
actualDF.rdd.cache
expectedDF.rdd.cache
val actualCount = actualDF.rdd.count
val expectedCount = expectedDF.rdd.count
if (actualCount != expectedCount) {
throw new DataFrameContentMismatch(countMismatchMessage(actualCount, expectedCount))
}
val expectedIndexValue = zipWithIndex(actualDF.rdd)
val resultIndexValue = zipWithIndex(expectedDF.rdd)
val unequalRDD = expectedIndexValue
.join(resultIndexValue)
.filter {
case (idx, (r1, r2)) =>
!(r1.equals(r2) || RowComparer.areRowsEqual(r1, r2, 0.0))
}
val maxUnequalRowsToShow = 10
assertEmpty(unequalRDD.take(maxUnequalRowsToShow))
} finally {
actualDF.rdd.unpersist()
expectedDF.rdd.unpersist()
}
}
assertSmallDataFrameEquality is faster for small DataFrame comparisons and I've found it sufficient for my test suites.
PySpark
Here's a simple function that returns true if the DataFrames are equal:
def are_dfs_equal(df1, df2):
if df1.schema != df2.schema:
return False
if df1.collect() != df2.collect():
return False
return True
or simplified
def are_dfs_equal(df1, df2):
return (df1.schema == df2.schema) and (df1.collect() == df2.collect())
You'll typically perform DataFrame equality comparisons in a test suite and will want a descriptive error message when the comparisons fail (a True / False return value doesn't help much when debugging).
Use the chispa library to access the assert_df_equality method that returns descriptive error messages for test suite workflows.
There are some standard ways in the Apache Spark test suites, however most of these involve collecting the data locally and if you want to do equality testing on large DataFrames then that is likely not a suitable solution.
Checking the schema first and then you could do an intersection to df3 and verify that the count of df1,df2 & df3 are all equal (however this only works if there aren't duplicate rows, if there are different duplicates rows this method could still return true).
Another option would be getting the underlying RDDs of both of the DataFrames, mapping to (Row, 1), doing a reduceByKey to count the number of each Row, and then cogrouping the two resulting RDDs and then do a regular aggregate and return false if any of the iterators are not equal.
I don't know about idiomatic, but I think you can get a robust way to compare DataFrames as you describe as follows. (I'm using PySpark for illustration, but the approach carries across languages.)
a = spark.range(5)
b = spark.range(5)
a_prime = a.groupBy(sorted(a.columns)).count()
b_prime = b.groupBy(sorted(b.columns)).count()
assert a_prime.subtract(b_prime).count() == b_prime.subtract(a_prime).count() == 0
This approach correctly handles cases where the DataFrames may have duplicate rows, rows in different orders, and/or columns in different orders.
For example:
a = spark.createDataFrame([('nick', 30), ('bob', 40)], ['name', 'age'])
b = spark.createDataFrame([(40, 'bob'), (30, 'nick')], ['age', 'name'])
c = spark.createDataFrame([('nick', 30), ('bob', 40), ('nick', 30)], ['name', 'age'])
a_prime = a.groupBy(sorted(a.columns)).count()
b_prime = b.groupBy(sorted(b.columns)).count()
c_prime = c.groupBy(sorted(c.columns)).count()
assert a_prime.subtract(b_prime).count() == b_prime.subtract(a_prime).count() == 0
assert a_prime.subtract(c_prime).count() != 0
This approach is quite expensive, but most of the expense is unavoidable given the need to perform a full diff. And this should scale fine as it doesn't require collecting anything locally. If you relax the constraint that the comparison should account for duplicate rows, then you can drop the groupBy() and just do the subtract(), which would probably speed things up notably.
Java:
assert resultDs.union(answerDs).distinct().count() == resultDs.intersect(answerDs).count();
There are 4 Options depending on whether you have duplicate rows or not.
Let's say we have two DataFrames, z1 and z1. Option 1/2 are good for rows without duplicates. You can try these in spark-shell.
Option 1: do except directly
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
def isEqual(left: DataFrame, right: DataFrame): Boolean = {
if(left.columns.length != right.columns.length) return false // column lengths don't match
if(left.count != right.count) return false // record count don't match
return left.except(right).isEmpty && right.except(left).isEmpty
}
Option 2: generate row hash by columns
def createHashColumn(df: DataFrame) : Column = {
val colArr = df.columns
md5(concat_ws("", (colArr.map(col(_))) : _*))
}
val z1SigDF = z1.select(col("index"), createHashColumn(z1).as("signature_z1"))
val z2SigDF = z2.select(col("index"), createHashColumn(z2).as("signature_z2"))
val joinDF = z1SigDF.join(z2SigDF, z1SigDF("index") === z2SigDF("index")).where($"signature_z1" =!= $"signature_z2").cache
// should be 0
joinDF.count
Option 3: use GroupBy(for DataFrame with duplicate rows)
val z1Grouped = z1.groupBy(z1.columns.map(c => z1(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val z2Grouped = z2.groupBy(z2.columns.map(c => z2(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val inZ1NotInZ2 = z1Grouped.except(z2Grouped).toDF()
val inZ2NotInZ1 = z2Grouped.except(z1Grouped).toDF()
// both should be size 0
inZ1NotInZ2.show
inZ2NotInZ1.show
Option 4, use exceptAll, which should also work for data with duplicate rows
// Source Code: https://github.com/apache/spark/blob/50538600ec/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2029
val inZ1NotInZ2 = z1.exceptAll(z2).toDF()
val inZ2NotInZ1 = z2.exceptAll(z1).toDF()
// same here, // both should be size 0
inZ1NotInZ2.show
inZ2NotInZ1.show
Try doing the following:
df1.except(df2).isEmpty
A scalable and easy way is to diff the two DataFrames and count the non-matching rows:
df1.diff(df2).where($"diff" != "N").count
If that number is not zero, then the two DataFrames are not equivalent.
The diff transformation is provided by spark-extension.
It identifies Inserted, Changed, Deleted and uN-changed rows.
You can do this using a little bit of deduplication in combination with a full outer join. The advantage of this approach is that it does not require you to collect results to the driver, and that it avoids running multiple jobs.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
// Generate some random data.
def random(n: Int, s: Long) = {
spark.range(n).select(
(rand(s) * 10000).cast("int").as("a"),
(rand(s + 5) * 1000).cast("int").as("b"))
}
val df1 = random(10000000, 34)
val df2 = random(10000000, 17)
// Move all the keys into a struct (to make handling nulls easy), deduplicate the given dataset
// and count the rows per key.
def dedup(df: Dataset[Row]): Dataset[Row] = {
df.select(struct(df.columns.map(col): _*).as("key"))
.groupBy($"key")
.agg(count(lit(1)).as("row_count"))
}
// Deduplicate the inputs and join them using a full outer join. The result can contain
// the following things:
// 1. Both keys are not null (and thus equal), and the row counts are the same. The dataset
// is the same for the given key.
// 2. Both keys are not null (and thus equal), and the row counts are not the same. The dataset
// contains the same keys.
// 3. Only the right key is not null.
// 4. Only the left key is not null.
val joined = dedup(df1).as("l").join(dedup(df2).as("r"), $"l.key" === $"r.key", "full")
// Summarize the differences.
val summary = joined.select(
count(when($"l.key".isNotNull && $"r.key".isNotNull && $"r.row_count" === $"l.row_count", 1)).as("left_right_same_rc"),
count(when($"l.key".isNotNull && $"r.key".isNotNull && $"r.row_count" =!= $"l.row_count", 1)).as("left_right_different_rc"),
count(when($"l.key".isNotNull && $"r.key".isNull, 1)).as("left_only"),
count(when($"l.key".isNull && $"r.key".isNotNull, 1)).as("right_only"))
summary.show()
try {
return ds1.union(ds2)
.groupBy(columns(ds1, ds1.columns()))
.count()
.filter("count % 2 > 0")
.count()
== 0;
} catch (Exception e) {
return false;
}
Column[] columns(Dataset<Row> ds, String... columnNames) {
List<Column> l = new ArrayList<>();
for (String cn : columnNames) {
l.add(ds.col(cn));
}
return l.stream().toArray(Column[]::new);}
columns method is supplementary and can be replaced by any method that returns Seq
Logic:
Union both the datasets, if columns are not matching, it will throw an exception and hence return false.
If columns are matching then groupBy on all columns and add a column count. Now, all the rows have count in the multiple of 2 (even for duplicate rows).
Check if there is any row that has count not divisible by 2, those are the extra rows.