Spark find previous value on each iteration of RDD - scala

I've following code :-
val rdd = sc.cassandraTable("db", "table").select("id", "date", "gpsdt").where("id=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2) , entry(3))
val rddcopy = rdd.sortBy(row => row.get[String]("gpsdt"), false).zipWithIndex()
rddcopy.foreach { records =>
{
val previousRow = (records - 1)th row
val currentRow = records
// Some calculation based on both rows
}
}
So, Idea is to get just previous \ next row on each iteration of RDD. I want to calculate some field on current row based on the value present on previous row. Thanks,

EDIT II: Misunderstood question below is how to get tumbling window semantics but sliding window is needed. considering this is a sorted RDD
import org.apache.spark.mllib.rdd.RDDFunctions._
sortedRDD.sliding(2)
should do the trick. Note however that this is using a DeveloperAPI.
alternatively you can
val l = sortedRdd.zipWithIndex.map(kv => (kv._2, kv._1))
val r = sortedRdd.zipWithIndex.map(kv => (kv._2-1, kv._1))
val sliding = l.join(r)
rdd joins should be inner joins (IIRC) thus dropping the edge cases where the tuples would be partially null
OLD STUFF:
how do you do identify the previous row? RDDs do not have any sort of stable ordering by themselves. if you have an incrementing dense key you could add a new column that get's calculated the following way if (k % 2 == 0) k / 2 else (k-1)/2 this should give you a key that has the same value for two successive keys. Then you could just group by.
But to reiterate there is no really sensible notion of previous in most cases for RDDs (depending on partitioning, datasource etc.)
EDIT: so now that you have a zipWithIndex and an ordering in your set you can do what I mentioned above. So now you have an RDD[(Int, YourData)] and can do
rdd.map( kv => if (kv._1 % 2 == 0) (kv._1 / 2, kv._2) else ( (kv._1 -1) /2, kv._2 ) ).groupByKey.foreach (/* your stuff here /*)
if you reduce at any point consider using reduceByKey rather than groupByKey().reduce

Related

Order Spark RDD based on ordering in another RDD

I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.
I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)

How do I read the value of a continuous index from spark RDD

I have a problem with Spark Scala get the first value from series key,I create a new RDD like this:
[(a,1),(a,2),(a,3),(a,4),(b,1),(b,2),(a,3),(a,4),(a,5),(b,8),(b,9)]
I want to fetch the result like this:
[(a,1),(b,1),(a,3),(b,8)]
How can I do this with scala from RDD
As mentioned in comments, in order to be able to use the order of the elements in an RDD, you'd have to somehow represent this order in the data itself. For that purpose exactly, zipWithIndex was created - the index is added to the data; Then, with some manipulation (join on an RDD with modified indices) we can get what you need:
// add index to RDD:
val withIndex = rdd.zipWithIndex().map(_.swap)
// create another RDD with indices increased by one, to later join each element with the previous one
val previous = withIndex.map { case (index, v) => (index + 1, v) }
// join RDDs, filter out those where previous "key" is identical
val result = withIndex.leftOuterJoin(previous).collect {
case (i, (left, None)) => (i, left) // keep first element in RDD
case (i, (left, Some((key, _)))) if left._1 != key => (i, left) // keep only elements where the previous key is different
}.sortByKey().values // if you want to preserve the original order...
result.collect().foreach(println)
// (a,1)
// (b,1)
// (a,3)
// (b,8)

How to split a spark dataframe with equal records

I am using df.randomSplit() but it is not splitting into equal rows. Is there any other way I can achieve it?
In my case I needed balanced (equal sized) partitions in order to perform a specific cross validation experiment.
For that you usually:
Randomize the dataset
Apply modulus operation to assign each element to a fold (partition)
After this step you will have to extract each partition using filter, afaik there is still no transformation to separate a single RDD into many.
Here is some code in scala, it only uses standard spark operations so it should be easy to adapt to python:
val npartitions = 3
val foldedRDD =
// Map each instance with random number
.zipWithIndex
.map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
// Random ordering
.sortBy( t => (t._1(m_classIndex), t._3) )
// Assign each instance to fold
.zipWithIndex
.map( t => (t._1, t._2 % npartitions) )
val balancedRDDList =
for (f <- 0 until npartitions)
yield foldedRDD.filter( _._2 == f )

DataFrame equality in Apache Spark

Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.
Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns?
The motivation for the question is that there are often many ways to compute some big data result, each with its own trade-offs. As one explores these trade-offs, it is important to maintain correctness and hence the need to check for the equivalence/equality on a meaningful test data set.
Scala (see below for PySpark)
The spark-fast-tests library has two methods for making DataFrame comparisons (I'm the creator of the library):
The assertSmallDataFrameEquality method collects DataFrames on the driver node and makes the comparison
def assertSmallDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
if (!actualDF.collect().sameElements(expectedDF.collect())) {
throw new DataFrameContentMismatch(contentMismatchMessage(actualDF, expectedDF))
}
}
The assertLargeDataFrameEquality method compares DataFrames spread on multiple machines (the code is basically copied from spark-testing-base)
def assertLargeDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
try {
actualDF.rdd.cache
expectedDF.rdd.cache
val actualCount = actualDF.rdd.count
val expectedCount = expectedDF.rdd.count
if (actualCount != expectedCount) {
throw new DataFrameContentMismatch(countMismatchMessage(actualCount, expectedCount))
}
val expectedIndexValue = zipWithIndex(actualDF.rdd)
val resultIndexValue = zipWithIndex(expectedDF.rdd)
val unequalRDD = expectedIndexValue
.join(resultIndexValue)
.filter {
case (idx, (r1, r2)) =>
!(r1.equals(r2) || RowComparer.areRowsEqual(r1, r2, 0.0))
}
val maxUnequalRowsToShow = 10
assertEmpty(unequalRDD.take(maxUnequalRowsToShow))
} finally {
actualDF.rdd.unpersist()
expectedDF.rdd.unpersist()
}
}
assertSmallDataFrameEquality is faster for small DataFrame comparisons and I've found it sufficient for my test suites.
PySpark
Here's a simple function that returns true if the DataFrames are equal:
def are_dfs_equal(df1, df2):
if df1.schema != df2.schema:
return False
if df1.collect() != df2.collect():
return False
return True
or simplified
def are_dfs_equal(df1, df2):
return (df1.schema == df2.schema) and (df1.collect() == df2.collect())
You'll typically perform DataFrame equality comparisons in a test suite and will want a descriptive error message when the comparisons fail (a True / False return value doesn't help much when debugging).
Use the chispa library to access the assert_df_equality method that returns descriptive error messages for test suite workflows.
There are some standard ways in the Apache Spark test suites, however most of these involve collecting the data locally and if you want to do equality testing on large DataFrames then that is likely not a suitable solution.
Checking the schema first and then you could do an intersection to df3 and verify that the count of df1,df2 & df3 are all equal (however this only works if there aren't duplicate rows, if there are different duplicates rows this method could still return true).
Another option would be getting the underlying RDDs of both of the DataFrames, mapping to (Row, 1), doing a reduceByKey to count the number of each Row, and then cogrouping the two resulting RDDs and then do a regular aggregate and return false if any of the iterators are not equal.
I don't know about idiomatic, but I think you can get a robust way to compare DataFrames as you describe as follows. (I'm using PySpark for illustration, but the approach carries across languages.)
a = spark.range(5)
b = spark.range(5)
a_prime = a.groupBy(sorted(a.columns)).count()
b_prime = b.groupBy(sorted(b.columns)).count()
assert a_prime.subtract(b_prime).count() == b_prime.subtract(a_prime).count() == 0
This approach correctly handles cases where the DataFrames may have duplicate rows, rows in different orders, and/or columns in different orders.
For example:
a = spark.createDataFrame([('nick', 30), ('bob', 40)], ['name', 'age'])
b = spark.createDataFrame([(40, 'bob'), (30, 'nick')], ['age', 'name'])
c = spark.createDataFrame([('nick', 30), ('bob', 40), ('nick', 30)], ['name', 'age'])
a_prime = a.groupBy(sorted(a.columns)).count()
b_prime = b.groupBy(sorted(b.columns)).count()
c_prime = c.groupBy(sorted(c.columns)).count()
assert a_prime.subtract(b_prime).count() == b_prime.subtract(a_prime).count() == 0
assert a_prime.subtract(c_prime).count() != 0
This approach is quite expensive, but most of the expense is unavoidable given the need to perform a full diff. And this should scale fine as it doesn't require collecting anything locally. If you relax the constraint that the comparison should account for duplicate rows, then you can drop the groupBy() and just do the subtract(), which would probably speed things up notably.
Java:
assert resultDs.union(answerDs).distinct().count() == resultDs.intersect(answerDs).count();
There are 4 Options depending on whether you have duplicate rows or not.
Let's say we have two DataFrames, z1 and z1. Option 1/2 are good for rows without duplicates. You can try these in spark-shell.
Option 1: do except directly
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Column
def isEqual(left: DataFrame, right: DataFrame): Boolean = {
if(left.columns.length != right.columns.length) return false // column lengths don't match
if(left.count != right.count) return false // record count don't match
return left.except(right).isEmpty && right.except(left).isEmpty
}
Option 2: generate row hash by columns
def createHashColumn(df: DataFrame) : Column = {
val colArr = df.columns
md5(concat_ws("", (colArr.map(col(_))) : _*))
}
val z1SigDF = z1.select(col("index"), createHashColumn(z1).as("signature_z1"))
val z2SigDF = z2.select(col("index"), createHashColumn(z2).as("signature_z2"))
val joinDF = z1SigDF.join(z2SigDF, z1SigDF("index") === z2SigDF("index")).where($"signature_z1" =!= $"signature_z2").cache
// should be 0
joinDF.count
Option 3: use GroupBy(for DataFrame with duplicate rows)
val z1Grouped = z1.groupBy(z1.columns.map(c => z1(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val z2Grouped = z2.groupBy(z2.columns.map(c => z2(c)).toSeq : _*).count().withColumnRenamed("count", "recordRepeatCount")
val inZ1NotInZ2 = z1Grouped.except(z2Grouped).toDF()
val inZ2NotInZ1 = z2Grouped.except(z1Grouped).toDF()
// both should be size 0
inZ1NotInZ2.show
inZ2NotInZ1.show
Option 4, use exceptAll, which should also work for data with duplicate rows
// Source Code: https://github.com/apache/spark/blob/50538600ec/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2029
val inZ1NotInZ2 = z1.exceptAll(z2).toDF()
val inZ2NotInZ1 = z2.exceptAll(z1).toDF()
// same here, // both should be size 0
inZ1NotInZ2.show
inZ2NotInZ1.show
Try doing the following:
df1.except(df2).isEmpty
A scalable and easy way is to diff the two DataFrames and count the non-matching rows:
df1.diff(df2).where($"diff" != "N").count
If that number is not zero, then the two DataFrames are not equivalent.
The diff transformation is provided by spark-extension.
It identifies Inserted, Changed, Deleted and uN-changed rows.
You can do this using a little bit of deduplication in combination with a full outer join. The advantage of this approach is that it does not require you to collect results to the driver, and that it avoids running multiple jobs.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
// Generate some random data.
def random(n: Int, s: Long) = {
spark.range(n).select(
(rand(s) * 10000).cast("int").as("a"),
(rand(s + 5) * 1000).cast("int").as("b"))
}
val df1 = random(10000000, 34)
val df2 = random(10000000, 17)
// Move all the keys into a struct (to make handling nulls easy), deduplicate the given dataset
// and count the rows per key.
def dedup(df: Dataset[Row]): Dataset[Row] = {
df.select(struct(df.columns.map(col): _*).as("key"))
.groupBy($"key")
.agg(count(lit(1)).as("row_count"))
}
// Deduplicate the inputs and join them using a full outer join. The result can contain
// the following things:
// 1. Both keys are not null (and thus equal), and the row counts are the same. The dataset
// is the same for the given key.
// 2. Both keys are not null (and thus equal), and the row counts are not the same. The dataset
// contains the same keys.
// 3. Only the right key is not null.
// 4. Only the left key is not null.
val joined = dedup(df1).as("l").join(dedup(df2).as("r"), $"l.key" === $"r.key", "full")
// Summarize the differences.
val summary = joined.select(
count(when($"l.key".isNotNull && $"r.key".isNotNull && $"r.row_count" === $"l.row_count", 1)).as("left_right_same_rc"),
count(when($"l.key".isNotNull && $"r.key".isNotNull && $"r.row_count" =!= $"l.row_count", 1)).as("left_right_different_rc"),
count(when($"l.key".isNotNull && $"r.key".isNull, 1)).as("left_only"),
count(when($"l.key".isNull && $"r.key".isNotNull, 1)).as("right_only"))
summary.show()
try {
return ds1.union(ds2)
.groupBy(columns(ds1, ds1.columns()))
.count()
.filter("count % 2 > 0")
.count()
== 0;
} catch (Exception e) {
return false;
}
Column[] columns(Dataset<Row> ds, String... columnNames) {
List<Column> l = new ArrayList<>();
for (String cn : columnNames) {
l.add(ds.col(cn));
}
return l.stream().toArray(Column[]::new);}
columns method is supplementary and can be replaced by any method that returns Seq
Logic:
Union both the datasets, if columns are not matching, it will throw an exception and hence return false.
If columns are matching then groupBy on all columns and add a column count. Now, all the rows have count in the multiple of 2 (even for duplicate rows).
Check if there is any row that has count not divisible by 2, those are the extra rows.

transform rdd into pairRDD

This is a newbie question.
Is it possible to transform an RDD like (key,1,2,3,4,5,5,666,789,...) with a dynamic dimension into a pairRDD like (key, (1,2,3,4,5,5,666,789,...))?
I feel like it should be super-easy but I cannot get how to.
The point of doing it is that I would like to sum all the values, but not the key.
Any help is appreciated.
I am using Spark 1.2.0
EDIT enlightened by the answer I explain my use case deeplier. I have N (unknown at compile time) different pairRDD (key, value), that have to be joined and whose values must be summed up. Is there a better way than the one I was thinking?
First of all if you just wanna sum all integers but first the simplest way would be:
val rdd = sc.parallelize(List(1, 2, 3))
rdd.cache()
val first = rdd.sum()
val result = rdd.count - first
On the other hand if you want to have access to the index of elements you can use rdd zipWithIndex method like this:
val indexed = rdd.zipWithIndex()
indexed.cache()
val result = (indexed.first()._2, indexed.filter(_._1 != 1))
But in your case this feels like overkill.
One more thing i would add, this looks like questionable desine to put key as first element of your rdd. Why not just instead use pairs (key, rdd) in your driver program. Its quite hard to reason about order of elements in rdd and i cant not think about natural situation in witch key is computed as first element of rdd (ofc i dont know your usecase so i can only guess).
EDIT
If you have one rdd of key value pairs and you want to sum them by key then do just:
val result = rdd.reduceByKey(_ + _)
If you have many rdds of key value pairs before counting you can just sum them up
val list = List(pairRDD0, pairRDD1, pairRDD2)
//another pairRDD arives in runtime
val newList = anotherPairRDD0::list
val pairRDD = newList.reduce(_ union _)
val resultSoFar = pairRDD.reduceByKey(_ + _)
//another pairRDD arives in runtime
val result = resultSoFar.union(anotherPairRDD1).reduceByKey(_ + _)
EDIT
I edited example. As you can see you can add additional rdd when every it comes up in runtime. This is because reduceByKey returns rdd of the same type so you can iterate this operation (Ofc you will have to consider performence).