SubtractByKey and keep rejected values - scala

I was playing around with spark and I am getting stuck with something that seems foolish.
Let's say we have two RDD:
rdd1 = {(1, 2), (3, 4), (3, 6)}
rdd2 = {(3, 9)}
if I am doing rdd1.substrackByKey(rdd2) , I will get {(1, 2)} wich is perfectly fine. But I also want to save the rejected values {(3,4),(3,6)} to another RDD, is there a prebuilt function in spark or an elegant way to do this?
Please keep in mind that I am new with Spark, any help will be appreciated, thanks.

As Rohan suggests, there is no (to the best of my knowledge) standard API call to do this. What you want to do can be expressed as Union - Intersection.
Here is how you can do this on spark:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val intersection = r1.map(_._1).intersection(r2.map(_._1))
val union = r1.map(_._1).union(r2.map(_._1))
val diff = union.subtract(intersection)
diff.collect()
> Array[Int] = Array(1)
To get the actual pairs:
val d = diff.collect()
r1.union(r2).filter(x => d.contains(x._1)).collect

I think I claim this is slightly more elegant:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val r3 = r1.leftOuterJoin(r2)
val subtracted = r3.filter(_._2._2.isEmpty).map(x=>(x._1, x._2._1))
val discarded = r3.filter(_._2._2.nonEmpty).map(x=>(x._1, x._2._1))
//subtracted: (1,2)
//discarded: (3,4)(3,6)
The insight is noticing that leftOuterJoin produces both the discarded (== records with a matching key in r2) and remaining (no matching key) in one go.
It's a pity Spark doesn't have RDD.partition (in the Scala collection sense of split a collection into two depending on a predicate) or we could caclculate subtracted and discarded in one pass

You can try
val rdd3 = rdd1.subtractByKey(rdd2)
val rdd4 = rdd1.subtractByKey(rdd3)
But you won't be keeping the values, just running another subtraction.

Unfortunately, I don't think there's an easy way to keep the rejected values using subtractByKey(). I think one way you get your desired result is through cogrouping and filtering. Something like:
val cogrouped = rdd1.cogroup(rdd2, numPartitions)
def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
You might be able to borrow the work done here to make the last two lines look more elegant.
When I run this on your example, I see:
scala> val rdd1 = sc.parallelize(Array((1, 2), (3, 4), (3, 6)))
scala> val rdd2 = sc.parallelize(Array((3, 9)))
scala> val cogrouped = rdd1.cogroup(rdd2)
scala> def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
scala> val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> res1.collect()
...
res7: Array[(Int, Int)] = Array((1,2))
scala> res2.collect()
...
res8: Array[(Int, Int)] = Array((3,4), (3,6))

First use substractByKey() and then subtract
val rdd1 = spark.sparkContext.parallelize(Seq((1,2), (3,4), (3,5)))
val rdd2 = spark.sparkContext.parallelize(Seq((3,10)))
val result = rdd1.subtractByKey(rdd2)
result.foreach(print) // (1,2)
val rejected = rdd1.subtract(result)
rejected.foreach(print) // (3,5)(3,4)

Related

How to get Rdd values that exists in array?

I have Rdd[(Int, Double)]
and an array[Int] and i want to get a new Rdd[(Int, Double)] with only those Int that exist in the array too.
E.g if my array is [0, 1, 2] and my rdd is (1, 4.2), (5, 4.3), i want to get as output rdd only the (1, 4.2)
I am thinking about using filter with a function that iterates the array, do the comparison and returns true/false but i am not sure if it is the logic of spark.
Something like:
val newrdd = rdd.filter(x => f(x._1, array))
where
f(x:Int, y:Array[In]): Boolean ={
val z = false
for (a<-0 to y.length-1){
if (x == y(a)){
z = true
z}
z
}
//Input rdd
val rdd = sc.parallelize(Seq((1,4.2),(5,4.3)))
//array, convert to rdd
val arrRdd = sc.parallelize(Array(0,1,2))
//convert rdd and arrRdd to dataframe
val arrDF = arrRdd.toDF()
val df = rdd.toDF()
//do join and again convert it to rdd
df.join(arrDF,df.col("_1") === arrDF.col("value"),"leftsemi").rdd.collect
//output Array([1,4.2])
Try this:
rdd.filter(x => Array(0,1,2).contains(x._1)).collect.foreach(println)
Output:
(1,4.2)
val acceptableValues = array.toSet
rdd.filter { case (x, _) => acceptableValues(x) }

Getting all Keys associated with the max value in Spark

I have a RDD, and i want to find all Keys which have the max values.
So if i have
( ((A), 5), ((B), 4), ((C), 5)) )
then i want to return
( ((A), 5), ((C), 5)) )
Edit; MaxBy only gives out one key, so i dont think that will work.
I have tried
newRDD = oldRDD.sortBy(._2, false).filter{._2 == _.first}ยจ
and
newRDD = oldRDD.filter{_._2 == _.maxBy}
Where i know _.first and _.MaxBy wont work, but are supposed to get the maxValue from the oldRDD. My problem in every solution i try is that i cant accsess the maxValue inside a filter. I also belive the 2nd "solution" i tried is much quicker than the first since sortBy is not really necessary.
Here is an answer. The logic is pretty simple:
val rdd = sc.parallelize(Seq(("a", 5), ("b", 4), ("c", 5)))
// first get maximum value
val maxVal = rdd.values.max
// now filter to those elements with value==max value
val rddMax = rdd.filter { case (_, v) => v == maxVal }
rddMax.take(10)
I'm not familiar with spark/RDD. In plain Scala, I would do:
scala> val max = ds.maxBy (_._2)._2
max: Int = 5
scala> ds.filter (_._2 == max)
res207: List[(String, Int)] = List((A,5), (C,5))
Setup was:
scala> val (a, b, c) = ("A", "B", "C")
a: String = A
b: String = B
c: String = C
scala> val ds = List ( ((a), 5), ((b), 4), ((c), 5))
ds: List[(String, Int)] = List((A,5), (B,4), (C,5))

Accessing a specific element of an Array RDD in apache-spark scala

I have a RDD that is containing an array of key,value pairs. I want to get an element with key (say 4).
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at keyBy at <console>:29
I have tried to apply filter on it but getting error.
scala> val c = b.filter(p => p(0) = 4);
<console>:31: error: value update is not a member of (Int, String)
val c = b.filter(p => p(0) = 4);
I want to print the key,value pair with specific key (say 4) as Array((4,lion))
The data is always coming in the form of an array of key,value pair
use p._1 instead of p(0).
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd: RDD[(Int, String)] = rdd.keyBy(_.length)
val filterRdd: RDD[(Int, String)] = kvRdd.filter(p => p._1 == 4)
//display rdd
println(filterRdd.collect().toList)
List((4,lion))
There's a lookup method applicable to RDDs of Key-Value pairs (RDDs of type RDD[(K,V)]) that directly offers this functionality.
b.lookup(4)
// res4: Seq[String] = WrappedArray(lion)
b.lookup(5)
// res6: Seq[String] = WrappedArray(tiger, eagle)

Concatenate String to each element of a List in a Spark dataframe with Scala

I have two columns in a Spark dataframe: one is a String, and the other is a List of Strings. How do I create a new column that is the concatenation of the String in column one with each element of the list in column 2, resulting in another list in column 3.
For example, if column 1 is "a", and column 2 is ["A","B"], I'd like the output in column 3 of the dataframe to to be ["aA","aB"].
So far, I have:
val multiplier = (x1: String, x2: Seq[String]) => {x1+x2}
val multiplierUDF = udf(multiplier)
val df2 = df1
.withColumn("col3", multiplierUDF(df1("col1"),df1("col2")))
which gives aWrappedArray(A,B)
I suggest you try your udf functions outside of spark, and get them working for local variables first. If you do:
val multiplier = (x1: String, x2: Seq[String]) => {x1+x2}
multiplier("a", Seq("A", "B"))
// output
res1: String = aList(A, B)
You'll see multiplier doesn't do what you want.
I think you're looking for:
val multiplier = (x1: String, x2: Seq[String]) => x2.map(x1+_)
multiplier("a", Seq("A", "B"))
//output
res2: Seq[String] = List(aA, aB)
I think you should redefine your UDF to something similar to my function append
val a = Seq("A", "B")
val p = "a"
def append(init: String, tails: Seq[String]) = tails.map(x => init + x)
append(p, a)
//res1: Seq[String] = List(aA, aB)

Spark RDD tuple transformation

I'm trying to transform an RDD of tuple of Strings of this format :
(("abc","xyz","123","2016-02-26T18:31:56"),"15") TO
(("abc","xyz","123"),"2016-02-26T18:31:56","15")
Basically seperating out the timestamp string as a seperate tuple element. I tried following but it's still not clean and correct.
val result = rdd.map(r => (r._1.toString.split(",").toVector.dropRight(1).toString, r._1.toString.split(",").toList.last.toString, r._2))
However, it results in
(Vector(("abc", "xyz", "123"),"2016-02-26T18:31:56"),"15")
The expected output I'm looking for is
(("abc", "xyz", "123"),"2016-02-26T18:31:56","15")
This way I can access the elements using r._1, r._2 (the timestamp string) and r._3 in a seperate map operation.
Any hints/pointers will be greatly appreciated.
Vector.toString will include the String 'Vector' in its result. Instead, use Vector.mkString(",").
Example:
scala> val xs = Vector(1,2,3)
xs: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
scala> xs.toString
res25: String = Vector(1, 2, 3)
scala> xs.mkString
res26: String = 123
scala> xs.mkString(",")
res27: String = 1,2,3
However, if you want to be able to access (abc,xyz,123) as a Tuple and not as a string, you could also do the following:
val res = rdd.map{
case ((a:String,b:String,c:String,ts:String),d:String) => ((a,b,c),ts,d)
}