Why result of Spark reduceByKey is not consistent - scala

I am trying to count the number of iteration of each line via spark using scala.
Following is my input:
1 vikram
2 sachin
3 shobit
4 alok
5 akul
5 akul
1 vikram
1 vikram
3 shobit
10 ashu
5 akul
1 vikram
2 sachin
7 vikram
now i create 2 separate RDDs as follows.
val f1 = sc.textFile("hdfs:///path to above data file")
val m1 = f1.map( s => (s.split(" ")(0),1) ) //creating a tuple (key,1)
//now if i create a RDD as
val rd1 = m1.reduceByKey((a,b) => a+b )
rd1.collect().foreach(println)
//I get a proper output i.e (it gives correct output every time)
//output: (4,1) (2,2) (7,1) (5,3) (3,2) (1,4) (10,1)
//but if i create a RDD as
val rd2 = m1.reduceByKey((a,b) => a+1 )
rd2.collect().foreach(println)
//I get a inconsistent result i.e some times i get this (WRONG)
//output: (4,1) (2,2) (7,1) (5,2) (3,2) (1,2) (10,1)
//and sometimes I get this as output (CORRECT)
//output: (4,1) (2,2) (7,1) (5,3) (3,2) (1,4) (10,1)
I am unable yo understand why is this happening and where to use what. I have also tried creating RDD as
val m2 = f1.map(s => (s,1))
val rd3 = m2.reduceByKey((a,b) => a+1 )
// Then also same issue occurs with a+1 but every thing works fine with a+b

reduceByKey assumes the passed function is commutative and associative (as docs clearly state).
And - your first function (a, b) => a + b is, but (a, b) => a+1 isn't.
WHY?
For one thing - reduceByKey applies the supplied function to each partition, and then to the combined results of all partitions. In other words, b isn't always 1, so using a+1 simply isn't correct.
Think of the following scenario - the input contains 4 records, split into two partitions:
(aa, 1)
(aa, 1)
(aa, 1)
(cc, 1)
reduceByKey(f) on this input might be calculated as follows:
val intermediate1 = f((aa, 1), (aa, 1))
val intermediate2 = f((aa, 1), (cc, 1))
val result = f(intermediate2, intermediate1)
Now, let's follow this with f = (a, b) => a + b
val intermediate1 = f((aa, 1), (aa, 1)) // (aa, 2)
val intermediate2 = f((aa, 1), (cc, 1)) // (aa, 1), (cc, 1)
val result = f(intermediate2, intermediate1) // (aa, 3), (cc, 1)
And with f = (a, b) => a + 1:
val intermediate1 = f((aa, 1), (bb, 1)) // (aa, 2)
val intermediate2 = f((aa, 1), (cc, 1)) // (aa, 1), (cc, 1)
// this is where it goes wrong:
val result = f(intermediate2, intermediate1) // (aa, 2), (cc, 1)
Main thing is - order of intermediate calculations isn't guaranteed and may change between executions, and for the latter case of non-commutative function this means result are sometimes wrong.

The function (a , b) => (a + 1) fails to be associative in nature.
The associative law says,
f(a ,f(b , c)) = f(f(a , b), c)
Assume the below keys:
a = (x, 1)
b = (x, 1)
c = (x, 1)
Applying the function (a , b) => (a + 1)
f(a ,f(b , c)) = (x , 2)
But,
f(f(a , b), c) = (x , 3)
Hence, it is not associative and is not applicable for reduceByKey.

Related

Scala different outputs of logically same programs

val a = List(1, 2, 3, 4, 5)
val b = a.grouped(2).filter(_.length == 2).map(x => (x(0), x(1)))
//b.foreach(x => println(x))
val r = b.foldLeft((0, 0)) {
case ((m, n), (x, y)) => {
(m + x, n + y)
}
}
println(r)
The program gives correct output (4, 6) for the above program. But when I uncomment the foreach statement above it outputs (0, 0). What's wrong here?
val b = a.grouped(2).filter(_.length == 2).map(x => (x(0), x(1))), b's type is Iterator:
scala> :type b
Iterator[(Int, Int)]
so when you have iterated b by b.foreach(x => println(x)), after this the current iterator b is empty, Since Iterator only can be iterated once.

Spark: produce RDD[(X, X)] of all possible combinations from RDD[X]

Is it possible in Spark to implement '.combinations' function from scala collections?
/** Iterates over combinations.
*
* #return An Iterator which traverses the possible n-element combinations of this $coll.
* #example `"abbbc".combinations(2) = Iterator(ab, ac, bb, bc)`
*/
For example how can I get from RDD[X] to RDD[List[X]] or RDD[(X,X)] for combinations of size = 2. And lets assume that all values in RDD are unique.
Cartesian product and combinations are two different things, the cartesian product will create an RDD of size rdd.size() ^ 2 and combinations will create an RDD of size rdd.size() choose 2
val rdd = sc.parallelize(1 to 5)
val combinations = rdd.cartesian(rdd).filter{ case (a,b) => a < b }`.
combinations.collect()
Note this will only work if an ordering is defined on the elements of the list, since we use <. This one only works for choosing two but can easily be extended by making sure the relationship a < b for all a and b in the sequence
This is supported natively by a Spark RDD with the cartesian transformation.
e.g.:
val rdd = sc.parallelize(1 to 5)
val cartesian = rdd.cartesian(rdd)
cartesian.collect
Array[(Int, Int)] = Array((1,1), (1,2), (1,3), (1,4), (1,5),
(2,1), (2,2), (2,3), (2,4), (2,5),
(3,1), (3,2), (3,3), (3,4), (3,5),
(4,1), (4,2), (4,3), (4,4), (4,5),
(5,1), (5,2), (5,3), (5,4), (5,5))
As discussed, cartesian will give you n^2 elements of the cartesian product of the RDD with itself.
This algorithm computes the combinations (n,2) of an RDD without having to compute the n^2 elements first: (used String as type, generalizing to a type T takes some plumbing with classtags that would obscure the purpose here)
This is probably less time efficient that cartesian + filtering due to the iterative count and take actions that forces the computation of the RDD, but more space efficient as it calculates only the C(n,2) = n!/(2*(n-2))! = (n*(n-1)/2) elements instead of the n^2 of the cartesian product.
import org.apache.spark.rdd._
def combs(rdd:RDD[String]):RDD[(String,String)] = {
val count = rdd.count
if (rdd.count < 2) {
sc.makeRDD[(String,String)](Seq.empty)
} else if (rdd.count == 2) {
val values = rdd.collect
sc.makeRDD[(String,String)](Seq((values(0), values(1))))
} else {
val elem = rdd.take(1)
val elemRdd = sc.makeRDD(elem)
val subtracted = rdd.subtract(elemRdd)
val comb = subtracted.map(e => (elem(0),e))
comb.union(combs(subtracted))
}
}
This creates all combinations (n, 2) and works for any RDD without requiring any ordering on the elements of RDD.
val rddWithIndex = rdd.zipWithIndex
rddWithIndex.cartesian(rddWithIndex).filter{case(a, b) => a._2 < b._2}.map{case(a, b) => (a._1, b._1)}
a._2 and b._2 are the indices, while a._1 and b._1 are the elements of the original RDD.
Example:
Note that, no ordering is defined on the maps here.
val m1 = Map('a' -> 1, 'b' -> 2)
val m2 = Map('c' -> 3, 'a' -> 4)
val m3 = Map('e' -> 5, 'c' -> 6, 'b' -> 7)
val rdd = sc.makeRDD(Array(m1, m2, m3))
val rddWithIndex = rdd.zipWithIndex
rddWithIndex.cartesian(rddWithIndex).filter{case(a, b) => a._2 < b._2}.map{case(a, b) => (a._1, b._1)}.collect
Output:
Array((Map(a -> 1, b -> 2),Map(c -> 3, a -> 4)), (Map(a -> 1, b -> 2),Map(e -> 5, c -> 6, b -> 7)), (Map(c -> 3, a -> 4),Map(e -> 5, c -> 6, b -> 7)))

Play Scala - groupBy remove repetitive values

I apply groupBy function to my List collection, however I want to remove the repetitive values in the value part of the Map. Here is the initial List collection:
PO_ID PRODUCT_ID RETURN_QTY
1 1 10
1 1 20
1 2 30
1 2 10
When I apply groupBy to that List, it will produce something like this:
(1, 1) -> (1, 1, 10),(1, 1, 20)
(1, 2) -> (1, 2, 30),(1, 2, 10)
What I really want is something like this:
(1, 1) -> (10),(20)
(1, 2) -> (30),(10)
So, is there anyway to remove the repetitive part in the Map's values [(1,1),(1,2)] ?
Thanks..
For
val a = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
consider
a.groupBy( v => (v._1,v._2) ).mapValues( _.map (_._3) )
which delivers
Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Note that mapValues operates over a List[List] of triplets obtained from groupBy, whereas in map we extract the third element of each triplet.
Is it easier to pull the tuple apart first?
scala> val ts = Seq( (1,1,10), (1,1,20), (1,2,30), (1,2,10) )
ts: Seq[(Int, Int, Int)] = List((1,1,10), (1,1,20), (1,2,30), (1,2,10))
scala> ts map { case (a,b,c) => (a,b) -> c }
res0: Seq[((Int, Int), Int)] = List(((1,1),10), ((1,1),20), ((1,2),30), ((1,2),10))
scala> ((Map.empty[(Int, Int), List[Int]] withDefaultValue List.empty[Int]) /: res0) { case (m, (k,v)) => m + ((k, m(k) :+ v)) }
res1: scala.collection.immutable.Map[(Int, Int),List[Int]] = Map((1,1) -> List(10, 20), (1,2) -> List(30, 10))
Guess not.

Comparing items in two lists

I have two lists : List(1,1,1) , List(1,0,1)
I want to get the following :
A count of every element that contains a 1 in first list and a 0 in the corresponding list at same position and vice versa.
In above example this would be 1 , 0 since the first list contains a 1 at middle position and second list contains a 0 at same position (middle).
A count of every element where 1 is in first list and 1 is also in second list.
In above example this is two since there are two 1's in each corresponding list. I can get this using the intersect method of class List.
I am just looking an answer to point 1 above. I could use an iterative a approach to count the items but is there a more functional method ?
Here is the entire code :
class Similarity {
def getSimilarity(number1: List[Int], number2: List[Int]) = {
val num: List[Int] = number1.intersect(number2)
println("P is " + num.length)
}
}
object HelloWorld {
def main(args: Array[String]) {
val s = new Similarity
s.getSimilarity(List(1, 1, 1), List(1, 0, 1))
}
}
For the first one:
scala> val a = List(1,1,1)
a: List[Int] = List(1, 1, 1)
scala> val b = List(1,0,1)
b: List[Int] = List(1, 0, 1)
scala> a.zip(b).filter(x => x._1==1 && x._2==0).size
res7: Int = 1
For the second:
scala> a.zip(b).filter(x => x._1==1 && x._2==1).size
res7: Int = 2
You can count all combinations easily and have it in a map with
def getSimilarity(number1 : List[Int] , number2 : List[Int]) = {
//sorry for the 1-liner, explanation follows
val countMap = (number1 zip number2) groupBy (identity) mapValues {_.length}
}
/*
* Example
* number1 = List(1,1,0,1,0,0,1)
* number2 = List(0,1,1,1,0,1,1)
*
* countMap = Map((1,0) -> 1, (1,1) -> 3, (0,1) -> 2, (0,0) -> 1)
*/
The trick is a common one
// zip the elements pairwise
(number1 zip number2)
/* List((1,0), (1,1), (0,1), (1,1), (0,0), (0,1), (1,1))
*
* then group together with the identity function, so pairs
* with the same elements are grouped together and the key is the pair itself
*/
.groupBy(identity)
/* Map( (1,0) -> List((1,0)),
* (1,1) -> List((1,1), (1,1), (1,1)),
* (0,1) -> List((0,1), (0,1)),
* (0,0) -> List((0,0))
* )
*
* finally you count the pairs mapping the values to the length of each list
*/
.mapValues(_.length)
/* Map( (1,0) -> 1,
* (1,1) -> 3,
* (0,1) -> 2,
* (0,0) -> 1
* )
Then all you need to do is lookup on the map
a.zip(b).filter(x => x._1 != x._2).size
Almost the same solution that was proposed by Jatin, except that you can useList.countfor a better lisibility:
def getSimilarity(l1: List[Int], l2: List[Int]) =
l1.zip(l2).count({case (x,y) => x != y})
You can also use foldLeft. Assuming there are no non-negative numbers:
a.zip(b).foldLeft(0)( (x,y) => if (y._1 + y._2 == 1) x + 1 else x )
1) You could zip 2 lists to get list of (Int, Int), collect only pairs (1, 0) and (0, 1), replace (1, 0) with 1 and (0, 1) with -1 and get sum. If count of (1, 0) and count of (0, 1) are the same the sum would be equal 0:
val (l1, l2) = (List(1,1,1) , List(1,0,1))
(l1 zip l2).collect{
case (1, 0) => 1
case (0, 1) => -1
}.sum == 0
You could use view method to prevent creation intermediate collections.
2) You could use filter and length to get count of elements with some condition:
(l1 zip l2).filter{ _ == (1, 1) }.length
(l1 zip l2).collect{ case (1, 1) => () }.length

Multiple yields in sequence comprehension?

I'm trying to learn Scala and tried to write a sequence comprehension that extracts unigrams, bigrams and trigrams from a sequence. E.g., [1,2,3,4] should be transformed to (not Scala syntax)
[1; _,1; _,_,1; 2; 1,2; _,1,2; 3; 2,3; 1,2,3; 4; 3,4; 2,3,4]
In Scala 2.8, I tried the following:
def trigrams(tokens : Seq[T]) = {
var t1 : Option[T] = None
var t2 : Option[T] = None
for (t3 <- tokens) {
yield t3
yield (t2,t3)
yield (t1,t2,Some(t3))
t1 = t2
t2 = t3
}
}
But this doesn't compile as, apparently, only one yield is allowed in a for-comprehension (no block statements either). Is there any other elegant way to get the same behavior, with only one pass over the data?
You can't have multiple yields in a for loop because for loops are syntactic sugar for the map (or flatMap) operations:
for (i <- collection) yield( func(i) )
translates into
collection map {i => func(i)}
Without a yield at all
for (i <- collection) func(i)
translates into
collection foreach {i => func(i)}
So the entire body of the for loop is turned into a single closure, and the presence of the yield keyword determines whether the function called on the collection is map or foreach (or flatMap). Because of this translation, the following are forbidden:
Using imperative statements next to a yield to determine what will be yielded.
Using multiple yields
(Not to mention that your proposed verison will return a List[Any] because the tuples and the 1-gram are all of different types. You probably want to get a List[List[Int]] instead)
Try the following instead (which put the n-grams in the order they appear):
val basis = List(1,2,3,4)
val slidingIterators = 1 to 4 map (basis sliding _)
for {onegram <- basis
ngram <- slidingIterators if ngram.hasNext}
yield (ngram.next)
or
val basis = List(1,2,3,4)
val slidingIterators = 1 to 4 map (basis sliding _)
val first=slidingIterators head
val buf=new ListBuffer[List[Int]]
while (first.hasNext)
for (i <- slidingIterators)
if (i.hasNext)
buf += i.next
If you prefer the n-grams to be in length order, try:
val basis = List(1,2,3,4)
1 to 4 flatMap { basis sliding _ toList }
scala> val basis = List(1, 2, 3, 4)
basis: List[Int] = List(1, 2, 3, 4)
scala> val nGrams = (basis sliding 1).toList ::: (basis sliding 2).toList ::: (basis sliding 3).toList
nGrams: List[List[Int]] = ...
scala> nGrams foreach (println _)
List(1)
List(2)
List(3)
List(4)
List(1, 2)
List(2, 3)
List(3, 4)
List(1, 2, 3)
List(2, 3, 4)
I guess I should have given this more thought.
def trigrams(tokens : Seq[T]) : Seq[(Option[T],Option[T],T)] = {
var t1 : Option[T] = None
var t2 : Option[T] = None
for (t3 <- tokens)
yield {
val tri = (t1,t2,t3)
t1 = t2
t2 = Some(t3)
tri
}
}
Then extract the unigrams and bigrams from the trigrams. But can anyone explain to me why 'multi-yields' are not permitted, and if there's any other way to achieve their effect?
val basis = List(1, 2, 3, 4)
val nGrams = basis.map(x => (x)) ::: (for (a <- basis; b <- basis) yield (a, b)) ::: (for (a <- basis; b <- basis; c <- basis) yield (a, b, c))
nGrams: List[Any] = ...
nGrams foreach (println(_))
1
2
3
4
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
(4,1)
(4,2)
(4,3)
(4,4)
(1,1,1)
(1,1,2)
(1,1,3)
(1,1,4)
(1,2,1)
(1,2,2)
(1,2,3)
(1,2,4)
(1,3,1)
(1,3,2)
(1,3,3)
(1,3,4)
(1,4,1)
(1,4,2)
(1,4,3)
(1,4,4)
(2,1,1)
(2,1,2)
(2,1,3)
(2,1,4)
(2,2,1)
(2,2,2)
(2,2,3)
(2,2,4)
(2,3,1)
(2,3,2)
(2,3,3)
(2,3,4)
(2,4,1)
(2,4,2)
(2,4,3)
(2,4,4)
(3,1,1)
(3,1,2)
(3,1,3)
(3,1,4)
(3,2,1)
(3,2,2)
(3,2,3)
(3,2,4)
(3,3,1)
(3,3,2)
(3,3,3)
(3,3,4)
(3,4,1)
(3,4,2)
(3,4,3)
(3,4,4)
(4,1,1)
(4,1,2)
(4,1,3)
(4,1,4)
(4,2,1)
(4,2,2)
(4,2,3)
(4,2,4)
(4,3,1)
(4,3,2)
(4,3,3)
(4,3,4)
(4,4,1)
(4,4,2)
(4,4,3)
(4,4,4)
You could try a functional version without assignments:
def trigrams[T](tokens : Seq[T]) = {
val s1 = tokens.map { Some(_) }
val s2 = None +: s1
val s3 = None +: s2
s1 zip s2 zip s3 map {
case ((t1, t2), t3) => (List(t1), List(t1, t2), List(t1, t2, t3))
}
}