Partitions are traversed multiple times invalidating Accumulator consistency - scala

We are trying to use Accumulators to count RDDs without having to force .count() on them for efficiency reasons. We are aware tasks can fail and re-run, which will invalidate the value of the accumulator, so we count the number of times a partition has been traversed, so we can detect this.
The problem is that partitions are being traversed multiple times even though
We cache the RDD in memory after applying the logic below
No tasks are failing, no executors are dying.
There is plenty of memory (no RDD eviction)
The code we use:
val count: LongAccumulator
val partitionTraverseCounts: List[LongAccumulator]
def increment(): Unit = count.add(1)
def incrementTimesCalled(partitionIndex: Int): Unit =
partitionTraverseCounts(partitionIndex).add(1)
def incrementForPartition[T](index: Int, it: Iterator[T]): Iterator[T] = {
incrementTimesCalled(index)
it.map { x =>
increment()
x
}
}
How we use the above:
rdd.mapPartitionsWithIndex(safeCounter.incrementForPartition)
We have a 50 partition RDD, and we frequently see odd traverse counts:
traverseCounts: List(2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2)
As you can see, some partitions are traversed twice, while others are traversed only once.
To confirm no task failures:
cat job.log | grep -i task | grep -i fail
To confirm no memory issues:
cat job.log | grep -i memory
We see every log line has multiple GB memory free.
We also don't see any errors or exceptions.
Question:
Why is spark traversing a cached RDD multiple times?
Is there any way to disable this?
See also:
https://issues.apache.org/jira/browse/SPARK-40048

Related

Printing specific output in Scala

I have the following array of arrays that represents a cycle in a graph that I want to print in the below format.
scala> result.collect
Array[Array[Long]] = Array(Array(0, 1, 4, 0), Array(1, 5, 2, 1), Array(1, 4, 0, 1), Array(2, 3, 5, 2), Array(2, 1, 5, 2), Array(3, 5, 2, 3), Array(4, 0, 1, 4), Array(5, 2, 3, 5), Array(5, 2, 1, 5))
0:0->1->4;
1:1->5->2;1->4->0;
2:2->3->5;2->1->5;
3:3->5->2;
4:4->0->1;
5:5->2->3;5->2->1;
How can I do this? I have tried to do a for loop with if statements like other coding languages but scala's ifs in for loops are for filtering and cannot make use if/else to account for two different criteria.
example python code
for (array,i) in enumerate(range(0,result.length)):
if array[i] == array[i+1]:
//print thing needed
else:
// print other thing
I also tried to do result.groupBy to make it easier to print but doing that ruins the arrays.
Array[(Long, Iterable[Array[Long]])] = Array((4,CompactBuffer([J#3677a08a)), (0,CompactBuffer([J#695fd7e)), (1,CompactBuffer([J#50b0f441, [J#142efc4d)), (3,CompactBuffer([J#1fd66db2)), (5,CompactBuffer([J#36811d3b, [J#61c4f556)), (2,CompactBuffer([J#2eba1b7, [J#2efcf7a5)))
Is there a way to nicely print the output needed in Scala?
This should do it:
result
.groupBy(_.head)
.toArray
.sortBy(_._1)
.map {
case (node, cycles) =>
val paths = cycles.map { cycle =>
cycle
.init // drop last node
.mkString("->")
}
s"$node:${paths.mkString(";")}"
}
.mkString(";\n")
This is the output for the sample input you provided:
0:0->1->4;
1:1->5->2;1->4->0;
2:2->3->5;2->1->5;
3:3->5->2;
4:4->0->1;
5:5->2->3;5->2->1

use pyspark to shuffle random selected columns

I was trying to do:
random select a few columns from the DataFrame
shuffle the value from the column select from step 1
add these columns from step 2 back to the DataFrame
The code is as following:
# Step 0: create data frame using list and tuple
df = sqlContext.createDataFrame([
("user1", 0, 1, 0, 1, 0, 1, 1, 0, 1, 0),
("user2", 1, 1, 0, 1, 0, 1, 1, 1, 1, 0),
("user3", 1, 1, 1, 1, 0, 0, 0, 1, 1, 0),
("user4", 0, 1, 0, 1, 1, 1, 1, 1, 0, 0),
("user5", 1, 1, 1, 1, 0, 1, 0, 1, 1, 0),
("user6", 0, 1, 0, 1, 1, 1, 1, 0, 1, 0)
], ["ID", "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9"])
df.show()
The DataFrame is:
import random
from pyspark.sql import functions as F
# define features
feature = [x for x in df.columns if x not in ['ID']]
# Step 1: random select a few columns from the DataFrame
random.seed(123)
random_col = random.sample(feature, 2)
print(random_col)
Step 1 works well. The random selected feature are 'x0', 'x4'
# shuffle the random selected columns to create random noise feature
for i in range(0, 2):
# Step 2: shuffle the value from the column select from step 1
rnd_df = df.select(random_col[i]).orderBy(F.rand(i)).withColumnRenamed(random_col[i], 'rnd_col').rnd_col
# step 3: add these columns from step 2 back to the DataFrame
df = df.withColumn('random'+ str(i+1), rnd_df)
Step 2 works well. But Step 3 fails with the following error. Does anyone know how to solve this problem?

Operate on neighbor elements in RDD in Spark

As I have a collection:
List(1, 3,-1, 0, 2, -4, 6)
It's easy to make it sorted as:
List(-4, -1, 0, 1, 2, 3, 6)
Then I can construct a new collection by compute 6 - 3, 3 - 2, 2 - 1, 1 - 0, and so on like this:
for(i <- 0 to list.length -2) yield {
list(i + 1) - list(i)
}
and get a vector:
Vector(3, 1, 1, 1, 1, 3)
That is, I want to make the next element minus the current element.
But how to implement this in RDD on Spark?
I know for the collection:
List(-4, -1, 0, 1, 2, 3, 6)
There will be some partitions of the collection, each partition is ordered, can I do the similar operation on each partition and collect results on each partition together?
The most efficient solution is to use sliding method:
import org.apache.spark.mllib.rdd.RDDFunctions._
val rdd = sc.parallelize(Seq(1, 3,-1, 0, 2, -4, 6))
.sortBy(identity)
.sliding(2)
.map{case Array(x, y) => y - x}
Suppose you have something like
val seq = sc.parallelize(List(1, 3, -1, 0, 2, -4, 6)).sortBy(identity)
Let's create first collection with index as key like Ton Torres suggested
val original = seq.zipWithIndex.map(_.swap)
Now we can build collection shifted by one element.
val shifted = original.map { case (idx, v) => (idx - 1, v) }.filter(_._1 >= 0)
Next we can calculate needed differences ordered by index descending
val diffs = original.join(shifted)
.sortBy(_._1, ascending = false)
.map { case (idx, (v1, v2)) => v2 - v1 }
So
println(diffs.collect.toSeq)
shows
WrappedArray(3, 1, 1, 1, 1, 3)
Note that you can skip the sortBy step if reversing is not critical.
Also note that for local collection this could be computed much more simple like:
val elems = List(1, 3, -1, 0, 2, -4, 6).sorted
(elems.tail, elems).zipped.map(_ - _).reverse
But in case of RDD the zip method requires each collection should contain equal element count for each partition. So if you would implement tail like
val tail = seq.zipWithIndex().filter(_._2 > 0).map(_._1)
tail.zip(seq) would not work since both collection needs equal count of elements for each partition and we have one element for each partition that should travel to previous partition.

Example of usage of a monoid for distributed computation with spark

I have user hobby data(RDD[Map[String, Int]]) like:
("food" -> 3, "music" -> 1),
("food" -> 2),
("game" -> 5, "twitch" -> 3, "food" -> 3)
I want to calculate stats of them, and represent the stats as Map[String, Array[Int]] while the array size is 5, like:
("food" -> Array(0, 1, 2, 0, 0),
"music" -> Array(1, 0, 0, 0, 0),
"game" -> Array(0, 0, 0, 0, 1),
"twitch" -> Array(0, 0, 1, 0 ,0))
foldLeft seems to be the right solution, but RDD cannot use it, and the data is too big to convert to List/Array to use foldLeft, how could I do this job?
The trick is to replace the Array in your example by a class that contains the statistic you want for some part of the data, and that can be combined with another instance of the same statistic (covering other part of the data) to provide the statistic on the whole data.
For instance, if you have a statistic that covers the data 3, 3, 2 and 5, I gather it would look something like (0, 1, 2, 0, 1) and if you have another instance covering the data 3,4,4 it would look like (0, 0, 1, 2,0). Now all you have to do is define a + operation that let you combine (0, 1, 2, 0, 1) + (0, 0, 1, 2, 0) = (0,1,3,2,1), covering the data 3,3,2,5 and 3,4,4.
Let's just do that, and call the class StatMonoid:
case class StatMonoid(flags: Seq[Int] = Seq(0,0,0,0,0)) {
def + (other: StatMonoid) =
new StatMonoid( (0 to 4).map{idx => flags(idx) + other.flags(idx)})
}
This class contains the sequence of 5 counters, and define a + operation that let it be combined with other counters.
We also need a convenience method to build it, this could be a constructor in StatMonoid, in the companion object, or just a plain method, as you prefer:
def stat(value: Int): StatMonoid = value match {
case 1 => new StatMonoid(Seq(1,0,0,0,0))
case 2 => new StatMonoid(Seq(0,1,0,0,0))
case 3 => new StatMonoid(Seq(0,0,1,0,0))
case 4 => new StatMonoid(Seq(0,0,0,1,0))
case 5 => new StatMonoid(Seq(0,0,0,0,1))
case _ => throw new RuntimeException("illegal init value: $value")
}
This allows us to easily compute instance of the statistic covering one single piece of data, for example:
scala> stat(4)
res25: StatMonoid = StatMonoid(List(0, 0, 0, 1, 0))
And it also allows us to combine them together simply by adding them:
scala> stat(1) + stat(2) + stat(2) + stat(5) + stat(5) + stat(5)
res18: StatMonoid = StatMonoid(Vector(1, 2, 0, 0, 3))
Now to apply this to your example, let's assume we have the data you mention as an RDD of Map:
val rdd = sc.parallelize(List(Map("food" -> 3, "music" -> 1), Map("food" -> 2), Map("game" -> 5, "twitch" -> 3, "food" -> 3)))
All we need to do to find the stat for each kind of food, is to flatten the data to get ("foodId" -> id) tuples, transform each id into an instance of StatMonoid above, and finally combine them all together for each kind of food:
import org.apache.spark.rdd.PairRDDFunctions
rdd.flatMap(_.toList).mapValue(stat).reduceByKey(_ + _).collect
Which yields:
res24: Array[(String, StatMonoid)] = Array((game,StatMonoid(List(0, 0, 0, 0, 1))), (twitch,StatMonoid(List(0, 0, 1, 0, 0))), (music,StatMonoid(List(1, 0, 0, 0, 0))), (food,StatMonoid(Vector(0, 1, 2, 0, 0))))
Now, for the side story, if you wonder why I call the class StateMonoid it's simply because... it is a monoid :D, and a very common and handy one, called product . In short, monoids are just thingies that can be combined with each other in associative fashion, they are super common when developing in Spark since they naturally define operations that can be executed in parallel on the distributed slaves, and gathered together into a final result.

Sized generators in scalacheck

UserGuide of scalacheck project mentioned sized generators. The explanation code
def matrix[T](g:Gen[T]):Gen[Seq[Seq[T]]] = Gen.sized {size =>
val side = scala.Math.sqrt(size).asInstanceOf[Int] //little change to prevent compile-time exception
Gen.vectorOf(side, Gen.vectorOf(side, g))
}
explained nothing for me. After some exploration I understood that length of generated sequence does not depend on actual size of generator (there is resize method in Gen object that "Creates a resized version of a generator" according to javadoc (maybe that means something different?)).
val g = Gen.choose(1,5)
val g2 = Gen.resize(15, g)
println(matrix(g).sample) // (1)
println(matrix(g2).sample) // (2)
//1,2 produce Seq with same length
Could you explain me what had I missed and give me some examples how you use them in testing code?
The vectorOf (which now is replaced with listOf) generates lists with a size that depends (linearly) on the size parameter that ScalaCheck sets when it evaluates a generator. When ScalaCheck tests a property it will increase this size parameter for each test, resulting in properties that are tested with larger and larger lists (if listOf is used).
If you create a matrix generator by just using the listOf generator in a nested fashion, you will get matrices with a size that depends on the square of the size parameter. Hence when using such a generator in a property you might end up with very large matrices, since ScalaCheck increases the size parameter for each test run. However, if you use the resize generator combinator in the way it is done in the ScalaCheck User Guide, your final matrix size depend linearly on the size parameter, resulting in nicer performance when testing your properties.
You should really not have to use the resize generator combinator very often. If you need to generate lists that are bounded by some specific size, it's much better to do something like the example below instead, since there is no guarantee that the listOf/ containerOf generators really use the size parameter the way you expect.
def genBoundedList(maxSize: Int, g: Gen[T]): Gen[List[T]] = {
Gen.choose(0, maxSize) flatMap { sz => Gen.listOfN(sz, g) }
}
The vectorOf method that you use is deprecated , and you should use the listOf method. This generates a list of random length where the maximum length is limited by the size of the generator. You should therefore resize the generator that
actually generates the actual list if you want control over the maximum elements that are generated:
scala> val g1 = Gen.choose(1,5)
g1: org.scalacheck.Gen[Int] = Gen()
scala> val g2 = Gen.listOf(g1)
g2: org.scalacheck.Gen[List[Int]] = Gen()
scala> g2.sample
res19: Option[List[Int]] = Some(List(4, 4, 4, 4, 2, 4, 2, 3, 5, 1, 1, 1, 4, 4, 1, 1, 4, 5, 5, 4, 3, 3, 4, 1, 3, 2, 2, 4, 3, 4, 3, 3, 4, 3, 2, 3, 1, 1, 3, 2, 5, 1, 5, 5, 1, 5, 5, 5, 5, 3, 2, 3, 1, 4, 3, 1, 4, 2, 1, 3, 4, 4, 1, 4, 1, 1, 4, 2, 1, 2, 4, 4, 2, 1, 5, 3, 5, 3, 4, 2, 1, 4, 3, 2, 1, 1, 1, 4, 3, 2, 2))
scala> val g3 = Gen.resize(10, g2)
g3: java.lang.Object with org.scalacheck.Gen[List[Int]] = Gen()
scala> g3.sample
res0: Option[List[Int]] = Some(List(1))
scala> g3.sample
res1: Option[List[Int]] = Some(List(4, 2))
scala> g3.sample
res2: Option[List[Int]] = Some(List(2, 1, 2, 4, 5, 4, 2, 5, 3))