Distributed process updating a global/single variable in Spark - scala

I'm in trouble trying to process a vast amount of data on a cluster.
The code:
val (sumZ, batchSize) = data.rdd.repartition(4)
.treeAggregate(0L, 0L))(
seqOp = (c, v) => {
// c: (z, count), v
val step = this.update(c, v)
(step._1, c._2 + 1)
},
combOp = (c1, c2) => {
// c: (z, count)
(c1._1 + c2._1, c1._2 + c2._2)
})
val finalZ = sumZ / 4
As you can see in the code, my current approach is to process this data partitioned into 4 chunks (x0, x1, x2, x3) making all the process independent. Each process generates an output (z0, z1, z2, z3), and the final value of z is an average of these 4 results.
This approach is working but the precision (and the computing time) is affected by the number of partitions.
My question is if there is a way of generating a "global" z that will be updated from every process (partition).

TL;DR There is not. Spark doesn't have shared memory with synchronized access so no true global access can exist.
The only form of "shared" writable variable in Spark is Accumulator. It allows write only access with commutative and associative function.
Since its implementation is equivalent to reduce / aggregate:
Each partition has its own copy which is updated locally.
After task is completed partial results are send to the driver and combined with "global" instance.
it won't resolve your problem.

Related

Calculating Requests Per Minute From Timetstamps in RDD during mapping

I am currently trying to enrich data for machine learning with requests per minute. The data is stored in a Kafka topic and on application start the whole content of the topic is loaded and processed - therefore it is not possible to use any window operations of spark streaming to my knowledge, as all data will arrive at the same time.
My approach was to try the following:
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = kMeansInformationRdd.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).count()
(duration, rpm)
})
However on this approach I get the following exception:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Is there a way to achieve what I want to do?
If you need any more information just drop me a comment and I will update what you need.
EDIT:
Broadcasting an RDD does not work. Broadcasting the collected Array does not result in an acceptable performance.
What will be executed but is horribly slow and therefore not really an option:
val collected = kMeansInformationRdd.collect()
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
UPDATE:
This code is at least able to get the job done way faster - but as far as I see it still gets slower the higher the requests per minute are as the filtered array grows - interesting enough it gets slower towards the end what I cannot figure out why. If someone sees the issue - or performance issues that could be generally improved - I would be happy if you let me know.
kMeansInformationRdd = kMeansInformationRdd.cache()
kMeansInformationRdd.sortBy(_._2, true)
var kMeansFeatureArray: Array[(String, Long, Long)] = Array()
var buffer: collection.mutable.Map[String, Array[Long]] = collection.mutable.Map()
var counter = 0
kMeansInformationRdd.collect.foreach(x => {
val ts = x._2
val identifier = x._1 //make sure the identifier represents actually the entity that receives the traffic -e.g. machine (IP?) not only endpoint
var bufferInstance = buffer.get(identifier).getOrElse(Array[Long]())
bufferInstance = bufferInstance ++ Array(ts)
bufferInstance = bufferInstance.filter(p => p > ts-1000)
buffer.put(identifier, bufferInstance)
val rpm = bufferInstance.size.toLong
kMeansFeatureArray = kMeansFeatureArray ++ Array((identifier, x._3, rpm)) //identifier, duration, rpm
counter = counter +1
if(counter % 10000==0){
println(counter)
println((identifier, x._3, rpm))
println((instanceSizeBefore, instanceSizeAfter))
}
})
val kMeansFeatureRdd = sc.parallelize(kMeansFeatureArray)
The code that is given in the EDIT section is not correct. It is not the correct way a variable is broadcasted in Spark. The correct way is as follows:
val collected = sc.broadcast(kMeansInformationRdd.collect())
val kMeansFeatureRdd = kMeansInformationRdd.map(x => {
val begin = x._2 //Long - unix timestamp millis
val duration = x._3 //Long
val rpm = collected.value.filter(y => (x._2 - 60000 <= y._2 && x._2 >= y._2)).size
(duration, rpm)
})
Of course, you can use global variables as well instead of sc.broadcast, but that is not recommended. Why?
The reason is that the difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is:
When using the external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor. So when the variable (say an array) is large enough (more than 10-20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>10-20K) is suggested to be broadcasted explicitly.
When using the external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.

Spark: flatMap/reduceByKey seems to be quite slow with Long keys on some distributions

I'm using Spark to process some corpora and I need to count the occurrence of each 2-gram. I started with counting tuples (wordID1, wordID2) and it worked fine except for the large memory usage and gc overhead due to the substantial number of small tuple objects. Then I tried to pack a pair of Ints into a Long, and the gc overhead did reduce greatly, but the run time also increased several times.
I ran some small experiments with random data on different distributions. It seems that the performance issue only occurs on exponential distributed data.
// lines of word IDs
val data = (1 to 5000).par.map({ _ =>
(1 to 1000) map { _ => (-1000 * Math.log(Random.nextDouble)).toInt }
}).seq
// count Tuples, fast
sc parallelize(data) flatMap { line =>
val first = line.iterator
val second = line.iterator.drop(1)
for (pair <- first zip(second))
yield (pair, 1L)
} reduceByKey { _ + _ } count()
// count Long, slow
sc parallelize(data) flatMap { line =>
val first = line.iterator
val second = line.iterator.drop(1)
for ((a, b) <- first zip(second))
yield ((a.toLong << 32) | b, 1L)
} reduceByKey { _ + _ } count()
The job is split into two stages, flatMap() and count(). When counting Tuple2s, flatMap() takes about 6s and count() takes about 2s, while when counting Longs, flatMap() takes 18s and count() takes 10s.
It doesn't make sense to me as Longs should impose less overhead than Tuple2. Does spark has some specializations for Long keys, which happen to perform even worse for some specific distributions?
Thanks to #SarveshKumarSingh's hint, I finally solved the problem. It is not the Spark's specialization for Long that trigger the issue, but Java's, and Spark doesn't address it properly.
Java's hashCode() for Long is quite simple and strongly dependent on the two halves of the values, and Spark's default HashPartitioner simply partition keys according their hashCode() values modulo the partition number. These make Spark's default partitioning quite sensitive to the distribution of Long keys, especially when the number of partitions is relatively small. And in my case, the situation deteriorates as the Long keys are constructed via concatenating pairs of Ints.
The solutions would be quite straightforward as we just need to somehow "shuffle" the keys, which makes the keys with similar frequencies distributed evenly.
The simplest way is to map each key into another unique value using some perfect hash function, and convert it back when the original key is required. This approach involves only small code changes, but might not perform very well. I achieved performance similar to the count-by-tuple approach using the following mappings.
val newKey = oldKey * 6364136223846793005L + 1442695040888963407L
val oldKey = (newKey - 1442695040888963407L) * -4568919932995229531L
A more effective way is to substitute the default HashPartitioner. I used the following partitioner between flatMap and reduceByKey and achieved two times performance boost on real world data.
val prevRDD = // ... flatMap ...
val nParts = prevRDD.partitioner match {
case Some(p) => p.numPartitions
case None => prevRDD.partitions.size
}
prevRDD partitionBy (new Partitioner {
override def getPartition(key: Any): Int = {
val rawMod = LongHash(key.asInstanceOf[Long]) % numPartitions
rawMod + (if (rawMod < 0) numPartitions else 0)
}
override def numPartitions: Int = nParts
}) reduceByKey { _ + _ }
def LongHash(v: Long) = { // the 64bit mix function from Murmurhash3
var k = v
k ^= k >> 33
k *= 0xff51afd7ed558ccdL
k ^= k >> 33
k *= 0xc4ceb9fe1a85ec53L
k ^= k >> 33
k.toInt
}

How to reduce shuffling and time taken by Spark while making a map of items?

I am using spark to read a csv file like this :
x, y, z
x, y
x
x, y, c, f
x, z
I want to make a map of items vs their count. This is the code I wrote :
private def genItemMap[Item: ClassTag](data: RDD[Array[Item]], partitioner: HashPartitioner): mutable.Map[Item, Long] = {
val immutableFreqItemsMap = data.flatMap(t => t)
.map(v => (v, 1L))
.reduceByKey(partitioner, _ + _)
.collectAsMap()
val freqItemsMap = mutable.Map(immutableFreqItemsMap.toSeq: _*)
freqItemsMap
}
When I run it, it is taking a lot of time and shuffle space. Is there a way to reduce the time?
I have a 2 node cluster with 2 cores each and 8 partitions. The number of lines in the csv file are 170000.
If you just want to do an unique item count thing, then I suppose you can take the following approach.
val data: RDD[Array[Item]] = ???
val itemFrequency = data
.flatMap(arr =>
arr.map(item => (item, 1))
)
.reduceByKey(_ + _)
Do not provide any partitioner while reducing, otherwise it will cause re-shuffling. Just keep it with the partitioning it already had.
Also... do not collect the distributed data into a local in-memory object like a Map.

Scala distributed execution of function objects

Given the following function objects,
val f : Int => Double = (i:Int) => i + 0.1
val g1 : Double => Double = (x:Double) => x*10
val g2 : Double => Double = (x:Double) => x/10
val h : (Double,Double) => Double = (x:Double,y:Double) => x+y
and for instance 3 remote servers or nodes (IP xxx.xxx.xxx.1, IP 2 and IP 3), how to distribute the execution of this program,
val fx = f(1)
val g1x = g1( fx )
val g2x = g2( fx )
val res = h ( g1x, g2x )
so that
fx is computed in IP 1,
g1x is computed in IP 2,
g2x is computed in IP 3,
res is computed in IP 1
May Scala Akka or Apache Spark provide a simple approach to this ?
Update
RPC (Remote Procedure Call) Finagle as suggested by #pkinsky may be a feasible choice.
Consider load-balancing policies as a mechanism for selecting a node for execution, at least any free available node policy.
I can speak for Apache Spark. It can do what you are looking for with the code below. But it's not designed for this kind of parallel computation. It is designed for parallel computation where you also have a large amount of parallel data distributed on many machines. So the solution looks a bit silly, as we distribute a single integer across a single machine for example (for f(1)).
Also, Spark is designed to run the same computation on all the data. So running g1() and g2() in parallel goes a bit against the design. (It's possible, but not elegant, as you see.)
// Distribute the input (1) across 1 machine.
val rdd1 = sc.parallelize(Seq(1), numSlices = 1)
// Run f() on the input, collect the results and take the first (and only) result.
val fx = rdd1.map(f(_)).collect.head
// The next stage's input will be (1, fx), (2, fx) distributed across 2 machines.
val rdd2 = sc.parallelize(Seq((1, fx), (2, fx)), numSlices = 2)
// Run g1() on one machine, g2() on the other.
val gxs = rdd2.map {
case (1, x) => g1(x)
case (2, x) => g2(x)
}.collect
val g1x = gxs(0)
val g2x = gxs(1)
// Same deal for h() as for f(). The input is (g1x, g2x), distributed to 1 machine.
val rdd3 = sc.parallelize(Seq((g1x, g2x)), numSlices = 1)
val res = rdd3.map { case (g1x, g2x) => h(g1x, g2x) }.collect.head
You can see that Spark code is based around the concept of RDDs. An RDD is like an array, except it's partitioned across multiple machines. sc.parallelize() creates such a parallel collection from a local collection. For example rdd2 in the above code will be created from the local collection Seq((1, fx), (2, fx)) and split across two machines. One machine will have Seq((1, fx)), the other will have Seq((2, fx)).
Next we do a transformation on the RDD. map is a common transformation that creates a new RDD of the same length by applying a function to each element. (Same as Scala's map.) The map we run on rdd2 will replace (1, x) with g1(x) and (2, x) with g2(x). So on one machine it will cause g1() to run, while on the other g2() will run.
Transformations run lazily, only when you want to access the results. The methods that access the results are called actions. The most straightforward example is collect, which downloads the contents of the entire RDD from the cluster to the local machine. (It is exactly the opposite of sc.parallelize().)
You can try and see all this if you download Spark, start bin/spark-shell, and copy your function definitions and the above code into the shell.

How to sort an RDD in Scala Spark?

Reading Spark method sortByKey :
sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
Is it possible to return just "N" amount of results. So instead of returning all results, just return the top 10. I could convert the sorted collection to an Array and use take method but since this is an O(N) operation is there a more efficient method ?
If you only need the top 10, use rdd.top(10). It avoids sorting, so it is faster.
rdd.top makes one parallel pass through the data, collecting the top N in each partition in a heap, then merges the heaps. It is an O(rdd.count) operation. Sorting would be O(rdd.count log rdd.count), and incur a lot of data transfer — it does a shuffle, so all of the data would be transmitted over the network.
Most likely you have already perused the source code:
class OrderedRDDFunctions {
// <snip>
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.size): RDD[P] = {
val part = new RangePartitioner(numPartitions, self, ascending)
val shuffled = new ShuffledRDD[K, V, P](self, part)
shuffled.mapPartitions(iter => {
val buf = iter.toArray
if (ascending) {
buf.sortWith((x, y) => x._1 < y._1).iterator
} else {
buf.sortWith((x, y) => x._1 > y._1).iterator
}
}, preservesPartitioning = true)
}
And, as you say, the entire data must go through the shuffle stage - as seen in the snippet.
However, your concern about subsequently invoking take(K) may not be so accurate. This operation does NOT cycle through all N items:
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
* results from that partition to estimate the number of additional partitions needed to satisfy
* the limit.
*/
def take(num: Int): Array[T] = {
So then, it would seem:
O(myRdd.take(K)) << O(myRdd.sortByKey()) ~= O(myRdd.sortByKey.take(k))
(at least for small K) << O(myRdd.sortByKey().collect()
Another option, at least from PySpark 1.2.0, is the use of takeOrdered.
In ascending order:
rdd.takeOrdered(10)
In descending order:
rdd.takeOrdered(10, lambda x: -x)
Top k values for k,v pairs:
rdd.takeOrdered(10, lambda (k, v): -v)