spark aggregatebykey - sum and running average in the same call

spark aggregatebykey - sum and running average in the same call - scala

I am learning spark, and do not have experience in hadoop.
Problem
I am trying to calculate the sum and average in the same call to aggregateByKey.
Let me share what I have tried so far.
Setup the data
val categoryPrices = List((1, 20), (1, 25), (1, 10), (1, 45))
val categoryPricesRdd = sc.parallelize(categoryPrices)
Attempt to calculate the average in the same call to aggregateByKey. This does not work.
val zeroValue1 = (0, 0, 0.0) // (count, sum, average)
categoryPricesRdd.
aggregateByKey(zeroValue1)(
(tuple, prevPrice) => {
val newCount = tuple._1 + 1
val newSum = tuple._2 + prevPrice
val newAverage = newSum/newCount
(newCount, newSum, newAverage)
},
(tuple1, tuple2) => {
val newCount1 = tuple1._1 + tuple2._1
val newSum1 = tuple1._2 + tuple2._2
// TRYING TO CALCULATE THE RUNNING AVERAGE HERE
val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1))/(tuple1._1 + tuple2._1)
(newCount1, newSum1, newAverage1)
}
).
collect.
foreach(println)
Result: Prints a different average each time
First time: (1,(4,100,70.0))
Second time: (1,(4,100,52.0))
Just do the sum first, and then calculate the average in a separate operation. This works.
val zeroValue2 = (0, 0) // (count, sum, average)
categoryPricesRdd.
aggregateByKey(zeroValue2)(
(tuple, prevPrice) => {
val newCount = tuple._1 + 1
val newSum = tuple._2 + prevPrice
(newCount, newSum)
},
(tuple1, tuple2) => {
val newCount1 = tuple1._1 + tuple2._1
val newSum1 = tuple1._2 + tuple2._2
(newCount1, newSum1)
}
).
map(rec => {
val category = rec._1
val count = rec._2._1
val sum = rec._2._2
(category, count, sum, sum/count)
}).
collect.
foreach(println)
Prints the same result every time:
(1,4,100,25)
I think I understand the difference between seqOp and CombOp. Given that an operation can split data across multiple partitions on different servers, my understanding is that seqOp operates on data in a single partition, and then combOp combines data received from different partitions. Please correct if this is wrong.
However, there is something very basic that I am not understanding. Looks like we can't calculate both the sum and average in the same call. If this is true, please help me understand why.

The computation related to your average aggregation in seqOp:
val newAverage = newSum/newCount
and in combOp:
val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1)) / (tuple1._1 + tuple2._1)
is incorrect.
Let's say the first three elements are in one partition and the last element in another. Your seqOp would generate the (count, sum, average) tuples as follows:
Partition #1: [20, 25, 10]
--> (1, 20, 20/1)
--> (2, 45, 45/2)
--> (3, 55, 55/3)
Partition #2: [45]
--> (1, 45, 45/1)
Next, the cross-partition combOp would combine the 2 tuples from the two partitions to give:
((55 * 3) + (45 * 1)) / 4
// Result: 52
As you can see from the above steps, the average value could be different if the ordering of the RDD elements or the partitioning is different.
Your 2nd approach works, as average is by definition total sum over total count hence is better calculated after first computing the sum and count values.

Related

How to mapConcat() and then fold() in a nested Flow before the upstream completes

Basically I'm trying to achieve the following scenario, that I consider a very simple use case. But as I'm new to akka-streams, I don't get it right.
At a certain stage of my stream graph, I split up N elements using the mapConcat function, then processing each of them in a nested flow in parallel and afterwards folding them again. The amount N of sub elements emitted by mapConcat is not know in advance and can differ from zero to hundreds of elements. But the fold function as stated in the docs only completes if the upstream completes, but I need a fan-in stage that does it, once all elements, that have been splitted by the mapConcat stage, have been processed.
A minimal example would be something like that:
Source(1 to 10)
.via(
Flow[Int].map(el => el * el))
.via(
Flow[Int]
.mapConcat(el => Set(el + 1, el + 2, el + 3))
.map(el => el * el)
.fold(0)((all, cur) => all + cur)
)
.runForeach(println)
For an interactive example see: https://scastie.scala-lang.org/hPdKEF4QS0yZazWMfFvuEQ
The fold operation folds all values, so it prints one value. How to construct the flow in order to get 10 results?

you can get your 10 elements via grouping by N (N : number of results emitted by mapConcat):
Source(1 to 10)
.via(
Flow[Int].map(el => el * el))
.via(
Flow[Int]
.mapConcat(el => Set(el + 1, el + 2, el + 3))
.map(el => el * el)
.grouped(3)
.map(s => s.sum))
.runForeach(println)
But mapConcat does not create a substream, it just takes a single element and emits N. All on the same stream.
If on the other hand you really wanted to create substreams I'd suggest having a look at akka-doc#substream

mapConcat applies a function to every element in an input stream and concatenate in the output stream. While having a slightly simpler signature, it operates like flatMapConcat and can probably be best illustrated with the following diagram from this Akka Stream doc.
With the above picture in mind, to review the flow content per Source element, you can use grouped to group the mapConcat-ed data, as shown below:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl._
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
Source(1 to 10).
via(
Flow[Int].map(el => el * el)).
via(
Flow[Int].
mapConcat(el => Set(el + 1, el + 2, el + 3)).
map(el => el * el).
grouped(3).
map{ g =>
val sum = g.reduce(_ + _)
println(s"$g: sum = $sum")
sum
}.
fold(0)(_ + _)
).
runForeach(println)
// res1: scala.concurrent.Future[akka.Done] = Future(<not completed>)
// Vector(4, 9, 16): sum = 29
// Vector(25, 36, 49): sum = 110
// Vector(100, 121, 144): sum = 365
// Vector(289, 324, 361): sum = 974
// Vector(676, 729, 784): sum = 2189
// Vector(1369, 1444, 1521): sum = 4334
// Vector(2500, 2601, 2704): sum = 7805
// Vector(4225, 4356, 4489): sum = 13070
// Vector(6724, 6889, 7056): sum = 20669
// Vector(10201, 10404, 10609): sum = 31214
// 80759
[UPDATE]
If the Iterable generated by mapConcat varies in size per Source element, grouped will no longer be helpful. In some cases, you might be able to move any post-mapConcat transformations into the body of the function taken by mapConcat, like in the following example:
import java.util.concurrent.ThreadLocalRandom
Source(1 to 10).
via(
Flow[Int].map(el => el * el)).
via(
Flow[Int].
mapConcat{ el =>
val list = for (i <- 1 to ThreadLocalRandom.current.nextInt(1, 4))
yield el + i
val list2 = list.map(e => e * e)
val sum = list2.reduce(_ + _)
println(s"$list2: sum = $sum")
list2
}.
fold(0)(_ + _)
).
runForeach(println)
// res2: scala.concurrent.Future[akka.Done] = Future(<not completed>)
// Vector(4): sum = 4
// Vector(25, 36, 49): sum = 110
// Vector(100, 121, 144): sum = 365
// Vector(289): sum = 289
// Vector(676, 729): sum = 1405
// Vector(1369): sum = 1369
// Vector(2500, 2601, 2704): sum = 7805
// Vector(4225, 4356): sum = 8581
// Vector(6724, 6889, 7056): sum = 20669
// Vector(10201, 10404): sum = 20605
// 61202

Spark Split RDD into chunks and concatenate

I have a relatively simple problem.
I have an large Spark RDD[String] (containing JSON). In my use case I want to group (concatenate) N strings together into a new RDD[String], so that it will have the size of oldRDD.size/N.
pseudo example:
val oldRDD : RDD[String] = ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}']
val newRDD : RDD[String] = someTransformation(oldRDD, ",", 2)
newRDD = ['{"id": 1},{"id": 2}','{"id": 3},{"id": 4}']
val anotherRDD : RDD[String] = someTransformation(oldRDD, ",", 3)
anotherRDD = ['{"id": 1},{"id": 2},{"id": 3}','{"id": 4}']
I already looked for a similar case, but couldnt find anything.
Thanks!

Here you have to use zipWithIndex function and then calculate group.
For example, index = 3 and n (number of groups) = 2 gives you 2nd group. 3 / 2 = 1 (integer divide), so 0-based 2nd group
val n = 3;
val newRDD1 = oldRDD.zipWithIndex() // creates tuples (element, index)
// map to tuple (group, content)
.map(x => (x._2 / n, x._1))
// merge
.reduceByKey(_ + ", " + _)
// remove key
.map(x => x._2)
One note: order of "zipWithIndex" is internal order. It can make no sense in business logic, you must check if order is ok in your case. If not, sort RDD and then use zipWithIndex

How to compute cumulative sum using Spark

I have an rdd of (String,Int) which is sorted by key
val data = Array(("c1",6), ("c2",3),("c3",4))
val rdd = sc.parallelize(data).sortByKey
Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous keys.
Eg: c1 = 0 , c2 = c1's value , c3 = (c1 value +c2 value) , c4 = (c1+..+c3 value)
expected output:
(c1,0), (c2,6), (c3,9)...
Is it possible to achieve this ?
I tried it with map but the sum is not preserved inside the map.
var sum = 0 ;
val t = keycount.map{ x => { val temp = sum; sum = sum + x._2 ; (x._1,temp); }}

Compute partial results for each partition:
val partials = rdd.mapPartitionsWithIndex((i, iter) => {
val (keys, values) = iter.toSeq.unzip
val sums = values.scanLeft(0)(_ + _)
Iterator((keys.zip(sums.tail), sums.last))
})
Collect partials sums
val partialSums = partials.values.collect
Compute cumulative sum over partitions and broadcast it:
val sumMap = sc.broadcast(
(0 until rdd.partitions.size)
.zip(partialSums.scanLeft(0)(_ + _))
.toMap
)
Compute final results:
val result = partials.keys.mapPartitionsWithIndex((i, iter) => {
val offset = sumMap.value(i)
if (iter.isEmpty) Iterator()
else iter.next.map{case (k, v) => (k, v + offset)}.toIterator
})

Spark has buit-in supports for hive ANALYTICS/WINDOWING functions and the cumulative sum could be achieved easily using ANALYTICS functions.
Hive wiki ANALYTICS/WINDOWING functions.
Example:
Assuming you have sqlContext object-
val datardd = sqlContext.sparkContext.parallelize(Seq(("a",1),("b",2), ("c",3),("d",4),("d",5),("d",6)))
import sqlContext.implicits._
//Register as test table
datardd.toDF("id","val").createOrReplaceTempView("test")
//Calculate Cumulative sum
sqlContext.sql("select id,val, " +
"SUM(val) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from test").show()
This approach cause to below warning. In case executor runs outOfMemory, tune job’s memory parameters accordingly to work with huge dataset.
WARN WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation
I hope this helps.

Here is a solution in PySpark. Internally it's essentially the same as #zero323's Scala solution, but it provides a general-purpose function with a Spark-like API.
import numpy as np
def cumsum(rdd, get_summand):
"""Given an ordered rdd of items, computes cumulative sum of
get_summand(row), where row is an item in the RDD.
"""
def cumsum_in_partition(iter_rows):
total = 0
for row in iter_rows:
total += get_summand(row)
yield (total, row)
rdd = rdd.mapPartitions(cumsum_in_partition)
def last_partition_value(iter_rows):
final = None
for cumsum, row in iter_rows:
final = cumsum
return (final,)
partition_sums = rdd.mapPartitions(last_partition_value).collect()
partition_cumsums = list(np.cumsum(partition_sums))
partition_cumsums = [0] + partition_cumsums
partition_cumsums = sc.broadcast(partition_cumsums)
def add_sums_of_previous_partitions(idx, iter_rows):
return ((cumsum + partition_cumsums.value[idx], row)
for cumsum, row in iter_rows)
rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
return rdd
# test for correctness by summing numbers, with and without Spark
rdd = sc.range(10000,numSlices=10).sortBy(lambda x: x)
cumsums, values = zip(*cumsum(rdd,lambda x: x).collect())
assert all(cumsums == np.cumsum(values))

I came across a similar problem and implemented #Paul 's solution. I wanted to do cumsum on a integer frequency table sorted by key(the integer), and there was a minor problem with np.cumsum(partition_sums), error being unsupported operand type(s) for +=: 'int' and 'NoneType'.
Because if the range is big enough, the probability of each partition having something is thus big enough(no None values). However, if the range is much smaller than count, and number of partitions remains the same, some of the partitions would be empty. Here comes the modified solution:
def cumsum(rdd, get_summand):
"""Given an ordered rdd of items, computes cumulative sum of
get_summand(row), where row is an item in the RDD.
"""
def cumsum_in_partition(iter_rows):
total = 0
for row in iter_rows:
total += get_summand(row)
yield (total, row)
rdd = rdd.mapPartitions(cumsum_in_partition)
def last_partition_value(iter_rows):
final = None
for cumsum, row in iter_rows:
final = cumsum
return (final,)
partition_sums = rdd.mapPartitions(last_partition_value).collect()
# partition_cumsums = list(np.cumsum(partition_sums))
#----from here are the changed lines
partition_sums = [x if x is not None else 0 for x in partition_sums]
temp = np.cumsum(partition_sums)
partition_cumsums = list(temp)
#----
partition_cumsums = [0] + partition_cumsums
partition_cumsums = sc.broadcast(partition_cumsums)
def add_sums_of_previous_partitions(idx, iter_rows):
return ((cumsum + partition_cumsums.value[idx], row)
for cumsum, row in iter_rows)
rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
return rdd
#test on random integer frequency
x = np.random.randint(10, size=1000)
D = sqlCtx.createDataFrame(pd.DataFrame(x.tolist(),columns=['D']))
c = D.groupBy('D').count().orderBy('D')
c_rdd = c.rdd.map(lambda x:x['count'])
cumsums, values = zip(*cumsum(c_rdd,lambda x: x).collect())

you can want to try out with windows over using rowsBetween. hope still helpful.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val data = Array(("c1",6), ("c2",3),("c3",4))
val df = sc.parallelize(data).sortByKey().toDF("c", "v")
val w = Window.orderBy("c")
val r = df.select( $"c", sum($"v").over(w.rowsBetween(-2, -1)).alias("cs"))
display(r)

Efficient countByValue of each column Spark Streaming

I want to find countByValues of each column in my data. I can find countByValue() for each column (e.g. 2 columns now) in basic batch RDD as fallows:
scala> val double = sc.textFile("double.csv")
scala> val counts = sc.parallelize((0 to 1).map(index => {
double.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
}))
scala> counts.take(2)
res20: Array[scala.collection.Map[Long,Long]] = Array(Map(2 -> 5, 1 -> 5), Map(4 -> 5, 5 -> 5))
Now I want to perform same with DStreams. I have windowedDStream and want to countByValue on each column. My data has 50 columns. I have done it as fallows:
val windowedDStream = myDStream.window(Seconds(2), Seconds(2)).cache()
ssc.sparkContext.parallelize((0 to 49).map(index=> {
val counts = windowedDStream.map(x=> { val token = x.split(",")
(math.round(token(index).toDouble))
}).countByValue()
counts.print()
}))
val topCounts = counts.map . . . . will not work
I get correct results with this, the only issue is that I want to apply more operations on counts and it's not available outside map.

You misunderstand what parallelize does. You think when you give it a Seq of two elements, those two elements will be calculated in parallel. That it not the case and it would be impossible for it to be the case.
What parallelize actually does is it creates an RDD from the Seq that you provided.
To try to illuminate this, consider that this:
val countsRDD = sc.parallelize((0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
})
Is equal to this:
val counts = (0 to 1).map { index =>
double.map { x =>
val token = x.split(",")
math.round(token(index).toDouble)
}.countByValue()
}
val countsRDD = sc.parallelize(counts)
By the time parallelize runs, the work has already been performed. parallelize cannot retroactively make it so that the calculation happened in parallel.
The solution to your problem is to not use parallelize. It is entirely pointless.

Using Streams for iteration in Scala

SICP says that iterative processes (e.g. Newton method of square root calculation, "pi" calculation, etc.) can be formulated in terms of Streams.
Does anybody use streams in Scala to model iterations?

Here is one way to produce the stream of approximations of pi:
val naturals = Stream.from(0) // 0, 1, 2, ...
val odds = naturals.map(_ * 2 + 1) // 1, 3, 5, ...
val oddInverses = odds.map(1.0d / _) // 1/1, 1/3, 1/5, ...
val alternations = Stream.iterate(1)(-_) // 1, -1, 1, ...
val products = (oddInverses zip alternations)
.map(ia => ia._1 * ia._2) // 1/1, -1/3, 1/5, ...
// Computes a stream representing the cumulative sum of another one
def sumUp(s : Stream[Double], acc : Double = 0.0d) : Stream[Double] =
Stream.cons(s.head + acc, sumUp(s.tail, s.head + acc))
val pi = sumUp(products).map(_ * 4.0) // Approximations of pi.
Now, say you want the 200th iteration:
scala> pi(200)
resN: Double = 3.1465677471829556
...or the 300000th:
scala> pi(300000)
resN : Double = 3.14159598691202

Streams are extremely useful when you are doing a sequence of recursive calculations and a single result depends on previous results, such as calculating pi. Here's a simpler example, consider the classic recursive algorithm for calculating fibbonacci numbers (1, 2, 3, 5, 8, 13, ...):
def fib(n: Int) : Int = n match {
case 0 => 1
case 1 => 2
case _ => fib(n - 1) + fib(n - 2)
}
One of the main points of this code is that while very simple, is extremely inefficient. fib(100) almost crashed my computer! Each recursion branches into two calls and you are essentially calculating the same values many times.
Streams allow you to do dynamic programming in a recursive fashion, where once a value is calculated, it is reused every time it is needed again. To implement the above using streams:
val naturals: Stream[Int] = Stream.cons(0, naturals.map{_ + 1})
val fibs : Stream[Int] = naturals.map{
case 0 => 1
case 1 => 2
case n => fibs(n - 1) + fibs( n - 2)
}
fibs(1) //2
fibs(2) //3
fibs(3) //5
fibs(100) //1445263496
Whereas the recursive solution runs in O(2^n) time, the Streams solution runs in O(n^2) time. Since you only need the last 2 generated members, you can easily optimize this using Stream.drop so that the stream size doesn't overflow memory.