How to combine `count` and `sum` computations for the same source - scala

There is a some stream of integers:
val source = Source(List(1,2,3,4,5))
Is there possible to get the (count, sum) result from the source? For the above example it will be (5, 15).
I guess I should use flows and combine them:
val countFlow = Flow[Int].fold(0)((c, _) => c + 1)
val sumFlow = Flow[Int].fold(0)((s, e) => s + e)
How to apply the above flows to the source. Or is there another way?

Final Total
The Flow that you presented is almost correct for getting a final value after the source is exhausted:
case class Data(sum : Int = 0, count : Int = 0)
val updateData : (Data, Int) => Data =
(data, i) => Data(data.sum + i, data.count + 1)
val zeroData = Data()
val countAndSum = Flow[Int].fold(zeroData)(updateData)
This Flow can then be combined with a Sink.head to get the final result:
val result : Future[Data] =
source
.via(countAndSum)
.runWith(Sink[Data].head)
Intermediate Values
If you want a "running counter", e.g. you want all of the intermediate Data values, then you can use Flow.scan instead of fold:
val intermediateCountAndSum =
Flow[Int].scan(zeroData)(updateData)
And you can "drain" these Data values into a Sink.seq:
val intermediateResult : Future[Seq[Data]] =
source
.via(intermediateCountAndSum)
.runWith(Sink[Data].seq)

val graph = Source.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val fanOut = builder.add(Broadcast[Int](2))
val merge = builder.add(Zip[Int, Int])
source ~> fanOut ~> countFlow ~> merge.in0
fanOut ~> sumFlow ~> merge.in1
SourceShape(merge.out)
})
graph.runWith(Sink.last)

You can simply do the following
source.map(list => (list.length, list.reduceLeft(_+_)))
I hope its helpful

case class Stats(sum: Int, count: Int) {
def add(el: Int): Stats = this.copy(sum = sum += el, count = count +=1)
}
object Stats {
def empty: Stats = Stats(0, 0)
}
val countFlow = Flow[Status].fold(Stats.empty)((stats, e) => stats add e)

Related

Akka Stream - Parallel Processing with Partition

I'm looking for a way to implement/use Fan-out which takes 1 input, and broadcast to N outputs parallel, the difference is that i want to partition them.
Example: 1 input can emit to 4 different outputs, and other input can emit to 2 others outputs, depends on some function f
source ~> partitionWithBroadcast // Outputs to some subset of [0,3] outputs
partitionWithBroadcast(0) ~> ...
partitionWithBroadcast(1) ~> ...
partitionWithBroadcast(2) ~> ...
partitionWithBroadcast(3) ~> ...
I was searching in the Akka documentation but couldn't found any flow which can be suitable
any ideas?
What comes to mind is a FanOutShape with filters attached to each output. NOTE: I am not using the standard Partition operator because it emits to just 1 output. The question asks to emit to any of the connected outputs. E.g.:
def createPartial[E](partitioner: E => Set[Int]) = {
GraphDSL.create[FanOutShape4[E,E,E,E,E]]() { implicit builder =>
import GraphDSL.Implicits._
val flow = builder.add(Flow.fromFunction((e: E) => (e, partitioner(e))))
val broadcast = builder.add(Broadcast[(E, Set[Int])](4))
val flow0 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(0)).map(_._1))
val flow1 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(1)).map(_._1))
val flow2 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(2)).map(_._1))
val flow3 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(3)).map(_._1))
flow.out ~> broadcast.in
broadcast.out(0) ~> flow0.in
broadcast.out(1) ~> flow1.in
broadcast.out(2) ~> flow2.in
broadcast.out(3) ~> flow3.in
new FanOutShape4[E,E,E,E,E](flow.in, flow0.out, flow1.out, flow2.out, flow3.out)
}
}
The partitioner is a function that maps an element from upstream to a tuple having that element and a set of integers that will activate the corresponding output. The graph calculates the desired partitions, then broadcasts the tuple. A flow attached to each of the outputs of the Broadcast selects elements that the partitioner assigned to that output.
Then use it e.g. as:
implicit val system: ActorSystem = ActorSystem()
implicit val ec = system.dispatcher
def partitioner(s: String) = (0 to 3).filter(s(_) == '*').toSet
val src = Source(immutable.Seq("*__*", "**__", "__**", "_*__"))
val sink0 = Sink.seq[String]
val sink1 = Sink.seq[String]
val sink2 = Sink.seq[String]
val sink3 = Sink.seq[String]
def toFutureTuple[X](f0: Future[X], f1: Future[X], f2: Future[X], f3: Future[X]) = f0.zip(f1).zip(f2).map(t => (t._1._1,t._1._2,t._2)).zip(f3).map(t => (t._1._1,t._1._2,t._1._3,t._2))
val g = RunnableGraph.fromGraph(GraphDSL.create(src, sink0, sink1, sink2, sink3)((_,f0,f1,f2,f3) => toFutureTuple(f0,f1,f2,f3)) { implicit builder =>
(in, o0, o1, o2, o3) => {
import GraphDSL.Implicits._
val part = builder.add(createPartial(partitioner))
in ~> part.in
part.out0 ~> o0
part.out1 ~> o1
part.out2 ~> o2
part.out3 ~> o3
ClosedShape
}
})
val result = Await.result(g.run(), 10.seconds)
println("0: " + result._1.mkString(" "))
println("1: " + result._2.mkString(" "))
println("2: " + result._3.mkString(" "))
println("3: " + result._4.mkString(" "))
// Prints:
//
// 0: *__* **__
// 1: **__ _*__
// 2: __**
// 3: *__* __**
First, implement your function to create the Partition:
def partitionFunction4[A](func: A => Int)(implicit builder: GraphDSL.Builder[NotUsed]) = {
// partition with 4 output ports
builder.add(Partition[A](4, inputElement => func(inputElement)))
}
then, create another function to create a Sink with a log function that is going to be used to print in the console the element:
def stream[A](log: A => Unit) = Flow.fromFunction[A, A](el => {
log(el)
el
} ).to(Sink.ignore)
Connect all the elements in the *graph function:
def graph[A](src: Source[A, NotUsed])
(func4: A => Int, log: Int => A => Unit) = {
RunnableGraph
.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val partition4 = partitionFunction4(func4)
/** Four sinks **/
val flowSet0 = (0 to 4).map(in => log(in))
src ~> partition4.in
partition4.out(0) ~> stream(flowSet0(0))
partition4.out(1) ~> stream(flowSet0(1))
partition4.out(2) ~> stream(flowSet0(2))
partition4.out(3) ~> stream(flowSet0(3))
ClosedShape
})
.run()
}
Create a Source that emits five Int elements. The function to create the partition is "element % 4". Depending on the result of this function the element will be redirected to the specific source:
val source1: Source[Int, NotUsed] = Source(0 to 4)
graph[Int](source1)(f1 => f1 % 4,
in => {
el =>
println(s"Stream ${in} element ${el}")
})
Obtaining as result:
Stream 0 element 0
Stream 1 element 1
Stream 2 element 2
Stream 3 element 3
Stream 0 element 4

Evaluating multiple filters in a single pass

I have below rdd created and I need to perform a series of filters on the same dataset to derive different counters and aggregates.
Is there a way I can apply these filters and compute aggregates in a single pass, avoiding spark to go over the same dataset multiple times?
val res = df.rdd.map(row => {
// ............... Generate data here for each row.......
})
res.persist(StorageLevel.MEMORY_AND_DISK)
val all = res.count()
val stats1 = res.filter(row => row.getInt(1) > 0)
val stats1Count = stats1.count()
val stats1Agg = stats1.map(r => r.getInt(1)).mean()
val stats2 = res.filter(row => row.getInt(2) > 0)
val stats2Count = stats2.count()
val stats2Agg = stats2.map(r => r.getInt(2)).mean()
You can use aggregate:
case class Stats(count: Int = 0, sum: Int = 0) {
def mean = sum/count
def +(s: Stats): Stats = Stats(count + s.count, sum + s.sum)
def <- (n: Int) = if(n > 0) copy(count + 1, sum + n) else this
}
val (stats1, stats2) = res.aggregate(Stats() -> Stats()) (
{ (s, row) => (s._1 <- row.getInt(1), s._2 <- row.getInt(2)) },
{ _ + _ }
)
val (stat1Count, stats1Agg, stats2Count, stats2Agg) = (stats1.count, stats1.mean, stats2.count, stats2.mean)

Akka Streams - Combine different Sources

i have an object that builds different Flows, each flow has filters, that may can discard values, so the final result may contain a subset of the original source.
The code:
object RawFlowGeneratorByVehicle {
val deviceEventFilter = (de : DeviceEvent) => de.isValidPosition : Boolean
def buildSpeedFlow(vehicles : List[Vehicle]) : VEHICLERAWFLOW = {
Flow[DeviceEvent].filter(deviceEventFilter)
.groupBy(vehicles.length,de => de.getModemId)
.reduce((a, b) => if(a.getGenerationDate >= b.getGenerationDate) a else b)
.mergeSubstreams
.map(de => VehicleFlowResult(de.getModemId,"Speed",de.getSpeed))
}
def buildCountFlow(vehicles: List[Vehicle], maxSpeed : Double) : VEHICLERAWFLOW = {
Flow[DeviceEvent].filter(deviceEventFilter)
.groupBy(vehicles.length,de => de.getModemId)
.filter(de => de.getSpeed > maxSpeed)
.map(_ -> 1)
.reduce((l, r) => (l._1, l._2 + r._2))
.mergeSubstreams
.map(a => VehicleFlowResult(a._1.getModemId, "SpeedCount", a._2))
}
//...Other flows
}
After build the flows, merge them in a graph and the final result is a csv file , this is the object with the graph
object RunnableFlows {
def rawGraph(in: Source[DeviceEvent, NotUsed], flows: List[VEHICLERAWFLOW]): Source[VehicleFlowResult, NotUsed] = {
val g = Source.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val bcast = builder.add(Broadcast[DeviceEvent](flows.length))
val merge = builder.add(Merge[VehicleFlowResult](flows.length))
in ~> bcast ~> flows.head ~> merge
for (curFlow <- flows.tail) {
bcast ~> curFlow ~> merge
}
SourceShape(merge.out)
})
g
}
}
the flows may have different size , so i dont know how to merge/concat/zip?, to generate a csv of the same size of rows like the vehicles list(this list dont have duplicate values),setting default values when an specific vehicle not pass the filters of the flows
The csv must be something like this
imei;name;event;value
aaa;vehicle1;Event1;100
aaa;vehicle1;Event2;100
bbb;vehicle2;DefaultEvent;defaultValue
ccc;vehicle3;Event5;89
Thanks!!

Iterating over cogrouped RDD

I have used a cogroup function and obtain following RDD:
org.apache.spark.rdd.RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
Before the map operation joined object would look like this:
RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
(-2095842000,(CompactBuffer((1504999740,1430096464017), (613904354,1430211912709), (-1514234644,1430288363100), (-276850688,1430330412225)),CompactBuffer((-511732877,1428682217564), (1133633791,1428831320960), (1168566678,1428964645450), (-407341933,1429009306167), (-1996133514,1429016485487), (872888282,1429031501681), (-826902224,1429034491003), (818711584,1429111125268), (-1068875079,1429117498135), (301875333,1429121399450), (-1730846275,1429131773065), (1806256621,1429135583312))))
(352234000,(CompactBuffer((1350763226,1430006650167), (-330160951,1430320010314)),CompactBuffer((2113207721,1428994842593), (-483470471,1429324209560), (1803928603,1429426861915))))
Now I want to do the following:
val globalBuffer = ListBuffer[Double]()
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
for(tupleB <- listB) {
val localResults = ListBuffer[Double]()
val itemToTest = Set(tupleB._1)
val tempList = ListBuffer[(Int, Double)]()
for(tupleA <- listA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
val i = (tupleA._1, tValue)
tempList += i
}
val sortList = tempList.sortWith(_._2 > _._2).slice(0,20).map(i => i._1)
val intersect = sortList.toSet.intersect(itemToTest)
if (intersect.size > 0)
localResults += 1.0
else localResults += 0.0
val normalized = sum(localResults.toList)/localResults.size
globalBuffer += normalized
}
})
//method sum
def sum(xs: List[Double]): Double = {//do the sum}
At the end of this I was expecting joined to be a list with double values. But when I looked at it it was unit. Also I will this is not the Scala way of doing it. How do I obtain globalBuffer as the final result.
Hmm, if I understood your code correctly, it could benefit from these improvements:
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
val localResults = listB.map {
case (intBValue, longBValue) =>
val itemToTest = intBValue // it's always one element
val tempList = listA.map {
case (intAValue, longAValue) =>
(intAValue, someFunctionReturnDouble(longBvalue, longAValue))
}
val sortList = tempList.sortWith(-_._2).slice(0,20).map(i => i._1)
if (sortList.toSet.contains(itemToTest)) { 1.0 } else {0.0}
// no real need to convert to a set for 20 elements, by the way
}
sum(localResults)/localResults.size
})
Transformations of RDDs are not going to modify globalBuffer. Copies of globalBuffer are made and sent out to each of the workers, but any modifications to these copies on the workers will never modify the globalBuffer that exists on the driver (the one you have defined outside the map on the RDD.) Here's what I do (with a few additional modifications):
val joined = data1.cogroup(data2) map { x =>
val iterA = x._2._1
val iterB = x._2._2
var count, positiveCount = 0
val tempList = ListBuffer[(Int, Double)]()
for (tupleB <- iterB) {
tempList.clear
for(tupleA <- iterA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
tempList += ((tupleA._1, tValue))
}
val sortList = tempList.sortWith(_._2 > _._2).iterator.take(20)
if (sortList.exists(_._1 == tupleB._1)) positiveCount += 1
count += 1
}
positiveCount.toDouble/count
}
At this point you can obtain of local copy of the proportions by using joined.collect.

How to extract value from a scala Future

How to I perform a reduce/fold operation on the Seq and then get the final value.
I'm performing an operation (in this case a Redis call) that returns a Future. I'm processing the Future (results) using a map operation.
The map operation returns a Future[Seq[Any]] type.
res0: scala.concurrent.Future[Seq[Any]] = scala.concurrent.impl.Promise$DefaultPromise#269f8f79
Now I want to perform some operations(fold/reduce) on this Seq and then get a final value. How can I achieve this?
implicit val akkaSystem = akka.actor.ActorSystem()
val redisClient = RedisClient()
val sentimentZSetKey = "dummyzset"
val currentTimeStamp = System.currentTimeMillis()
val end = Limit(currentTimeStamp)
val start = Limit(currentTimeStamp - 60 * 100000)
val results = redisClient.zrangebyscoreWithscores(ZSetKey, start, end)
implicit val formats = DefaultFormats
import org.json4s._
import org.json4s.native.JsonMethods._
import org.json4s.DefaultFormats
results.map {
seq => seq.map {
element => element match {
case (byteString, value) => {
val p = byteString.decodeString("UTF-8")
try {
val ph = parse(p).extract[MyClass]
ph
} catch {
case e: Exception => println(e.getMessage)
}
}
case _ =>
}
}
}
Blocking is discouraged when using futures in Scala, but it can be done with the Await function as per the link. Since you want to further transform the sequence, you are better off using functional composition as in these examples.