Count number of elements in Akka Streams - scala

I'm trying to transform a Source of Scala entities into a Source of ByteString via Alpakka's CsvFormatting and count number of elements in the initial stream. Could you suggest the best way to count the initialSource elements and keep the result as a ByteString Source:
val initialSource: Source[SomeEntity, NotUsed] = Source.fromPublisher(publisher)
val csvSource: Source[ByteString, NotUsed] = initialSource
.map(e => List(e.firstName, e.lastName, e.city))
.via(CsvFormatting.format())

To count the elements in a stream, one must run the stream. One approach is to broadcast the stream elements to two sinks: one sink is the result of the main processing, the other sink simply counts the number of elements. Here is a simple example, which uses a graph to obtain the materialized values of both sinks:
val sink1 = Sink.foreach(println)
val sink2 = Sink.fold[Int, ByteString](0)((acc, _) => acc + 1)
val g = RunnableGraph.fromGraph(GraphDSL.create(sink1, sink2)((_, _)) { implicit builder =>
(s1, s2) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[ByteString](2))
val source: Source[ByteString, NotUsed] =
Source(1 to 10)
.map(i => List(i.toString))
.via(CsvFormatting.format())
source ~> broadcast.in
broadcast.out(0) ~> s1
broadcast.out(1) ~> s2
ClosedShape
}) // RunnableGraph[(Future[Done], Future[Int])]
val (fut1, fut2) = g.run()
fut2 onComplete {
case Success(count) => println(s"Number of elements: $count")
case Failure(_) =>
}
In the above example, the first sink just prints the stream elements and has a materialized value of type Future[Done]. The second sink does a fold operation to count the stream elements and has a materialized value of type Future[Int]. The following is printed:
ByteString(49, 13, 10)
ByteString(50, 13, 10)
ByteString(51, 13, 10)
ByteString(52, 13, 10)
ByteString(53, 13, 10)
ByteString(54, 13, 10)
ByteString(55, 13, 10)
ByteString(56, 13, 10)
ByteString(57, 13, 10)
ByteString(49, 48, 13, 10)
Number of elements: 10
Another option for sending stream elements to two different sinks, while retaining their respective materialized values, is to use alsoToMat:
val sink1 = Sink.foreach(println)
val sink2 = Sink.fold[Int, ByteString](0)((acc, _) => acc + 1)
val (fut1, fut2) = Source(1 to 10)
.map(i => List(i.toString))
.via(CsvFormatting.format())
.alsoToMat(sink1)(Keep.right)
.toMat(sink2)(Keep.both)
.run() // (Future[Done], Future[Int])
fut2 onComplete {
case Success(count) => println(s"Number of elements: $count")
case Failure(_) =>
}
This produces the same result as the graph example described earlier.

Related

Akka - convert Flow into Collection or Publisher

I'm trying to split an Akka Source into two separate ones.
val requestFlow = Flow[BodyPartEntity].to(Sink.seq) // convert to Seq[BodyPartEntity]
val dataFlow = Flow[BodyPartEntity].to(Sink.asPublisher(fanout = false)) // convert to Publisher[BodyPartEntity]
implicit class EitherSourceExtension[L, R, Mat](source: Source[FormData.BodyPart, Mat]) {
def partition(left: Sink[BodyPartEntity, NotUsed], right: Sink[BodyPartEntity, NotUsed]): Graph[ClosedShape, NotUsed] = {
GraphDSL.create() { implicit builder =>
import akka.stream.scaladsl.GraphDSL.Implicits._
val partition = builder.add(Partition[FormData.BodyPart](2, element => if (element.getName == "request") 0 else 1))
source ~> partition.in
partition.out(0).map(_.getEntity) ~> left
partition.out(1).map(_.getEntity) ~> right
ClosedShape
}
}
}
How to convert requestFlow into Seq[BodyPartEntity] and dataFlow into Publisher[BodyPartEntity]
You could use a BroadcastHub for this. From doc:
A BroadcastHub can be used to consume elements from a common producer by a dynamic set of consumers.
Simplified code:
val runnableGraph: RunnableGraph[Source[Int, NotUsed]] =
Source(1 to 5).toMat(
BroadcastHub.sink(bufferSize = 4))(Keep.right)
val fromProducer: Source[Int, NotUsed] = runnableGraph.run()
// Process the messages from the producer in two independent consumers
fromProducer.runForeach(msg => println("consumer1: " + msg))
fromProducer.runForeach(msg => println("consumer2: " + msg))

Akka Stream - Parallel Processing with Partition

I'm looking for a way to implement/use Fan-out which takes 1 input, and broadcast to N outputs parallel, the difference is that i want to partition them.
Example: 1 input can emit to 4 different outputs, and other input can emit to 2 others outputs, depends on some function f
source ~> partitionWithBroadcast // Outputs to some subset of [0,3] outputs
partitionWithBroadcast(0) ~> ...
partitionWithBroadcast(1) ~> ...
partitionWithBroadcast(2) ~> ...
partitionWithBroadcast(3) ~> ...
I was searching in the Akka documentation but couldn't found any flow which can be suitable
any ideas?
What comes to mind is a FanOutShape with filters attached to each output. NOTE: I am not using the standard Partition operator because it emits to just 1 output. The question asks to emit to any of the connected outputs. E.g.:
def createPartial[E](partitioner: E => Set[Int]) = {
GraphDSL.create[FanOutShape4[E,E,E,E,E]]() { implicit builder =>
import GraphDSL.Implicits._
val flow = builder.add(Flow.fromFunction((e: E) => (e, partitioner(e))))
val broadcast = builder.add(Broadcast[(E, Set[Int])](4))
val flow0 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(0)).map(_._1))
val flow1 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(1)).map(_._1))
val flow2 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(2)).map(_._1))
val flow3 = builder.add(Flow[(E, Set[Int])].filter(_._2.contains(3)).map(_._1))
flow.out ~> broadcast.in
broadcast.out(0) ~> flow0.in
broadcast.out(1) ~> flow1.in
broadcast.out(2) ~> flow2.in
broadcast.out(3) ~> flow3.in
new FanOutShape4[E,E,E,E,E](flow.in, flow0.out, flow1.out, flow2.out, flow3.out)
}
}
The partitioner is a function that maps an element from upstream to a tuple having that element and a set of integers that will activate the corresponding output. The graph calculates the desired partitions, then broadcasts the tuple. A flow attached to each of the outputs of the Broadcast selects elements that the partitioner assigned to that output.
Then use it e.g. as:
implicit val system: ActorSystem = ActorSystem()
implicit val ec = system.dispatcher
def partitioner(s: String) = (0 to 3).filter(s(_) == '*').toSet
val src = Source(immutable.Seq("*__*", "**__", "__**", "_*__"))
val sink0 = Sink.seq[String]
val sink1 = Sink.seq[String]
val sink2 = Sink.seq[String]
val sink3 = Sink.seq[String]
def toFutureTuple[X](f0: Future[X], f1: Future[X], f2: Future[X], f3: Future[X]) = f0.zip(f1).zip(f2).map(t => (t._1._1,t._1._2,t._2)).zip(f3).map(t => (t._1._1,t._1._2,t._1._3,t._2))
val g = RunnableGraph.fromGraph(GraphDSL.create(src, sink0, sink1, sink2, sink3)((_,f0,f1,f2,f3) => toFutureTuple(f0,f1,f2,f3)) { implicit builder =>
(in, o0, o1, o2, o3) => {
import GraphDSL.Implicits._
val part = builder.add(createPartial(partitioner))
in ~> part.in
part.out0 ~> o0
part.out1 ~> o1
part.out2 ~> o2
part.out3 ~> o3
ClosedShape
}
})
val result = Await.result(g.run(), 10.seconds)
println("0: " + result._1.mkString(" "))
println("1: " + result._2.mkString(" "))
println("2: " + result._3.mkString(" "))
println("3: " + result._4.mkString(" "))
// Prints:
//
// 0: *__* **__
// 1: **__ _*__
// 2: __**
// 3: *__* __**
First, implement your function to create the Partition:
def partitionFunction4[A](func: A => Int)(implicit builder: GraphDSL.Builder[NotUsed]) = {
// partition with 4 output ports
builder.add(Partition[A](4, inputElement => func(inputElement)))
}
then, create another function to create a Sink with a log function that is going to be used to print in the console the element:
def stream[A](log: A => Unit) = Flow.fromFunction[A, A](el => {
log(el)
el
} ).to(Sink.ignore)
Connect all the elements in the *graph function:
def graph[A](src: Source[A, NotUsed])
(func4: A => Int, log: Int => A => Unit) = {
RunnableGraph
.fromGraph(GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val partition4 = partitionFunction4(func4)
/** Four sinks **/
val flowSet0 = (0 to 4).map(in => log(in))
src ~> partition4.in
partition4.out(0) ~> stream(flowSet0(0))
partition4.out(1) ~> stream(flowSet0(1))
partition4.out(2) ~> stream(flowSet0(2))
partition4.out(3) ~> stream(flowSet0(3))
ClosedShape
})
.run()
}
Create a Source that emits five Int elements. The function to create the partition is "element % 4". Depending on the result of this function the element will be redirected to the specific source:
val source1: Source[Int, NotUsed] = Source(0 to 4)
graph[Int](source1)(f1 => f1 % 4,
in => {
el =>
println(s"Stream ${in} element ${el}")
})
Obtaining as result:
Stream 0 element 0
Stream 1 element 1
Stream 2 element 2
Stream 3 element 3
Stream 0 element 4

Schedule computation concurrently for all elements of the fs2.Stream

I have an fs2.Stream consisting of some elements (probably infinite) and I want to schedule some computation for all elements of the stream concurrently to each other. Here is what I tried
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)
val stream = for {
id <- fs2.Stream.emits(List(1, 2)).covary[IO]
_ <- fs2.Stream.awakeEvery[IO](1.second)
_ <- fs2.Stream.eval(IO(println(id)))
} yield ()
stream.compile.drain.unsafeRunSync()
The program output looks like
1
1
1
etc...
which is not what's expected. I'd like to interleave the scheduled computation for all of the elements of the original stream, but not wait until the first stream terminates (which never happens due to the infinite scheduling).
val str = for {
id <- Stream.emits(List(1, 5, 7)).covary[IO]
res = timer.sleep(id.second) >> IO(println(id))
} yield res
val stream = str.parEvalMapUnordered(5)(identity)
stream.compile.drain.unsafeRunSync()
or
val stream = Stream.emits(List(1, 5, 7))
.map { id =>
Stream.eval(timer.sleep(id.second) >> IO(println(id))) }
.parJoinUnbounded
stream.compile.drain.unsafeRunSync()
Accroding to hints given by #KrzysztofAtłasik and #LuisMiguelMejíaSuárez here is the solution I just came up with:
val originalStream = fs2.Stream.emits(List(1, 2))
val scheduledComputation = originalStream.covary[IO].map({ id =>
fs2.Stream.awakeEvery[IO](1.second).evalMap(_ => IO.delay(println(id)))
}).fold(fs2.Stream.empty.covaryAll[IO, Unit])((result, stream) => result.merge(stream)).flatten
The solution that #KrzysztofAtłasik proposed in the comment with interleaving
id <- fs2.Stream.emits(List(1, 2)).covary[IO] and _ <- fs2.Stream.awakeEvery[IO](1.second) also works, but it does not allow to schedule each element in its own way.
To schedule elements concurrently for elementValue seconds it is possible to do the following:
val scheduleEachElementIndividually = originalStream.covary[IO].map({ id =>
//id.seconds
fs2.Stream.awakeEvery[IO](id.second).evalMap(_ => IO.delay(println(id)))
}).fold(fs2.Stream.empty.covaryAll[IO, Unit])((result, stream) => result.merge(stream)).flatten

Akka Streams scala DSL and Op-Rabbit

I have started using Akka Streams and Op-Rabbit and am a bit confused.
I need to split the stream based on a predicate and then combine them much like I have done when creating graphs and using the Partition and Merge.
I have been able to do things like this using the GraphDSL.Builder, but can't seem to get it to work with AckedSource/Flow/Sink
the graph would look like:
| --> flow1 --> |
source--> partition --> | | --> flow3 --> sink
| --> flow2 --> |
I'm not sure if splitWhen is what I should use because I always need exactly 2 flows.
This is a sample that does not do the partitioning and does not use the GraphDSL.Builder:
def splitExample(source: AckedSource[String, SubscriptionRef],
queueName: String)
(implicit actorSystem: ActorSystem): RunnableGraph[SubscriptionRef] = {
val toStringFlow: Flow[AckTup[Message], AckTup[String], NotUsed] = Flow[AckTup[Message]]
.map[AckTup[String]](tup => {
val (p,m) = tup
(p, new String(m.data))
})
val printFlow1: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]]
.map[AckTup[String]](tup => {
val (p, s) = tup
println(s"flow1 processing $s")
tup
})
val printFlow2: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]]
.map[AckTup[String]](tup => {
val (p, s) = tup
println(s"flow2 processing $s")
tup
})
source
.map(Message.queue(_, queueName))
.via(AckedFlow(toStringFlow))
// partition if string.length < 10
.via(AckedFlow(printFlow1))
.via(AckedFlow(printFlow2))
.to(AckedSink.ack)
}
This is the code that I can't seem to get working:
import GraphDSL.Implicits._
def buildModelAcked(source: AckedSource[String, SubscriptionRef] , queueName: String)(implicit actorSystem: ActorSystem): Graph[ClosedShape, Future[Done]] = {
import GraphDSL.Implicits._
GraphDSL.create(Sink.ignore) { implicit builder: GraphDSL.Builder[Future[Done]] => s =>
import GraphDSL.Implicits._
source.map(Message.queue(_, queueName)) ~> AckedFlow(toStringFlow) ~> AckedSink.ack
// source.map(Message.queue(_, queueName)).via(AckedFlow(toStringFlow)).to(AckedSink.ack)
ClosedShape
}}
The compiler can't resolve the ~> operator
So my questions are:
Is there an example project that uses the scala dsl to build graphs of Acked/Source/Flow/Sink?
Is there an example project that partitions and merges that is similar to what I am trying to do here?
Keep in mind the following definitions when dealing the acked-stream.
AckedSource[Out, Mat] is a wrapper for Source[AckTup[Out], Mat]]
AckedFlow[In, Out, Mat] is a wrapper for Flow[AckTup[In], AckTup[Out], Mat]
AckedSink[In, Mat] is a wrapper for Sink[AckTup[In], Mat]
AckTup[T] is an alias for (Promise[Unit], T)
the classic flow combinators will operate on the T part of the AckTup
the .acked combinator will complete the Promise[Unit] of an AckedFlow
The GraphDSL edge operator (~>) will work against a bunch of Akka predefined shapes (see the code for GraphDSL.Implicits), but it won't work against custom shapes defined by the acked-stream lib.
You got 2 ways out:
you define your own ~> implicit operator, along the lines of the ones in GraphDSL.Implicits
you unwrap the acked stages to obtain standard stages. You are able to access the wrapped stage using .wrappedRepr - available on AckedSource, AckedFlow and AckedSink.
Based on Stefano Bonetti's excellent direction, here is a possible solution:
graph:
|--> short --|
rabbitMq --> before --| |--> after
|--> long --|
solution:
val before: Flow[AckTup[Message], AckTup[String], NotUsed] = Flow[AckTup[Message]].map[AckTup[String]](tup => {
val (p,m) = tup
(p, new String(m.data))
})
val short: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"short: $s")
tup
})
val long: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"long: $s")
tup
})
val after: Flow[AckTup[String], AckTup[String], NotUsed] = Flow[AckTup[String]].map[AckTup[String]](tup => {
val (p, s) = tup
println(s"all $s")
tup
})
def buildSplitGraph(source: AckedSource[String, SubscriptionRef]
, queueName: String
, splitLength: Int)(implicit actorSystem: ActorSystem): Graph[ClosedShape, Future[Done]] = {
GraphDSL.create(Sink.ignore) { implicit builder: GraphDSL.Builder[Future[Done]] => s =>
val toShort = 0
val toLong = 1
// junctions
val split = builder.add(Partition[AckTup[String]](2, (tup: AckTup[String]) => {
val (p, s) = tup
if (s.length < splitLength) toShort else toLong
}
))
val merge = builder.add(Merge[AckTup[String]](2))
//graph
val beforeSplit = source.map(Message.queue(_, queueName)).wrappedRepr ~> AckedFlow(before).wrappedRepr
beforeSplit ~> split
// must do short, then long since the split goes in that order
split ~> AckedFlow(short).wrappedRepr ~> merge
split ~> AckedFlow(long).wrappedRepr ~> merge
// after the last AckedFlow, be sure to '.acked' so that the message will be removed from the queue
merge ~> AckedFlow(after).acked ~> s
ClosedShape
}}
As Stefano Bonetti said, the key was to use the .wrappedRepr associated with the AckedFlow and then to use the .acked combinator as the last step.

Split RDD into RDD's with no repeating values

I have a RDD of Pairs as below :
(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)
I would like to split it into two RDD's such that the first has only the pairs with non-repeating values (for both key and value) and the second will have the rest of the omitted pairs.
So from the above we could get two RDD's
1) (105,918)
(502,516)
2) (105,757) - Omitted as 105 is already included in 1st RDD
(105,137) - Omitted as 105 is already included in 1st RDD
(516,816) - Omitted as 516 is already included in 1st RDD
(350,502) - Omitted as 502 is already included in 1st RDD
Currently I am using a mutable Set variable to track the elements already selected after coalescing the original RDD to a single partition :
val evalCombinations = collection.mutable.Set.empty[String]
val currentValidCombinations = allCombinations
.filter(p => {
if(!evalCombinations.contains(p._1) && !evalCombinations.contains(p._2)) {
evalCombinations += p._1;evalCombinations += p._2; true
} else
false
})
This approach is limited by memory of the executor on which the operations run. Is there a better scalable solution for this ?
Here is a version, which will require more memory for driver.
import org.apache.spark.rdd._
import org.apache.spark._
def getUniq(rdd: RDD[(Int, Int)], sc: SparkContext): RDD[(Int, Int)] = {
val keys = rdd.keys.distinct
val values = rdd.values.distinct
// these are the keys which appear in value part also.
val both = keys.intersection(values)
val bBoth = sc.broadcast(both.collect.toSet)
// remove those key-value pairs which have value which is also a key.
val uKeys = rdd.filter(x => !bBoth.value.contains(x._2))
.reduceByKey{ case (v1, v2) => v1 } // keep uniq keys
uKeys.map{ case (k, v) => (v, k) } // swap key, value
.reduceByKey{ case (v1, v2) => v1 } // keep uniq value
.map{ case (k, v) => (v, k) } // correct placement
}
def getPartitionedRDDs(rdd: RDD[(Int, Int)], sc: SparkContext) = {
val r = getUniq(rdd, sc)
val remaining = rdd subtract r
val set = r.flatMap { case (k, v) => Array(k, v) }.collect.toSet
val a = remaining.filter{ case (x, y) => !set.contains(x) &&
!set.contains(y) }
val b = getUniq(a, sc)
val part1 = r union b
val part2 = rdd subtract part1
(part1, part2)
}
val rdd = sc.parallelize(Array((105,918),(105,757),(502,516),
(105,137),(516,816),(350,502)))
val (first, second) = getPartitionedRDDs(rdd, sc)
// first.collect: ((516,816), (105,918), (350,502))
// second.collect: ((105,137), (502,516), (105,757))
val rdd1 = sc.parallelize(Array((839,841),(842,843),(840,843),
(839,840),(1,2),(1,3),(4,3)))
val (f, s) = getPartitionedRDDs(rdd1, sc)
//f.collect: ((839,841), (1,2), (840,843), (4,3))