Akka Streams custom merge - scala

I'm new to akka-streams and not sure how to approach this problem.
I have 3 source streams are sorted by a sequence ID. I want to group the values together which have the same ID. Values in each stream may be missing or duplicated. If one stream is a faster producer than the rest, it should get backpressured.
case class A(id: Int)
case class B(id: Int)
case class C(id: Int)
case class Merged(as: List[A], bs: List[B], cs: List[C])
import akka.stream._
import akka.stream.scaladsl._
val as = Source(List(A(1), A(2), A(3), A(4), A(5)))
val bs = Source(List(B(1), B(2), B(3), B(4), B(5)))
val cs = Source(List(C(1), C(1), C(3), C(4)))
val merged = ???
// value 1: Merged(List(A(1)), List(B(1)), List(C(1), C(1)))
// value 2: Merged(List(A(2)), List(B(2)), Nil)
// value 3: Merged(List(A(3)), List(B(3)), List(C(3)))
// value 4: Merged(List(A(4)), List(B(4)), List(C(4)))
// value 5: Merged(List(A(5)), List(B(5)), Nil)
// (end of stream)

this question is old but I was trying to find a solution for that and I only encountered the rocks to the path at the lightbend forum, but not a working use case. So I decided to implement and post here my example.
I created 3 sources sourceA, sourceB, and sourceC which emit events in a different speed using .throttle(). Then I created a RunnableGraph where I merge the sources using Merge and I the output to my WindowGroupEventFlow Flow that I implemented based on a sliding window of number of events. This is the graph:
sourceA ~> mergeShape.in(0)
sourceB ~> mergeShape.in(1)
sourceC ~> mergeShape.in(2)
mergeShape.out ~> windowFlowShape ~> sinkShape
The classes that I am using on the sources are these:
object Domain {
sealed abstract class Z(val id: Int, val value: String)
case class A(override val id: Int, override val value: String = "A") extends Z(id, value)
case class B(override val id: Int, override val value: String = "B") extends Z(id, value)
case class C(override val id: Int, override val value: String = "C") extends Z(id, value)
case class ABC(override val id: Int, override val value: String) extends Z(id, value)
}
and this is the WindowGroupEventFlow Flow that I created to group the events:
// step 0: define the shape
class WindowGroupEventFlow(maxBatchSize: Int) extends GraphStage[FlowShape[Domain.Z, Domain.Z]] {
// step 1: define the ports and the component-specific members
val in = Inlet[Domain.Z]("WindowGroupEventFlow.in")
val out = Outlet[Domain.Z]("WindowGroupEventFlow.out")
// step 3: create the logic
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
// mutable state
val batch = new mutable.Queue[Domain.Z]
var count = 0
// var result = ""
// step 4: define mutable state implement my logic here
setHandler(in, new InHandler {
override def onPush(): Unit = {
try {
val nextElement = grab(in)
batch.enqueue(nextElement)
count += 1
// If window finished we have to dequeue all elements
if (count >= maxBatchSize) {
println("************ window finished - dequeuing elements ************")
var result = Map[Int, Domain.Z]()
val list = batch.dequeueAll(_ => true).to[collection.immutable.Iterable]
list.foreach { tuple =>
if (result.contains(tuple.id)) {
val abc = result.get(tuple.id)
val value = abc.get.value + tuple.value
val z: Domain.Z = Domain.ABC(tuple.id, value)
result += (tuple.id -> z)
} else {
val z: Domain.Z = Domain.ABC(tuple.id, tuple.value)
result += (tuple.id -> z)
}
}
val finalResult: collection.immutable.Iterable[Domain.Z] = result.map(p => p._2)
emitMultiple(out, finalResult)
count = 0
} else {
pull(in) // send demand upstream signal, asking for another element
}
} catch {
case e: Throwable => failStage(e)
}
}
})
setHandler(out, new OutHandler {
override def onPull(): Unit = {
pull(in)
}
})
}
// step 2: construct a new shape
override def shape: FlowShape[Domain.Z, Domain.Z] = FlowShape[Domain.Z, Domain.Z](in, out)
}
and this is how I run everything:
object WindowGroupEventFlow {
def main(args: Array[String]): Unit = {
run()
}
def run() = {
implicit val system = ActorSystem("WindowGroupEventFlow")
import Domain._
val sourceA = Source(List(A(1), A(2), A(3), A(1), A(2), A(3), A(1), A(2), A(3), A(1))).throttle(3, 1 second)
val sourceB = Source(List(B(1), B(2), B(1), B(2), B(1), B(2), B(1), B(2), B(1), B(2))).throttle(2, 1 second)
val sourceC = Source(List(C(1), C(2), C(3), C(4))).throttle(1, 1 second)
// Step 1 - setting up the fundamental for a stream graph
val windowRunnableGraph = RunnableGraph.fromGraph(
GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
// Step 2 - create shapes
val mergeShape = builder.add(Merge[Domain.Z](3))
val windowEventFlow = Flow.fromGraph(new WindowGroupEventFlow(5))
val windowFlowShape = builder.add(windowEventFlow)
val sinkShape = builder.add(Sink.foreach[Domain.Z](x => println(s"sink: $x")))
// Step 3 - tying up the components
sourceA ~> mergeShape.in(0)
sourceB ~> mergeShape.in(1)
sourceC ~> mergeShape.in(2)
mergeShape.out ~> windowFlowShape ~> sinkShape
// Step 4 - return the shape
ClosedShape
}
)
// run the graph and materialize it
val graph = windowRunnableGraph.run()
}
}
you can see on the output how I am grouping the elements with same ID:
sink: ABC(1,ABC)
sink: ABC(2,AB)
************ window finished - dequeuing elements ************
sink: ABC(3,A)
sink: ABC(1,BA)
sink: ABC(2,CA)
************ window finished - dequeuing elements ************
sink: ABC(2,B)
sink: ABC(3,AC)
sink: ABC(1,BA)
************ window finished - dequeuing elements ************
sink: ABC(2,AB)
sink: ABC(3,A)
sink: ABC(1,BA)

Related

generate list of case class with int field without repeat

I want to generate a List of some class which contains several fields. One of them is Int type and it doesn’t have to repeat. Could you help me to write the code?
I tried next:
case class Person(name: String, age: Int)
implicit val genPerson: Gen[Person] =
for {
name <- arbitrary[String]
age <- Gen.posNum[Int]
} yield Person(name, age)
implicit val genListOfPerson: Gen[scala.List[Person]] = Gen.listOfN(3, genPerson)
The problem is that I got an instance of a person with equal age.
If you're requiring that no two Persons in the generated list have the same age, you can
implicit def IntsArb: Arbitrary[Int] = Arbitrary(Gen.choose[Int](0, Int.MaxValue))
implicit val StringArb: Arbitrary[String] = Arbitrary(Gen.listOfN(5, Gen.alphaChar).map(_.mkString))
implicit val PersonGen = Arbitrary(Gen.resultOf(Person.apply _))
implicit val PersonsGen: Arbitrary[List[Person]] =
Arbitrary(Gen.listOfN(3, PersonGen.arbitrary).map { persons =>
val grouped: Map[Int, List[Person]] = persons.groupBy(_.age)
grouped.values.map(_.head) // safe because groupBy
})
Note that this will return a List with no duplicate ages but there's no guarantee that the list will have size 3 (it is guaranteed that the list will be nonempty, with size at most 3).
If having a list of size 3 is important, at the risk of generation failing if the "dice are against you", you can have something like:
def uniqueAges(persons: List[Person], target: Int): Gen[List[Person]] = {
val grouped: Map[Int, List[Person]] = persons.groupBy(_.age)
val uniquelyAged = grouped.values.map(_.head)
val n = uniquelyAged.size
if (n == target) Gen.const(uniquelyAged)
else {
val existingAges = grouped.keySet
val genPerson = PersonGen.arbitrary.retryUntil { p => !existingAges(p.age) }
Gen.listOf(target - n, genPerson)
.flatMap(l => uniqueAges(l, target - n))
.map(_ ++ uniquelyAged)
}
}
implicit val PersonsGen: Arbitrary[List[Person]] =
Arbitrary(Gen.listOfN(3, PersonGen.arbitrary).flatMap(l => uniqueAges(l, 3)))
You can do it as follows:
implicit def IntsArb: Arbitrary[Int] = Arbitrary(Gen.choose[Int](0, Int.MaxValue))
implicit val StringArb: Arbitrary[String] = Arbitrary(Gen.listOfN(5, Gen.alphaChar).map(_.mkString))
implicit val PersonGen = Arbitrary(Gen.resultOf(Person.apply _))
implicit val PersonsGen: Arbitrary[List[Person]] = Arbitrary(Gen.listOfN(3, PersonGen.arbitrary))

Debug a custom Pipeline Transformer in Flink

I am trying to implement a custom Transformer in Flink following indications in its documentation but when I try to executed it seems the fit operation is never being called. Here it is what I've done so far:
class InfoGainTransformer extends Transformer[InfoGainTransformer] {
import InfoGainTransformer._
private[this] var counts: Option[collection.immutable.Vector[Map[Key, Double]]] = None
// here setters for params, as Flink does
}
object InfoGainTransformer {
// ====================================== Parameters =============================================
// ...
// ==================================== Factory methods ==========================================
// ...
// ========================================== Operations =========================================
implicit def fitLabeledVectorInfoGain = new FitOperation[InfoGainTransformer, LabeledVector] {
override def fit(instance: InfoGainTransformer, fitParameters: ParameterMap, input: DataSet[LabeledVector]): Unit = {
val counts = collection.immutable.Vector[Map[Key, Double]]()
input.map {
v =>
v.vector.map {
case (i, value) =>
println("INSIDE!!!")
val key = Key(value, v.label)
val cval = counts(i).getOrElse(key, .0)
counts(i) + (key -> cval)
}
}
}
}
implicit def fitVectorInfoGain[T <: Vector] = new FitOperation[InfoGainTransformer, T] {
override def fit(instance: InfoGainTransformer, fitParameters: ParameterMap, input: DataSet[T]): Unit = {
input
}
}
implicit def transformLabeledVectorsInfoGain = {
new TransformDataSetOperation[InfoGainTransformer, LabeledVector, LabeledVector] {
override def transformDataSet(
instance: InfoGainTransformer,
transformParameters: ParameterMap,
input: DataSet[LabeledVector]): DataSet[LabeledVector] = input
}
}
implicit def transformVectorsInfoGain[T <: Vector : BreezeVectorConverter : TypeInformation : ClassTag] = {
new TransformDataSetOperation[InfoGainTransformer, T, T] {
override def transformDataSet(instance: InfoGainTransformer, transformParameters: ParameterMap, input: DataSet[T]): DataSet[T] = input
}
}
}
Then I tried to use it in two ways:
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures()
val mlr = MultipleLinearRegression()
val gain = InfoGainTransformer().setK(2)
// Construct the pipeline
val pipeline = scaler
.chainTransformer(polyFeatures)
.chainTransformer(gain)
.chainPredictor(mlr)
val r = pipeline.predict(dataSet map (_.vector))
r.print()
And only my transformer:
pipeline.fit(dataSet)
In both cases, when I set a breakpoint inside fitLabeledVectorInfoGain, for example in the line input.map, the debugger stops there, but if I also set a breakpoint inside the nested map, for example bellow println("INSIDE!!!"), it never stops there.
Does anyone knows how could I debug this custom transformer?
It seems its working now. I think what was happening was I wasn't implementing right the FitOperation because nothing was being saved in the instance state, this is the implementation now:
implicit def fitLabeledVectorInfoGain = new FitOperation[InfoGainTransformer, LabeledVector] {
override def fit(instance: InfoGainTransformer, fitParameters: ParameterMap, input: DataSet[LabeledVector]): Unit = {
// val counts = collection.immutable.Vector[Map[Key, Double]]()
val r = input.map {
v =>
v.vector.foldLeft(Map.empty[Key, Double]) {
case (m, (i, value)) =>
println("INSIDE fit!!!")
val key = Key(value, v.label)
val cval = m.getOrElse(key, .0) + 1.0
m + (key -> cval)
}
}
instance.counts = Some(r)
}
}
Now the debugger enters correctly in all breakpoints and the TransformOperation its also being called.

akka streams stops after parallelism

I tried to build a small PDF parser with akka-streams (and limited understandings of it yet) and Apache's pdfbox.
One thing which I don't really get, the stream stops exactly after the given number of parallelism which is given into the mapAsync.
so if a PDF-doc has 20 pages and the parallelism is set to 5, the first 5 pages get processed and the rest is ignored, if set to 20, everything is done fine. Anybody an idea what I'm doing wrong?
class PdfParser(ws: WSClient, conf: Configuration, parallelism: Int) {
implicit val system = ActorSystem("image-parser")
implicit val materializer = ActorMaterializer()
def documentPages(doc: PDDocument, key: String): Iterator[Page] = {
val pages: util.List[_] = doc.getDocumentCatalog.getAllPages
val pageList = (for {
i ← 0 until pages.size()
page = pages.get(i)
} yield Page(page, s"$key-$i.jpg")).toIterator
pageList
}
val pageToImage: Flow[Page, Image, NotUsed] = Flow[Page].map { p ⇒
val img = p.content.asInstanceOf[PDPage].convertToImage()
Image(img, p.name)
}
val imageToS3: Flow[Image, String, NotUsed] = Flow[Image].mapAsync(parallelism) { i ⇒
val s3 = S3.fromConfiguration(ws, conf)
val bucket = s3.getBucket("elsa-essays")
val baos = new ByteArrayOutputStream()
ImageIO.write(i.content, "jpg", baos)
val res = bucket add BucketFile(i.name, "image/jpeg", baos.toByteArray)
res.map { _ ⇒
"uploaded"
}.recover {
case e: S3Exception ⇒ e.message
}
}
val sink: Sink[String, Future[String]] = Sink.head[String]
def parse(path: Path, key: String): Future[String] = {
val stream: InputStream = new FileInputStream(path.toString)
val doc = PDDocument.load(stream)
val source = Source.fromIterator(() ⇒ documentPages(doc, key))
val runnable: RunnableGraph[Future[String]] = source.via(pageToImage).via(imageToS3).toMat(sink)(Keep.right)
val res = runnable.run()
res.map { s ⇒
doc.close()
stream.close()
s
}
}
}
The problem is in your Sink. That Sink.head will return one element from your materialized Stream. So the question is, why it´s received more than one value when mapAsync(>1) is used in stream materialization?. Maybe it´s because it uses more than one actor pushing values downstream.
In any case, change your sink to something like:
val sink: Sink[String, Future[String]] = Sink.fold("")((a, b) => b ++ a)
and it will work.

How to avoid to broadcast a large lookup table in Spark

Can you help me to avoid broadcasting of a large lookup table? I have a table with measurements:
Measurement Value
x1 5.1
x2 8.9
x1 9.1
x3 4.4
x2 2.1
...
And a list of pairs:
P1 P2
x1 x2
x2 x3
...
The task is to get all values for both elements of every pair and put them into a magic function. That's how I solved it by broadcasting the large table with the measurements.
case class Measurement(measurement: String, value: Double)
case class Candidate(c1: String, c2: String)
val measurements = Seq(Measurement("x1", 5.1), Measurement("x2", 8.9),
Measurement("x1", 9.1), Measurement("x3", 4.4))
val candidates = Seq(Candidate("x1", "x2"), Candidate("x2", "x3"))
// create data frames
val dfm = sqc.createDataFrame(measurements)
val dfc = sqc.createDataFrame(candidates)
// broadcast lookup table
val lookup = sc.broadcast(dfm.rdd.map(r => (r(0), r(1))).collect())
// udf: run magic test with every candidate
val magic: ((String, String) => Double) = (c1: String, c2: String) => {
val lt = lookup.value
val c1v = lt.filter(_._1 == c1).map(_._2).map(_.asInstanceOf[Double])
val c2v = lt.filter(_._1 == c2).map(_._2).map(_.asInstanceOf[Double])
new Foo().magic(c1v, c2v)
}
val sq1 = udf(magic)
val dfks = dfc.withColumn("magic", sq1(col("c1"), col("c2")))
As you can guess I'm not pretty happy with the solution. For every pair I filter the lookup table twice, this isn't fast nor elegant. I'm using Spark 1.6.1.
An alternative would be to use RDD and join. Not sure what's better in term of performance though.
case class Measurement(measurement: String, value: Double)
case class Candidate(c1: String, c2: String)
val measurements = Seq(Measurement("x1", 5.1), Measurement("x2", 8.9),
Measurement("x1", 9.1), Measurement("x3", 4.4))
val candidates = Seq(Candidate("x1", "x2"), Candidate("x2", "x3"))
val rdm = sc.parallelize(measurements).map(r => (r.measurement, r.value)).groupByKey().cache()
val rdc = sc.parallelize(candidates).map(r => (r.c1, r.c2)).cache()
val firstColJoin = rdc.join(rdm).values
val secondColJoin = firstColJoin.join(rdm).values
secondColJoin.map { case (c1v, c2v) => new Foo().magic(c1v, c2v) }
Thank you for all comments. I read the comments, did some research and studied zero323 posts.
My current solution is using two joins and an UserDefinedAggregateFunction:
object GroupValues extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = ArrayType(DoubleType)
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[Double])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[Double](0) :+ input.getDouble(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[Double](0) ++ buffer2.getSeq[Double](0))
}
def evaluate(buffer: Row) = buffer.getSeq[Double](0)
}
// join data for candidate one
val j1 = dfc.join(dfm, dfc("c1") === dfm("measurement"))
// aggregate all c1 values to an array
val j1v = j1.groupBy(col("c1"), col("c2")).agg(GroupValues(col("value"))
.alias("c1-values"))
// join data for candidate two
val j2 = j1v.join(dfm, j1v("c2") === dfm("measurement"))
// aggregate all c2 values to an array
val j2v = j2.groupBy(col("c1"), col("c2"), col("c1-values"))
.agg(GroupValues(col("value")).alias("c2-values"))
Next step would be to use collect_list instead of UserDefinedAggregateFunction.

How to create Source and push elements to it manually?

I want to create custom StatefulStage which should works like groupBy method and emit Source[A, Unit] elements but I don't understand how to create instance of Source[A, Unit] and push incoming element to it. Here is stub:
class GroupBy[A, Mat]() extends StatefulStage[A, Source[A, Unit]] {
override def initial: StageState[A, Source[A, Unit]] = new StageState[A, Source[A, Unit]] {
override def onPush(elem: A, ctx: Context[Source[A, Unit]]): SyncDirective = {
val source: Source[A, Unit] = ... // Need to create source here
// and push `elem` to `source` here
emit(List(source).iterator, ctx)
}
}
}
You can use the following snippet for test GroupBy flow (it should print events from created stream):
case class Tick()
case class Event(timestamp: Long, sessionUid: String, traffic: Int)
implicit val system = ActorSystem()
import system.dispatcher
implicit val materializer = ActorMaterializer()
var rnd = Random
rnd.setSeed(1)
val eventsSource = Source
.tick(FiniteDuration(0, SECONDS), FiniteDuration(1, SECONDS), () => Tick)
.map {
case _ => Event(System.currentTimeMillis / 1000, s"session-${rnd.nextInt(5)}", rnd.nextInt(10) * 10)
}
val flow = Flow[Event]
.transform(() => new GroupByUntil)
.map {
(source) => source.runForeach(println)
}
eventsSource
.via(flow)
.runWith(Sink.ignore)
.onComplete(_ => system.shutdown())
Can anybody explain me how to do it?
UPDATE:
I wrote the following onPush method base on this answer but it didn't print events. As I understand I can push element to source only when it running as part of flow but I want to run flow outside of GroupBy in test snippet. If I run flow in GroupBy as in this example then it will process events and send them to Sink.ignore. I think this is a reason why my test snippet didn't print events.
override def onPush(elem: A, ctx: Context[Source[A, Unit]]): SyncDirective = {
val source: Source[A, ActorRef] = Source.actorRef[A](1000, OverflowStrategy.fail)
val flow = Flow[A].to(Sink.ignore).runWith(source)
flow ! elem
emit(List(source.asInstanceOf[Source[A, Unit]]).iterator, ctx)
}
So, how to fix it?