Scala RestartSink Future - scala

I'm trying to re-create similar functionality of Scala's [RestartSink][1] feature.
I've come up with this code. However, since we only return a SinkShape instead of a Sink, I'm having trouble specifying that it should return a Future[Done] instead of NotUsed as it's materialized type. However, I'm confused about how to do that. I'm only able to have it return [MessageActionPair, NotUsed] instead of the desired [MessageActionPair, Future[Done]]. I'm still learning my way around this framework, so I'm sure I'm missing something small. I tried calling Source.toMat(RestartWithBackoffSink...), however that doesn't give the desired result either.
private final class RestartWithBackoffSink(
sourcePool: Seq[SqsEndpoint],
minBackoff: FiniteDuration,
maxBackoff: FiniteDuration,
randomFactor: Double) extends GraphStage[SinkShape[MessageActionPair]] { self ⇒
val in = Inlet[MessageActionPair]("RoundRobinRestartWithBackoffSink.in")
override def shape = SinkShape(in)
override def createLogic(inheritedAttributes: Attributes) = new RestartWithBackoffLogic(
"Sink", shape, minBackoff, maxBackoff, randomFactor, onlyOnFailures = false) {
override protected def logSource = self.getClass
override protected def startGraph() = {
val sourceOut = createSubOutlet(in)
Source.fromGraph(sourceOut.source).runWith(createSink(getEndpoint))(subFusingMaterializer)
}
override protected def backoff() = {
setHandler(in, new InHandler {
override def onPush() = ()
})
}
private def createSink(endpoint: SqsEndpoint): Sink[MessageActionPair, Future[Done]] = {
SqsAckSink(endpoint.queue.url)(endpoint.client)
}
def getEndpoint: SqsEndpoint = {
if(isTimedOut) {
index = (index + 1) % sourcePool.length
restartCount = 0
}
sourcePool(index)
}
backoff()
}
}
Syntax error here, since types don't match:
def withBackoff[T](minBackoff: FiniteDuration, maxBackoff: FiniteDuration, randomFactor: Double, sourcePool: Seq[SqsEndpoint]): Sink[MessageActionPair, Future[Done]] = {
Sink.fromGraph(new RestartWithBackoffSink(sourcePool, minBackoff, maxBackoff, randomFactor))
}

By extending extends GraphStage[SinkShape[MessageActionPair]] you are defining a stage with no materialized value. Or better you define a stage that materializes to NotUsed.
You have to decide if your stage can materialize into anything meaningful. More on materialized values for stages here.
If so: you have to extend GraphStageWithMaterializedValue[SinkShape[MessageActionPair], Future[Done]] and properly override the createLogicAndMaterializedValue function. More guidance can be found in the docs.
If not: you can change your types as per below
def withBackoff[T](minBackoff: FiniteDuration, maxBackoff: FiniteDuration, randomFactor: Double, sourcePool: Seq[SqsEndpoint]): Sink[MessageActionPair, NotUsed] = {
Sink.fromGraph(new RestartWithBackoffSink(sourcePool, minBackoff, maxBackoff, randomFactor))
}

Related

GraphStage with shape of 2-in 2-out

I need to write a custom GraphStage that has two input ports, and two output ports. This GraphStage will allow two otherwise independent flows to affect each other. What shape could I use for that? FanOutShape2 Has two outputs and FanInShape2 has two inputs, but how can I have a shape that has both? Somehow combine (inherit from) both? Use BidiFlow? Make my own?
Answering this myself, since this has been solved by the helpful guys on discuss.lightbend.com, see https://discuss.lightbend.com/t/graphstage-with-shape-of-2-in-and-2-out/4160/3
The answer to this question is to simply use BidiShape. Despite the otherwise revealing name, the logic behind a BidiShape has to be by no means bi-directional (it's obvious in retrospect, but I was thrown off by this).
Some code that can be used for reference if anybody is in a similar situation, where they have to do something based on two inputs, with the possibility to push to two outputs:
class BiNoneCounter[T]() extends GraphStage[BidiShape[Option[T], Option[Int], Option[T], Option[Int]]] {
private val leftIn = Inlet[Option[T]]("BiNoneCounter.in1")
private val rightIn = Inlet[Option[T]]("BiNoneCounter.in2")
private val leftOut = Outlet[Option[Int]]("BiNoneCounter.out1")
private val rightOut = Outlet[Option[Int]]("BiNoneCounter.out2")
override val shape = BidiShape(leftIn, leftOut, rightIn, rightOut)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) {
private var grabNextPush = false
val inHandler = new InHandler {
override def onPush(): Unit = {
if (grabNextPush) {
(grab(leftIn), grab(rightIn)) match {
// do stuff here
}
}
grabNextPush = !grabNextPush
}
}
val outHandler = (inlet: Inlet[Option[T]]) => new OutHandler {
override def onPull(): Unit = {
pull(inlet)
}
}
setHandler(leftOut, outHandler(leftIn))
setHandler(rightOut, outHandler(rightIn))
setHandler(leftIn, inHandler)
setHandler(rightIn, inHandler)
}
}
Can be used like this:
sourceOne ~> bidi.in1
bidi.out1 ~> sinkOne
sourceTwo ~> bidi.in2
bidi.out2 ~> sinkTwo

How to unit test BroadcastProcessFunction in flink when processElement depends on broadcasted data

I implemented a flink stream with a BroadcastProcessFunction. From the processBroadcastElement I get my model and I apply it on my event in processElement.
I don't find a way to unit test my stream as I don't find a solution to ensure the model is dispatched prior to the first event.
I would say there are two ways for achieving this:
1. Find a solution to have the model pushed in the stream first
2. Have the broadcast state filled with the model prio to the execution of the stream so that it is restored
I may have missed something, but I have not found an simple way to do this.
Here is a simple unit test with my issue:
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
import org.scalatest.Matchers._
import org.scalatest.{BeforeAndAfter, FunSuite}
import scala.collection.mutable
class BroadCastProcessor extends BroadcastProcessFunction[Int, (Int, String), String] {
import BroadCastProcessor._
override def processElement(value: Int,
ctx: BroadcastProcessFunction[Int, (Int, String), String]#ReadOnlyContext,
out: Collector[String]): Unit = {
val broadcastState = ctx.getBroadcastState(broadcastStateDescriptor)
if (broadcastState.contains(value)) {
out.collect(broadcastState.get(value))
}
}
override def processBroadcastElement(value: (Int, String),
ctx: BroadcastProcessFunction[Int, (Int, String), String]#Context,
out: Collector[String]): Unit = {
ctx.getBroadcastState(broadcastStateDescriptor).put(value._1, value._2)
}
}
object BroadCastProcessor {
val broadcastStateDescriptor: MapStateDescriptor[Int, String] = new MapStateDescriptor[Int, String]("int_to_string", classOf[Int], classOf[String])
}
class CollectSink extends SinkFunction[String] {
import CollectSink._
override def invoke(value: String): Unit = {
values += value
}
}
object CollectSink { // must be static
val values: mutable.MutableList[String] = mutable.MutableList[String]()
}
class BroadCastProcessTest extends FunSuite with BeforeAndAfter {
before {
CollectSink.values.clear()
}
test("add_elem_to_broadcast_and_process_should_apply_broadcast_rule") {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val dataToProcessStream = env.fromElements(1)
val ruleToBroadcastStream = env.fromElements(1 -> "1", 2 -> "2", 3 -> "3")
val broadcastStream = ruleToBroadcastStream.broadcast(BroadCastProcessor.broadcastStateDescriptor)
dataToProcessStream
.connect(broadcastStream)
.process(new BroadCastProcessor)
.addSink(new CollectSink())
// execute
env.execute()
CollectSink.values should contain("1")
}
}
Update thanks to David Anderson
I went for the buffer solution. I defined a process function for the synchronization:
class SynchronizeModelAndEvent(modelNumberToWaitFor: Int) extends CoProcessFunction[Int, (Int, String), Int] {
val eventBuffer: mutable.MutableList[Int] = mutable.MutableList[Int]()
var modelEventsNumber = 0
override def processElement1(value: Int, ctx: CoProcessFunction[Int, (Int, String), Int]#Context, out: Collector[Int]): Unit = {
if (modelEventsNumber < modelNumberToWaitFor) {
eventBuffer += value
return
}
out.collect(value)
}
override def processElement2(value: (Int, String), ctx: CoProcessFunction[Int, (Int, String), Int]#Context, out: Collector[Int]): Unit = {
modelEventsNumber += 1
if (modelEventsNumber >= modelNumberToWaitFor) {
eventBuffer.foreach(event => out.collect(event))
}
}
}
And so I need to add it to my stream:
dataToProcessStream
.connect(ruleToBroadcastStream)
.process(new SynchronizeModelAndEvent(3))
.connect(broadcastStream)
.process(new BroadCastProcessor)
.addSink(new CollectSink())
Thanks
There isn't an easy way to do this. You could have processElement buffer all of its input until the model has been received by processBroadcastElement. Or run the job once with no event traffic and take a savepoint once the model has been broadcast. Then restore that savepoint into the same job, but with its event input connected.
By the way, the capability you are looking for is often referred to as "side inputs" in the Flink community.
Thanks to David Anderson and Matthieu I wrote this generic CoFlatMap function that makes the requested delay on the event stream:
import org.apache.flink.streaming.api.functions.co.CoProcessFunction
import org.apache.flink.util.Collector
import scala.collection.mutable
class SynchronizeEventsWithRules[A,B](rulesToWait: Int) extends CoProcessFunction[A, B, A] {
val eventBuffer: mutable.MutableList[A] = mutable.MutableList[A]()
var processedRules = 0
override def processElement1(value: A, ctx: CoProcessFunction[A, B, A]#Context, out: Collector[A]): Unit = {
if (processedRules < rulesToWait) {
println("1 item buffered")
println(rulesToWait+"--"+processedRules)
eventBuffer += value
return
}
eventBuffer.clear()
println("send input to output without buffering:")
out.collect(value)
}
override def processElement2(value: B, ctx: CoProcessFunction[A, B, A]#Context, out: Collector[A]): Unit = {
processedRules += 1
println("1 rule processed processedRules:"+processedRules)
if (processedRules >= rulesToWait && eventBuffer.length>0) {
println("send buffered data to output")
eventBuffer.foreach(event => out.collect(event))
eventBuffer.clear()
}
}
}
but unfortunately, it does not help at all in my case, because the subject under the test was a KeyedBroadcastProcessFunction that makes the delay on event data irrelevant because of that I tried to apply a flatmap that makes the rule stream n times larger, n was the number of CPUs. so I will sure that the resulting event stream will be always sync with the rule stream and will be arrived after that but it does not help either.
after all, I came to this simple solution that of course it is not deterministic but because of the nature of parallelism and concurrency the problem itself is not deterministic either.
If we set the delayMilis big enough (>100) the result will be deterministic:
val delayMilis = 100
val synchronizedInput = inputEventStream.map(x=>{
Thread.sleep(delayMilis)
x
}).keyBy(_.someKey)
you can also change the mapping function to this to apply the delay only on the first element:
package util
import org.apache.flink.api.common.functions.MapFunction
class DelayEvents[T](delayMilis: Int) extends MapFunction[T,T] {
var delayed = false
override def map(value: T): T = {
if (!delayed) {
delayed=true
Thread.sleep(delayMilis)
}
value
}
}
val delayMilis = 100
val synchronizedInput = inputEventStream.map(new DelayEvents(100)).keyBy(_.someKey)

Use config value from MainActor class in other class

I'm using Akka in my project and pull config values in my MainActor class. I want to be able to use commit, version, author tag inside of another file in order to build an avro response, but I can't just simply make MainActor the parent class of my Avro response interface. Is there a workaround?
My MainActor class
class MainActor extends Actor with ActorLogging with ConfigComponent with ExecutionContextComponent with DatabaseComponent with DefaultCustomerProfiles {
override lazy val config: Config = context.system.settings.config
override implicit lazy val executionContext: ExecutionContext = context.dispatcher
override val db: Database = Database.fromConfig(config.getConfig("com.ojolabs.customer-profile.database"))
private val avroServer = context.watch {
val binding = ReflectiveBinding[CustomerService.Async](customerProfileManager)
val host = config.getString("com.ojolabs.customer-profile.avro.bindAddress")
val port = config.getInt("com.ojolabs.customer-profile.avro.port")
context.actorOf(AvroServer.socketServer(binding, host, port))
}
val commit = config.getString("com.ojolabs.customer-profile.version.commit")
val author = config.getString("com.ojolabs.customer-profile.version.author")
val tag = config.getString("com.ojolabs.customer-profile.version.tag")
val buildId = config.getString("com.ojolabs.customer-profile.version.buildId")
override def postStop(): Unit = {
db.close()
super.postStop()
}
//This toplevel actor does nothing by default
override def receive: Receive = Actor.emptyBehavior
}
The class I want to pull values into
trait DefaultCustomerProfiles extends CustomerProfilesComponent {
self: DatabaseComponent with ExecutionContextComponent =>
lazy val customerProfileManager = new CustomerService.Async {
import db.api._
override def customerById(id: String): Future[AvroCustomer] = {
db.run(Customers.byId(UUID.fromString(id)).result.headOption)
.map(_.map(AvroConverters.toAvroCustomer).orNull)
}
override def customerByPhone(phoneNumber: String): Future[AvroCustomer] = {
db.run(Customers.byPhoneNumber(phoneNumber).result.headOption)
.map(_.map(AvroConverters.toAvroCustomer).orNull)
}
override def findOrCreate(phoneNumber: String, creationReason: String): Future[AvroCustomer] = {
db.run(Customers.findOrCreate(phoneNumber, creationReason)).map(AvroConverters.toAvroCustomer)
}
override def createEvent(customerId: String, eventType: String, version: Double, data: String, metadata: String): Future[AvroCustomerEvent] = {
val action = CustomerEvents.create(
UUID.fromString(customerId),
eventType,
Json.parse(data),
version,
Json.parse(metadata)
)
db.run(action).map(AvroConverters.toAvroEvent)
}
override def getVersion() : Version = {
}
}
Create another trait that defines the values, and mix it in with your MainActor and DefaultCustomerProfiles traits.
trait AnvroConfig {
self: ConfigComponent
val commit = config.getString("com.ojolabs.customer-profile.version.commit")
val author = config.getString("com.ojolabs.customer-profile.version.author")
val tag = config.getString("com.ojolabs.customer-profile.version.tag")
val buildId = config.getString("com.ojolabs.customer-profile.version.buildId")
}
I think what you really need is an Akka Extension, which enables you to add features, like custom config, to your Akka system in an elegant way. This way, you would have access to those config values within all your actors from the actor system. As an example, check out this nice blog post.
As for the other class from your example, you should pass them as parameters - it should be concerned with retrieving and parsing the config itself.

Asynchronous Iterable over remote data

There is some data that I have pulled from a remote API, for which I use a Future-style interface. The data is structured as a linked-list. A relevant example data container is shown below.
case class Data(information: Int) {
def hasNext: Boolean = ??? // Implemented
def next: Future[Data] = ??? // Implemented
}
Now I'm interested in adding some functionality to the data class, such as map, foreach, reduce, etc. To do so I want to implement some form of IterableLike such that it inherets these methods.
Given below is the trait Data may extend, such that it gets this property.
trait AsyncIterable[+T]
extends IterableLike[Future[T], AsyncIterable[T]]
{
def hasNext : Boolean
def next : Future[T]
// How to implement?
override def iterator: Iterator[Future[T]] = ???
override protected[this] def newBuilder: mutable.Builder[Future[T], AsyncIterable[T]] = ???
override def seq: TraversableOnce[Future[T]] = ???
}
It should be a non-blocking implementation, which when acted on, starts requesting the next data from the remote data source.
It is then possible to do cool stuff such as
case class Data(information: Int) extends AsyncIterable[Data]
val data = Data(1) // And more, of course
// Asynchronously print all the information.
data.foreach(data => println(data.information))
It is also acceptable for the interface to be different. But the result should in some way represent asynchronous iteration over the collection. Preferably in a way that is familiar to developers, as it will be part of an (open source) library.
In production I would use one of following:
Akka Streams
Reactive Extensions
For private tests I would implement something similar to following.
(Explanations are below)
I have modified a little bit your Data:
abstract class AsyncIterator[T] extends Iterator[Future[T]] {
def hasNext: Boolean
def next(): Future[T]
}
For it we can implement this Iterable:
class AsyncIterable[T](sourceIterator: AsyncIterator[T])
extends IterableLike[Future[T], AsyncIterable[T]]
{
private def stream(): Stream[Future[T]] =
if(sourceIterator.hasNext) {sourceIterator.next #:: stream()} else {Stream.empty}
val asStream = stream()
override def iterator = asStream.iterator
override def seq = asStream.seq
override protected[this] def newBuilder = throw new UnsupportedOperationException()
}
And if see it in action using following code:
object Example extends App {
val source = "Hello World!";
val iterator1 = new DelayedIterator[Char](100L, source.toCharArray)
new AsyncIterable(iterator1).foreach(_.foreach(print)) //prints 1 char per 100 ms
pause(2000L)
val iterator2 = new DelayedIterator[String](100L, source.toCharArray.map(_.toString))
new AsyncIterable(iterator2).reduceLeft((fl: Future[String], fr) =>
for(l <- fl; r <- fr) yield {println(s"$l+$r"); l + r}) //prints 1 line per 100 ms
pause(2000L)
def pause(duration: Long) = {println("->"); Thread.sleep(duration); println("\n<-")}
}
class DelayedIterator[T](delay: Long, data: Seq[T]) extends AsyncIterator[T] {
private val dataIterator = data.iterator
private var nextTime = System.currentTimeMillis() + delay
override def hasNext = dataIterator.hasNext
override def next = {
val thisTime = math.max(System.currentTimeMillis(), nextTime)
val thisValue = dataIterator.next()
nextTime = thisTime + delay
Future {
val now = System.currentTimeMillis()
if(thisTime > now) Thread.sleep(thisTime - now) //Your implementation will be better
thisValue
}
}
}
Explanation
AsyncIterable uses Stream because it's calculated lazily and it's simple.
Pros:
simplicity
multiple calls to iterator and seq methods return same iterable with all items.
Cons:
could lead to memory overflow because stream keeps all prevously obtained values.
first value is eagerly gotten during creation of AsyncIterable
DelayedIterator is very simplistic implementation of AsyncIterator, don't blame me for quick and dirty code here.
It's still strange for me to see synchronous hasNext and asynchronous next()
Using Twitter Spool I've implemented a working example.
To implement spool I modified the example in the documentation.
import com.twitter.concurrent.Spool
import com.twitter.util.{Await, Return, Promise}
import scala.concurrent.{ExecutionContext, Future}
trait AsyncIterable[+T <: AsyncIterable[T]] { self : T =>
def hasNext : Boolean
def next : Future[T]
def spool(implicit ec: ExecutionContext) : Spool[T] = {
def fill(currentPage: Future[T], rest: Promise[Spool[T]]) {
currentPage foreach { cPage =>
if(hasNext) {
val nextSpool = new Promise[Spool[T]]
rest() = Return(cPage *:: nextSpool)
fill(next, nextSpool)
} else {
val emptySpool = new Promise[Spool[T]]
emptySpool() = Return(Spool.empty[T])
rest() = Return(cPage *:: emptySpool)
}
}
}
val rest = new Promise[Spool[T]]
if(hasNext) {
fill(next, rest)
} else {
rest() = Return(Spool.empty[T])
}
self *:: rest
}
}
Data is the same as before, and now we can use it.
// Cool stuff
implicit val ec = scala.concurrent.ExecutionContext.global
val data = Data(1) // And others
// Print all the information asynchronously
val fut = data.spool.foreach(data => println(data.information))
Await.ready(fut)
It will trow an exception on the second element, because the implementation of next was not provided.

Override how a BsonRecord fills in its fields

I would like to extend the BsonRecord class to handles some of its fields when they are filled in. I'm trying to do it by extending the setFieldsFrom... methods, but it doesn't seem to work...
Here is the code I have :
trait NodeBsonRecord[MyType <: BsonRecord[MyType]] extends BsonRecord[MyType]
{
self: MyType =>
override def setFieldsFromDBObject(dbo:DBObject) =
{
super.setFieldsFromDBObject(dbo)
println("setFieldsFromDBObject")
}
override def setFieldsFromJSON(json:String) =
{
val out = super.setFieldsFromJSON(json)
println("setFieldsFromJSON")
out
}
override def setFieldsFromJsonString(json:String) =
{
val out = super.setFieldsFromJsonString(json)
println("setFieldsFromJsonString")
out
}
override def setFieldsFromJValue(jval:JValue) =
{
val out = super.setFieldsFromJValue(jval)
println("setFieldsFromJValue")
out
}
override def setFieldsFromReq(req:Req) =
{
val out = super.setFieldsFromReq(req)
println("setFieldsFromReq")
out
}
}
So when I request for a Record (using MongoRecord.find()), I expect to see a "setFieldFrom..." thing, but nothing is printed out...
Anybody can tell me how to do this ?
Mongo seems to use setFieldsFromDBObject in BsonMetaRecord as part of find, which iterates through each field and calls setFromAny.