I need to build the following graph:
val graph = getFromTopic1 ~> doSomeWork ~> writeToTopic2 ~> commitOffsetForTopic1
but trying to implement it in Reactive Kafka has me down a rabbit hole. And that seems wrong because this strikes me as a relatively common use case: I want to move data between Kafka topics while guaranteeing At Least Once Delivery Semantics.
Now it's no problem at all to write in parallel
val fanOut = new Broadcast(2)
val graph = getFromTopic1 ~> doSomeWork ~> fanOut ~> writeToTopic2
fanOut ~> commitOffsetForTopic1
This code works because writeToTopic2 can be implemented with ReactiveKafka#publish(..), which returns a Sink. But then I lose ALOS guarantees and thus data when my app crashes.
So what I really need is to write a Flow that writes to a Kafka topic. I have tried using Flow.fromSinkAndSource(..) with a custom GraphStage but run up against type issues for the data flowing through; for example, what gets committed in commitOffsetForTopic1 should not be included in writeToTopic2, meaning that I have to keep a wrapper object containing both pieces of data all the way through. But this conflicts with the requirements that writeToTopic2 accept a ProducerMessage[K,V]. My latest attempt to resolve this ran up against private and final classes in the reactive kafka library (extending/wrapping/replacing the underlying SubscriptionActor).
I don't really want to maintain a fork to make this happen. What am I missing? Why is this so hard? Am I somehow trying to build a pathological graph node or is this use case an oversight ... or is there something completely obvious I have somehow missed in the docs and source code I've been digging through?
Current version is 0.10.1. I can add more detailed information about any of my many attempts upon request.
Related
We are developing a pipeline in apache flink (datastream API) that needs to sends its messages to an external system using API calls. Sometimes such an API call will fail, in this case our message needs some extra treatment (and/or a retry).
We had a few options for doing this:
We map() our stream through a function that does the API call and get the result of the API call returned, so we can act upon failures subsequently (this was my original idea, and why i did this: flink scala map with dead letter queue)
We write a custom sink function that does the same.
However, both options have problems i think:
With the map() approach i won't be able to get exactly once (or at most once which would also be fine) semantics since flink is free to re-execute pieces of pipelines after recovering from a crash in order to get the state up to date.
With the custom sink approach i can't get a stream of failed API calls for further processing: a sink is a dead end from the flink APPs point of view.
Is there a better solution for this problem ?
The async i/o operator is designed for this scenario. It's a better starting point than a map.
There's also been recent work done to develop a generic async sink, see FLIP-171. This has been merged into master and will be released as part of Flink 1.15.
One of those should be your best way forward. Whatever you do, don't do blocking i/o in your user functions. That causes backpressure and often leads to performance problems and checkpoint failures.
I started to learn Akka and came across a challenge for which I can't find an easy solution, despite having waded through the documentation and related Stakoverflow questions:
Building on the Client-Side Websocket Support example on the Akka website, I am using as the basis the following code snippet in Scala:
val flow: Flow[Message, Message, Future[Done]] =
Flow.fromSinkAndSourceMat(printSink, Source.maybe)(Keep.left)
val (upgradeResponse, closed) =
Http().singleWebSocketRequest(WebSocketRequest("ws://localhost/ws"), flow)
The use case I have is a client (printSink) consuming a continuous stream from the websocket server. The communication uni-directional only, thus no need for a source.
My question is then as follows:
I need to regularly force a re-connection to the websocket server, and for that I need to disconnect first. But for the life of me, I can't find a way to do a simple disconnect
In a somewhat opposite scenario, I need to keep the websocket connection alive and "swap out" the sink. Is this even possible, i.e. without creating another websocket connection?
For question 1 (forcing a disconnect from the client), this should work
val flow: Flow[Message, Message, (Future[Done], Promise[Option[Message])] =
Flow.fromSinkAndSourceMat(
printSink,
Source.maybe
)(Keep.both)
val (upgradeResponse, (closed, disconnect)) =
Http().singleWebsocketRequest(WebSocketRequest("ws://localhost/ws"), flow)
disconnect can then be completed with a None to disconnect:
disconnect.success(None)
For question 2, my intuition is that that sort of dynamic stream operation would seem to require a custom stream operator (i.e. one level below the Graph DSL and two levels below the "normal" scaladsl/javadsl). I don't have a huge amount of direct experience there, to be honest.
I'v been reading about Akka Streams in the last couple of days and I've been working with Rx libraries in Scala for the last couple of months. To me there seems to be some overlap in what both these libraries got to offer. RxScala was a bit easier to get started, understand and use. For example., here is a simple use case where I'm using Scala's Rx library to connect to Kafka topic, wrap that up into an Observable so that I could have subscribers getting those messages.
val consumerStream = consumer.createMessageStreamsByFilter(topicFilter(topics), 1, keyDecoder, valueDecoder).head
val observableConsumer = Observable.fromIterator(consumerStream).map(_.message())
This is quite simple and neat. Any clues on how I should get started with akka streams? I want to use the same example above where I want to emit events from the Source. I will later have a Flow and a Sink. Then finally, in my main class, I will combine these 3 to run the application data flow.
Any suggestions?
So here is what I came up with:
val kafkaStreamItr = consumer.createMessageStreamsByFilter(topicFilter(topics), 1, keyDecoder, valueDecoder).head
Source.fromIterator(() => kafkaStreamItr).map(_.message)
I am attempting to implement a message processing pipeline using actors. The steps of the pipeline include functions such as reading, filtering, augmentation and, finally, storage into a database.
Something similar to this: http://sujitpal.blogspot.nl/2013/12/akka-content-ingestion-pipeline-part-i.html
The issue is that the reading, filtering and augmentation steps are much faster than the storage step which results in having a congested store actor and an unreliable system.
I am considering the following option: have the store actor pull the processed and ready to store messages. Is this a good option? better suggestions?
Thank you
You may consider several options:
if order of messages doesn't matter - just execute every storage operation inside separate actor (or future). It will cause all data storage to be doing in parallel - I recommend to use separate thread pool for that. If some messages are amendments to others or participate in same transaction - you may create separate actors only for each messageId/transactionId to avoid pessimistic/optimistic lock problems (don't forget to kill such actors on transaction end or by timeout) .
use bounded mailboxes (back-pressure) - then you will block new messages from your input if older are still not processed (for example you may block the receiving thread til message will be acknowledged by last actor in the chain). It will move responsibility to source system. It's working pretty much good with JMS durables - messages are storing in reliable way on JMS-broker side til your system finally have them processed.
combine the previous two
I am using an approach similar to this: Akka Work Pulling Pattern (source code here: WorkPullingPattern.scala). It has the advantage that it works both locally & with Akka Cluster. Plus the whole approach is fully asynchronous, no blocking at all.
If your processed "objects" won't all fit into memory, or one of the steps is slow, it is an awesome solution. If you spawn N workers, then N "tasks" will be processed at one time. It might be a good idea to put the "steps" into BalancingPools also with parallelism N (or less).
I have no idea if your processing "pipeline" is sequential or not, but if it is, just a couple hours ago I have developed a type safe abstraction based on the above + Shapeless library. A glimpse at the code, before it was merged with WorkPullingPattern is here: Pipeline.
It takes any pipeline of functions (of properly matching signatures), spawns them in BalancingPools, creates Workers and links them to a master actor which can be used for scheduling the tasks.
The new AKKA stream (still in beta) has back pressure. It's designed to solve this problem.
You could also use receive pipeline on actors:
class PipelinedActor extends Actor with ReceivePipeline {
// Increment
pipelineInner { case i: Int ⇒ Inner(i + 1) }
// Double
pipelineInner { case i: Int ⇒ Inner(i * 2) }
def receive: Receive = { case any ⇒ println(any) }
}
actor ! 5 // prints 12 = (5 + 1) * 2
http://doc.akka.io/docs/akka/2.4/contrib/receive-pipeline.html
It suits your needs the best as you have small pipelining tasks before/after processing of the message by actor. Also it is blocking code but that is fine for your case, I believe
I am a new starter in Flink, I have a requirement to read data from Kafka, enrich those data conditionally (if a record belongs to category X) by using some API and write to S3.
I made a hello world Flink application with the above logic which works like a charm.
But, the API which I am using to enrich doesn't have 100% uptime SLA, so I need to design something with retry logic.
Following are the options that I found,
Option 1) Make an exponential retry until I get a response from API, but this will block the queue, so I don't like this
Option 2) Use one more topic (called topic-failure) and publish it to topic-failure if the API is down. In this way it won't block the actual main queue. I will need one more worker to process the data from the queue topic-failure. Again, this queue has to be used as a circular queue if the API is down for a long time. For example, read a message from queue topic-failure try to enrich if it fails to push to the same queue called topic-failure and consume the next message from the queue topic-failure.
I prefer option 2, but it looks like not an easy task to accomplish this. Is there is any standard Flink approach available to implement option 2?
This is a rather common problem that occurs when migrating away from microservices. The proper solution would be to have the lookup data also in Kafka or some DB that could be integrated in the same Flink application as an additional source.
If you cannot do it (for example, API is external or data cannot be mapped easily to a data storage), both approaches are viable and they have different advantages.
1) Will allow you to retain the order of input events. If your downstream application expects orderness, then you need to retry.
2) The common term is dead letter queue (although more often used on invalid records). There are two easy ways to integrate that in Flink, either have a separate source or use a topic pattern/list with one source.
Your topology would look like this:
Kafka Source -\ Async IO /-> Filter good -> S3 sink
+-> Union -> with timeout -+
Kafka Source dead -/ (for API call!) \-> Filter bad -> Kafka sink dead