We have a Stream of events that are being pulled from kafka using fs2-kafka and we are finishing processing when the events are newer than a given deadline or the offset is at the end of the partition (this does not matter too much for the question, but to give a bit of context about the program).
We ideally would want to log when those conditions are met, but takeWhile and takeThrough require pure functions O => Boolean.
Our stream is:
partitionStream
.takeWhile(shouldProcess(processingDelay, _))
.takeThrough(!atLogEndOffset(_, endOffsets))
Obviously we could do a log.info inside shouldProcess and atLongEndOffset but that would mean side effecting inside a pure function, something we would prefer not doing.
What approach would be better without calling the functions twice (one for logging and one for the condition evaluation)?
Thanks!
Related
My scenario is something like this.
I have a vector consisting of a large number of reports that needs to be sent using a rest api call.
I am using Futures.traverse(the vector mentioned in 1)
Since the vector is too huge, it is failing with max open requests exceeded.
One initial solution that I could think of is to increase the max-open-requests setting. But the problem here is I am not aware of the number of reports that needs to be sent beforehand.
Can someone please suggest an alternative solution like limiting the parallelism that is taking place through Futures.traverse
Since you tag this question with akka, I'm assuming that you are using akka-http for the calls. You could use akka-streams to make the requests in batch so to avoid to overflow your connections, something like:
Source(reportsVector)
.grouped(safeValue)
.mapAsync(1)(reps => Future.traverse(reps)(x => ...)) //do your stuff
.mapConcat(identity)
.runWith(Sink.seq)
The example will execute safeValue concurrent calls at a time and collect all the results into a collection that will be returned when the entire stream is done. You can also play with other operators like sliding and splitWhen to make it better for your use case, you can tune the safeValue and the mapAsync concurrency values as well. Notice that the source of this stream is a known vector (reportsVector) but it could be an unknown finite stream of reports as well.
This question already has answers here:
How to send final kafka-streams aggregation result of a time windowed KTable?
(3 answers)
Closed 4 years ago.
We're having an issue where upon doing a groupby --> reduce --> toStream, partial reduce values are being sent downstream when a commit happens during the reduce. So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.
So here is our case in more detail:
msg --> leftJoin
leftJoin --> flatMap //break msg into parts so we can join again downstream
flatMap --> leftJoin
leftJoin --> groupByKey
groupByKey --> reduce
reduce --> toStream
toStream --> to
Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.
Thanks.
So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.
If I understand your description correctly, this is actually intended behavior. For one, it's a tradeoff between processing latency (where you want to see update records as soon as you have a new piece of input data) vs. coalescing multiple update records into fewer or even just a single update record.
The default behavior of Kafka Streams is to favor lower processing latency. That is, it will not wait for "all input data to have arrived" before sending downstream updates. Rather, it will send updates once new data has arrived. Some background information is described at https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/.
Today, you have two main knobs to change/tune this default behavior, which is controlled by (1) Kafka Streams record caches (for the DSL) and (2) the configured commit interval (you already mentioned this).
Moving forward, the Kafka community has also been working on a new feature that will allow you to define that you just want a single, final update record to be sent (rather than what you described as "partial" updates). This new feature, in case you are interested, is described in the Kafka Improvement Proposal KIP-328: Ability to suppress updates for KTables. This is actively being worked on, but it will unlikely to be finished in time for the upcoming Kafka v2.1 release in October.
Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.
In short, in stream processing you should embrace the nature of how streaming works. In general, you will only have partial/incomplete knowledge of the world, so to speak, or rather: you only know what you observed thus far. So, at any given point in time, you must deal with the situation that more, additional data may arrive that you still have to deal with.
A typical situation is having to deal with late-arriving data, where your application logic must decide whether you want to still integrate and process this data (quite likely) or discard (sometimes the way it needs to be).
Going back to your example:
So if there are 65 keys to be reduced [...]
How would one know it's 65, and not 100 or 28, and so on? One can only tell that: "Thus far, at this point in time, I have received 65. So, what do I do? Do I reduce those 65 because I believe that's all the input? Or do I wait some seconds/minutes/hours longer because there might be 35 more to arrive, but this will mean that I will not send an update/answer downstream until this waiting time has elapsed (which results in higher processing latency)?"
In your situation, I would ask: Why do you consider the streaming behavior of how/when updates are being sent a problem? Perhaps it's because you have a downstream system or application that doesn't know how to handle such streaming updates?
Does that make any sense? Again, the above is based on my understanding of what you described as being the issue.
I am getting started with Akka streams; I am trying to create a stream that reads data from a web service and then persists them in S3.
I was wondering, if I define a Sink using the Sink.fold method (in order to gather information about the persisted elements) for the persistence, are the elements sent to the sink going to be processed one after another, or in parallel?
It's a basic question, but I wasn't able to find a definitive answer in the docs.
Since Sink.fold needs the result from the previous elements to combine it with the next one, it's necessarily sequential.
It's more of a Sink.foldLeft, actually.
In other words, if you have a, b as elements, and you fold them using f, you need acc = f(zero, a) to be able to process f(acc, b). So, until the processing for ais done b cannot be processed.
From the api doc:
A Sink that will invoke the given function for every received element, giving it its previous output (or the given zero value) and the element as input. The returned java.util.concurrent.CompletionStage will be completed with value of the final function evaluation when the input stream ends, or completed with Failure if there is a failure is signaled in the stream.
I'm writing an application that reads relatively large text files, validates and transforms the data (every line in a text file is an own item, there are around 100M items/file) and creates some kind of output. There already exists a multihreaded Java application (using BlockingQueue between Reading/Processing/Persisting Tasks), but I want to implement a Scala application that does the same thing.
Akka seems to be a very popular choice for building concurrent applications. Unfortunately, due to the asynchronous nature of actors, I still don't understand what a single actor can or can't do, e.g. if I can use actors as traditional workers that do some sort of calculation.
Several documentations say that Actors should never block and I understand why. But the given examples for blocking code always only mention such things as blocking file/network IO.. things that make the actor waiting for a short period of time which is of course a bad thing.
But what if the actor is "blocking" because it actually does something useful instead of waiting? In my case, the processing and transformation of a single line/item of text takes 80ms which is quite a long time (pure processing, no IO involved). Can this work be done by an actor directly or should I use a Future instead (but then, If I have to use Futures anyway, why use Akka in the first place..)?.
The Akka docs and examples show that work can be done directly by actors. But it seems that the authors only do very simplistic work (such as calling filter on a String or incrementing a counter and that's it). I don't know if they do this to keep the docs simple and concise or because you really should not do more that within an actor.
How would you design an Akka-based application for my use case (reading text file, processing every line which takes quite some time, eventually persisting the result)? Or is this some kind of problem that does not suit to Akka?
It all depends on the type of an actor.
I use this rule of thumb: if you don't need to talk to this actor and this actor does not have any other responsibilities, then it's ok to block in it doing actual work. You can treat it as a Future and this is what I would call a "worker".
If you block in an actor that is not a leaf node (worker), i.e. work distributor then the whole system will slow down.
There are a few patterns that involve work pulling/pushing or actor per request model. Either of those could be a fit for your application. You can have a manager that creates an actor for each piece of work and when the work is finished actor sends result back to manager and dies. You can also keep an actor alive and ask for more work from that actor. You can also combine actors and Futures.
Sometimes you want to be able to talk to a worker if your processing is more complex and involves multiple stages. In that case a worker can delegate work yet to another actor or to a future.
To sum-up don't block in manager/work distribution actors. It's ok to block in workers if that does not slow your system down.
disclaimer: by blocking I mean doing actual work, not just busy waiting which is never ok.
Doing computations that take 100ms is fine in an actor. However, you need to make sure to properly deal with backpressure. One way would be to use the work-pulling pattern, where your CPU bound actors request new work whenever they are ready instead of receiving new work items in a message.
That said, your problem description sounds like a processing pipeline that might benefit from using a higher level abstraction such as akka streams. Basically, produce a stream of file names to be processed and then use transformations such as map to get the desired result. I have something like this in production that sounds pretty similar to your problem description, and it works very well provided the data used by the individual processing chunks is not too large.
Of course, a stream will also be materialized to a number of actors. But the high level interface will be more type-safe and easier to reason about.
My use-case is to identify entities from which expected events have not been received after X amount of time in real-time rather than using batch jobs. For Example:
If we have received PaymentInitiated event at time T but didn't receive either of PaymentFailed / PaymentAborted / PaymentSucedded by T+X, then raise a trigger saying PaymentStuck along with details of PaymentIntitiated event.
How can I model such use-cases in Apache Storm as it is rolling time period X on each event, rather than fixed time interval.
Thanks,
Harish
For Storm, would need to put all your logic into your UDF code using low level Java API (I doubt that Trindent is helpful). I never worked with Samza and cannot provide any help for it (or judge which system would be the better fit for your problem).
In Storm for example, you could assign a timestamp to each tuple in Spout.nextTuple(), and buffer all tuples of an incomplete payment within a Bolt in descending order of the timestamp. Each time Bolt.execute() is called, you can compare the timestamp of the new tuple with the head (ie, oldest tuple) of your queue. If the input tuple has a larger timestamep than head-T plus X, you know that your head tuple times out and you can raise your trigger for it.
Of course, you need to do fieldsGrouping() to ensure that all tuples belonging to the same payment are processed by the same Bolt instance. You might also need to somewhat order the incoming bolt tuples by timestamp or use more advance time-out logic to deal with out-of-order tuples (with regard to increasing timestamps).
Depending on you latency requirement and input stream rate you might also use "tick tuples" to trigger the comparison of the head tuple with this dummy tick tuple. Or as an ever stricter implementation, do all this logic directly in Spout.next() (if you know that all tuples of a payment go through the same Spout instance).