I have a service that requires an operation to be performed on an entity status 10 minutes after a specific event has been invoked. During the 10 minute wait, it has to listen for other events that would stop this event altogether so the operation would not be executed.
At any point in time, there could be multiple instances of entities to be handled.
Is there any suggestion as to what possible ways of implementation to create this service?
I have something similar in one of my projects that handles users presence management. The rough idea is like this:
1 You have a timer with tick precision that you need (i use 15
second)
2 Some input event triggers monitoring of an entity and puts entity into the execution queue with time of adding
3 If any canceling event occurs > you remove corresponding entities from the execution queue
4 On timer ticks you check time of adding for each iteam in executionQueue and execute operation for entities that are in there for 10 mins
PS.
you might want to consider using multi-threading / TPL for operations execution.
Related
I have an application which listens to a stream of events. These events tend to come in chunks: 10 to 20 of them within the same second, with minutes or even hours of silence between them. These events are processed and result in an aggregate state, and this updated state is sent further downstream.
In pseudo code, it would look something like this:
kafkaSource()
.mapAsync(1)((entityId, event) => entityProcessor(entityId).process(event)) // yields entityState
.mapAsync(1)(entityState => submitStateToExternalService(entityState))
.runWith(kafkaCommitterSink)
The thing is that the downstream submitStateToExternalService has no use for 10-20 updated states per second - it would be far more efficient to just emit the last one and only handle that one.
With that in mind, I started looking if it wouldn't be possible to not emit the state after processing immediately, and instead wait a little while to see if more events are coming in.
In a way, it's similar to conflate, but that emits elements as soon as the downstream stops backpressuring, and my processing is actually fast enough to keep up with the events coming in, so I can't rely on backpressure.
I came across groupedWithin, but this emits elements whenever the window ends (or the max number of elements is reached). What I would ideally want, is a time window where the waiting time before emitting downstream is reset by each new element in the group.
Before I implement something to do this myself, I wanted to make sure that I didn't just overlook a way of doing this that is already present in akka-streams, because this seems like a fairly common thing to do.
Honestly, I would make entityProcessor into an cluster sharded persistent actor.
case class ProcessEvent(entityId: String, evt: EntityEvent)
val entityRegion = ClusterSharding(system).shardRegion("entity")
kafkaSource()
.mapAsync(parallelism) { (entityId, event) =>
entityRegion ? ProcessEvent(entityId, event)
}
.runWith(kafkaCommitterSink)
With this, you can safely increase the parallelism so that you can handle events for multiple entities simultaneously without fear of mis-ordering the events for any particular entity.
Your entity actors would then update their state in response to the process commands and persist the events using a suitable persistence plugin, sending a reply to complete the ask pattern. One way to get the compaction effect you're looking for is for them to schedule the update of the external service after some period of time (after cancelling any previously scheduled update).
There is one potential pitfall with this scheme (it's also a potential issue with a homemade Akka Stream solution to allow n > 1 events to be processed before updating the state): what happens if the service fails between updating the local view of state and updating the external service?
One way you can deal with this is to encode whether the entity is dirty (has state which hasn't propagated to the external service) in the entity's state and at startup build a list of entities and run through them to have dirty entities update the external state.
If the entities are doing more than just tracking state for publishing to a single external datastore, it might be useful to use Akka Persistence Query to build a full-fledged read-side view to update the external service. In this case, though, since the read-side view's (State, Event) => State transition would be the same as the entity processor's, it might not make sense to go this way.
A midway alternative would be to offload the scheduling etc. to a different actor or set of actors which get told "this entity updated it's state" and then schedule an ask of the entity for its current state with a timestamp of when the state was locally updated. When the response is received, the external service is updated, if the timestamp is newer than the last update.
Using Fixed Windows in Apache Beam. The watermark is set by the event time.
Some data may arrive out of order and cause the window to close.
How can a trigger be defined in Java to occur say 2 minutes after the last data was seen?
It's not entire clear what behavior you expect. One question is what do you expect to happen if the data arrives within the two minutes? Do you want to restart the two minutes interval, don't restart it, re-emit the data or not?
Looks like the trigger you are trying to describe is something along these lines:
wait until the watermark passed the end of window, in event time;
wait for additional 2 minutes in processing time;
emit the data;
If in step 2 it was event time, i.e. you wanted to re-emit the window if a late element arrives that fits within window + 2min, then you could use withAllowedLateness(). Though it sounds different from what you want, because it can keep re-emitting the window contents every time a matching late element arrives.
With processing-time in step 2 this is not possible in general with basic triggers that are available in Beam. You can probably achieve a behavior you want if you manually manage state and timers in your own ParDo, e.g. you can watch for the incoming elements, keep track on them in the state, and then on timer emit what you want. This can become very complicated and might still be not flexible enough for your specific use case.
One of the major problems is that there is no good way to define processing time triggers in Beam in general. It would be complicated to define a general mechanism of working with timers in this manner. For example, when you want to express "wait for 2 minutes", the framework needs to understand in relation to what these two minutes are, when to start the timer, so you need a mechanism to express that as well. And with composition, continuation and other complications this doesn't seem easy to reason about. So it's not in the framework in this general form.
In order to implement only the "wait for 2 minutes after the last element was seen in the window", the framework has to watch for it and set the timer. Technically it is possible to do something like this but doesn't seem like anyone has done it yet.
There seems to be only one meaningful processing time trigger available in Beam but it's not generic enough and doesn't do what you want. You can look at composite triggers like AfterFirst or AfterAll but they likely won't help you without a better general processing time trigger.
I decided against using Beam and implemented the solution in Kafka Streams.
I basically grouped by, then used fixed windows and the aggregated the result.
The "grace" on the window allows data to arrive late.
KGroupedStream<Long, OxyStreamItem> grouped = input.groupByKey();
TimeWindowedKStream<Long, OxyStreamItem> windowed =
grouped.windowedBy(
TimeWindows.of(WIN_SIZE)
.advanceBy(WIN_SIZE)
.grace(Duration.ofSeconds(5L)));
return windowed
.aggregate(
makeInitializer(),
makeAggregator(),
Materialized
.<Long, Aggregate, WindowStore<Bytes, byte[]>>as("tmp")
.withValueSerde(new AggregateSerde()))
.suppress(
Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.map(calculateAvg());
I want to try out Azure Functions with the following project:
Triggered by time (every 30 minutes) my initial function1 puts some data in a queue1.
This queue1 triggers another function2 that calls an external REST API, modifies the response and puts the results in another queue3.
This queue3 starts another functions3 doing the rest.
My problem is that the REST API has a rate limiting. So if my function1 puts 100 items in the queue1 and the function2 is called 100 times parallel, my API calls will be blocked. I therefore need some kind of throttling.
How would you achieve that? I could tell function2 to wait a specific time and then add the item back to queue1, but since everything is parallel I might run in to a deadlock?
Thanks in advance for ideas!
I would recommend that you take a look at Azure Durable Functions here. The Durable Functions framework allows you to orchestrate complex workflows and manage state. In your example, you could use Durable Functions to work around the rate limiting issue
is called 100 times parallel
You can limit this (to some extent) by configuring host.json:
batchSize for Storage Queues
maxConcurrentCalls for Service Bus
If that's not enough, you could do something more sofisticated:
Function 1 knows how many items it has to process, so it could calculate the "ideal" distribution of those for the next 30 minutes
When adding messages to queue1, it could set the time when the message should be picked up (ScheduledEnqueueTimeUtc for Service Bus or initialVisibilityDelay for CloudQueue)
Function2 will be called "on schedule" which should prevent throttling, if the total amount of messages is not too high
I have a requirement to
read a Database table
process the data (dataCleanser)
to, increase the throughput, I implemented EJB timers (non persistent) that wake up every 5 minutes (10 of them) and do the above work.
The problem is 'back pressure', the dataCleanser can sometimes take like 12 minutes (makes an external API call) and when this happens, Websphere reports a hung thread.
In those cases, I would like to decrease the number of timers (say from 10 to 5) programatically.
I can do that only if the timer comes back reports its status of successful execution or exception or timeout
that way I can control the back pressure.
is there anyway to do that in Websphere 8?
To ask the question in another way
- can the EJB timers(with transaction_not_supported) invoke another EJB that have transaction timeouts?
- can those timeouts be caught in the calling timer code?
if that is not possible, what are the downsides of using the plain old infinite loop with sleep and then invoke and EJB(in turn calls dataCleanser) with transaction timeout?
one of the downside, I know is this becomes single threaded and I do not how to make 10 parallel executions like I would do with timers.
I had some similar issues with scheduling, I decided to re-schedule callbacks programmatically each time my logic ended, based on the results of processing. You can have the timer service injected:
#Resource
private TimerService timerService;
This can be in a superclass if there are multiple EJBs that need scheduling, with methods like:
protected void reschedule(long millis) {
timerService.createTimer(millis, null);
}
This way you can control beans individually. I would not try to make them control each other, since that would become difficult in a cluster with multiple JVMs.
Have been trying to google this but getting a bit stuck.
Let's say we have a class that fires an event, and that event could be fired by several threads at the same time.
Using Observable.FromEventPattern, we create an Observable, and subscribe to that event. How exactly does Rx manage multiple those events being fired at once? Let's say we have 3 events fired in quick succession on different threads. Does it queue them internally, and then call the Subscribe delegate synchronously for each one? Let's say we were subscribing on a thread pool, can we still guarantee the Subscriptions would be processed separately in time?
Following on from that, let's say for each event, we want to perform an action, but it's a method that's potentially not thread safe, so we only want one thread to be in this method at a time. Now I see we can use an EventLoop Scheduler, and presumably we wouldn't need to implement any locking on the code?
Also, would observing on the Current Thread be an option? Is Current Thread the thread that the event was fired from, or the event the subscription was set up on? i.e. Is that current thread guaranteed to always be the same or could be have 2 threads running ending up in the method at the same time?
Thx
PS: I put an example together but I always seem to end up on the samethread in my subscrive method, even when I ObserveOn the threadpool, which is confusing :S
PSS: From doing a few more experiments, it seems that if no Schedulers are specified, then RX will just execute on whatever thread the event was fired on, meaning it processes several concurrently. As soon as I introduce a scheduler, it always runs things consecutively, no matter what the type of the scheduler is. Strange :S
According to the Rx Design Guidelines, an observable should never call OnNext of an observer concurrently. It will always wait for the current call to complete before making the next call. All Rx methods honor this convention. And, more importantly, they assume you also honor this convention. When you violate this condition, you may encounter subtle bugs in the behavior of your Observable.
For those times when you have source data that does not honor this convention (ie it can produce data concurrently), they provide Synchronize.
Observable.FromEventPattern assumes you will not be firing concurrent events and so does nothing to prevent concurrent downstream notifications. If you plan on firing events from multiple threads, sometimes concurrently, then use Synchronize() as the first operation you do after FromEventPattern:
// this will get you in trouble if your event source might fire events concurrently.
var events = Observable.FromEventPattern(...).Select(...).GroupBy(...);
// this version will protect you in that case.
var events = Observable.FromEventPattern(...).Synchronize().Select(...).GroupBy(...);
Now all of the downstream operators (and eventually your observer) are protected from concurrent notifications, as promised by the Rx Design Guidelines. Synchronize works by using a simple mutex (aka the lock statement). There is no fancy queueing or anything. If one thread attempts to raise an event while another thread is already raising it, the 2nd thread will block until the first thread finishes.
In addition to the recommendation to use Synchronize, it's probably worth having a read of the Intro to Rx section on scheduling and threading. It Covers the different schedulers and their relationship to threads, as well as the differences between ObserveOn and SubscribeOn, etc.
If you have several producers then there are RX methods for combining them in a threadsafe way
For combining streams of the same type of event into a single stream
Observable.Merge
For combining stream of different types of events into a single stream using a selector to transform the latest value on each stream into a new value.
Observable.CombineLatest
For example combining stock prices from different sources
IObservable<StockPrice> source0;
IObservable<StockPrice> source1;
IObservable<StockPrice> combinedSources = source0.Merge(source1);
or create balloons at the current position every time there is a click
IObservable<ClickEvent> clicks;
IObservable<Position> position;
IObservable<Balloons> balloons = clicks
.CombineLatest
( positions
, (click,position)=>new Balloon(position.X, position.Y)
);
To make this specifically relevant to your question you say there is a class which combines events from different threads. Then I would use Observable.Merge to combine the individual event sources and expose that as an Observable on your main class.
BTW if your threads are actually tasks that are firing events to say they have completed here is an interesting patterns
IObservable<Job> jobSource;
IObservable<IObservable<JobResult>> resultTasks = jobSource
.Select(job=>Observable.FromAsync(cancelationToken=>DoJob(token,job)));
IObservable<JobResult> results = resultTasks.Merge();
Where what is happening is you are getting a stream of jobs in. From the jobs you are creating a stream of asynchronous tasks ( not running yet ). Merge then runs the tasks and collects the results. It is an example of a mapreduce algorithm. The cancellation token can be used to cancel running async tasks if the observable is unsubscribed from (ie canceled )