How to retry hot observable? - system.reactive

Rx has great function Observable.Buffer. But there is a problem with it in real life.
Scenario: application sends a stream of events to a database. Inserting events one-by-one is expensive, so we need to batch it. I want to use Observable.Buffer for this. But inserting into DB has small probability of failure (deadlocks, timeouts, downtime, etc).
I can add some retry logic into batching function itself, but it would be against Rx idea of composablility. Observable.Retry does not cut it, because it will re-subscribe to "hot" source, which means that failed batch will be lost.
Are there functions, which I can compose to achieve desired effect, or do I need to implement my own extension? I would like something like this:
_inputBuffer = new BufferBlock<int>();
Buffer(TimeSpan.FromSeconds(10), 1000).
Do(batch => SqlSaveBatch(batch)).
To make it perfect, I would like to be able to get control over situation when OnComplete is called, while retry buffer has incomplete batches, and be able to perform some actions (send error email, save data to local file system, etc.)

When a save to database fails and needs to be retried, it's not really the stream or the events that are in error, it's a action taken against an event.
I would structure your code more like this:
IDisposable subscription =
Buffer(TimeSpan.FromSeconds(10), 1000).
batch => SqlSaveBatchWithRetryLogic(batch),
() => YourOnCompleteAction);
You can provide the retry logic inside of SqlSaveBatchWithRetryLogic()
Handle OnComplete of the events inside YourOnCompleteAction()
You can elect to dispose the subscription from within SqlSaveBatchWithRetryLogic() if you fail to save a batch.
This also removes the Do side effect.
I would be careful about this approach though - you need to watch the retry logic. You have no back-pressure (way to slow down the input). So if you have any kind of back-off/retry you are risking the queue backing up and filling memory. If you start seeing batches consistently at the count limit, you are probably in trouble! You may want to implement a counter to monitor the outstanding items.


Akka-streams time based grouping

I have an application which listens to a stream of events. These events tend to come in chunks: 10 to 20 of them within the same second, with minutes or even hours of silence between them. These events are processed and result in an aggregate state, and this updated state is sent further downstream.
In pseudo code, it would look something like this:
.mapAsync(1)((entityId, event) => entityProcessor(entityId).process(event)) // yields entityState
.mapAsync(1)(entityState => submitStateToExternalService(entityState))
The thing is that the downstream submitStateToExternalService has no use for 10-20 updated states per second - it would be far more efficient to just emit the last one and only handle that one.
With that in mind, I started looking if it wouldn't be possible to not emit the state after processing immediately, and instead wait a little while to see if more events are coming in.
In a way, it's similar to conflate, but that emits elements as soon as the downstream stops backpressuring, and my processing is actually fast enough to keep up with the events coming in, so I can't rely on backpressure.
I came across groupedWithin, but this emits elements whenever the window ends (or the max number of elements is reached). What I would ideally want, is a time window where the waiting time before emitting downstream is reset by each new element in the group.
Before I implement something to do this myself, I wanted to make sure that I didn't just overlook a way of doing this that is already present in akka-streams, because this seems like a fairly common thing to do.
Honestly, I would make entityProcessor into an cluster sharded persistent actor.
case class ProcessEvent(entityId: String, evt: EntityEvent)
val entityRegion = ClusterSharding(system).shardRegion("entity")
.mapAsync(parallelism) { (entityId, event) =>
entityRegion ? ProcessEvent(entityId, event)
With this, you can safely increase the parallelism so that you can handle events for multiple entities simultaneously without fear of mis-ordering the events for any particular entity.
Your entity actors would then update their state in response to the process commands and persist the events using a suitable persistence plugin, sending a reply to complete the ask pattern. One way to get the compaction effect you're looking for is for them to schedule the update of the external service after some period of time (after cancelling any previously scheduled update).
There is one potential pitfall with this scheme (it's also a potential issue with a homemade Akka Stream solution to allow n > 1 events to be processed before updating the state): what happens if the service fails between updating the local view of state and updating the external service?
One way you can deal with this is to encode whether the entity is dirty (has state which hasn't propagated to the external service) in the entity's state and at startup build a list of entities and run through them to have dirty entities update the external state.
If the entities are doing more than just tracking state for publishing to a single external datastore, it might be useful to use Akka Persistence Query to build a full-fledged read-side view to update the external service. In this case, though, since the read-side view's (State, Event) => State transition would be the same as the entity processor's, it might not make sense to go this way.
A midway alternative would be to offload the scheduling etc. to a different actor or set of actors which get told "this entity updated it's state" and then schedule an ask of the entity for its current state with a timestamp of when the state was locally updated. When the response is received, the external service is updated, if the timestamp is newer than the last update.

Asynchronous or synchronous pull for counting stream data in pub sub pub/sub?

I would like to count the number of messages in the last hour (last hour referring to a timestamp field in the message data).
I currently have a code that will count the messages synchronously (I am using Google Cloud Pub/Sub Synchronous pull), but I noticed it will take quite long.
My code will repeatedly poll the subscription for a predefined (I set it to 100+) number of times so that I am sure there are no more messages in the last hour that are coming in out of order.
This is not an acceptable design because it means the user has to wait for 5-10 mins for the service to count the messages when they want the metric!
Are there best practices in Pub Sub design for solving this kind of problem?
This seems like a simple problem to solve (count the number of events in the last X timeframe) so I thought there might be.
Will asynchronous design help? How would an async design work? I am not too sure about the async and Python future concept (I am using GCP Pub/Sub's Python client library).
I will try to catch the message differently. My solution is based on logging and BigQuery. The idea is to write a log, for example message received with timestamp xxxxx, to filter this log pattern and to sink the result in BigQuery.
Then, when a user ask, you simply have to request BigQuery and to count the message in the desired lap of time. You also have the advantage to change the time frame, to have an history,...
For writing this log, 2 solutions
Cheaper but not really recommended, the process which consume the message log it with it process it. However, you are dependent of an external service. And this service has 2 responsibilities: its work, and this log (for metrics). Not SOLID. Maybe it's can be the role of the publisher with a loge like this: message published at XXXX. However this imply that all the publisher or all the subscribers are on GCP.
Better is to plug a function, the cheaper (128Mb of memory) to simply handle the message and write the log.

How often put() is triggered in Kafka Connect sink tasks?

Can I control the intervals at which the put() method of my Kafka Connect Sink tasks is triggered? What is the expected behavior of the Kafka Connect framework in this respect? Ideally, I would like to specify, for example, "don't call me unless you have X new records/Y new bytes, or Z milliseconds passed since the last invocation". This could potentially make the batching logic within the sink task simpler (quoting the documentation, "in many cases internal buffering will be useful so an entire batch of records can be sent at once, reducing the overhead of inserting events into the downstream data store).
Today, put from a SinkTask is only called when deliverMessages is invoked in a WorkerSinkTask. The good news is that the only time deliverMessages happens is within poll so you should have some control over how often you poll for new records by overriding consumer properties.
If you want to do internal buffering, you could have a look at how the HDFSConnector is handling this in its implementation of SinkTask. However, right now, Connect will immediately put any records that get returned by the poll.
All of that said, if you are really looking to batch messages before they hit the downstream system, you might consider looking into and which control how often flush() is invoked.

Using many consumers in SQS Queue

I know that it is possible to consume a SQS queue using multiple threads. I would like to guarantee that each message will be consumed once. I know that it is possible to change the visibility timeout of a message, e.g., equal to my processing time. If my process spend more time than the visibility timeout (e.g. a slow connection) other thread can consume the same message.
What is the best approach to guarantee that a message will be processed once?
What is the best approach to guarantee that a message will be processed once?
You're asking for a guarantee - you won't get one. You can reduce probability of a message being processed more than once to a very small amount, but you won't get a guarantee.
I'll explain why, along with strategies for reducing duplication.
Where does duplication come from
When you put a message in SQS, SQS might actually receive that message more than once
For example: a minor network hiccup while sending the message caused a transient error that was automatically retried - from the message sender's perspective, it failed once, and successfully sent once, but SQS received both messages.
SQS can internally generate duplicates
Simlar to the first example - there's a lot of computers handling messages under the covers, and SQS needs to make sure nothing gets lost - messages are stored on multiple servers, and can this can result in duplication.
For the most part, by taking advantage of SQS message visibility timeout, the chances of duplication from these sources are already pretty small - like fraction of a percent small.
If processing duplicates really isn't that bad (strive to make your message consumption idempotent!), I'd consider this good enough - reducing chances of duplication further is complicated and potentially expensive...
What can your application do to reduce duplication further?
Ok, here we go down the rabbit hole... at a high level, you will want to assign unique ids to your messages, and check against an atomic cache of ids that are in progress or completed before starting processing:
Make sure your messages have unique identifiers provided at insertion time
Without this, you'll have no way of telling duplicates apart.
Handle duplication at the 'end of the line' for messages.
If your message receiver needs to send messages off-box for further processing, then it can be another source of duplication (for similar reasons to above)
You'll need somewhere to atomically store and check these unique ids (and flush them after some timeout). There are two important states: "InProgress" and "Completed"
InProgress entries should have a timeout based on how fast you need to recover in case of processing failure.
Completed entries should have a timeout based on how long you want your deduplication window
The simplest is probably a Guava cache, but would only be good for a single processing app. If you have a lot of messages or distributed consumption, consider a database for this job (with a background process to sweep for expired entries)
Before processing the message, attempt to store the messageId in "InProgress". If it's already there, stop - you just handled a duplicate.
Check if the message is "Completed" (and stop if it's there)
Your thread now has an exclusive lock on that messageId - Process your message
Mark the messageId as "Completed" - As long as this messageId stays here, you won't process any duplicates for that messageId.
You likely can't afford infinite storage though.
Remove the messageId from "InProgress" (or just let it expire from here)
Some notes
Keep in mind that chances of duplicate without all of that is already pretty low. Depending on how much time and money deduplication of messages is worth to you, feel free to skip or modify any of the steps
For example, you could leave out "InProgress", but that opens up the small chance of two threads working on a duplicated message at the same time (the second one starting before the first has "Completed" it)
Your deduplication window is as long as you can keep messageIds in "Completed". Since you likely can't afford infinite storage, make this last at least as long as 2x your SQS message visibility timeout; there is reduced chances of duplication after that (on top of the already very low chances, but still not guaranteed).
Even with all this, there is still a chance of duplication - all the precautions and SQS message visibility timeouts help reduce this chance to very small, but the chance is still there:
Your app can crash/hang/do a very long GC right after processing the message, but before the messageId is "Completed" (maybe you're using a database for this storage and the connection to it is down)
In this case, "Processing" will eventually expire, and another thread could process this message (either after SQS visibility timeout also expires or because SQS had a duplicate in it).
Store the message, or a reference to the message, in a database with a unique constraint on the Message ID, when you receive it. If the ID exists in the table, you've already received it, and the database will not allow you to insert it again -- because of the unique constraint.
AWS SQS API doesn't automatically "consume" the message when you read it with API,etc. Developer need to make the call to delete the message themselves.
SQS does have a features call "redrive policy" as part the "Dead letter Queue Setting". You just set the read request to 1. If the consume process crash, subsequent read on the same message will put the message into dead letter queue.
SQS queue visibility timeout can be set up to 12 hours. Unless you have a special need, then you need to implement process to store the message handler in database to allow it for inspection.
You can use setVisibilityTimeout() for both messages and batches, in order to extend the visibility time until the thread has completed processing the message.
This could be done by using a scheduledExecutorService, and schedule a runnable event after half the initial visibility time. The code snippet bellow creates and executes the VisibilityTimeExtender every half of the visibilityTime with a period of half the visibility time. (The time should to guarantee the message to be processed, extended with visibilityTime/2)
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
ScheduledFuture<?> futureEvent = scheduler.scheduleAtFixedRate(new VisibilityTimeExtender(..), visibilityTime/2, visibilityTime/2, TimeUnit.SECONDS);
VisibilityTimeExtender must implement Runnable, and is where you update the new visibility time.
When the thread is done processing the message, you can delete it from the queue, and call futureEvent.cancel(true) to stop the scheduled event.

Event sourcing: Write event before or after updating the model

I'm reasoning about event sourcing and often I arrive at a chicken and egg problem. Would be grateful for some hints on how to reason around this.
If I execute all I/O-bound processing async (ie writing to the event log) then how do I handle, or sometimes even detect, failures?
I'm using Akka Actors so processing is sequential for each event/message. I do not have any database at this time, instead I would persist all the events in an event log and then keep an aggregated state of all the events in a model stored in memory. Queries are all against this model, you can consider it to be a cache.
Creating a new user:
Validate that the user does not exist in model
Persist event to journal
Update model (in memory)
If step 3 breaks I still have persisted my event so I can replay it at a later date. If step 2 breaks I can handle that as well gracefully.
This is fine, but since step 2 is I/O-bound I figured that I should do I/O in a separate actor to free up the first actor for queries:
Updating a user while allowing queries (A0 = Front end/GUI actor, A1 = Processor Actor, A2 = IO-actor, E = event bus).
(A0->E->A1) Event is published to update user 'U1'. Validate that the user 'U1' exists in model
(A1->A2) Persist event to journal (separate actor)
(A0->E->A1->A0) Query for user 'U1' profile
(A2->A1) Event is now persisted continue to update model
(A0->E->A1->A0) Query for user 'U1' profile (now returns fresh data)
This is appealing since queries can be processed while I/O-is churning along at it's own pace.
But now I can cause myself all kinds of problems where I could have two incompatible commands (delete and then update) be persisted to the event log and crash on me when replayed up at a later date, since I do the validation before persisting the event and then update the model.
My aim is to have a simple reasoning around my model (since Actor processes messages sequentially single threaded) but not be waiting for I/O-bound updates when Querying. I get the feeling I'm modeling a database which in itself is might be a problem.
If things are unclear please write a comment.
Asychronous I/O can coexist with transactional updates. If you send an "ACK" or "NACK" after the command, then you can understand whether it has happened or not. In a distributed or truly asynchronous model, it is likely the "NACK" will come from both explicit failures and time-outs.