Hazelcast - queue reading - queue

I have such a situation in my project: there are two different types of services A1, A2, A3, and B1,B2,B3.
A's are just many instances of service A, and the same for B's.
My question is:
I publish an element to queue in hazelcast and want to process that element only by one instance of A and one instance of B. Topic can not be used because this is like broadcast to all instances and also the ItemListener for queue will be run for all instance of A's and B's. Is it possible to make this in a single queue? Or does it exists a different approach to deal with such a situation?

Hazelcast doesn't provide a concept of consumer groups for topics.
If your items are small you can publish them to 2 queues (or more in case you have services A, B, C...).
If your items are large and duplicating them would result in too much overhead you can put the items into a map and publish only a reference into the queue.

Related

How can an event sourced entity to subscribe to state changes in another entity?

I have an events-sourced entity (C) that needs to change its state in response to state changes in another entity of a different type (P). The logic to whether the state of C should actually change is quite complex and the data to compute that lives in C; moreover, many instances of C should listen to one instance of P, and the set of instances increases over time, so I'd rather they pull out of a stream knowing the ID of P than have P keep track of the IDs of all the Cs and push to them.
I am thinking of doing something such as:
Tag a projection of P's events
Have a Subscribe(P.id) command that gets sent to C
If C is not already subscribing to a P (it can only subscribe to one, and it shouldn't change), fire an event Subscribed(P.id)
In response to the event, use Akka-persistent-query to materialize the stream of events tagged in 1, map them to commands, and run asynchronously with a sync that sends them to my ES entity reference
This seems a bit like an anti pattern to have a stream run in the event handler. I am wondering if there's a better/more supported way to do this without the upstream having to know about the downstream. I decided against Akka pub-sub because it does at-most-once delivery, and I'd like to avoid using Kafka if possible.
You definitely don't want to run the stream in the event handler: the event handler should never side effect.
Assuming that you would like a C to get events from times when that C was not running (including before that C had ever run), this suggests that a stream should be run for each C. Since the subscription will be to one particular P, I'd seriously consider not tagging, but instead using the eventsByPersistenceId stream to get all the events of a P and ignore the ones that aren't of interest. In the stream, you translate those to commands in C's API, including the offset in P's event stream with the command, and send it to C (for at-least-once delivery, a mapAsync with an ask is useful; C will persist an event recording that it processed the offset: this allows the command to be idempotent, as C can acknowledge the command if the offset is less-than-or-equal-to the high water offset in its state).
This stream gets kicked off by the command-handler after successfully persisting a Subscribed(P.id) event (in this case starting from offset 0) and then gets kicked off after the persistent actor is rehydrated if the state shows it's subscribed (in this case starting from one plus the high water offset).
The rationale for not using tagging here arises from an assumption that the number of events C isn't interested in is smaller than the number of events with the tag from Ps that C isn't subscribed to (note that for most of the persistence plugins, the more tags there are, the more overhead there is: a tag which is only used by one particular instance of an entity is often not a good idea). If the tag in question is rarely seen, this assumption might not hold and eventsByTag and filtering by id could be useful.
This does of course have the downside of running discrete streams for every C: depending on how many Cs are subscribed to a given P, the overhead of this may be substantial, and the streams for subscribers which are caught up will be especially wasteful. In this scenario, responsibility for delivering commands to subscribed Cs for a given P can be moved to an actor. The only real change in that scenario is that where C would run the stream, it instead confirms that it is subscribed to the event stream by asking that actor feeding events from the P. Because this approach is a marked step-up in complexity (especially around managing when Cs join and drop out of the shared "caught-up" stream), I'd tend to recommend starting with the stream-per-C approach and then going to the shared stream (it's also worth noting that there can be multiple shared streams: in fact I'd tend to have shared streams be per-ActorSystem (e.g. a "node singleton" per P of interest) so as not to involve remoting), since it's not difficult to make the transition (from C's perspective, there's not really a difference whether the adapted commands are coming from a stream it started or from a stream being run by some other actor).

Tracking an expected set of Kafka events

Say I have N cities and each will report their temperature for the hour (H) by producing Kafka events. I have a complex model I want to run but want to ensure it doesn't attempt to kick-off before all N are read.
Say they are being produced in batches, I understand that to ensure at-least-once consumption, if a consumer fails mid-batch then it will pick up at the front of the batch. I have built this into my model to count by unique Cities (and if a city is sent multiple times it will overwrite existing records).
My current plan is to set it up as follows:
An application creates an initial event which says "Expect these N cities to report for H o'clock".
The events are persisted (in db, Redis, etc) by another application. After writing, it produces an event which states how many unique cities have been reported in total so far for H.
Some process matches the initial "Expect N" events with "N Written" events. It alerts the rest of the system that the data set for H is ready for creating the model when they are equal.
Does this problem have a name and are there common patterns or libraries available to manage it?
Does the solution as outlined have glaring holes or overcomplicate the issue?
What you're describing sounds like an Aggregator, described by Gregor Hohpe and Bobby Woolf's "Enterprise Integration Patterns" as:
a special Filter that receives a stream of messages and identifies messages that are correlated. Once a complete set of messages has been received [...], the Aggregator collects information from each correlated message and publishes a single, aggregated message to the output channel for further processing.
This could be done on top of Kafka Streams, using its built-in aggregation, or with a stateful service like you suggested.
One other suggestion -- designing processes like this with event-driven choreography can be tricky. I have seen strong engineering teams fail to deliver similar solutions due to diving into the deep end without first learning to swim. If your scale demands it and your organization is already primed for event-driven distributed architecture, then go for it, but if not, consider an orchestration-based alternative (for example, AWS Step Functions, Airflow, or another workflow orchestration tool). These are much easier to reason about and debug.

High Scalability Question: How to sync data across multiple microservices

I have the following use cases:
Assume you have two micro-services one AccountManagement and ActivityReporting that processes event U.
When a user registers, event U containing the user information will published into a broker for the two micro-services to process.
AccountManagement, and ActivityReporting microservice are replicated across two instances each for performance and scalability reasons.
Each microservice instance has a consumer listening on the broker topic. The choice of topic is so that both AccountManagement, and ActivityReporting can process U concurrently.
However, I want only one instance of AccountManagement to process event U, and one instance of ActivityReporting to process event U.
Please share your experience implementing a Consume Once per Application Group, broker system.
As this would effectively solve this problem.
If all your consumer listeners even from different instances have the same group.id property then only one of them will receive the message. You need to set this property when you initialise the consumer. So in your case you will need one group.id for AccountManagement and another for ActivityReporting.
I would recommend Cadence Workflow which is much more powerful solution for microservice orchestration.
It offers a lot of advantages over using queues for your use case.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
See the presentation that goes over Cadence programming model.

Should Kafka event carried state transfer systems be implemented using a GlobalKTable for local queries?

The event carried state transfer removes the need to make remote calls to query information from other services.
Let's assume a practical case:
We have a customer service that publishes CustomerCreated/CustomerUpdated events to a customer Kafka topic.
A shipping service listens to an order topic
When an OrderCreated event is read by the shipping service, it will need access to the customer address. Instead of making a REST call to the customer service, shipping service will already have the user information available locally. It is kept in a KTable/GlobalKTable with persistent storage.
My questions are about how we should implement this: we want this system to be resilient and scalable so there will be more than one instance of the customer and shipping services, meaning there will also be more than one partition for the customer and order topics.
We could find scenarios like this: An OrderCreated(orderId=1, userId=7, ...) event is read by shipping service but if it uses a KTable to keep and access the local user information, the userId=7 may not be there because the partition that handles that userId could have been assigned to the other shipping service instance.
Offhand this problem could be solved using a GlobalKTable so that all shipping service instances have access to the whole range of customers.
Is this (GlobalKTable) the recommended approach to implement that pattern?
Is it a problem to replicate the whole customer dataset in every shipping service instance when the number of customers is very large?
Can this/should this case be implemented using KTable in some way?
You can solve this problem with both a GKTable and a KTable. The former data structure is replicated so the whole table is available on every node (and uses up more storage). The latter is partitioned so the data is spread across the various nodes. This has the side effect that, as you say, the partition that handles the userId may not also handle the corresponding customer. You solve this problem by repartitioning one of the streams so they are co-partitioned.
So in your example you need to enrich Order events with Customer information in the Shipping Service. You can either:
a) Use a GlobalKTable of Customer information and join to that on each node
b) Use a KTable of Customer information and perform the same operation, but before doing the enrichment you must rekey using the selectKey() operator to ensure the data is co-partitioned (i.e. the same keys will be on the same node). You also have to have the same number of partitions in the Customer and Orders topics.
The Inventory Service Example in the Confluent Microservices Examples does something similar. It rekeys the stream of orders so they are partitioned by productId, then joins to a KTable of Inventory (also keyed by productId).
Regarding your individual questions:
Is GlobalKTable the recommended approach to implement that pattern?
Both work. The GKTable has a longer worst-case reload time if your service loses storage for whatever reason. The KTable will have a slightly greater latency as data has to be repartitioned, which means writing the data out to Kafka and reading it back again.
Is it a problem to replicate the whole customer dataset in every shipping service instance when the amount of customers is very large?
The main difference is the aforementioned worst-case reload time. Although technically GKTable and KTable have slightly different semantics (GKTable load fully on startup, KTable load incrementally based on event-time, but that's not strictly relevant to this problem)
Can this/should this case be implemented using KTable in some way?
See above.
See also: Microservice Examples, Quick start, Blog Post.

Is there a way of assigning an int number to different instances of stateless services?

I'm building a solution where we'll have a (service-fabric) stateless service deployed to K instances. This service is tasked with some workload (like querying) and I want to split the workload between them as evenly as I can - and I want to make this a dynamic solution, which means if I decide to go from K instances to N instances tomorrow, I want the workload splitting to happen in a way that it will automatically distribute the load across N instances now. I don't have any partitions specified for this service.
As an example -
Let's say I'd like to query a database to retrieve a particular chunk of the records. I have 5 nodes. I want these 5 nodes to retrieve different 1/5th of the set of records. This can be achieved through some query logic like (row_id % N == K) where N is the total number of instances and K is the unique instance_number.
I was hoping to leverage FabricRuntime.GetNodeContext().NodeId - but this returns a guid which is not overly useful.
I'm looking for a way where I can deterministically say it's instance number M out of N (I need to be able to name the instances through 1..N) - so I can set my querying logic according to this. One of the requirements is if that instance goes down / crashes etc... when SF automatically restarts it, it should still identify as the same instance id - so that 2 or more nodes doesn't query the same set of results.
What is the best of solving this problem? Is there a solution which involves pure configuration through ApplicationManifest.xml or ServiceManifest.xml?
There is no out of the box solution for your problem, but it can be easily done in many different ways.
The simplest way is using the Queue-Based Load Leveling pattern in conjunction with Competing Consumers pattern.
It consists of creating a queue, add the work to the queue, and each instance get one message to process this work, if one instance goes down and the message is not processed, it goes back to the queue and another instance pick it up.
This way you don't have to worry about the number of instances running, failures and so on.
Regarding the work being put in the queue, it will depend if you want to to do batch processing or process item by item.
Item by item, you put one message in the queue for each item being processed, this is a simple way to handle the work and each instance process one message at time, or multiple messages in parallel.
In batch, you can put a message that represents a list of items to be processed and each instance process that batch until completed, this is a bit trickier because you might have to handle the progress of the work being done, in case of failure, the next time you can continue from where it stopped.
The queue approach is a reactive design, in this case the work need to be put in the queue to trigger the processing, If you want a proactive approach and need to keep track of which work goes to who, you probably might be better of using some other approach, like a Leasing mechanism, where each instance acquire a lease that belongs to the instance until it releases the lease, this would more suitable when you work with partitioned data or other mechanism where you can easily split the load.
Regarding the issue with the ID, an option would be the InstanceId of the replica you are on, you can reach by StatelessService.Context.InstanceId, it is not a sequential ID, but it is a random number. It is better than using the node id, because you might have multiple partitions on same node and the id would conflict with each other.
If you decide to use named partitions, you could use order in the partition name instead, so each partition would have a sequential name.
Worth mention that service fabric has a limitation that doesn't allow services to have multiple replicas on same node, because of this limitation you might have to design your services with this in mind, otherwise you won't be able to scale out once the limit is reached. Also, the same thread has some discussion about approaches to process multiple distributed items that might give you some ideas.