Stateless & stateful worker scenario - celery

I have one scenario:
Say I have a worker in a distributed system whose task it to accept a task and apply business logic on that task and throw the result to some other service.
So say there can be 3 types of requests i.e; the worker is able to differentiate requests and apply corresponding business logic on those requests. So is the worker stateful or stateless in this scenario?
To my observation, the worker is stateless as the worker doesn't save any information about the task, it doesn't care what operations were applied to the task earlier, neither it cares about what will happen to it in the future. So basically there is no state sharing issues. The worker just cares about processing the task with the corresponding business logic. The business logic is say for example - formatting the data / parsing & converting the data so that the data becomes consumable in the system.

Your workers are stateless. They don't hold any information in-memory except the bare minimum required to send and receive data from other services. If the worker crashes, another worker can replace it seamlessly w/o need of syncing information from a persistent data store.

Related

Invoking CloudRun endpoint from within itself

Assuming there is a Flask web server that has two routes, deployed as a CloudRun service over GKE.
#app.route('/cpu_intensive', methods=['POST'], endpoint='cpu_intensive')
def cpu_intensive():
#TODO: some actions, cpu intensive
#app.route('/batch_request', methods=['POST'], endpoint='batch_request')
def batch_request():
#TODO: invoke cpu_intensive
A "batch_request" is a batch of many same structured requests - each one is highly CPU intensive and handled by the function "cpu_intensive". No reasonable machine can handle a large batch and thus it needs to be paralleled across multiple replicas.
The deployment is configured that every instance can handle only 1 request at a time, so when multiple requests arrive CloudRun will replicate the instance.
I would like to have a service with these two endpoints, one to accept "batch_requests" and only break them down to smaller requests and another endpoint to actually handle a single "cpu_intensive" request. What is the best way for "batch_request" break down the batch to smaller requests and invoke "cpu_intensive" so that CloudRun will scale the number of instances?
make http request to localhost - doesn't work since the load balancer is not aware of these calls.
keep the deployment URL in a conf file and make a network call to it?
Other suggestions?
With more detail, it's now clearer!!
You have 2 responsibilities
One to split -> Many request can be handle in parallel, no compute intensive
One to process -> Each request must be processed on a dedicated instance because of compute intensive process.
If your split performs internal calls (with localhost for example) you will be only on the same instance, and you will parallelize nothing (just multi thread the same request on the same instance)
So, for this, you need 2 services:
one to split, and it can accept several concurrent request
The second to process, and this time you need to set the concurrency param to 1 to be sure to accept only one request in the same time.
To improve your design, and if the batch processing can be asynchronous (I mean, the split process don't need to know when the batch process is over), you can add PubSub or Cloud Task in the middle to decouple the 2 parts.
And if the processing requires more than 4 CPUs 4Gb of memory, or takes more than 1 hour, use Cloud Run on GKE and not Cloud Run managed.
Last word: Now, if you don't use PubSub, the best way is to set the Batch Process URL in Env Var of your Split Service to know it.
I believe for this use case it's much better to use GKE rather than Cloud Run. You can create two kubernetes deployements one for the batch_request app and one for the cpu_intensive app. the second one will be used as worker for the batch_request app and will scale on demand when there are more requests to the batch_request app. I believe this is called master-worker architecture in which you separate your app front from intensive work or batch jobs.

High Scalability Question: How to sync data across multiple microservices

I have the following use cases:
Assume you have two micro-services one AccountManagement and ActivityReporting that processes event U.
When a user registers, event U containing the user information will published into a broker for the two micro-services to process.
AccountManagement, and ActivityReporting microservice are replicated across two instances each for performance and scalability reasons.
Each microservice instance has a consumer listening on the broker topic. The choice of topic is so that both AccountManagement, and ActivityReporting can process U concurrently.
However, I want only one instance of AccountManagement to process event U, and one instance of ActivityReporting to process event U.
Please share your experience implementing a Consume Once per Application Group, broker system.
As this would effectively solve this problem.
If all your consumer listeners even from different instances have the same group.id property then only one of them will receive the message. You need to set this property when you initialise the consumer. So in your case you will need one group.id for AccountManagement and another for ActivityReporting.
I would recommend Cadence Workflow which is much more powerful solution for microservice orchestration.
It offers a lot of advantages over using queues for your use case.
Built it exponential retries with unlimited expiration interval
Failure handling. For example it allows to execute a task that notifies another service if both updates couldn't succeed during a configured interval.
Support for long running heartbeating operations
Ability to implement complex task dependencies. For example to implement chaining of calls or compensation logic in case of unrecoverble failures (SAGA)
Gives complete visibility into current state of the update. For example when using queues all you know if there are some messages in a queue and you need additional DB to track the overall progress. With Cadence every event is recorded.
Ability to cancel an update in flight.
See the presentation that goes over Cadence programming model.

Microservice data replication patterns

In a microservice architecture, we usually have two ways for 2 microservices to communicate. Let’s say service A needs to get information from service B. The first option is a remote call, usually synchronous over HTTPS, so service A query an API hosted by service B.
The second option is adopting an event-driven architecture, where the state of service B can be published and consumed by service A in an asynchronous way. Using this model, service A can update its own database with the information from the service B’s events and all queries are made locally in this database. This approach has the advantage of a better decoupling of microservices, from development until operations. But it comes with some disadvantages related to data replication.
The first one is the high consumption of disk space, since the same data can reside in the databases of the microservices that need it. But the second one is worst in my opinion: data can become stale if service B can’t process its subscription as fast as needed, or it can’t be available for service A at the same time it’s created at service B, given the eventual consistency of the model.
Let’s say we’re using Kafka as an event hub, and its topics are configured to use 7 days of data retention. Service A is kept in sync as service B publishes its state. After two weeks, a new service C is deployed and its database needs to be enriched with all information that service B holds. We can only get partial information from Kafka topics since the oldest events are gone. My question here is what are the patterns we can use to achieve this microservice’s database enrichment (besides asking service B to republish all its current state to the event hub).
There are 2 options:
You can enable log compaction for Kafka for an individual topic. That will keep the most recent value for a given key discarding old updates. This saves space and also holds more data than the normal mode for a given retention period
Assuming you take a backup of service B DB on a daily basis, on introduction of a new service C, you need to first create the initial state of C from the latest backup of B and then replay the Kafka topic events from the particular offset id that represents the data after the backup.
Your concern is right but at the same time Microservices approach is give and take. You get loose coupling at the cost of individual data base for each service. There is no right answer to microservices architecture and really depends on what you are trying to achieve.
According to CAP theorem you have to compromise between consistency and availability and in most cases we go with eventual consistency . If your service A is not consistent with B then it will eventually be and that's the trade off at the cost of availability.
Another thing regarding microservice is that you only keep the reference of data from other service and may be very limited actual data from other service but definitely not much. And that too only if replicating the data is making your service independent and autonomouse, if you can't achieve any of it even after replicating the data then there is no point. e.g. Your shipping service will have complete history of order transition , but your booking service only have the latest status of order (e.g. in transit , On board etc) . User goes to booking and you show the current status of the order. But if user click details you get all the order transition history from shipping microservice. Now at some point your shipping service goes down and your user comes to check the status you at-least have current order status even when you can't show the details because order status is replicated in the booking service.
Regarding new services joining the system at later stage , Event sourcing is the pattern that you use for these kind of scenarios. Its complex pattern but it will bring your newly added services to the state at which you want them to be. You basically save all your events in an event store and replay them to attain the current state of the system and pre-populate service C database with those events.

Stateful or Stateless service for processing servicebus queues

I have a Session enabled Azure servicebus queue. I need some form of service that can read from the queue and process them and save the result (in memory for later retrieval). We are using azure servicefabric in our current architecture. I got few questions regarding which one to choose Stateful or Stateless service.
If I use Stateful service, then based on the documentation my understanding is, service will be running on 1 primary node (assuming 1 partition) and 2 active secondary nodes. That means, if I have a 10 node Service fabric cluster, then this stateful service will be utilizing only one node (VM) primarily.
So if I add a listener to this stateful service to read messages from Queues then that service on primary node will read messages from queues and all other remaining 9 nodes wont be able to utilized. Is this correct?
Whereas if I use Stateless service, I can create instances on all 10 nodes and all of them could listen to the message in Queues and process them in parallel. However, I will loose the option to save the results.
Please advise.
So if I add a listener to this stateful service to read messages from Queues then that service on primary node will read messages from queues and all other remaining 9 nodes wont be able to utilized. Is this correct?
That is correct. With stateful service scenario, only the primary replica will have it's listener executed, and work will be done. Other replicas can be used in read-only mode, but they would not be writing anything into reliable collections.
Whereas if I use Stateless service, I can create instances on all 10 nodes and all of them could listen to the message in Queues and process them in parallel.
Exactly. Stateless services can perform their work in parallel and no state is persisted. That's also the reason whey there's no reliable collection available for this Service Fabric model.
However, I will loose the option to save the results.
Not necessarily true. You could still save your data in a centralized/shared DB, just like you'd do with stateless solutions in the past (for example Cloud Services, or a Azure WebApp).
What you should ask yourself is what problem are you solving. If you have data sharding, the Statful makes more sense. If you don't have data sharding and/or you need to scale out your processing power, rather that scale up, Stateless is a better approach.

How to run something on each node in service fabric

In a service fabric application, using Actors or Services - what would the design be if you wanted to make sure that your block of code would be run on each node.
My first idea would be that it had to be a Service with instance count set to -1, but also in cases that you had set to to 3 instances. How would you make a design where the service ensured that it ran some operation on each instance.
My own idea would be having a Actor with state controlling the operations that need to run, and it would itterate over services using serviceProxy to call methods on each instance - but thats just a naive idea for which I dont know if its possible or if it is the proper way to do so?
Some background info
Only Stateless services can be given a -1 for instance count. You can't use a ServiceProxy to target a specific instance.
Stateful services are deployed using 1 or more partitions (data shards). Partition count is configured in advance, as part of the service deployment and can't be changed automatically. For instance if your cluster is scaled out, partitions aren't added automatically.
Autonomous workers
Maybe you can invert the control flow by running Stateless services (on all nodes) and have them query a 'repository' for work items. The repository could be a Stateful service, that stores work items in a Queue.
This way, adding more instances (scaling out the cluster) increases throughput without code modification. The stateless service instances become autonomous workers.
(opposed to an intelligent orchestrator Actor)