I am exploring Cadence and have a question on failure recovery. I understand that workflows are fault tolerant (workflow history is maintained), in case of workflow worker failure. I couldn’t find the same guarantees for activity worker. Example: say an activity makes RPC call to service A, which changes some remote object state; now, let’s assume that the call succeeded but activity worker is lost before notifying Cadence service. In this case, would Cadence schedule the activity again on a new worker?
I understand that the above may not be a problem if Service A is idempotent. What are the recommendation of handling above scenario in Cadence, if Service A is not idempotent.
Cadence by default doesn't retry activities. So in the scenario of the activity worker failure the workflow is going to get a timeout error and can handle it accordingly to its business logic. For non idempotent activities it is usually done by running compensating activities.
Cadence also supports automatic retries for idempotent activities. It is done by providing a retry policy when invoking an activity.
Related
Cadence is a fault tolerant stateful code platform. How does cadence handle fault in various failure condition?
There are al kinds of failures in distributed systems and Cadence provides various options to them.
Here is the list from myself. It may not be complete. But I will try add more if I can think of.
activity
Activity failure and retry. See https://cadenceworkflow.io/docs/concepts/activities/#timeouts
Also note that long running activity can recover from checkpoints via “heartbeat “
workflow
By design of event sourcing models, a workflow can recover from any point left when a worker crashed. See https://cadenceworkflow.io/docs/concepts/workflows/#state-recovery-and-determinism
Workflow can also have retry policy like activity to retry on failure automatically https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries
On certain scenarios the failure is caused by bad code change which leads to wrong states. Cadence provides “reset” tool to reset workflow to any point of time.
See https://cadenceworkflow.io/docs/cli/#reset-and-restart
On top of reset, Cadence also allows you to reset by deployment. This is useful to reset a big number of workflow(eg millions of).
Cadence server cluster
Both activity and workflow workers are stateless.
Cadence server is a highly available and scalable service provides the durability.
The durability is from underlying design and persistence storage ( by either Cassandra, MySQL or Postgres)
In a single cluster setup, Cadence service is running with different independent shards. The whole cluster consists of different hosts. Any failed host can be replaced by another.
Cadence provides Cross data center replication to provide much higher availability https://cadenceworkflow.io/docs/concepts/cross-dc-replication/#global-domains-architecture
Currently on application startup I'm deploying a single verticle and calling createHttpServer(serverOptions).
I've set up a request().connection().closeHandler for handling a closed connection event, primarily so when clients decide to cancel their request, we halt our execution of that request.
However, when I set up that handler in that same verticle, it only seems to execute the closeHandler code once any synchronous code is finished executing and we're waiting on databases to respond via Futures and asynchronous handlers.
If instead of that, I deploy a worker verticle for each new HTTP request, it properly interrupts execution to execute the closeHandler code.
As I understand it, the HttpServer is already supposed to handle scalability of requests on its own since it can handle many at once without deploying new verticles. Essentially, this sounds like a hacky workaround that may affect our thread loads or things of that nature once our application is in full swing. So my questions are:
Is this the right way of doing this?
If not, what is the correct method or paradigm to follow?
How do you cancel the execution of a verticle from within itself verticle and inside that closeHandler? And by cancel execution, I mean including any Futures waiting to be completed.
Why does closeHandler only execute asynchronously when doing this multiple verticle approach? Using the normal way and simply executing requests using the alloted thread pool postpones closeHandler's execution until the eventloop finishes its queue, we need this to happen asynchronously
I think you need to understand Vert.x better. Vert.x does not start and stop thread per request. Verticles are long living and each handle multiple events during their lifetime but never concurrently. Also you should not deploy worker (or non-worker) Verticles per request.
What you do is that you deploy a pool of Verticles (worker and non) and Vert.x divides the load between them. An HTTP server is placed in front and will receive requests and forward them to verticle(s) to be handled.
for stopping processing a request, you need to keep a flag somewhere which is set if connection is closed. then you can check for it in your process and stop processing. Just don't forget to clear the flag at beginning of each request.
Deploying or undeploying verticles doesn't affect threads count. Vert.x uses thread pools of a limited size.
Undeploying verticles is a mean to downscale your service. Ideally, you shouldn't undeploy verticles at all. Deploying or undeploying does have a performance impact.
closeHandler, as I mentioned previously, is a callback method to release resources.
Vert.x Future doesn't provide cancellation means. The reason is that even Java's Future.cancel() is a cooperative operation.
As a means to fix this, probably passing a reference to AtomicBoolean as was suggested above, and checking it before every synchronous step is the best way. You will still be blocked by synchronous operations, though.
I'd like to have more technical details about underling mechanisms of calling Actors in Azure Service Fabric, which I can't easily find online. Actors are know for their single-threaded scope, so unless any of its method execution is fully completed, no other clients are allowed to call it.
To be more specific, I need to know what happens if Actor is stuck for a while with its job initiated by one client call. How long are other clients supposed to wait until the job gets done? seconds, minutes, hours?
Is there any time-out mechanism, and if so is it somehow
configurable?
What happens if node where actor is located crashes,
would client receives immediate error, or ActorProxy somehow handles this situation and redirects call to newly created instance of Actor on healthy node?
There are quite a few SO answers with details about actor mechanisms and also in docs, I can point you a few:
This one does not exactly answer your question, but I described a bit how the locking works: Start a thread inside an Azure Service Fabric actor?
Q: Is there any time-out mechanism, and if so is it somehow configurable?
yes, there is a timeout, I have answered here: Acquisition of turn based concurrency lock for actor '{actorName}' timed out after {time}
The configuration docs are located in here
Q: What happens if node where actor is located crashes, would client receives immediate error, or ActorProxy somehow handles this situation and redirects call to newly created instance of Actor on healthy node?
Generally there is always a replica available when one replica goes down, new requests will start moving to the new replica when SF promotes a secondary replica to Primary.
Regarding the communication, by default, SF Actors use .Net Remoting for communication the same way as the reliable services, the behaviour is described very well in here, In summary, it retries transient failures, if the client can't connect to the service(Actor) it will retry until it reaches the connection timeout.
From the docs:
The service proxy handles all failover exceptions for the service partition it is created for. It re-resolves the endpoints if there are failover exceptions (non-transient exceptions) and retries the call with the correct endpoint. The number of retries for failover exceptions is indefinite. If transient exceptions occur, the proxy retries the call.
The Actor docs, has more info, in summary there are two points to keep in mind:
Message delivery is best effort.
Actors may receive duplicate messages from the same client.
That means, in case a transient failure occurs while delivering a message, it will retry, even though the message has been already delivered, causing duplicate messages.
I'm new to Service Fabric Reliable Actors technology and trying to figure out best practices for this specific scenario:
Let's say we have some legacy code that we want to run new code built on SF Reliable Actors. Actors of certain type "ActorExecutor" are going to asynchronously call some third-party service that sometimes could stuck for pretty long time, longer than actor's calling client is ready to wait, or even experience some prolonged underling communication issues. We do not want client (legacy code) to get blocked by any sort of issues in ActorExecutor, it does not expect to receive any value or status back from actor. Should we use SF ReliableQueue for that? Should we use some sort of actor-broker to receive requests from client and storing them to queue: Client->ActorBroker->ActorExecutor? Are reminders could be helpful here?
One more question in this regard: Giving the situation is possible when many thousands of actors might stuck in 'third-party incomplete call' in the same time, and we want to reactivate and repeat the very last call for them, should we write a new tool for that? In NServiceBus you can create an error queue in MSMQ where all failed like 'unable to process' messages to be landed, and then we were able to simply re-process them anytime in the future. From my understanding, there is no such thing in Service Fabric and it's something we need to built on our own.
An event driven approach can help you here. Instead of waiting for the Actor to return from the call to a service, you can enqueue some task on it, to request it to perform some action. The service calling Actor would function autonomously, processing items from it's task queue. This will allow it to perform retries and error handling. After a successful call, a new event can notify the rest of the system.
Maybe this project can help you to get started.
edits:
At this time, I don't believe you can use reliable collections in Actors. So a queue inside the state of an Actor, is a regular (read-only) collection.
Process the queue using an Actor Timer. Don't use the threadpool, as it's not persistent and won't survive crashes and Actor garbage collections.
I was looking at 2 scenario's: A is ok, B not sure.
Scenario A: Simulate application restart after commit, before dispatch
Start EventStore
Commit change
Event not dispatched
Stop Event store
Start event store
De commited event is send again in step 5. This works fine and I see this also in the dispatcher code.
Scenario B: Simulate bus error
Start EventStore
Commit change 1
Exception in dispatcher
Commit change 2
Dispatch ok
In this case I cannot find the behavior and also wonder if it is a valid case:
This could only happen if there was a bug in the bus code.
Are there trigger which will retry to dispatch or do I need to write code to handle this or is my reasoning faulty?
Your assessment of Scenario A is correct, in a failure condition such as an application or machine restart/process termination, when the process starts up again it will discover the undispatched commits and push them to the dispatcher.
Scenario B is somewhat more tricky. The issue is that the EventStore is not a bus so the question of how to handle errors in the bus isn't something that cannot be handled in the EventStore itself. Furthermore, because there are a number of bus implementations, I don't want to couple the EventStore to any particular implementation. Some users may not even use a message bus; they may decide to use RPC calls instead.
The question that you really have then is, how should bus failures--and by extension, queue failures--be handled? The EventStore has an interface IPublishCommits. When an event is committed it's then pushed to a dispatcher. The dispatchers are simply responsible for marking an event as dispatched once it has been properly and successfully handled by the implementation of IPublishCommits.
The best way to handle transient bus and queue failures would be to implement the circuit breaker pattern in your IPublishCommits implementation that retries until things start working again. For bigger issues, such as serialization failures, you may want to log some kind of critical failure that will notify an administrator immediately. Again, the sticky problem in all of this is that the EventStore cannot know about all of the specifics of your situation.