Automatic heart-beat of Cadence workflow/activity - cadence-workflow

We have registered the activities with auto-heartbeat configuration EnableAutoHeartBeat: true and also configured the activity option config HeartbeatTimeout: 15Min in the activity implementation.
Do we still need to explicitly send heart-beat using activity.heartbeat() or is it automatically taken care by the go-client library?
If its automatic, then what will happen if the Activity is waiting for external API response say >15Min delay?
What will happen during the activity heart-beat if the worker executing the activity crashes or killed?
Will Cadence retry the activities due to heart-beat failures?

No, the SDK will take care of it with this config.
The auto heartbeat will send heartbeat for every interval — the interval is 80% * Heartbeat timeout(15 minutes in your case) so that the activity won’t get timeout as long as the activity worker is still live.
So you should use a smaller heatbeat timeout, ideally 10~20s is the best.
The activity will fail with “heartbeat timeout “
Yes if you have set a retry policy .
See my other answer for retry policy
How to set proper timeout values for Cadence activities(local and regular activities, with or without retry)?
Example
Let say your activity implementation is waiting on AWS SDK API for 2 hours (max API timeout configured) --
You should still use 10-20 s for heartbeat timeout, and also use 2 hours for activity start to close timeout.
Heartbeat timeout is for detecting the host is not live anymore so that the activity can be restarted as early as possible.
Imagine this case:
Because the API takes 2 hours, the activity worker got restarted during the 2 hours.
If the HB timeout is 15 minutes, then Cadence will retry this activity after 15 minutes.
If HB timeout is 10s, then Cadence can retry it after 10s, because it will get HB timeout within 10 seconds.

Related

How to set proper timeout values for Cadence activities(local and regular activities, with or without retry)?

So there are so many timeout values:
For local activity:
ScheduleToClose timeout
For regular activity without retry:
ScheduleToStart timeout
ScheduleToClose timeout
StartToClose timeout
Heartbeat timeout
And then more values in retryOptions:
ExpirationInterval
InitialInterval
BackoffCoefficient
MaximumInterval
MaximumAttempts
And retryOptions can be applied onto localActivity or regular activity.
How do I use them together with what expectation?
TL;DR
The easiest way of using timeouts:
Regular Activity with retry:
Use StartToClose as timeout of each attempt
Leave ScheduleToStart and SchedueToClose empty
If StartToClose is too large(like 10 mins), then set Heartbeat timeout to a smaller value like 10s. Call heartbeat API inside activity regularly.
Use retryOptions.InitialInterval, retryOptions.BackoffCoefficient, retryOptions.MaximumInterval to control backoff.
Use retryOptions.ExperiationInterval as overall timeout of all attempts.
Leave retryOptions.MaximumAttempts empty.
Regular Activity without retry:
Use ScheduleToClose for overall timeout
Leave ScheduleToStart and StartToClose empty
If ScheduleToClose is too large(like 10 mins), then set Heartbeat timeout to a smaller value like 10s. Call heartbeat API inside activity regularly.
LocalActivity without retry: Use ScheduleToClose for overall timeout
LocalActivity with retry:
Use ScheduleToClose as timeout of each attempt.
Use retryOptions.InitialInterval, retryOptions.BackoffCoefficient, retryOptions.MaximumInterval to control backoff.
Use retryOptions.ExperiationInterval as overall timeout of all attempts.
Leave retryOptions.MaximumAttempts empty.
More TL;DR
Because activity should be idempotent, all activity should set retry policy.
Temporal has set an infinite retry policy for any activity by default. Cadence should do the same IMO.
iWF also set default infinite retry for State APIs to match Temporal activity.
What and Why
Basics without Retry
Things are easier to understand in the world without retry. Because Cadence started from it.
ScheduleToClose timeout is the overall end-to-end timeout from a workflow's perspective.
ScheduleToStart timeout is the time that activity worker needed to start an activity. Exceeding this timeout, activity will return an ScheduleToStart timeout error/exception to workflow
StartToClose timeout is the time that an activity needed to run. Exceeding this will return
StartToClose to workflow.
Requirement and defaults:
Either ScheduleToClose is provided or both of ScheduleToStart and StartToClose are provided.
If only ScheduleToClose, then ScheduleToStart and StartToClose are default to it.
If only ScheduleToStart and StartToClose are provided, then ScheduleToClose = ScheduleToStart + StartToClose.
All of them are capped by workflowTimeout. (e.g. if workflowTimeout is 1hour, set 2 hour for ScheduleToClose will still get 1 hour :ScheduleToClose=Min(ScheduleToClose, workflowTimeout) )
So why are they?
You may notice that ScheduleToClose is only useful when
ScheduleToClose < ScheduleToStart + StartToClose. Because if ScheduleToClose >= ScheduleToStart+StartToClose the ScheduleToClose timeout is already enforced by the combination of the other two, and it become meaningless.
So the main use case of ScheduleToClose being less than the sum of two is that people want to limit the overall timeout of the activity but give more timeout for scheduleToStart or startToClose. This is extremely rare use case.
Also the main use case that people want to distinguish ScheduleToStart and StartToClose is that the workflow may need to do some special handling for ScheduleToStart timeout error. This is also very rare use case.
Therefore, you can understand why in TL;DR that I recommend only using ScheduleToClose but leave the other two empty. Because only in some rare cases you may need it. If you can't think of the use case, then you do not need it.
LocalActivity doesn't have ScheduleToStart/StartToClose because it's started directly inside workflow worker without server scheduling involved.
Heartbeat timeout
Heartbeat is very important for long running activity, to prevent it from getting stuck. Not only bugs can cause activity getting stuck, regular deployment/host restart/failure could also cause it. Because without heartbeat, Cadence server couldn't know whether or not the activity is still being worked on. See more details about here Solutions to fix stuck timers / activities in Cadence/SWF/StepFunctions
RetryOptions and Activity with Retry
First of all, here RetryOptions is for server side backoff retry -- meaning that the retry is managed automatically by Cadence without interacting with workflows. Because retry is managed by Cadence, the activity has to be specially handled in Cadence history that the started event can not written until the activity is closed. Here is some reference: Why an activity task is scheduled but not started?
In fact, workflow can do client side retry on their own. This means workflow will be managing the retry logic. You can write your own retry function, or there is some helper function in SDK, like Workflow.retry in Cadence-java-client. Client side retry will show all start events immediately, but there will be many events in the history when retrying for a single activity. It's not recommended because of performance issue.
So what do the options mean:
ExpirationInterval:
It replaces the ScheduleToClose timeout to become the actual overall timeout of the activity for all attempts.
It's also capped to workflow timeout like other three timeout options. ScheduleToClose = Min(ScheduleToClose, workflowTimeout)
The timeout of each attempt is StartToClose, but StartToClose defaults to ScheduleToClose like explanation above.
ScheduleToClose will be extended to ExpirationInterval:
ScheduleToClose = Max(ScheduleToClose, ExpirationInterval), and this happens before ScheduleToClose is copied to ScheduleToClose and StartToClose.
InitialInterval: the interval of first retry
BackoffCoefficient: self explained
MaximumInterval: maximum of the interval during retry
MaximumAttempts: the maximum attempts. If existing with ExpirationInterval, then retry stops when either one of them is exceeded.
Requirements and defaults:
Either MaximumAttempts or ExpirationInterval is required. ExpirationInterval is set to workflowTimeout if not provided.
Since ExpirationInterval is always there, and in fact it's more useful. Most of the time it's harder to use MaximumAttempts, because it's easily messed up with backoffCoefficient(e.g. when backoffCoefficient>1, the end to end timeout can be really large if not careful). So I would recommend just use ExpirationInterval. Unless you really need it.

How should I pick ScheduleToStartTimeout and StartToCloseTimeout values for ActivityOptions

There are four different timeout options in the ActivityOptions, and two of those are mandatory without any default values: ScheduleToStartTimeout and StartToCloseTimeout.
What considerations should be made when selecting values for these timeouts?
As mentioned in the question, there are four different timeout options in ActivityOptions, and the differences between them may not be super clear to a new Cadence user. Let’s first briefly explain what those are:
ScheduleToStartTimeout: This configuration specifies the maximum
duration between the time the Activity is scheduled by a workflow and
it’s picked up by an activity worker to start executing it. In other
words, it configures the time a task spends in the queue.
StartToCloseTimeout: This one specifies the maximum time taken by
an activity worker from the time it fetches a task until it reports
the completion of it to the Cadence server.
ScheduleToCloseTimeout: This configuration specifies an end-to-end
timeout duration for an activity from the time it is scheduled by the
workflow until it is completed by an activity worker.
HeartbeatTimeout: If your activity is a heartbeating activity, this
configuration basically specifies the maximum duration the Cadence
server would wait for a heartbeat before assuming the activity worker
has failed.
How to select a proper timeout value
Picking the StartToCloseTimeout is fairly straightforward when you know what it does. Essentially, you should make this long enough so that the activity can complete under normal circumstances. Therefore, you should account for everything that can affect the time taken by an activity worker the latency of your down-stream (ie. services, networking etc.). On the other hand, you should aim to keep this value as small as it’s feasible to make your end-to-end system more responsive. If you can’t make this timeout less than a couple of minutes (ideally 1 minute or less), you should consider using a HeartbeatTimeout config and implement heartbeating in your activity.
ScheduleToCloseTimeout is also easy to understand, but it is more common to face issues caused by picking a less-than-ideal value here. Therefore, it’s important to ensure that a moment to pay some extra attention to this configuration.
Basically, you should consider everything that can create a backlog in the activity task queue. Some common events that contribute to a backlog are:
Reduced worker pool throughput due to deployments, maintenance or
network-related issues.
Down-stream latency spikes that would increase the time it takes to
complete each activity task, which then reduces the throughput of the
worker pool.
A significant spike in the number of workflow instances that schedule
the activity; especially if one of the upstream services is also an
asynchronous queue/stream processor which can create its own backlog
and suddenly start processing it at a very high-volume.
Ideally, no activity should timeout while waiting in the task queue, especially if the queue is backed up and the activity is configured to be retried. Because the retries would add more activity tasks to the queue and subsequently make it harder to recover from backlog or make it even worse. On the other hand, there are many use cases where business requirements really limit the total time the system can take to process an activity. Therefore, it’s usually not a bad idea to aim for a high ScheduleToCloseTimeout value as long as the business requirements allow. Depending on your use case, it might not make sense to keep your activity in the queue for more than a few minutes or it might be perfectly fine to keep it there for several days before timing out.

How does jmeter starts sending requests to server

If Thread: 100, Rampup: 1 and Loop count: 1 is the configuration, how will jmeter start sending requests to the server?
Request will be sent 1 req/sec or all requests will be sent all at once to server?
JMeter will send requests as fast as it can, to wit:
It will start all threads (virtual users) you define in Thread Group within the ramp-up period (in your case - 100 threads in 1 second)
Each thread (virtual user) will start executing Samplers which are present in the Thread Group upside down (or according to the Logic Controllers)
When there are no more samplers to execute or loops to iterate the thread will be shut down
When there are no more active threads left - JMeter test will end.
With regards to requests per second - it mostly depends on your application response time, i.e.
if you have 100 virtual users and response time is 1 second - you will get 100 requests/second
if you have 100 virtual users and response time is 2 seconds - you will get 50 requests/second
if you have 100 virtual users and response time is 500 milliseconds - you will get 200 requests/second
etc.
I would recommend increasing (and decreasing) the load gradually, this way you will be able to correlate increasing load with increasing throughput/response time/number of errors, etc. while releasing all threads at once will not tell you the full story (unless you're doing a form of spike testing, in this case consider using Synchronizing Timer)
JMeter's ramp-up period set as 1 means to start all 100 threads in 1 second.
This isn't recommended settings as describe below
The ramp-up period tells JMeter how long to take to "ramp-up" to the full number of threads chosen. If 10 threads are used, and the ramp-up period is 100 seconds, then JMeter will take 100 seconds to get all 10 threads up and running. Each thread will start 10 (100/10) seconds after the previous thread was begun. If there are 30 threads and a ramp-up period of 120 seconds, then each successive thread will be delayed by 4 seconds.
Ramp-up needs to be long enough to avoid too large a work-load at the start of a test, and short enough that the last threads start running before the first ones finish (unless one wants that to happen).
Start with Ramp-up = number of threads and adjust up or down as needed.
See also Can i set ramp up period 0 in JMeter?
bear in mind that with low rampup and many threads, you may be limited by local resources, so your results may be a measurement of client capability rather than server.

Distributed timer service

I am looking for a distributed timer service. Multiple remote client services should be able to register for callbacks (via REST apis) after specified intervals. The length of an interval can be 1 minute. I can live with an error margin of around 1 minute. The number of such callbacks can go up to 100,000 for now but I would need to scale up later. I have been looking at schedulers like Quartz but I am not sure if they are a fit for the problem. With Quartz, I will probably have to save the callback requests in a DB and poll every minute for overdue requests on 100,000 rows. I am not sure that will scale. Are there any out of the box solutions around? Else, how do I go about building one?
Posting as answer since i cant comment
One more options to consider is a message queue. Where you publish a message with scheduled delay so that consumers can consume after that delay.
Amazon SQS Delay Queues
Delay queues let you postpone the delivery of new messages in a queue for the specified number of seconds. If you create a delay queue, any message that you send to that queue is invisible to consumers for the duration of the delay period. You can use the CreateQueue action to create a delay queue by setting the DelaySeconds attribute to any value between 0 and 900 (15 minutes). You can also change an existing queue into a delay queue using the SetQueueAttributes action to set the queue's DelaySeconds attribute.
Scheduling Messages with RabbitMQ
https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/
A user can declare an exchange with the type x-delayed-message and then publish messages with the custom header x-delay expressing in milliseconds a delay time for the message. The message will be delivered to the respective queues after x-delay milliseconds.
Out of the box solution
RocketMQ meets your requirements since it supports the Scheduled messages:
Scheduled messages differ from normal messages in that they won’t be
delivered until a provided time later.
You can register your callbacks by sending such messages:
Message message = new Message("TestTopic", "");
message.setDelayTimeLevel(3);
producer.send(message);
And then, listen to this topic to deal with your callbacks:
consumer.subscribe("TestTopic", "*");
consumer.registerMessageListener(new MessageListenerConcurrently() {...})
It does well in almost every way except that the DelayTimeLevel options can only be defined before RocketMQ server start, which means that if your MQ server has configuration messageDelayLevel=1s 5s 10s, then you just can not register your callback with delayIntervalTime=3s.
DIY
Quartz+storage can build such callback service as you mentioned, while I don't recommend that you store callback data in relational DB since you hope it to achieve high TPS and constructing distributed service will be hard to get rid of lock and transaction which bring complexity to DB coding.
I do suggest storing callback data in Redis. Because it has better performance than relational DB and it's data structure ZSET suits this scene well.
I once developed a timed callback service based on Redis and Dubbo. it provides some more useful features. Maybe you can get some ideas from it https://github.com/joooohnli/delay-callback

Timeout configurations in Curator

I create a Curator client as follows:
RetryPolicy retryPolicy = new RetryNTimes(3, 1000);
CuratorFramework client = CuratorFrameworkFactory.newClient(zkConnectString,
15000, // sessionTimeoutMs
15000, // connectionTimeoutMs
retryPolicy);
When running my client program I simulate a network partition by bringing down the NIC that Curator is using to communicate with Zookeeper. I have a few questions based on the behavior that I am seeing:
I see a ConnectionStateManager - State change: SUSPENDED message after 10 seconds. Is the amount of time until Curator enters the SUSPENDED state configurable, based on a percentage of the other timeout values, or always 10 seconds?
I do not receive any notification after the configured 15-second session timeout has passed since the last successful heartbeat. I do see a ZooKeeper - Session: 0x14adf3f01ef0001 closed message in the log, however this does not appear to trickle up as an event that I can capture or listen on. Am I missing something here?
I eventually receive a ConnectionStateManager - State change: LOST message almost two minutes after the connection loss. Why so long?
If my goal is to use an InterProcessMutex as a means of preventing split-brain in an HA scenario, it seems that the safest approach is for the lock holder to assume that it has lost the lock when the SUSPENDED message is received, since it is entirely possible that Zookeeper has released the lock
unbeknownst to it on the other side of the network partition. Is this a typical/sane approach?
It depends which version of Curator you're using (note: I'm the main author of Curator)...
In Curator 2.x, the LOST state means that a retry policy has been exhausted. It does not mean that the Session has been lost. In ZooKeeper the session is only determined to be lost once the connection to the ensemble is repaired. So, you get SUSPENDED when Curator sees the first "Disconnected" message. Then, when an operation fails due to the retry policy giving up you get LOST.
In Curator 3.x the meaning of LOST was changed. In 3.x when the "Disconnected" is received Curator starts an internal timer. When the timer passes the negotiated session timeout Curator calls getTestable().injectSessionExpiration() and posts a LOST state change.
Correct. Assume leadership has been lost on SUSPEND and LOST.
This is the way the Apache Curator recipes work.
You may want to use Apache Curator rather than implementing your own algorithm.
https://curator.apache.org/curator-recipes/index.html
The first question, Zookeeper has a variable called MAX_SEND_PING_INTERVAL which is 10 seconds, so It's always 10 seconds on your condition.The code is in the ClientCnxn class.
//1000(1 second) is to prevent race condition missing to send the second ping
//also make sure not to send too many pings when readTimeout is small
int timeToNextPing = readTimeout / 2 - clientCnxnSocket.getIdleSend() -
((clientCnxnSocket.getIdleSend() > 1000) ? 1000 : 0);
//send a ping request either time is due or no packet sent out within MAX_SEND_PING_INTERVAL
if (timeToNextPing <= 0 || clientCnxnSocket.getIdleSend() > MAX_SEND_PING_INTERVAL) {
sendPing();
clientCnxnSocket.updateLastSend();
} else {
if (timeToNextPing < to) {
to = timeToNextPing;
}
}