Is it safe to invoke a workflow activity from defer() within the workflow code? - cadence-workflow

func MyWorkflow(ctx Context) (retErr error) {
log := workflow.GetLogger()
log.Info("starting my workflow")
defer func() {
if retErr != nil {
DoActivityCleanupError(ctx, ..)
} else {
DoActivityCleanupNoError(ctx, ...)
}
}
err := DoActivityA(ctx, ...)
if err != nil {
return err
}
...
err := DoActivityB(ctx, ...)
if err != nil {
return err
}
}
Basically there are catchall activities, ActivityCleanupNoError and ActivityCleanupError, that we want to execute whenever the workflow exits (particularly in the error case which we don't want to repeatedly call ActivityCleanupError in all error returns.
Does this work with distributed decision making? For example, if ownership of workflow decision move from one worker to another, is it going to trigger defer on the original worker?
Bonus Question: Does the logger enforce logging only once per workflow run? Even if decisions are moved from one worker to another? Do you expect to see the log line appear in both worker's log? or is there magic behind the scene to prevent this from happening?

Yes it is.
But it's quite complicated to understand why it is safe. This is how to get the conclusion:
In no-sticky mode(without sticky cache), Cadence SDK will always execute workflow code to make(collect) workflow decisions, and release all the goroutines/stack. When releasing them, the defer will be executed which mean the cleanup activity code path will run -- HOWEVER, those decision will be ignored. Hence it will not impact the actual correctness.
In sticky mode, if workflow is not closing, Cadence SDK will be blocking on somewhere; If the workflow is actually closing, then the defer will be executed and the decisions will be collected.
When the sticky cache(goroutines/stack) is evicted, what will happen is exactly the same as 1. so it's also safe.
Does the logger enforce logging only once per workflow run? Even if decisions are moved from one worker to another? Do you expect to see the log line appear in both worker's log? or is there magic behind the scene to prevent this from happening?
Each log line will only appear in the worker that actually executed the code as making decision -- in other words, in non-replay mode. That's the only magic :)

Related

Reusing context.WithTimeout in deferred function

The below code snippet (reduced for brevity) from MongoDB's Go quickstart blog post creates context.WithTimeout at the time of connecting with the database and reuses the same for the deferred Disconnect function, which I think is buggy.
func main() {
client, _ := mongo.NewClient(options.Client().ApplyURI("<ATLAS_URI_HERE>"))
ctx, _ := context.WithTimeout(context.Background(), 10*time.Second)
_ = client.Connect(ctx)
defer client.Disconnect(ctx)
}
My train of thought-
context.WithTimeout sets a deadline in UNIX time at the point it is created.
So, passing it to Connect makes sense as we want to cancel the process of establishing the connection if it exceeds the time limit (ie, the derived UNIX time).
Now, passing the same ctx to the deferred Disconnect, which will most probably be called late in the future, will result in the ctx's time being in the past. Meaning, it is already expired when the function starts executing. This is not what is expected and breaks the logic as- quoting the doc for Disconnect-
If the context expires via cancellation,
deadline, or timeout before the in use connections have returned, the in use
connections will be closed, resulting in the failure of any in flight read
or write operations.
Please tell me if and how I am wrong and/or missing something.
Your understanding is correct.
In the example it is sufficient because the example just connects to the database, performs some example operation (e.g. lists databases), then main() ends, so running the deferred disconnect with the same context will cause no trouble (the example will/should run well under 10 seconds).
In "real-world" applications this won't be the case of course. So you will likely not use the same context for connecting and disconnecting (unless that context has no timeout).

How can I version Cadence workflows?

Cadence workflows are required to be deterministic, which means that a workflow is expected to produce the exact same results if it’s executed with the same input parameters.
When I learned the requirement above as a new Cadence user, I wondered how I can maintain workflows in the long run when determinism-breaking changes are required.
An example scenario is where you have a workflow that executes Activity1 and Activity2 consecutively, and then you need to change the order of these activities so that the workflow executes Activity2 before Activtiy1. There are many other ways to make determinism-breaking changes like this, and I wanted to understand how to handle those changes.
This is especially important in cases where the workflows can run for long durations such as days, weeks, or even months!
Apparently, this is probably one of the most common questions a new Cadence developer asks. Cadence workflows are required to be deterministic algorithms. If a workflow algorithm isn’t deterministic, Cadence workers will be at the risk of hitting nondeterministic workflow errors when they try replaying the history (ie. during worker failure recovery).
There are two ways to solve this problem:
Creating a brand-new workflow: This is the most naive approach for
versioning workflows. The approach is as simple as it sounds: anytime
you need to make a change to your workflow’s algorithm, you make a
copy of your original workflow and edit it the way you want, give it
a new name like MyWorkflow_V2 and start using for all new instances
going forward. If your workflow is not very long-living, your
existing workflows will “drain out” at some point and you’ll be able
to delete the old version altogether. On the other hand, this
approach can turn into a maintenance nightmare very quickly for
obvious reasons.
Using the GetVersion() API to fork workflow logic: Cadence client has
a function named GetVersion, which tells you what version of the
workflow is currently running. You can use the information returned
by this function to decide which version of your workflow algorithm
needs to be used. In other words, your workflow has both the old and
new algorithms running side-by-side and you are able to pick the
right version for your workflow instances to ensure that they run
deterministically.
Below is an example of the GetVersion() based approach. Let’s assume you want to change the following line in your workflow:
err = workflow.ExecuteActivity(ctx, foo).Get(ctx, nil)
to
err = workflow.ExecuteActivity(ctx, bar).Get(ctx, nil)
This is a breaking change since it runs the bar activity instead of foo. If you simply make that change without worrying about determinism, your workflows will fail to replay if they need to and they’ll be stuck with the nondeterministic workflow error. The correct way to make this change properly is updating the workflow as follows:
v := GetVersion(ctx, "fooChange", DefaultVersion, 1)
if v == DefaultVersion {
err = workflow.ExecuteActivity(ctx, foo).Get(ctx, nil)
} else {
err = workflow.ExecuteActivity(ctx, bar).Get(ctx, nil)
}
The GetVersion function accepts 4 parameters:
ctx is the standard context object
“fooChange” is a human-readable ChangeID or the semantic change you are making in your workflow algorithm that breaks the determinism
DefaultVersion is a constant that simply means Version 0.In other
words, the very first version. It’s passed as the minSupportedVersion
parameter to the GetVersion function
1 is the maxSupportedVersion that can be handled by your current
workflow code. In this case, our algorithm can support workflow
versions from DefaultVersion to Version 1 (inclusively).
When a new instance of this workflow reaches the GetVersion() call above for the first time, the function will return the maxSupportedVersion parameter so that you can run the latest version of your workflow algorithm. In the meantime, it’ll also record that version number in the workflow history (internally known as a Marker Event) so that it is remembered in the future. When replaying this workflow later on, Cadence client will keep returning the same version number even if you pass a different maxSupportedVersion parameter (ie. if your workflow has even more versions).
If the GetVersion call is encountered during a history replay and the history doesn’t have a marker event that was logged earlier, the function will return DefaultVersion, with the assumption that the “fooChange” had never existed in the context of this workflow instance.
In case you need to make one more breaking change in the same step of your workflow, you simply need to change the code above like this:
v := GetVersion(ctx, "fooChange", DefaultVersion, 2) // Note the new max version
if v == DefaultVersion {
err = workflow.ExecuteActivity(ctx, foo).Get(ctx, nil)
} else if v == 1 {
err = workflow.ExecuteActivity(ctx, bar).Get(ctx, nil)
} else { // This is the Version 2 logic
err = workflow.ExecuteActivity(ctx, baz).Get(ctx, nil)
}
When you are comfortable with dropping the support for the Version 0, you change the code above like this:
v := GetVersion(ctx, "fooChange", 1, 2) // DefaultVersion is no longer supported
if v == 1 {
err = workflow.ExecuteActivity(ctx, bar).Get(ctx, nil)
} else {
err = workflow.ExecuteActivity(ctx, baz).Get(ctx, nil)
}
After this change, if your workflow code runs for an old workflow instance with the DefaultVersion version, Cadence client will raise an error and stop the execution.
Eventually, you’ll probably want to get rid of all previous versions and only support the latest version. One option is to simply get rid of the GetVersion call and the if statement altogether and simply have a single line of code that does the right thing. However, it’s actually a better idea to keep the GetVersion() call in there for two reasons:
GetVersion() gives you a better idea of what went wrong if your
worker attempts to replay the history of an old workflow instance.
Instead of investigating the root cause of a mysterious
nondeterministic workflow error, you’ll know that the failure is
caused by workflow versioning at this location.
If you need to make more breaking changes to the same step of your
workflow algorithm, you’ll be able to reuse the same Change ID and
continue following the same pattern as you did above.
Considering the two reasons mentioned above, you should be updating your workflow code like the following when it’s time to drop to support for all old versions:
GetVersion(ctx, "fooChange", 2, 2) // This acts like an assertion to give you a proper error
err = workflow.ExecuteActivity(ctx, baz).Get(ctx, nil)

How mutex guarantee ownership in freeRTOS?

I'm playing with Mutex in freeRTOS using esp32. in some documents i have read that mutex guarantee ownership, which mean if a thread (let's name it task_A) locks up a critical resource (take token) other threads (task_B and task_C) will stay in hold mode waiting for that resource to be unlocked by the same thread that locked it up(which is task_A). i tried to prove that by setting up the other tasks (task_B and task_C) to give a token before start doing anything and just after that it will try to take a token from the mutex holder, which is surprisingly worked without showing any kid of error.
Well, the method i used to verify or display how things works i created a display function that read events published (set and cleared) by each task (when it's in waiting mode it set the waiting bit up if it's working it will set the working bit up etc..., you get the idea). and a simple printf() in case of error in take or give function ( xSemaphoreTake != true and xSemaphoreGive != true).
I can't use the debug mode because i don't have any kind of micro controller debugger.
This is an example of what i'm trying to do:
i created many tasks and each one will call this function but in different time with different setup.
void vVirtualResource(int taskId, int runTime_ms){
int delay_tick = 10;
int currentTime_tick = 0;
int stopTime_tick = runTime_ms/portTICK_PERIOD_MS;
if(xSemaphoreGive(xMutex)!=true){
printf("Something wrong in giving first mutex's token in task id: %d\n", taskId);
}
while(xSemaphoreTake(xMutex, 10000/portTICK_PERIOD_MS) != true){
vTaskDelay(1000/portTICK_PERIOD_MS);
}
// notify that the task with <<task id>> is currently running and using this resource
switch (taskId)
{
case 1:
xEventGroupClearBits(xMutexEvent, EVENTMASK_MUTEXTSK1);
xEventGroupSetBits(xMutexEvent, EVENTRUN_MUTEXTSK1);
break;
case 2:
xEventGroupClearBits(xMutexEvent, EVENTMASK_MUTEXTSK2);
xEventGroupSetBits(xMutexEvent, EVENTRUN_MUTEXTSK2);
break;
case 3:
xEventGroupClearBits(xMutexEvent, EVENTMASK_MUTEXTSK3);
xEventGroupSetBits(xMutexEvent, EVENTRUN_MUTEXTSK3);
break;
default:
break;
}
// start running the resource
while(currentTime_tick<stopTime_tick){
vTaskDelay(delay_tick);
currentTime_tick += delay_tick;
}
// gives back the token
if(xSemaphoreGive(xMutex)!=true){
printf("Something wrong in giving mutex's token in task id: %d\n", taskId);
}
}
You will notice that for the very first time, the first task that will start running in the processor will print out the first error message because it can't give a token while there still a token in the mutex holder, it's normal, so i just ignore it.
Hope someone can explain to me how mutex guarantee ownership using code in freeRTOS. In the first place i didn't use the first xSemaphoreGive function and it worked fine. but that doesn't mean it guarantee anything. or i'm not coding right.
Thank you.
Your example is quite convoluted, I also don't see clear code of task_A, task_B or task_C so I'll try to explain on a simplier example which hopefully explains how mutex guarantees resource ownership.
The general approach to working with mutexes is the following:
void doWork()
{
// attempt to take mutex
if(xSemaphoreTake(mutex, WAIT_TIME) == pdTRUE)
{
// mutex taken - do work
...
// release mutex
xSemaphoreGive(mutex);
}
else
{
// failed to take mutex for 'WAIT_TIME' amount of time
}
}
The doWork function above is the function that may be called by multiple threads at the same time and needs to be protected. This pattern repeats for every function on given resource that needs protection. If resource is more complex, a good approach is to guard the top-most functions that are callable by threads, then if mutex is successfully taken call internal functions that do the actual work.
The ownership guarantee you speak about is the fact that there may not be more than one context (threads, but also interrupts) that are under the if(xSemaphoreTake(mutex, WAIT_TIME) == pdTRUE) statement. In other words, if one context successfully takes the mutex, it is guaranteed that no other context will be able to also take it, unless the original context releases it with xSemaphoreGive first.
Now as for your scenario - while it is not entirely clear to me how it's supposed to work, I can see two issues with your code:
xSemaphoreGive at the beginning of the function - don't do that. Mutexes are by default "given" and you're not supposed to be "giving" it if you aren't the one "taking" it first. Always put a xSemaphoreGive under a successful xSemaphoreTake and nowhere else.
This code block:
while(xSemaphoreTake(xMutex, 10000/portTICK_PERIOD_MS) != true){
vTaskDelay(1000/portTICK_PERIOD_MS);
}
If you need to wait for mutex for longer - specify a longer time. If you want infinite wait, simply specify longest possible time (0xFFFFFFFF). In your scenario, you're polling for mutex every 10s, then delay for 1s during which mutex isn't actually checked, meaning there will be cases where you'll have to wait almost a full second after mutex is released by other thread to start doing work in the current thread that requested it. Waiting for mutex is already done by RTOS in an optimal way - it'll wake the highest priority task currently waiting for the mutex as soon as it's released, there's no need to do more than necessary.
If I was to give an advice of how to fix your example - simplify it and don't do more than needed such as additional calls to xSemaphoreGive or implementing your own waiting for mutex. Isolate the portion of code that performs some work to a separate function that does a single call to xSemaphoreTake at the very top and a single call to xSemaphoreGive only if xSemaphoreTake succeeds. Then call this function from different threads to test whether it works.

Is code within DispatchQueue.global.async executed serially?

I have some code that looks like this:
DispatchQueue.global(qos: .userInitiated).async {
self.fetchProjects()
DispatchQueue.main.async {
self.constructMenu()
}
}
My question is are the blocks within the global block executed serially? When I add print statements, they always executed in the same order, but I'm not sure if I'm getting lucky as looking at the documentation, it says:
Tasks submitted to the returned queue are scheduled concurrently with respect to one another.
I wonder if anyone can shed any light on this?
EDIT:
Apologies, I don't think I made the question clear. I would like for the method constructMenu to only be called once fetchProjects has completed. From what I can tell (by logging print statements) this is the case.
But I'm not really sure why that's the case if what Apple's documentation above says (where each task is scheduled concurrently) is true.
Is code within an async block always executed serially, or is the fact that the code seems to execute serially a result of using DispatchQueue.main or is it just 'luck' and at some point constructMenu will actually return before fetchProjects?
I would like for the method constructMenu to only be called once fetchProjects has completed. From what I can tell (by logging print statements) this is the case.
Yes, this is the case.
But I'm not really sure why that's the case if what Apple's documentation above says (where each task is scheduled concurrently) is true.
Apple’s documentation is saying that two separate dispatches may run concurrently with respect to each other.
Consider:
DispatchQueue.global(qos: .userInitiated).async {
foo()
}
DispatchQueue.global(qos: .userInitiated).async {
bar()
}
In this case, foo and bar may end up running at the same time. This is what Apple means by “Tasks submitted to the returned queue are scheduled concurrently.”
But consider:
DispatchQueue.global(qos: .userInitiated).async {
foo()
bar()
}
In this case, bar will not run until we return from foo.
Is code within an async block always executed serially, or is the fact that the code seems to execute serially a result of using DispatchQueue.main or is it just ‘luck’ and at some point constructMenu will actually return before fetchProjects?
No luck involved. It will never reach the DispatchQueue.main.async line until you return from fetchProjects.
There is one fairly major caveat, though. This assumes that fetchProjects won’t return until the fetch is done. That means that fetchProjects better not be initiating any asynchronous processes of its own (i.e. no network requests). If it does, you probably want to supply it with a completion handler, and put the call to constructMenu in that completion handler.
Yes, blocks of code are submitted as blocks, and those blocks run sequentially. So for your example:
DispatchQueue.global(qos: .userInitiated).async {
self.fetchProjects()
DispatchQueue.main.async {
self.constructMenu()
}
}
fetchProjects() must complete before constructMenu is enqueued. There is no magic here. The block between {...} is submitted as a block. At some point in the future it will executed. But the pieces of the block will not be considered in any granular way. They will start at the top, fetchProjects, and then the next line of code will be executed, DispatchQueue.main.async, which accepts as a parameter another block. The complier doesn't know anything about these blocks. It just passes them to functions, and those functions put them on queues.
DispatchQueue.global is a concurrent queue that mean all tasks submitted run asynchronously at the same time , if you need it serially create a custom queue
let serial = DispatchQueue(label: "com.queueName")
serial.sync {
///
}

Safety of running assertions in a separate execution context

How isolated are different execution contexts from each other? Say we have two execution contexts ec1 and ec2 both used on the same code path implementing some user journey. If, say, starvation and crashing starts happening in ec2, would ec1 remain unaffected?
For example, consider the following scenario where we want to make sure user was charged only once by running an assertion inside a Future
chargeUserF andThen { case _ =>
getNumberOfChargesF map { num => assert(num == 0) }
.andThen { case Failure(e) => logger.error("User charged more than once! Fix ASAP!", e) }
}
Here getNumberOfChargesF is not necessary to fulfil user's request, it is just a side-concern where we assert on the expected state of the database after it was mutated by chargeUserF. Because it is not necessary I feel uneasy adding it to the main business logic out of fear it could break the main logic in some way. If I run getNumberOfChargesF on a different execution context from the one chargeUserF uses, can I assume issues such as starvation, blocking etc. caused by getNumberOfChargesF will not affect the main business logic?
Each execution context has its own thread pool, so, yeah ... kinda.
They are "independent" in the sense that if one runs out of threads, the other one might still keep going, however, they do use the same resource (cpu), so if that gets maxed out by one, the other will obviously be affected.
They are also affected by each other's side effects. For example, the way your code is written, chargeUser and getNumberOfCharges are happening in parallel, and there is no saying which one will finish first, so, if I am guessing the semantics right, the number of charges may end up being either 0 or 1 fairly randomly, depending on whether the previous future has completed or not.