Uber Cadence Local Activity Vs Side Effect - cadence-workflow

What are some key differences between Local Activities and Side Effect? On the surface both of them appear to be similar where Local Activity is a super set.
When should user prefer Side Effect over Local Activity.

SideEffect doesn't support any error handling. If it fails it essentially blocks the workflow execution (by panicing the decision task). It is also executed in the same goroutine as the workflow code.
LocalActivity is executed in a separate goroutine and supports error handling including automatic retries through RetryOptions.
So use SideEffect only for very short lived operations that are not expected to fail. Or if they fail it is OK to block the workflow execution. UUID generation is a good example of such operation.

Related

Uber Cadence task list management

We are using Uber Cadence and periodically we run into issues on the production environment.
The setup is the following:
One Java 14 BE with Cadence client 2.7.5
Cadence service version 0.14.1 with Postgres DB
There are multiple domains, for all domains the single BE server is registered as a worker.
What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:
"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...
In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.
At this point the investigation is stuck, so some help would be appreciated.
Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?
Any help would be more than welcome! Thanks!
Does anyone know about how a worker or task list can lose its ability to be a decision handler
This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue
Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).
From your logs:
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.
It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

How to stop or suspend and restart “service delay" or the "delay" blocks from the agent based diagram?

By following your advice I’m constructing small models to learn how to use AnyLogic and build my simulation.
I need discrete events diagram interacting with agent based, where the agent based will represent a “service process” based in a previous recommendation it was straight forward to trigger the agent based activity, but I cannot stop or suspend or delay the “delay” block, I tryed to use the “until stopDelay is called” function but I could not make it work, I decided to test with and cyclic event inside the discrete event agent and but was not possible. I am considering that maybe my approach is not correct, and I need to use a different strategy to stop the discrete events process while the agent-based process is running, however since agent based is attempting to simulate some human behaviour, I’m interested in the time variations this could cause to the discrete events process. So my question is how to stop or suspend the “service delay or the delay blocks and restart them from the agent based diagram?
If you just need to store an entity somewhere until Agent process is done then I would recommend using using a 'wait' block instead of a 'delay'. The whole point of a delay is to have a timed exit so suspending it doesn't align with the intended use-case. You can read more about 'wait' block here.
I found the the Job Shop model example, some blocks using stopDelayForAll(), with a "if" code block, so I noticed that it was using a parameter, so I made some changes and the code I'm using and worked is this:
if ( Inqueue >= queCap )
delay.stopDelayForAll();
"Inqueue" is a variable capturing data from the delay block and queCap is a parameter telling the queue block capacity.

Apache Beam CloudBigtableIO read/write error handling

We have a java based Data flow pipeline which reads from Bigtable, after some processing write data back to Bigtable. We use CloudBigtableIO for these purposes.
I am trying wrap my head around failure handling in CloudBigtableIO. I haven;t found any references/documentation on how the errors are handled inside and outside the CloudBigtableIO.
CloudBigtableIO has bunch of Options in BigtableOptionsFactory which specify timeouts, grpc codes to retry on, retry limits.
google.bigtable.grpc.retry.max.scan.timeout.retries - is this the retry limit for scan operations or does it include Mutation operations as well? if this is just for scan, how many retries are done for Mutation operations? is it configurable?
google.bigtable.grpc.retry.codes - Do these codes enable retries for both scan, Mutate operations?
Customizing options would only enable retries, would there be cases where CloudBigtableIO reads partial data than what is requested but not fails the pipeline?
When mutating few millions of records, I think it is possible we get errors beyond the retry limits, what happens to such mutations? do they fail simply? how do we handle them in pipeline? BigQueryIO has function that collects failures and provides a way to retrieve them through side output, why do CloudBigtableIO doesn't have one such functions?
We occasionally get DEADLINE_EXCEEDED errors while writing mutations but the logs are not clear whether the mutations were retried and successful or Retries were exhausted, I do see RetriesExhaustedWithDetailsException but that is of no use, if we are not able to handle failures
Are these failures thrown back to the preceding step in data flow pipeline if preceding step and CloudBigtableIO write are fused? with bulk mutations enabled it is not really clear on how the failures are thrown back to preceding steps.
For question 1, I believe google.bigtable.mutate.rpc.timeout.ms would correspond to mutation operations, though it is noted in the Javadoc that the feature is experimental. google.bigtable.grpc.retry.codes allows you to add additional codes to retry on that are not set by default (defaults include DEADLINE_EXCEEDED, UNAVAILABLE, ABORTED, and UNAUTHENTICATED)
You can see an example of the configuration getting set for mutation timeouts here: https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-client-core-parent/bigtable-hbase/src/test/java/com/google/cloud/bigtable/hbase/TestBigtableOptionsFactory.java#L169
google.bigtable.grpc.retry.max.scan.timeout.retries:
It is only for setting the number of time to retry after a SCAN timeout.
Regarding retries on mutation operations
This is how Bigtable handles operations failures.
Regarding your question about handling errors in the pipeline
I see that you already are aware for the "RetriesExhaustedWithDetailsException". Please keep in mind that in order to retrieve the detailed exceptions for each failed request you have to call the "RetriesExhaustedWithDetailsException#getCauses()"
As for the failures, Google documentation states:
" Append and Increment operations are not suitable for retriable batch
programming models, including Hadoop and Cloud Dataflow, and are
therefore not supported inputs to CloudBigtableIO.writeToTable.
Dataflow bundles, or a group of inputs, can fail even though some of
the inputs have been processed. In those cases, the entire bundle will
be retried, and previously completed Append and Increment operations
would be performed a second time, resulting in incorrect data."
Some documentation that you may consider helpful:
Cloud Bigtable Client Libraries
Source code for Cloud Bigtable HBase client for Java
Sample code for use with Cloud BigTable
Hope you find the above helpful.

UVM shared variables

I have a doubt regarding UVM. Let's think I have a DUT with two interfaces, each one with its agent, generating transactions with the same clock. These transactions are handled with analysis imports (and write functions) on the scoreboard. My problem is that both these transactions read/modify shared variables of the scoreboard.
My questions are:
1) Have I to guarantee mutual exclusion explicitly though a semaphore? (i suppose yes)
2) Is this, in general, a correct way to proceed?
3) and the main problem, can in some way the order of execution be fixed?
Depending on that order the values of shared variables can change, generating inconsistency. Moreover, that order is fixed by specifications.
Thanks in advance.
While SystemVerilog tasks and functions do run concurrently, they do not run in parallel. It is important to understand the difference between parallelism and concurrency and it has been explained well here.
So while a SystemVerilog task or function could be executing concurrently with another task or function, in reality it does not actually run at the same time (run time context). The SystemVerilog scheduler keeps a list of all the tasks and functions that need to run on the same simulation time and at that time it executes them one-by-one (sequentially) on the same processor (concurrency) and not together on multiple processors (parallelism). As a result mutual exclusion is implicit and you do not need to use semaphores on that account.
The sequence in which two such concurrent functions would be executed is not deterministic but it is repeatable. So when you execute a testbench multiple times on the same simulator, the sequence of execution would be same. But two different simulators (or different versions of the same simulator) could execute these functions in a different order.
If the specifications require a certain order of execution, you need to ensure that order by making one of these tasks/functions wait on the other. In your scoreboard example, since you are using analysis port, you will have two "write" functions (perhaps using uvm_analysis_imp_decl macro) executing concurrently. To ensure an order, (since functions can not wait) you can fork out join_none threads and make one of the threads wait on the other by introducing an event that gets triggered at the conclusion of the first thread and the other thread waits for this event at the start.
This is a pretty difficult problem to address. If you get 2 transactions in the same time step, you have to be able to process them regardless of the order in which they get sent to your scoreboard. You can't know for sure which monitor will get triggered first. The only thing you can do is collect the transactions and at the end of the time step do your modeling/checking/etc.
Semaphores only help you if you have concurrent threads that take (simulation) time that are trying to access a shared resource. If you get things from an analysis port, then you get them in 0 time, so semaphores won't help you here.
So to my understanding, the answer is: compiler/vendor/uvm cannot ensure the order of execution. If you need to ensure the order which actually happen in same time step, you need to use semaphore correctly to make it work the way you want.
Another thing is, only you yourself know which one must execute after the other if they are in same simulation time.
this is a classical race condition where the result depends upon the actual thread order...
first of all you have to decide if the write race is problematic for you and/or if there is a priority order in this case. if you dont care the last access would win.
if the access isnt atomic you might need a semaphore to ensure only one access is handled at a time and the next waits till the first has finished.
you can also try to control order by changing the structure or introducing thread ordering (wait_order) or if possible you remove timing at all (here instead of directly operating with the data you get you simply store the data for some time and then later you operate on it.

Is it required to use spin_lock inside tasklets?

As far as I know in interrupt handler, there is no need of synchronization technique. The interrupt handler cannot run concurrently. In short, the pre-emption is disabled in ISR. However, I have a doubt regarding tasklets. As per my knowledge, tasklets runs under interrupt context. Thus, In my opinion, there is no need for spin lock under tasklet function routine. However, I am not sure on it. Can somebody please explain on it? Thanks for your replies.
If data is shared between top half and bottom half then go for lock. Simple rules for locking. Locks meant to protect data not code.
1. What to protect?.
2. Why to protect?
3. How to protect.
Two tasklets of the same type do not ever run simultaneously. Thus, there is no need to protect data used only within a single type of tasklet. If the data is shared between two different tasklets, however, you must obtain a normal spin lock before accessing the data in the bottom half. You do not need to disable bottom halves because a tasklet never preempts another running tasklet on the same processor.
For synchronization between code running in process context (A) and code running in softirq context (B) we need to use special locking primitives. We must use spinlock operations augmented with deactivation of bottom-half handlers on the current processor in (A), and in (B) only basic spinlock operations. Using spinlocks makes sure that we don't have races between multiple CPUs while deactivating the softirqs makes sure that we don't deadlock in the softirq is scheduled on the same CPU where we already acquired a spinlock. (c) Kernel docs