UpdateTaskList operation failed with Cadence matching-service

UpdateTaskList operation failed with Cadence matching-service - cadence-workflow

The other day we ran into some issues with our cadence setup. One of our machine instances began to increase the CPU usage up to 90% and all of the inbound workflow executions were stuck in "Scheduled" states. After checking the logs, we noticed that the matching service was throwing the following error:
{
"level": "error",
"ts": "2021-03-20T14:41:55.130Z",
"msg": "Operation failed with internal error.",
"service": "cadence-matching",
"error": "InternalServiceError{Message: UpdateTaskList operation failed. Error: gocql: no hosts available in the pool}",
"metric-scope": 34,
"logging-call-at": "persistenceMetricClients.go:872",
"stacktrace": "github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:872\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).UpdateTaskList\n\t/cadence/common/persistence/persistenceMetricClients.go:855\ngithub.com/uber/cadence/service/matching.(*taskListDB).UpdateState\n\t/cadence/service/matching/db.go:103\ngithub.com/uber/cadence/service/matching.(*taskReader).persistAckLevel\n\t/cadence/service/matching/taskReader.go:277\ngithub.com/uber/cadence/service/matching.(*taskReader).getTasksPump\n\t/cadence/service/matching/taskReader.go:156"
}
After restarting the workflow, everything went back to normal, but we're still trying to figure out what happened. We were not presenting any heavy workload at the moment of this event, it just happened suddenly. Our major suspicion is that probably the matching service lost connectivity with the cassandra database during this event and just after we restarted it, it was able to pick it up. But this is just an hypothesis at this moment.
What might have the cause of this problem been? and is there a way to prevent this from happening in the future? Maybe some dynamic config that we're missing out?
PS: Cadence version is 0.18.3

This is a known issue in gocql that can caused by many reasons:
Cassandra is overloaded and some nodes are not responsive. You may think your load is small, but best way to look at is through Cadence metrics/dashboard. There is a section about persistence metrics.
If 1. is the problem, you may tune the ratelimiting to protect your Cassandra. Using matching.persistenceGlobalMaxQPS will act as a global ratelimiter to override matching.persistenceMaxQPS
Network issue or some bugs in gocql. It's really frustrating. We recently decided to do the refreshing in this PR. Hopefully this will get mitigated in next release.
Also, if a matching node is running hot, probably you are hitting the limit of a single tasklist. If so, consider enable the scalable tasklist feature.

Related

Discrepancy between Redshift data api DescribeStatement status and console status

I'm loading data into redshift which usually takes about an hour when successful but seems to timeout randomly sometimes. I continue to get a "STARTED" status from DescribeStatement calls for my query but when I look in the console it says the query was ABORTED and rolled back via "Undoing 1 transactions on table ..." statement. But I'm not finding any errors in STL_LOAD_ERRORS related to the query or anything useful in STL_UTILITYTEXT for that transaction; though STL_UNDONE view does show the rollback.
I would've expected DescribeStatement to update with "FAILED" or "ABORTED" status when this occurred but that doesn't seem to be the case. Any idea what is causing the load to fail without any errors? Is there a way to catch/handle this via redshift data api? I'm currently thinking of checking STL_UNDONE after a specified time but was hoping there's a better solution.

Statement timeout seems like a likely cause. What you are describing sounds like the connection closed out from under the executing statement. There are a number of places where this timeout can come from but a common one is in the cluster configuration and the WLM configuration.
Another possibility is a network timeout. Database connections stay open for the entirety of the session but when a statement is in flight there is no activity on the connection. Some network equipment see this an assume that something is wrong and close the connection which closes the session which aborts the transaction in flight.
If your issue is caused by the connection closing you may be able to line things up in stl_sessions. There is info in there about timeouts but also you can see if the time the session closes is right when the query commands abort.
Just one area that could be causing your issue but is more common than people think.

So after escalating to AWS support, it was confirmed there was a bug on their end. Related to data API autoscaling protocols that were sometimes scaling down without waiting for outstanding tasks to complete. There's a temporary fix in place to avoid this happening while they implement a long term solution. Should hopefully be rolled out end of this month, June 2022.

How to fix a circuit breaker timed out exception on spawning a persistent actor in akka scala?

has anyone faced a circuit breaker timeout on akka for persistent systems.
I believe this happens when akka fails to reload saved snapshots of events. I tried bypassing this reload by tweaking the conf and adding snapshot-store-plugin-fallback.snapshot-is-optional = true to it. However that doesn’t do the trick either.
I am using a jdbc-journal with jdbc-snapshot-store as snapshot-store.plugin
Any help on this is appreciated!
Supervisor StopSupervisor saw failure: Exception during recovery. Last known sequence number [0]. PersistenceId [UCPersistenceId], due to: Exception during recovery. Last known sequence number [0]. PersistenceId [UCPersistenceId], due to: Circuit Breaker Timed out.","context":"default","exception":"akka.persistence.typed.internal.JournalFailureException: Exception during recovery. Last known sequence number [0]. PersistenceId [UCPersistenceId], due to: Exception during recovery. Last known sequence number [0]. PersistenceId [UCPersistenceId], due to: Circuit Breaker Timed out.
I also get this warning after the exception block.
{"logger":"org.mariadb.jdbc.HostAddress","message":"Aurora recommended connection URL must only use cluster end-point like \"jdbc:mariadb:aurora://xx.cluster-yy.zz.rds.amazonaws.com\". Using end-point permit auto-discovery of new replicas","context":"default"}

Circuit Breaker timed out is not the actual exception. It's rather a symptom of another exception, the source exception. I would suggest looking the database server logs for the actual exception that happened.

As #Debasish-Ghosh (whose book on Functional and Reactive Domain Modeling I highly recommend) notes, the circuit breaker exception is just a symptom of another failure.
From the warning, it seems likely that you've configured your journal and/or snapshots to point to a specific Aurora replica rather than the cluster end-point. If that specific replica isn't available, connections to it will fail and eventually trip the circuit breaker; the effect of that is to make attempts to read from or write to the journal fail fast (so as not to consume excessive resources and to give the DB a chance to recover if the failures were due to overload).
So for this specific issue, I suggest using the Aurora cluster end-point in your configuration, which should make replicas being unavailable a non-issue.

Uber Cadence task list management

We are using Uber Cadence and periodically we run into issues on the production environment.
The setup is the following:
One Java 14 BE with Cadence client 2.7.5
Cadence service version 0.14.1 with Postgres DB
There are multiple domains, for all domains the single BE server is registered as a worker.
What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:
"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...
In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.
At this point the investigation is stuck, so some help would be appreciated.
Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?
Any help would be more than welcome! Thanks!

Does anyone know about how a worker or task list can lose its ability to be a decision handler
This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue
Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).
From your logs:
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.
It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

Ensemble runtime global lock

When I try to start up my Ensemble production, I get the following error:
ERROR ErrCanNotAcquireRuntimeLock: Could not acquire Ensemble runtime
global lock within timeout '10'
I figured I will disable all the services, processes and opperations and restart them individually to see which one is causing the error, however any action I take on the production takes a very long time and then comes back with the same error.
Googling the issue did not yield much, any ideas?

You should check the contents of your lock table while the production is not running -- it's likely that you have a job (or multiple jobs) that still have locks on the core Ensemble runtime globals. If you can identify the OS-level process(es) and can work out what they are actually doing, you should be able to terminate the OS processes. In both cases, you should perform this detection and termination from within Ensemble. You should be able to use the System Management Portal for both actions, or you can use the ^LOCKTAB and ^JOBEXAM CHUI utilities in the %SYS namespace to track this down.

If you can restart Ensemble server, lock table should be cleared. This however doesn't help to find the cause of your problem.

Does MongoDB fail silently if I don't check error codes?

I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?

If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.

Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse