How to fix a circuit breaker timed out exception on spawning a persistent actor in akka scala? - scala

has anyone faced a circuit breaker timeout on akka for persistent systems.
I believe this happens when akka fails to reload saved snapshots of events. I tried bypassing this reload by tweaking the conf and adding snapshot-store-plugin-fallback.snapshot-is-optional = true to it. However that doesn’t do the trick either.
I am using a jdbc-journal with jdbc-snapshot-store as snapshot-store.plugin
Any help on this is appreciated!
Supervisor StopSupervisor saw failure: Exception during recovery. Last known sequence number [0]. PersistenceId [UCPersistenceId], due to: Exception during recovery. Last known sequence number [0]. PersistenceId [UCPersistenceId], due to: Circuit Breaker Timed out.","context":"default","exception":"akka.persistence.typed.internal.JournalFailureException: Exception during recovery. Last known sequence number [0]. PersistenceId [UCPersistenceId], due to: Exception during recovery. Last known sequence number [0]. PersistenceId [UCPersistenceId], due to: Circuit Breaker Timed out.
I also get this warning after the exception block.
{"logger":"org.mariadb.jdbc.HostAddress","message":"Aurora recommended connection URL must only use cluster end-point like \"jdbc:mariadb:aurora://xx.cluster-yy.zz.rds.amazonaws.com\". Using end-point permit auto-discovery of new replicas","context":"default"}

Circuit Breaker timed out is not the actual exception. It's rather a symptom of another exception, the source exception. I would suggest looking the database server logs for the actual exception that happened.

As #Debasish-Ghosh (whose book on Functional and Reactive Domain Modeling I highly recommend) notes, the circuit breaker exception is just a symptom of another failure.
From the warning, it seems likely that you've configured your journal and/or snapshots to point to a specific Aurora replica rather than the cluster end-point. If that specific replica isn't available, connections to it will fail and eventually trip the circuit breaker; the effect of that is to make attempts to read from or write to the journal fail fast (so as not to consume excessive resources and to give the DB a chance to recover if the failures were due to overload).
So for this specific issue, I suggest using the Aurora cluster end-point in your configuration, which should make replicas being unavailable a non-issue.

Related

Uber Cadence task list management

We are using Uber Cadence and periodically we run into issues on the production environment.
The setup is the following:
One Java 14 BE with Cadence client 2.7.5
Cadence service version 0.14.1 with Postgres DB
There are multiple domains, for all domains the single BE server is registered as a worker.
What is visible in the logs is that sometimes during a query the cadence seems to lose stickiness to the BE service:
"msg":"query direct through matching failed on sticky, clearing sticky before attempting on non-sticky","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
...
In the backend in the meanwhile nothing is visible. However, during this time if I check the pollers on the cadence web client I see that the task list is there, but it is not considered as a decision handler any more (http://localhost:8088/domains/mydomain/task-lists/mytasklist/pollers). Because of this pretty much the whole environment is dead because there is nothing that can progress with the decision. The only option is to restart the backend service and let it re-register as a worker.
At this point the investigation is stuck, so some help would be appreciated.
Does anyone know about how a worker or task list can lose its ability to be a decision handler? Is it managed by cadence, like based on how many errors the worker generates? I was not able to find anything about this.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue it (in my case this will be the same worker as there is only one). Is it possible that replaying the flow is not possible (although I think it would generate something in the backend log from the cadence client) or at that point the worker is already removed from the list and that causes the time-out?
Any help would be more than welcome! Thanks!
Does anyone know about how a worker or task list can lose its ability to be a decision handler
This will happen when worker stops polling for decision tasks. For example if you configure the worker only polls for activity tasks, then it will show like that. So apparently it will also happen if for some reason the worker stops polling for decision tasks.
As I understand when the stickiness is lost, cadence will check for another worker to replay the workflow and continue
Yes, as long as there is another worker polling for decision tasks. Note that Query tasks is considered as of the the decision task types. (this is a wrong design, we are working on to separate it).
From your logs:
"msg":"query directly though matching on non-sticky failed","service":"cadence-history","shard-id":1,"address":"10.1.1.111:7934"..."error":"code:deadline-exceeded message:timeout"
This means that Cadence dispatch the Query tasks to a worker, and a worker accepted, but didn't respond back within timeout.
It's very highly possible that there is some bugs in your Query handler logic. The bug caused decision worker to crash(which means Cadence java client also has a bug too, user code crashing shouldn't crash worker). And then a query task loop over all instances of your worker pool, finally crashed all your decision workers.

UpdateTaskList operation failed with Cadence matching-service

The other day we ran into some issues with our cadence setup. One of our machine instances began to increase the CPU usage up to 90% and all of the inbound workflow executions were stuck in "Scheduled" states. After checking the logs, we noticed that the matching service was throwing the following error:
{
"level": "error",
"ts": "2021-03-20T14:41:55.130Z",
"msg": "Operation failed with internal error.",
"service": "cadence-matching",
"error": "InternalServiceError{Message: UpdateTaskList operation failed. Error: gocql: no hosts available in the pool}",
"metric-scope": 34,
"logging-call-at": "persistenceMetricClients.go:872",
"stacktrace": "github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).updateErrorMetric\n\t/cadence/common/persistence/persistenceMetricClients.go:872\ngithub.com/uber/cadence/common/persistence.(*taskPersistenceClient).UpdateTaskList\n\t/cadence/common/persistence/persistenceMetricClients.go:855\ngithub.com/uber/cadence/service/matching.(*taskListDB).UpdateState\n\t/cadence/service/matching/db.go:103\ngithub.com/uber/cadence/service/matching.(*taskReader).persistAckLevel\n\t/cadence/service/matching/taskReader.go:277\ngithub.com/uber/cadence/service/matching.(*taskReader).getTasksPump\n\t/cadence/service/matching/taskReader.go:156"
}
After restarting the workflow, everything went back to normal, but we're still trying to figure out what happened. We were not presenting any heavy workload at the moment of this event, it just happened suddenly. Our major suspicion is that probably the matching service lost connectivity with the cassandra database during this event and just after we restarted it, it was able to pick it up. But this is just an hypothesis at this moment.
What might have the cause of this problem been? and is there a way to prevent this from happening in the future? Maybe some dynamic config that we're missing out?
PS: Cadence version is 0.18.3
This is a known issue in gocql that can caused by many reasons:
Cassandra is overloaded and some nodes are not responsive. You may think your load is small, but best way to look at is through Cadence metrics/dashboard. There is a section about persistence metrics.
If 1. is the problem, you may tune the ratelimiting to protect your Cassandra. Using matching.persistenceGlobalMaxQPS will act as a global ratelimiter to override matching.persistenceMaxQPS
Network issue or some bugs in gocql. It's really frustrating. We recently decided to do the refreshing in this PR. Hopefully this will get mitigated in next release.
Also, if a matching node is running hot, probably you are hitting the limit of a single tasklist. If so, consider enable the scalable tasklist feature.

Apache Beam CloudBigtableIO read/write error handling

We have a java based Data flow pipeline which reads from Bigtable, after some processing write data back to Bigtable. We use CloudBigtableIO for these purposes.
I am trying wrap my head around failure handling in CloudBigtableIO. I haven;t found any references/documentation on how the errors are handled inside and outside the CloudBigtableIO.
CloudBigtableIO has bunch of Options in BigtableOptionsFactory which specify timeouts, grpc codes to retry on, retry limits.
google.bigtable.grpc.retry.max.scan.timeout.retries - is this the retry limit for scan operations or does it include Mutation operations as well? if this is just for scan, how many retries are done for Mutation operations? is it configurable?
google.bigtable.grpc.retry.codes - Do these codes enable retries for both scan, Mutate operations?
Customizing options would only enable retries, would there be cases where CloudBigtableIO reads partial data than what is requested but not fails the pipeline?
When mutating few millions of records, I think it is possible we get errors beyond the retry limits, what happens to such mutations? do they fail simply? how do we handle them in pipeline? BigQueryIO has function that collects failures and provides a way to retrieve them through side output, why do CloudBigtableIO doesn't have one such functions?
We occasionally get DEADLINE_EXCEEDED errors while writing mutations but the logs are not clear whether the mutations were retried and successful or Retries were exhausted, I do see RetriesExhaustedWithDetailsException but that is of no use, if we are not able to handle failures
Are these failures thrown back to the preceding step in data flow pipeline if preceding step and CloudBigtableIO write are fused? with bulk mutations enabled it is not really clear on how the failures are thrown back to preceding steps.
For question 1, I believe google.bigtable.mutate.rpc.timeout.ms would correspond to mutation operations, though it is noted in the Javadoc that the feature is experimental. google.bigtable.grpc.retry.codes allows you to add additional codes to retry on that are not set by default (defaults include DEADLINE_EXCEEDED, UNAVAILABLE, ABORTED, and UNAUTHENTICATED)
You can see an example of the configuration getting set for mutation timeouts here: https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-client-core-parent/bigtable-hbase/src/test/java/com/google/cloud/bigtable/hbase/TestBigtableOptionsFactory.java#L169
google.bigtable.grpc.retry.max.scan.timeout.retries:
It is only for setting the number of time to retry after a SCAN timeout.
Regarding retries on mutation operations
This is how Bigtable handles operations failures.
Regarding your question about handling errors in the pipeline
I see that you already are aware for the "RetriesExhaustedWithDetailsException". Please keep in mind that in order to retrieve the detailed exceptions for each failed request you have to call the "RetriesExhaustedWithDetailsException#getCauses()"
As for the failures, Google documentation states:
" Append and Increment operations are not suitable for retriable batch
programming models, including Hadoop and Cloud Dataflow, and are
therefore not supported inputs to CloudBigtableIO.writeToTable.
Dataflow bundles, or a group of inputs, can fail even though some of
the inputs have been processed. In those cases, the entire bundle will
be retried, and previously completed Append and Increment operations
would be performed a second time, resulting in incorrect data."
Some documentation that you may consider helpful:
Cloud Bigtable Client Libraries
Source code for Cloud Bigtable HBase client for Java
Sample code for use with Cloud BigTable
Hope you find the above helpful.

How to handle client response during Transient Exception retrying?

Context
I'm developing a REST API that, as you might expect, is backed by multiple external cross-network services, APIs, and databases. It's very possible that a transient failure is encountered at any point and for which the operation should be retried. My question is, during that retry operation, how should my API respond to the client?
Suppose a client is POSTing a resource, and my server encounters a transient exception when attempting to write to the database. Using a combination of the Retry Pattern perhaps with the Circuit Breaker Pattern, my server-side code should attempt to retry the operation, following randomized linear/exponential back-off implementations. The client would obviously be left waiting during that time, which is not something we want.
Questions
Where does the client fit into the retry operation?
Should I perhaps provide an isTransient: true indicator in the JSON response and leave the client to retry?
Should I leave retrying to the server and respond with a message and status code indicative that the server is actively retrying the request and then have the client poll for updates? How would you determine the polling interval in that case without overloading the server? Or, should the server respond via a web socket instead so the client need not poll?
What happens if there is an unexpected server crash during the retry operation? Obviously, when the server recovers, it won't "remember" the fact that it was retrying an operation unless that fact was persisted somewhere. I suppose that's a non-critical issue that would just cause further unnecessary complexity if I attempted to solve it.
I'm probably over-thinking the issue, but while there is a lot of documentation about implementing transient exception retry logic, seldom have I come across resources that discuss how to leave the client "pending" during that time.
Note: I realize that similar questions have been asked, but my queries are more specific, for I'm specifically interested in the different options for where the client fits into a given retry operation, how the client should react in those cases, and what happens should a crash occur that interrupts a retry sequence.
Thank you very much.
There are some rules for retry:
always create an idempotency key to understand that there is retry operation.
if your operation a complex and you want to wrap rest call with retry, you must ensure that for duplicate requests no side effects will be done(start from failure point and don't execute success code).
Personally, I think the client should not know that you retry something, and of course, isTransient: true should not be as a part of the resource.
Warning: Before add retry policy to something you must check side effects, put retry policy everywhere is bad practice

Does MongoDB fail silently if I don't check error codes?

I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?
If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.
Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."