Excessive number of TaskCanceledExceptions - service

I've just started looking at Service Fabric, and was hoping someone could help with a problem I am having.
We have started noticing an excessive number of TaskCanceledExceptions being thrown back to RunAsync. This seems to be caused by Service Fabric setting the cancellation token which was passed into RunAsync.
I understand there are legitimate reasons for Service Fabric setting the token, like if the service is restarted, or the primary is demoted, but we are seeing hundreds of thousands in some months (and hardly any in others).
My questions are:
What scenarios could cause the setting of large numbers of cancellation tokens?
SF must log the reason it cancelled a given token. Where does Service Fabric log this kind of information?

Related

What happens if Dapr fail?

I would like to know what happens if Dapr fails. For example, if my service's sidecar or even the Control Plane fails, what is the expected behavior of my application?
Oh, and would there be any way for me to simulate these error cases?
Context:
In my application I have a service that uses Dapr, but in a non-critical way. Therefore, I would like to ensure that it continues to run normally even if your sidecar or Dapr fails.
Very good question without a straight-forward answer, but I'll share how I look at it.
Dapr can be used with monolithic, legacy applications, for migration and modernization purposes for example, but it is more commonly used with distributed applications. In a distributed application, there are many more components that can fail: database, transparent proxy (envoy/), ingress proxy, message broker, producer, consumer... In that regard, Dapr is no different, and it can fail, but there are a few reasons why that is less likely to happen:
Dapr is like a technical microservice, it has no business logic, and your app interacts with it over explicit APIs. It is harder for a failure in the sidecar to spread to your app.
If the sidecar is exploited, it is harder to get control of the application, acts as a security boundary.
As a popular open source project, Dapr has many eyes and users on it. You are more likely to get new bugs found and fixed early.
If that happens, upgrading Dapr is much easier than a library upgrade. You can upgrade Dapr control plane with little to no disruptions to your app, and then upgrade select sidecars (a canary release if you want) - I've also done many middleware/library patching/upgrades and I know how much work the latter is in comparison.
Each sidecar lives co-located with its app. Any hardware or network failure is likely to impact both the app and sidecar, rather than sidecar only.
With Dapr, you get many resiliency and observability benefits OOTB. See my blog on this topic here. It is more likely to improve the reliability of your app than reduce it.
When you follow the best practices, and enable k8s health checks, resource constraints, Kubernetes will deal with it. Dapr can even detect the health-status of your app, and stop interacting with it until it recovers.
In the end, if there is a bug in Dapr, it may fail. But that can happen wit a library implementing Dapr-like features too. With Dapr, you can isolate the failure, and upgrade faster, w/o a single line of code change, building, testing of the application, that is the difference from perspective of this question.
Disclaimer: I work for a company building products for running Dapr, and I'm highly biassed on this topic.

response of request hangs and resolves after awhile

i have an issue with a backend service in a microservice architecture using Kubernetes.
sometimes when i make a request to the service on postman or frontend, it will hang for very long until it timeouts whether the request is a big or small one, and this behaviour will last for a minute or so before it auto resolves and requests respond at normal speeds.
my team is figuring what the problem is. we suspect the source of the problem could come from Kubernetes or postgresql.
hope if someone has some insight into this. tq!

Informatica BDM - How to re-try "REST Web Service Consumer"?

I have an Informatica BDM system (note Big Data Management, not Power Centre) and am having a problem with dropping connections when communicating with a third party web service. This fails the REST web service transformation which in turn kills our batch job.
Rather than fail the entire job, I would like each REST call to potentially be retried several times first.
I looked at the documentation, but I see no option to set a re-try on the REST Web Service Consumer transformation. Did I miss it? Or does one have to construct a re-try loop around it in some other way?
Bad News, But the Informatica BDM is yet to come up with an option to Retry. The need for the feature is already been raised by many users and may be the option will come up in the upcoming releases.
For now, we can only keep track of all the requests and responses manually and hope for the best.

What does 'Renews' and 'Renews threshold' mean in Eureka

I'm new to Eureka and I see this information from the home page of my Eureka server(localhost:8761/). I didn't find any explanation from official docs about 'Renews' and 'Renews threshold'. Could any one please explain these words? Thanks!
Hope it helps:
Renews: total number of heartbeats the server received from clients
Renews threshold: a switch which controls the "self-preservation mode" of Eureka. If "Renews" is below "Renews threshold", the "self-preservation mode" is on.
self-preservation mode:
When the Eureka server comes up, it tries to get all of the instance registry information from a neighboring node. If there is a problem getting the information from a node, the server tries all of the peers before it gives up. If the server is able to successfully get all of the instances, it sets the renewal threshold that it should be receiving based on that information. If any time, the renewals falls below the percent configured for that value (below 85% within 15 mins), the server stops expiring instances to protect the current instance registry information.
In Netflix, the above safeguard is called as self-preservation mode and is primarily used as a protection in scenarios where there is a network partition between a group of clients and the Eureka Server. In these scenarios, the server tries to protect the information it already has. There may be scenarios in case of a mass outage that this may cause the clients to get the instances that do not exist anymore. The clients must make sure they are resilient to eureka server returning an instance that is non-existent or un-responsive. The best protection in these scenarios is to timeout quickly and try other servers.
For more details, please refer to the Eureka wiki.

MSMQ messages bound for clustered MSMQ instance get stuck in outgoing queues

We have clustered MSMQ for a set of NServiceBus services, and everything runs great until it doesn't. Outgoing queues on one server start filling up, and pretty soon the whole system is hung.
More details:
We have a clustered MSMQ between servers N1 and N2. Other clustered resources are only services that operate directly on the clustered queues as local, i.e. NServiceBus distributors.
All of the worker processes live on separate servers, Services3 and Services4.
For those unfamiliar with NServiceBus, work goes into a clustered work queue managed by the distributor. Worker apps on Service3 and Services4 send "I'm Ready for Work" messages to a clustered control queue managed by the same distributor, and the distributor responds by sending a unit of work to the worker process's input queue.
At some point, this process can get completely hung. Here is a picture of the outgoing queues on the clustered MSMQ instance when the system is hung:
If I fail over the cluster to the other node, it's like the whole system gets a kick in the pants. Here is a picture of the same clustered MSMQ instance shortly after a failover:
Can anyone explain this behavior, and what I can do to avoid it, to keep the system running smoothly?
Over a year later, it seems that our issue has been resolved. The key takeaways seem to be:
Make sure you have a solid DNS system so when MSMQ needs to resolve a host, it can.
Only create one clustered instance of MSMQ on a Windows Failover Cluster.
When we set up our Windows Failover Cluster, we made the assumption that it would be bad to "waste" resources on the inactive node, and so, having two quasi-related NServiceBus clusters at the time, we made a clustered MSMQ instance for Project1, and another clustered MSMQ instance for Project2. Most of the time, we figured, we would run them on separate nodes, and during maintenance windows they would co-locate on the same node. After all, this was the setup we have for our primary and dev instances of SQL Server 2008, and that has been working quite well.
At some point I began to grow dubious about this approach, especially since failing over each MSMQ instance once or twice seemed to always get messages moving again.
I asked Udi Dahan (author of NServiceBus) about this clustered hosting strategy, and he gave me a puzzled expression and asked "Why would you want to do something like that?" In reality, the Distributor is very light-weight, so there's really not much reason to distribute them evenly among the available nodes.
After that, we decided to take everything we had learned and recreate a new Failover Cluster with only one MSMQ instance. We have not seen the issue since. Of course, making sure this problem is solved would be proving a negative, and thus impossible. It hasn't been an issue for at least 6 months, but who knows, I suppose it could fail tomorrow! Let's hope not.
Maybe your servers were cloned and thus share the same Queue Manager ID (QMId).
MSMQ use the QMId as a hash for caching the address of remote machines. If more than one machine has the same QMId in your network you could end up with stuck or missing messages.
Check out the explanation and solution in this blog post: Link
How are your endpoints configured to persist their subscriptions?
What if one (or more) of your service encounters an error and is restartet by the Failoverclustermanager? In this case, this service would never receive one of the "I'm Ready for Work" message from the other services again.
When you fail over to the other node, I guess that all your services send these messages again and, as a result, everything gets back working.
To test this behavior do the following.
Stop and restart all your services.
Stop only one of the services.
Restart the stopped service.
If your system does not hang, repeat this with each single service.
If your system now hangs again, check your configurations. It this scenario your at least one, if not all, services lose the subscriptions between restarts. If you did not do so already, persist the subscription in a database.