Recently the our Azure service fabric health went bad after a deployment. The deployment was successful but service fabric health went bad due to some code issue and it was not rolling back. Only on looking into the service fabric explorer did we know that the cluster went bad
Is there a way to get an email alert when the service fabric health goes bad.
Scenarios where service fabric failed
Whole cluster so what happened was 1 service went bad(showed in red) and was consuming a lot of memory and that in turn caused other services to go bad. after which the whole cluster I had to log into the scaleset to see which services was taking most of the memory.
In another case we added another reliable collection to existing reliable collection to statefull service. This caused failure.
in each of the cases i need to look at the servifabric explorer and then go to each scale set to see the actual error message.
Related
Running Pods with WorkloadIdentity makes an Google Credential error when auto scaling started.
My application is configured with WorkloadIdentity to use Google Pub/Sub and also set HorizontalPodAutoscaler to scale the pods up to 5 replicas.
The problem arises when an auto scaler create replicas of the pod, GKE's metadata server does not work for few seconds then after 5 to 10 seconds no error created.
here is the error log after a pod created by auto scaler.
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 1 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 2 of 3. Reason: timed out
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable onattempt 3 of 3. Reason: timed out
WARNING:google.auth._default:Authentication failed using Compute Engine authentication due to unavailable metadata server
Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started
what exactly is the problem here?
When I read the doc from here workload identity docs
"The GKE metadata server takes a few seconds to start to run on a newly created Pod"
I think the problem is related to this issue but is there a solution for this kind situation?
Thanks
There is no specific solution other than to ensure your application can cope with this. Kubernetes uses DaemonSets to launch per-node apps like the metadata intercept server but as the docs clearly tell you, that takes a few seconds (noticing the new node, scheduling the pod, pulling the image, starting the container).
You can use an initContainer to prevent your application from starting until some script returns, which can just try to hit a GCP API until it works. But that's probably more work than just making your code retry when those errors happen.
I've got some doubts about deploying Spring Cloud Gateway (old Zuul) with Kubernetes and getting zero-downtime. I'm completely new to Kubernetes and I'm a bit lost with quite a lot of concepts.
We would like to use the Spring Cloud Gateway verify the JWT. I've also read that when I've got a call, it should go first have gateway, afterwards the ribbon discovery and finally the REST services.
The application has very strict zero-downtime requirements. My question is, what happens when I need to redeploy for some reason the Gateway? Is it possible to achieve the zero-downtime if it is my first component and I will have constantly traffic and request in my system
Is there any other component I should set-up in order to archive this? The users that are having having access to my REST services shouldn't be disconnected abruptly.
Kubernetes Deployments use a rolling update model to achieve zero downtime deploys. New pods are brought up and allowed to become ready, then added to the rotation, then old ones are shut down, repeat as needed.
I was testing things with a non-prod cluster of On Premises Service Fabric and I stopped the process for one of my services on one of my nodes (to see how it would respond).
That caused the explorer to show the following warning:
There was an error during CodePackage activation.The service host terminated with exit code:1
Service Fabric recovered well. It made a new instance of the process and kept going.
But I can't seem to see how to clear out the warning from the Explorer. In my non-prod env, I can just drop and redeploy the app.
But in prod I can't do that. Especially not to just get rid of a warning message!
Is there a way to tell Service Fabric, "thanks for telling me that, I know about it now and you can forget about it and carry on?"
I have setup a service that uses RegisterServiceNotificationFilterAsync to get notified of service change events. It works as intended. When a service goes down, this event gets called.
But it happens AFTER the service has gone offline. Which means that several requests could have failed against that now offline service before I get it pulled out of my loadbalancer pool.
Sometimes Service Fabric can only react to a service going offline. For example, if someone pulls the plug on a server node, Service Fabric clearly can't tell me in advance that the service is going offline.
But many times, a threshold is reached and it is Service Fabric itself that kills the service (and starts a new one).
Is there a way to know BEFORE Service Fabric kills a service? (So I have time to update my loadblancer.)
(In case it matters, I am running Service Fabric on premises.)
You can only be notified on shutdown inside the service. RegisterServiceNotificationFilterAsync is based on endpoint changes from the naming service.
If it's a reliable service, you get events for these scenarios: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-lifecycle
For guest executeables the signal Service Fabric send is Ctrl+C. For containers it's "docker stop".
We have bunch of owin based web api services hosted in azure service fabric cluster. All of those services have been mapped to different ports in associated load balancer. There are 2 out-of-the-box probes when cluster is created. They are: FabricGatewayProbe and FabricHttpGatewayProbe. We added our port rules and used FabricGatewayProbe in them.
For some reason, these service endpoints seem to be going to sleep after a period of inactivity because clients of those services are timing out. We tried adjusting load balancer idle time out period to 30 minutes (which is maximum). It seems to help immediately but only for a brief period and then we are back to time out errors.
Where else should I be looking for resolution of this problem?
So further to our comments I agree that the documentation is open to interpretation, but after doing some testing I can confirm the following:
When creating a new cluster via the portal it will give you a 1:1 relation of rule to probe and I have also been able to reproduce your issue when modifying one of my existing ARM templates to use the same existing probe as you have.
On reflection this makes sense as a probe is effectively being bound to a service, if you attempt to share a probe for rules on different ports how will the load balancer know if one of the services is actually up, also Service Fabric (depending on your instance count settings) will move the services between nodes.
So if you had two services on different ports using the same probe on different nodes the service not using the port from the probe will receive the error that the request took too long to respond.
A little long winded so hopefully a quick illustration will help show what I mean.