How to simulate asp.net web API app to a deadlock or hung state? - kubernetes

I have an asp.net core 3.1 web api app. Can I write an end point by which if I call this the entire application will go into hung or deadlock situation? or a background task where I wrote some code to make this happen?
I am trying to test Kubernetes Liveness probe which I wrote as a health check endpoint in asp.net core web API application.
Thanks.

If it's for testing some kind of hung situation, you can do something like making the thread / process to sleep. Make sure that the sleep period is greater than the timeoutSeconds configuration of the livenessProbe.
Or if you want to simulate the app to be down, you can also return http status >= 400 (ex: 500 / Internal Server Error) in the API response, so that livenessProbe will think the app is not healthy.
livenessProbe also has failureThreshold which is used to indicate how many failures to tolerate before K8 will terminate the pod, so if you want the pod to be terminated after just 1 failure, set it to 1.

Related

Service Fabric upgrades keep active connections alive

I am trying to upgrade an application deployed to service fabric.
How can I only upgrade nodes that have no active connections and wait for the busy nodes to finish before upgrading them?
Most of the time, you don't really have to worry about the upgrades on a node level as the SF runtime handles it internally if configured in Monitored mode. This is what we've been using with a high level of success and never really had to do much. This also fit our requirement that all upgrade domains (nodes) have to match our health state policies before considered healthy.
If you want to have more advanced control over your upgrades like using request draining etc, have a look at the info as mentioned here. But to be honest, we've been quite happy with just using monitored mode and investigating why stuff fails if it does. We had some apps that had a long background task running as a stateful actor that sometimes failed upgrade and most always it was due to an issue that was caused in the background task itself instead of anything to do with Service Fabric.
Service Fabric knew when no active connections and background tasks were running to then upgrade nodes and we could actually see the nodes that were temporarily 'stuck' due to waiting for an active background task to finish.

ECS + ALB - My applications only respond a few times

I've developed two spring boot applications for microservices and I've used ECS to deploy these applications into containers.
To do this, I followed the official pet clinic example (https://github.com/aws-samples/amazon-ecs-java-microservices/tree/master/3_ECS_Java_Spring_PetClinic_CICD).
All seems to works correctly, but when I make a request to the ALB very often I receive the 502 or 503 HTTP error and a few times I can see the correct response of the applications.
Can someone help me?
Thanks in advance.
You receive a 502 when you have no healthy task running and 503 when task is starting/restarting.
All of this mean that your task got stopped and then your cluster restart it, so you should find what make your task failed.
It can be something directly in your code that make it crash. or it can be the cluster healthcheck defined in your target group that failed.
Firstly you should look your task in the AWS ECS Console and see what error your task receive when it's stopped.
But as you are able to make request for some time and then it failed. I pretty sure your problem come from your healthcheck. So go in your target group used by the task (in AWS EC2 Console) and make sure the healthcheck path configured exist and returned a 200 status code.

Health checks in blocking (synchronous) web-frameworks

I have blocking (synchronous) web-framework with uWSGI. Task if to prepare /health endpoint for kubernetes to ensure that pod is alive. There is no issue with endpoint itself, but issue is in sync nature. For example, if I define 8 processes in uWSGI and web-application processes 8 heavy requests, call to /health will be queued and depends on timeouts, kubernetes may not receive response in some period of time and decides to kill/restart pod. Of course I can run another web-service on different port but it will require changes in code and increase complexity of deployment. Maybe I'm missing something and it's possible to define exclusive worker in uWSGI to process /health endpoint in non-blocking way? Thanks in advance!
WSGI is an inherently synchronous protocol. There has been some work to create a new async-friendly replacement called ASGI, but it's only implemented by the Django Channels project AFAIK. While you can mount a sync-mode WSGI app in an async server (generally using threads), you can't go the other way. So you could, for example, write some twisted.web code that grabs requests to /health and handles them natively via Twisted and hands off everything else to the WSGI container. I don't know of anything off the shelf for this though.

Warmup services on upgrade in Service Fabric

We are wondering if there is a built-in way to warm up services as part of the service upgrades in Service Fabric, similar to the various ways you could warm up e.g. IIS based app pools before they are hit by requests. Ideally we want the individual services to perform some warm-up tasks as part of their initialization (could be cache loading, recovery etc.) before being considered as started and available for other services to contact. This warmup should be part of the upgrade domain processing so the upgrade process should wait for the warmup to be completed and the service reported as OK/Ready.
How are others handling such scenarios, controlling the process for signalling to the service fabric that the specific service is fully started and ready to be contacted by other services?
In the health policy there's this concept:
HealthCheckWaitDurationSec The time to wait (in seconds) after the upgrade has finished on the upgrade domain before Service Fabric evaluates the health of the application. This duration can also be considered as the time an application should be running before it can be considered healthy. If the health check passes, the upgrade process proceeds to the next upgrade domain. If the health check fails, Service Fabric waits for an interval (the UpgradeHealthCheckInterval) before retrying the health check again until the HealthCheckRetryTimeout is reached. The default and recommended value is 0 seconds.
Source
This is a fixed wait period though.
You can also emit Health events yourself. For instance, you can report health 'Unknown' while warming up. And adjust your health policy (HealthCheckWaitDurationSec) to check this.
Reporting health can help. You can't report Unknown, you must report Error very early on, then clear the Error when your service is ready. Warning and Ok do not impact upgrade. To clear the Error, your service can report health state Ok, RemoveWhenExpired=true, low TTL (read more on how to report).
You must increase HealthCheckRetryTimeout based on the max warm up time. Otherwise, if a health check is performed and cluster is evaluated to Error, the upgrade will fail (and rollback or pause, per your policy).
So, the order the events is:
your service reports Error - "Warming up in progress"
upgrade waits for fixed HealthCheckWaitDurationSec (you can set this to min time to warm up)
upgrade performs health checks: if the service hasn't yet warmed up, the health state is Error, so upgrade retries until either HealthCheckRetryTimeout is reached or your service is not in Error anymore (warm up completed and your service cleared the Error).

Gracefully draining sessions from Dropwizard application using Kubernetes

I have a Dropwizard application that holds short-lived sessions (phone calls) in memory. There are good reasons for this, and we are not changing this model in the near term.
We will be setting up Kubernetes soon and I'm wondering what the best approach would be for handling shutdowns / rolling updates. The process will need to look like this:
Remove the DW node from the load balancer so no new sessions can be started on this node.
Wait for all remaining sessions to be completed. This should take a few minutes.
Terminate the process / container.
It looks like kubernetes can handle this if I make step 2 a "preStop hook"
http://kubernetes.io/docs/user-guide/pods/#termination-of-pods
My question is, what will the preStop hook actually look like? Should I set up a DW "task" (http://www.dropwizard.io/0.9.2/docs/manual/core.html#tasks) that will just wait until all sessions are completed and CURL it from kubernetes? Should I put a bash script that polls some sessionCount service until none are left in the docker container with the DW app and execute that?
Assume you don't use the preStop hook, and a pod deletion request has been issued.
API server processes the deletion request and modify the pod object.
Endpoint controller observes the change and removes the pod from the list of endpoints.
On the node, a SIGTERM signal is sent to your container/process.
Your process should trap the signal and drains all existing requests. Note that this step should not take more than the defined TerminationGracePeriod defined in your pod spec.
Alternatively, you can use the preStop hook which blocks until all the requests are drained. Most likely, you'll need a script to accomplish this.