.NET Core / Kubernetes - SIGTERM, clean shutdown - kubernetes

I'm trying to verify that shutdown is completing cleanly on Kubernetes, with a .NET Core 2.0 app.
I have an app which can run in two "modes" - one using ASP.NET Core and one as a kind of worker process. Both use Console and JSON-which-ends-up-in-Elasticsearch-via-Filebeat-sidecar-container logger output which indicate startup and shutdown progress.
Additionally, I have console output which writes directly to stdout when a SIGTERM or Ctrl-C is received and shutdown begins.
Locally, the app works flawlessly - I get the direct console output, then the logger output flowing to stdout on Ctrl+C (on Windows).
My experiment scenario:
App deployed to GCS k8s cluster (using helm, though I imagine that doesn't make a difference)
Using kubectl logs -f to stream logs from the specific container
Killing the pod from GCS cloud console site, or deleting the resources via helm delete
Dockerfile is FROM microsoft/dotnet:2.1-aspnetcore-runtime and has ENTRYPOINT ["dotnet", "MyAppHere.dll"], so not wrapped in a bash process or anything
Not specifying a terminationGracePeriodSeconds so guess it defaults to 30 sec
Observing output returned
Results:
The API pod log streaming showed just the immediate console output, "[SIGTERM] Stop signal received", not the other Console logger output about shutdown process
The worker pod log streaming showed a little more - the same console output and some Console logger output about shutdown process
The JSON logs didn't seem to pick any of the shutdown log output
My conclusions:
I don't know if Kubernetes is allowing the process to complete before terminating it, or just issuing SIGTERM then killing things very quick. I think it should be waiting, but then, why no complete console logger output?
I don't know if console output is cut off when stdout log streaming at some point before processes finally terminates?
I would guess that the JSON stuff doesn't come through to ES because filebeat running in the sidecar terminates even if there's outstanding stuff in files to send
I would like to know:
Can anyone advise on points 1,2 above?
Any ideas for a way to allow a little extra time or leeway for the sidecar to send stuff up, like a pod container termination order, delay on shutdown for that container, etc?

SIGTERM does indeed signal termination. The less obvious part is that when the SIGTERM handler returns, everything is considered finished.
The fix is to not return from the SIGTERM handler until the app has finished shutting down. For example, using a ManualResetEvent and Wait()ing it in the handler.

I've started to look into this for my own purposes and have come across your question over a year after it was posted... This is a bit late, but have you tried GraceTerm?
There is an associated NuGET package for this.
From the description...
Graceterm middleware provides implementation to ensure graceful shutdown of AspNet Core applications. The basic concept is: After application received a SIGTERM (a signal asking it to terminate), Graceterm will hold it alive till all pending requests are completed or a timeout occur.
I haven't personally tried this yet, but it does look promising.

Try add STOPSIGNAL SIGINT to your Dockerfile

Related

Airflow: what do `airflow webserver`, `airflow scheduler` and `airflow worker` exactly do?

I've been working with Airflow for a while now, which was set up by a colleague. Lately I run into several errors, which require me to more in dept know how to fix certain things within Airflow.
I do understand what the 3 processes are, I just don't understand the underlying things that happen when I run them. What exactly happens when I run one of the commands? Can I somewhere see afterwards that they are running? And if I run one of these commands, does this overwrite older webservers/schedulers/workers or add a new one?
Moreover, if I for example run airflow webserver, the screen shows some of the things that are happening. Can I simply get out of this by pressing CTRL + C? Because when I do this, it says things like Worker exiting and Shutting down: Master. Does this mean I'm shutting everything down? How else should I get out of the webserver screen then?
Each process does what they are built to do while they are running (webserver provides a UI, scheduler determines when things need to be run, and workers actually run the tasks).
I think your confusion is that you may be seeing them as commands that tell some sort of "Airflow service" to do something, but they are each standalone commands that start the processes to do stuff. ie. Starting from nothing, you run airflow scheduler: now you have a scheduler running. Run airflow webserver: now you have a webserver running. When you run airflow webserver, it is starting a python flask app. While that process is running, the webserver is running, if you kill command, is goes down.
All three have to be running for airflow as a whole to work (assuming you are using an executor that needs workers). You should only ever had one scheduler running, but if you were to run two processes of airflow webserver (ignoring port conflicts, you would then have two separate http servers running using the same metadata database. Workers are a little different in that you may want multiple worker processes running so you can execute more tasks concurrently. So if you create multiple airflow worker processes, you'll end up with multiple processes taking jobs from the queue, executing them, and updating the task instance with the status of the task.
When you run any of these commands you'll see the stdout and stderr output in console. If you are running them as a daemon or background process, you can check what processes are running on the server.
If you ctrl+c you are sending a signal to kill the process. Ideally for a production airflow cluster, you should have some supervisor monitoring the processes and ensuring that they are always running. Locally you can either run the commands in the foreground of separate shells, minimize them and just keep them running when you need them. Or run them in as a background daemon with the -D argument. ie airflow webserver -D.

How can I log each time Kubernetes kills a process due to resource use?

Is there any way to log this action / event?
It seems that if a process is using too much resources and is not the main process, it gets a SIGKILL signal. I'd like to log this action.
You can use posthook to a run scripts when container is getting terminated.
Use that script to send notification or to log info to STDOUT.
Later when pod get killed, We can still get the older logs by using -p flag in kubectl logs.

cf stop command does not perform graceful shutdown on bluemix

I have a node app in bluemix which holds some transaction cache in memory and I would like to flush this cache to DB before the application goes down. So I have the appropriate event handlers to intercept SIGTERM/SIGINT signals and all works fine from my laptop, however, it seems like the cf stop command does not perform graceful shutdown.
Unfortunately, there is no clear documentation around this topic, at one place in the cloudfoundary app-lifecycle doc they do mention that first SIGTERM is issued and then wait for 10 secs etc but Im not seeing this happening. Probably a bug on their side. https://docs.cloudfoundry.org/devguide/deploy-apps/app-lifecycle.html
Has anyone noticed this issue and probably have a workaround pls?
CF is sending the SIGTERM first but because of how the app is started by other processes, it's not being correctly propagated to your app.
As a workaround, disable App Management by setting the CF environment variable BLUEMIX_APP_MGMT_INSTALL=false and prefix your app's start command in your package.json file with 'exec' (e.g. exec node app.js).

Wakanda Server solution.quitServer() sequence of operations

I have already read the thread:
Wakanda Server scripted clean shutdown
This does not address my question.
We are running Wakanda Server 11.197492.
We want an automated, orderly, ensured shut-down of Wakanda Server - no matter which version we are running.
Before we give the "shutdown" command, we will stop inbound traffic for 1 to 2 minutes, to ensure that no httpHandlers are running when we shut-down.
We have scripted a single SharedWorker process to look for the "shutdown" command, and execute solution.quitServer().
At this time no other ShareWorker processes are running, and no active threads should be executing. This will likely not always be the case.
When this is executed, is a "solution quit" guaranteed?
Is solution.quitServer() the best way to initiate an automated solution shutdown?
Will there be a better way?
Is there a way to know of any of the Solution's Projects are currently executing threads prior to shutting down?
If more than 1 Project issues a solution.quitServer() method, within a few seconds of eachother, will that be a problem?
solution.quitServer() is probably not the best way to shutdown your server as it will be deprecated in the next major release.
I would recommend to send a sigkill as you point out in your question.
Wakanda Server scripted clean shutdown
Some fix have been done on v1.1.0 to safely close wakanda server after a kill.

Jobs in a queue is dropped unexpectedly in Gearman

I'm dealing with a very strange problem now.
Since I queue the jobs over 1,000 at once, Gearman doesn't work properly so far...
The problem is that, when I reserve the jobs in background mode, I could see the jobs were correctly queued from the monitoring page (gearman monitor),
but It is drained right after without delivering it to the worker. (within a few seconds)
After all, the jobs never be executed by the worker, just disappeared from the queue (job server).
So I tried rebooting the server entirely, and reinstall gearman as well as php library. (I'm using 1 CentOS, 1 Ubuntu with PHP gearman library, and version is 0.34 and 1.0.2)
But no luck yet... Job server just misbehaving as I explained in aobve.
What should I do for now?
Can I check the workers state, or see and monitor the whole process from queueing the jobs to the delivering to the worker?
When I tried gearmand with a option like: 'gearmand -vvvv' It never print anything on the screen while I register worker to the server, and run a job with client code (PHP)
Any comment will be appreciated.
For your information, I'm not considering persistent queue using MySQL or SQLite for now, because it sometimes occurs performance issue with slow execution.