taskKillGracePeriodSeconds is not working for DC/OS Marathon Application?

taskKillGracePeriodSeconds is not working for DC/OS Marathon Application? - marathon

We have setup DC/OS(version 1.9) Cluster on AWS nodes. We are creating Marathon Application definition with setting "taskKillGracePeriodSeconds"=60. We are also catching SIGTERM in our application to handle the application shutdown gracefully. But this is is not wroking, Marathon is immediately killing the Application (on Scale Down / Destroy) and not waits for 60 secs as expected. We are getting callback on SIGTERM but application killed immediately after that. We have also tried with starting Mesos slave agents with setting following attributes in file /var/lib/dcos/mesos-slave-common MESOS_ATTRIBUTES=executor_shutdown_grace_period:60secs;docker_stop_timeout:60s
ecs but this is also not helping.
DCOS Cluster Agents uses centos-release-7-2.1511.el7.centos.2.10.x86_64 OS.
Does anybody able to use taskKillGracePeriodSeconds successfully.?
Please help to work out this.
Thanks.

are you using Docker containers?
There was a problem as far as I remember when using process groups (=containers) with the forwarding of the SIGTERM signal.
Just to test this on your cluster, can you deploy an app with the following command, just using mesos containerizer and a taskKillGracePeriodSeconds of 10 seconds?
trap "echo ' killing' && sleep 5 && echo 'test' && sleep 100" SIGTERM && sleep 100000

Related

Upgrade container for long running jobs in kubernetes

I have long running workers running in kubernetes - more than 5 hours. I want to update the container without interrupting the long running jobs. I want any newly started work off the queue to start with the new version of the release but I don't want to interrupt the currently running work.
BTW I'm not actually using Jobs, I'm using Deployments with workers that get work off a redis queue.
What is the best way to do to do a release without killing the long running work?
Have a huge timeout for SIGTERM
preStop hooks?
Another container in the pod that checks for the latest version and updates once work is done?

.NET Core / Kubernetes - SIGTERM, clean shutdown

I'm trying to verify that shutdown is completing cleanly on Kubernetes, with a .NET Core 2.0 app.
I have an app which can run in two "modes" - one using ASP.NET Core and one as a kind of worker process. Both use Console and JSON-which-ends-up-in-Elasticsearch-via-Filebeat-sidecar-container logger output which indicate startup and shutdown progress.
Additionally, I have console output which writes directly to stdout when a SIGTERM or Ctrl-C is received and shutdown begins.
Locally, the app works flawlessly - I get the direct console output, then the logger output flowing to stdout on Ctrl+C (on Windows).
My experiment scenario:
App deployed to GCS k8s cluster (using helm, though I imagine that doesn't make a difference)
Using kubectl logs -f to stream logs from the specific container
Killing the pod from GCS cloud console site, or deleting the resources via helm delete
Dockerfile is FROM microsoft/dotnet:2.1-aspnetcore-runtime and has ENTRYPOINT ["dotnet", "MyAppHere.dll"], so not wrapped in a bash process or anything
Not specifying a terminationGracePeriodSeconds so guess it defaults to 30 sec
Observing output returned
Results:
The API pod log streaming showed just the immediate console output, "[SIGTERM] Stop signal received", not the other Console logger output about shutdown process
The worker pod log streaming showed a little more - the same console output and some Console logger output about shutdown process
The JSON logs didn't seem to pick any of the shutdown log output
My conclusions:
I don't know if Kubernetes is allowing the process to complete before terminating it, or just issuing SIGTERM then killing things very quick. I think it should be waiting, but then, why no complete console logger output?
I don't know if console output is cut off when stdout log streaming at some point before processes finally terminates?
I would guess that the JSON stuff doesn't come through to ES because filebeat running in the sidecar terminates even if there's outstanding stuff in files to send
I would like to know:
Can anyone advise on points 1,2 above?
Any ideas for a way to allow a little extra time or leeway for the sidecar to send stuff up, like a pod container termination order, delay on shutdown for that container, etc?

SIGTERM does indeed signal termination. The less obvious part is that when the SIGTERM handler returns, everything is considered finished.
The fix is to not return from the SIGTERM handler until the app has finished shutting down. For example, using a ManualResetEvent and Wait()ing it in the handler.

I've started to look into this for my own purposes and have come across your question over a year after it was posted... This is a bit late, but have you tried GraceTerm?
There is an associated NuGET package for this.
From the description...
Graceterm middleware provides implementation to ensure graceful shutdown of AspNet Core applications. The basic concept is: After application received a SIGTERM (a signal asking it to terminate), Graceterm will hold it alive till all pending requests are completed or a timeout occur.
I haven't personally tried this yet, but it does look promising.

Try add STOPSIGNAL SIGINT to your Dockerfile

Which process to kill to stop a Kafka Connect Worker?

I want to kill my Kafka Connect distributed worker, but I am unable (or I do not know how) to determine which process running in linux is that worker.
When running
ps aux | grep worker
I do see a lot of worker processes but am unsure which is the connect worker and which are standard non-connect workers
It is true that only one of these processes was started yesterday and I suspect that that is the one, but that would obviously not be a sufficient condition in all cases, say for example if the Kafka cluster was brought online yesterday. So, in general, how can I determine which process is a Kafka connect worker?
What is the fool proof method here?

If the other worker processes are not related to connect, you can search connect process with properties file which you passed to start connect worker.
ps aux | grep connect-distributed.properties
There is no kill script for connect workers. You have to run kill command with SIGTERM to stop worker process gracefully.

Airflow: what do `airflow webserver`, `airflow scheduler` and `airflow worker` exactly do?

I've been working with Airflow for a while now, which was set up by a colleague. Lately I run into several errors, which require me to more in dept know how to fix certain things within Airflow.
I do understand what the 3 processes are, I just don't understand the underlying things that happen when I run them. What exactly happens when I run one of the commands? Can I somewhere see afterwards that they are running? And if I run one of these commands, does this overwrite older webservers/schedulers/workers or add a new one?
Moreover, if I for example run airflow webserver, the screen shows some of the things that are happening. Can I simply get out of this by pressing CTRL + C? Because when I do this, it says things like Worker exiting and Shutting down: Master. Does this mean I'm shutting everything down? How else should I get out of the webserver screen then?

Each process does what they are built to do while they are running (webserver provides a UI, scheduler determines when things need to be run, and workers actually run the tasks).
I think your confusion is that you may be seeing them as commands that tell some sort of "Airflow service" to do something, but they are each standalone commands that start the processes to do stuff. ie. Starting from nothing, you run airflow scheduler: now you have a scheduler running. Run airflow webserver: now you have a webserver running. When you run airflow webserver, it is starting a python flask app. While that process is running, the webserver is running, if you kill command, is goes down.
All three have to be running for airflow as a whole to work (assuming you are using an executor that needs workers). You should only ever had one scheduler running, but if you were to run two processes of airflow webserver (ignoring port conflicts, you would then have two separate http servers running using the same metadata database. Workers are a little different in that you may want multiple worker processes running so you can execute more tasks concurrently. So if you create multiple airflow worker processes, you'll end up with multiple processes taking jobs from the queue, executing them, and updating the task instance with the status of the task.
When you run any of these commands you'll see the stdout and stderr output in console. If you are running them as a daemon or background process, you can check what processes are running on the server.
If you ctrl+c you are sending a signal to kill the process. Ideally for a production airflow cluster, you should have some supervisor monitoring the processes and ensuring that they are always running. Locally you can either run the commands in the foreground of separate shells, minimize them and just keep them running when you need them. Or run them in as a background daemon with the -D argument. ie airflow webserver -D.

Gracefully update running celery pod in Kubernetes

I have a Kubernetes cluster running Django, Celery, RabbitMq and Celery Beat. I have several periodic tasks spaced out throughout the day (so as to keep server load down). There are only a few hours when no tasks are running, and I want to limit my rolling-updates to those times, without having to track it manually. So I'm looking for a solution that will allow me to fire off a script or task of some sort that will monitor the Celery server, and trigger a rolling update once there's a window in which no tasks are actively running. There are two possible ways I thought of doing this, but I'm not sure which is best, nor how to implement either one.
Run a script (bash or otherwise) that checks up on the Celery server every few minutes, and initiates the rolling-update if the server is inactive
Increment the celery app name before each update (in the Beat run command, the Celery run command, and in the celery.py config file), create a new Celery pod, rolling-update the Beat pod, and then delete the old Celery 12 hours later (a reasonable time span for all running tasks to finish)
Any thoughts would be greatly appreciated.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse