I am creating development and pod in lens. There specific program is launched. This program prints logs and has heartbeat every 10 minutes. From time to time, this heartbeat just stops, without any exception and programs stops working, but Pod does not restart, it continues working, like nothing happened. Has anyone faced the problem?
Related
we have applications that work with Kafka (MSK), we noticed that once pod is starting to shutdown (during autoscaling or deployment) the app container loses all active connections and the SIGTERM signal causes Kuma to close all connections immediately which cause data loss due to unfinished sessions (which doesn’t get closed gracefully) on the app side and after that we receive connection errors to the kafka brokers,
is anyone have an idea how to make Kuma wait some time once it gets the SIGTERM signal to let the sessions close gracefully?
or maybe a way to let the app know before the kuma about the shutsown?
or any other idea ?
This is known issue getting fixed in the coming 1.7 release: https://github.com/kumahq/kuma/pull/4229
While running standard Airflow examples with airflow 2.1.2, DAGs are taking a long time to complete. On every DAG run, this problem occurs. The problem happens when running from the airflow GUI. It isn't a problem when running as a test from the airflow command line. Looking at scheduler log as it runs, this is what is apparent: after a task runs, apparently the DagFileProcessorManager has to be restarted for it to continue to the next tasks, which take 1 to 2 minutes. The restart happens after the absence of heartbeat responses, and this error shows:
{dag_processing.py:414} ERROR - DagFileProcessorManager (PID=67503) last sent a heartbeat 64.25 seconds ago! Restarting it
Question: How can I fix this?
This fixed the problem:
(1) Use postgresql instead of sqlite.
(2) Switch from SequentialExecutor to LocalExecutor.
Just to add to that - we had other similar reports and we decided to make a very clear warning in such case in the UI (will be released in the next version):
https://github.com/apache/airflow/pull/17133
This kind of issues occurs several times
The log shows that when celery serves for a long time, there was a stuck about 50s. I dont know what happened in that "break"
But it works properly after restarting celery and indeed this task only consumes less than 10s
I suppose there was something overloaded in the backend of celery then be released after restart.
But I have no idea what's the real problem and hardly come up with such issue
When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.
I want to implement a automatic service restarting for several tomcat applications, applications that do take a lot of time to start, even over 10 minutes.
Mainly the test would check if the application is responding on HTTP with a valid response.
Still, this is not the problem, the problem is how to prevent this uptime check to fail while the service is under maintenance, scheduled or not.
I don't want for this service to be started if it was stopped manually, with `service appname stop".
I considered creating .maintenance files on stop or restart actions of the daemon and checking for them before triggering an automated restart.
So far the only problem that I wasn't able to properly solve was, how to detect that the app finished starting up and remove the .maintenance file, so the automatic restart would work properly.
Note, an init.d script is not supposed to wait, so the daemon should start a background command that solves this problem.