Kuber container launched in lens stopping working unexpectedly - kubernetes

I am creating development and pod in lens. There specific program is launched. This program prints logs and has heartbeat every 10 minutes. From time to time, this heartbeat just stops, without any exception and programs stops working, but Pod does not restart, it continues working, like nothing happened. Has anyone faced the problem?

Related

SIGTERM signal arrives first to kuma and stops all active application connections immediately

we have applications that work with Kafka (MSK), we noticed that once pod is starting to shutdown (during autoscaling or deployment) the app container loses all active connections and the SIGTERM signal causes Kuma to close all connections immediately which cause data loss due to unfinished sessions (which doesn’t get closed gracefully) on the app side and after that we receive connection errors to the kafka brokers,
is anyone have an idea how to make Kuma wait some time once it gets the SIGTERM signal to let the sessions close gracefully?
or maybe a way to let the app know before the kuma about the shutsown?
or any other idea ?
This is known issue getting fixed in the coming 1.7 release: https://github.com/kumahq/kuma/pull/4229

Airflow example DAGs take a long time in GUI and the scheduler is showing this: DagFileProcessorManager (PID=...) last sent a heartbeat

While running standard Airflow examples with airflow 2.1.2, DAGs are taking a long time to complete. On every DAG run, this problem occurs. The problem happens when running from the airflow GUI. It isn't a problem when running as a test from the airflow command line. Looking at scheduler log as it runs, this is what is apparent: after a task runs, apparently the DagFileProcessorManager has to be restarted for it to continue to the next tasks, which take 1 to 2 minutes. The restart happens after the absence of heartbeat responses, and this error shows:
{dag_processing.py:414} ERROR - DagFileProcessorManager (PID=67503) last sent a heartbeat 64.25 seconds ago! Restarting it
Question: How can I fix this?
This fixed the problem:
(1) Use postgresql instead of sqlite.
(2) Switch from SequentialExecutor to LocalExecutor.
Just to add to that - we had other similar reports and we decided to make a very clear warning in such case in the UI (will be released in the next version):
https://github.com/apache/airflow/pull/17133

task executed slowly after celery serving for long time but still work quickly after celery restart

This kind of issues occurs several times
The log shows that when celery serves for a long time, there was a stuck about 50s. I dont know what happened in that "break"
But it works properly after restarting celery and indeed this task only consumes less than 10s
I suppose there was something overloaded in the backend of celery then be released after restart.
But I have no idea what's the real problem and hardly come up with such issue

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.

How to properly check if a slow starting java tomcat application is running so you can restart it?

I want to implement a automatic service restarting for several tomcat applications, applications that do take a lot of time to start, even over 10 minutes.
Mainly the test would check if the application is responding on HTTP with a valid response.
Still, this is not the problem, the problem is how to prevent this uptime check to fail while the service is under maintenance, scheduled or not.
I don't want for this service to be started if it was stopped manually, with `service appname stop".
I considered creating .maintenance files on stop or restart actions of the daemon and checking for them before triggering an automated restart.
So far the only problem that I wasn't able to properly solve was, how to detect that the app finished starting up and remove the .maintenance file, so the automatic restart would work properly.
Note, an init.d script is not supposed to wait, so the daemon should start a background command that solves this problem.