Airflow Webserver Shutting down - webserver

my airflow webserver shut down abruptly around the same timing about 16:37 GMT.
My airflow scheduler runs fine (no crash) tasks still run.
There is not much error except.
Handling signal: ttou
Worker exiting (pid: 118711)
ERROR - No response from gunicorn master within 120 seconds
ERROR - Shutting down webserver
Handling signal: term
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Shutting down: Master
Is it a cause of memory?
My cfg setting for webserver is standard.
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 120
# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = 120
# Number of workers to refresh at a time. When set to 0, worker refresh is
# disabled. When nonzero, airflow periodically refreshes webserver workers by
# bringing up new ones and killing old ones.
worker_refresh_batch_size = 1
# Number of seconds to wait before refreshing a batch of workers.
worker_refresh_interval = 30
Update:
Ok its doesn't crash everyday but today I have gunicorn unable to restart log.
ERROR - [0/0] Some workers seem to have died and gunicorn did not restart them as expected
Update: 30 October 2020
[CRITICAL] WORKER TIMEOUT (pid:108237)
I am getting this, I have increased timeout to 240, twice the default value.
Anyone know why this keep arising?

Related

What does "Pool is now shutting down as requested" mean when using host connection pools

I have a few streams that wake every min or so and pulling some docs from the DB and performing some actions and in the end sending messages to SNS.
The tick interval is every 1 min currently.
Every few minutes I see this error info in the log:
[INFO] [06/04/2020 07:50:32.326] [default-akka.actor.default-dispatcher-5] [default/Pool(shared->https://sns.eu-west-1.amazonaws.com:443)] Pool is now shutting down as requested.
[INFO] [06/04/2020 07:51:32.666] [default-akka.actor.default-dispatcher-15] [default/Pool(shared->https://sns.eu-west-1.amazonaws.com:443)] Pool shutting down because akka.http.host-connection-pool.idle-timeout triggered after 30 seconds.
What does it mean? Did someone have it before? 443 was worrying me.
Akka http connection pools are terminated by akka automatically if not used for a certain time (default is 30 seconds). This can be configured and set to infinite if needed.
The pools are re-created on next use but this takes some time, so the request initiating the creation will be "blocked" till the pool is re-created.
From documentation.
The time after which an idle connection pool (without pending requests) will automatically terminate itself. Set to infinite to completely disable idle timeouts.
The config parameter that controls it is
akka.http.host-connection-pool.idle-timeout
The log message points to the config parameter too
Pool shutting down because akka.http.host-connection-pool.idle-timeout
triggered after 30 seconds.

Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout"

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).
connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.

Flink: HA mode killing leading jobmanager terminating standby jobmanagers

I am trying to get Flink to run in HA mode using Zookeeper, but when I try to test it by killing the leader JobManager all my standby jobmanagers get killed too.
So instead of a standby jobmanager taking over as the new Leader, they all get killed which isn't supposed to happen.
My setup:
4 servers, 3 of those servers have Zookeeper running, but only 1 server will host all the JobManagers.
ad011.local: Zookeeper + Jobmanagers
ad012.local: Zookeeper + Taskmanager
ad013.local: Zookeeper
ad014.local: nothing interesting
My masters file looks like this:
ad011.local:8081
ad011.local:8082
ad011.local:8083
My flink-conf.yaml:
jobmanager.rpc.address: ad011.local
blob.server.port: 6130,6131,6132
jobmanager.heap.mb: 512
taskmanager.heap.mb: 128
taskmanager.numberOfTaskSlots: 4
parallelism.default: 2
taskmanager.tmp.dirs: /var/flink/data
metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 8789,8790,8791
high-availability: zookeeper
high-availability.zookeeper.quorum: ad011.local:2181,ad012.local:2181,ad013.local:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.cluster-id: /cluster-one
high-availability.storageDir: /var/flink/recovery
high-availability.jobmanager.port: 50000,50001,50002
When I run flink by using start-cluster.sh script I see my 3 JobManagers running, and going to the WebUI they all point to ad011.local:8081, which is the leader. Which is okay I guess?
I then try to test the failover by killing the leader using kill and then all my other standby JobManagers stop too.
This is what I see in my standby JobManager logs:
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#ad011.local:50002/user/jobmanager.
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService#72d546c8.
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#ad011.local:50002/user/jobmanager on port 8083
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,645 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#ad011.local:50000/user/jobmanager:f7dc2c48-dfa5-45a4-a63e-ff27be21363a.
2017-09-29 08:08:41,651 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,722 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Received leader address but not running in leader ActorSystem. Cancelling registration.
2017-09-29 09:26:13,472 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#ad011.local:50000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-09-29 09:26:14,274 INFO org.apache.flink.runtime.jobmanager.JobManager - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2017-09-29 09:26:14,284 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6132
Any help would be appreciated.
Solved it by running my cluster using ./bin/start-cluster.sh instead of using service files (which calls the same script), the service file kills the other jobmanagers apparently.

Mesos-marthon cluster issue Could not determine the current leader

I am new in Mesos and Marathon services. I have setup 3 master and 3 slave server as per www.digitalocean.com. Configured as it is in master servers as well as slaves. Finally I done setup of Mesos, Marathon, Zookeeper and Chronos. Mesos is able to listing with 5050, Marathon is 8080 and Chronos 4400. After few hours my Marthon instances are showing like Error 503
HTTP ERROR: 503
Problem accessing /. Reason:
Could not determine the current leader
Powered by Jetty:// 9.3.z-SNAPSHOT.
But mesos is working fine. Every time i am facing this problem and if i restart the marathon service and zookeeper service its working fine.
Marathon
Jun 15 06:19:20 master3 marathon[1054]: INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(192.168.4.78:8080 (mesosphere.marathon.api.LeaderProxyFilter$:qtp522188921-35)
Jun 15 06:19:20 master3 marathon[1054]: INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(192.168.4.78:8080 (mesosphere.marathon.api.LeaderProxyFilter$:qtp522188921-35)
Zookeeper
2016-06-15 03:41:13,797 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.4.78:38339
2016-06-15 03:41:13,798 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running

Gunicorn socket disappears

Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise
gunicorn (version 19.1.1)
nginx version: nginx/1.1.19
My gunicorn conf:
bind = ["unix:///tmp/someproj1.sock", "unix:///tmp/someproj2.sock"]
pythonpath = "/home/deploy/someproj/someproj"
workers = 5
worker_class = "eventlet"
worker_connections = 25
timeout = 3600
graceful_timeout = 3600
We started getting 502s at around 2PM yesterday in our dev env. This was in the Nginx error log:
connect() to unix:///tmp/someproj1.sock failed (2: No such file or directory) while connecting to upstream"
Both gunicorn sockets were missing from /tmp.
At 11:55AM today I ran ps -eo pid,cmd,etime|grep gunicorn to get the uptime:
4156 gunicorn: master [myproj.    22:53:54
4161 gunicorn: worker [myproj.    22:53:54
4162 gunicorn: worker [myproj.    22:53:54
4163 gunicorn: worker [myproj.    22:53:54
4164 gunicorn: worker [myproj.    22:53:54
4165 gunicorn: worker [myproj.    22:53:53
5207 grep --color=auto gunicorn        00:00
So gunicorn and all its workers have been running uninterrupted since ~1:01PM yesterday. The Nginx access log confirm that requests were successfully being served for about an hour after gunicorn was started. Then it seems for some reason both gunicorn sockets disappeared, and gunicorn continued running without writing any error logs.
Any ideas on what could cause that? Or how to fix it?
It turns out this was indeed a bug, where eventlet workers would delete the socket when they themselves would restart.
The fix has already been merged into master branch, but it has unfortunately not been released yet (version 19.3 still has the problem).