What does "Pool is now shutting down as requested" mean when using host connection pools - scala

I have a few streams that wake every min or so and pulling some docs from the DB and performing some actions and in the end sending messages to SNS.
The tick interval is every 1 min currently.
Every few minutes I see this error info in the log:
[INFO] [06/04/2020 07:50:32.326] [default-akka.actor.default-dispatcher-5] [default/Pool(shared->https://sns.eu-west-1.amazonaws.com:443)] Pool is now shutting down as requested.
[INFO] [06/04/2020 07:51:32.666] [default-akka.actor.default-dispatcher-15] [default/Pool(shared->https://sns.eu-west-1.amazonaws.com:443)] Pool shutting down because akka.http.host-connection-pool.idle-timeout triggered after 30 seconds.
What does it mean? Did someone have it before? 443 was worrying me.

Akka http connection pools are terminated by akka automatically if not used for a certain time (default is 30 seconds). This can be configured and set to infinite if needed.
The pools are re-created on next use but this takes some time, so the request initiating the creation will be "blocked" till the pool is re-created.
From documentation.
The time after which an idle connection pool (without pending requests) will automatically terminate itself. Set to infinite to completely disable idle timeouts.
The config parameter that controls it is
akka.http.host-connection-pool.idle-timeout
The log message points to the config parameter too
Pool shutting down because akka.http.host-connection-pool.idle-timeout
triggered after 30 seconds.

Related

Airflow Webserver Shutting down

my airflow webserver shut down abruptly around the same timing about 16:37 GMT.
My airflow scheduler runs fine (no crash) tasks still run.
There is not much error except.
Handling signal: ttou
Worker exiting (pid: 118711)
ERROR - No response from gunicorn master within 120 seconds
ERROR - Shutting down webserver
Handling signal: term
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Shutting down: Master
Is it a cause of memory?
My cfg setting for webserver is standard.
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 120
# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = 120
# Number of workers to refresh at a time. When set to 0, worker refresh is
# disabled. When nonzero, airflow periodically refreshes webserver workers by
# bringing up new ones and killing old ones.
worker_refresh_batch_size = 1
# Number of seconds to wait before refreshing a batch of workers.
worker_refresh_interval = 30
Update:
Ok its doesn't crash everyday but today I have gunicorn unable to restart log.
ERROR - [0/0] Some workers seem to have died and gunicorn did not restart them as expected
Update: 30 October 2020
[CRITICAL] WORKER TIMEOUT (pid:108237)
I am getting this, I have increased timeout to 240, twice the default value.
Anyone know why this keep arising?

Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout"

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).
connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.

Why is my NiFi PublishKafka processor only working with previous versions?

I am using Kafka 2, and for some reason the only NiFi processors that will correctly publish my messages to Kafka are PublishKafka (0_9) and PublishKafka_0_10. The later versions don't push my messages through, which is odd because again, I'm running Kafka 2.1.1.
For more information, when I try to run my FlowFile through the later PublishKafka processors, I get a timeout exception that repeats voluminously.
2019-03-11 16:05:34,200 ERROR [Timer-Driven Process Thread-7] o.a.n.p.kafka.pubsub.PublishKafka_2_0 PublishKafka_2_0[id=6d7f1896-0169-
1000-ca27-cf7f86f22694] PublishKafka_2_0[id=6d7f1896-0169-1000-ca27-
cf7f86f22694] failed to process session due to
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
initializing transactional state in 5000ms.; Processor Administratively
Yielded for 1 sec: org.apache.kafka.common.errors.TimeoutException:
Timeout expired while initializing transactional state in 5000ms.
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
initializing transactional state in 5000ms.
2019-03-11 16:05:34,201 WARN [Timer-Driven Process Thread-7]
o.a.n.controller.tasks.ConnectableTask Administratively Yielding
PublishKafka_2_0[id=6d7f1896-0169-1000-ca27-cf7f86f22694] due to uncaught
Exception: org.apache.kafka.common.errors.TimeoutException: Timeout
expired while initializing transactional state in 5000ms.
My processor settings are the following:
All other configurations are defaults.
Any ideas on why this is happening?

Apache Modcluster failed to drain active sessions

I'm am using JBoss EAP 6.2 as Webapplication server and Apace Modcluster for load balancing.
Whenever I try to undeploy my webapplication, I get the following warning
14:22:16,318 WARN [org.jboss.modcluster] (ServerService Thread Pool -- 136) MODCLUSTER000025: Failed to drain 2 remaining active sessions from default-host:/starrassist within 10.000000.1 seconds
14:22:16,319 INFO [org.jboss.modcluster] (ServerService Thread Pool -- 136) MODCLUSTER000021: All pending requests drained from default-host:/starrassist in 10.002000.1 seconds
and it takes forever to undeploy and the EAP server-group and node in which the application is deployed becomes unresponsive.
The only workaround is to restart the entire EAP server. My Question is, Is there an attribute that I can set in EAP or ModCluster so that the active sessions beyond a maxTimeOut would expire itself?
To control the timeout to stop a context you can use the following configuration value:
Stop Context Timeout
The amount of time, measure in units specified by
stopContextTimeoutUnit, for which to wait for clean shutdown of a
context (completion of pending requests for a distributable context;
or destruction/expiration of active sessions for a non-distributable
context).
CLI Command:
/profile=full-ha/subsystem=modcluster/mod-cluster-config=configuration/:write-attribute(name=stop-context-timeout,value=10)
Ref: Configure the mod_cluster Subsystem
Likewise if you are using JDK 8 take a look at this issue: Draining pending requests fails with Oracle JDK8

JBoss shutdown takes forever

I am having issues in stopping the jboss. Most of the times the when I execute the shutdown. it stops the server in couple of seconds.
But some times it takes forver to stop and I have to kill the process.
Whenerver the shut down takes long I see the scheduler was running and in logs I see
2014-07-14 19:19:29,124 INFO [org.springframework.scheduling.quartz.SchedulerFactoryBean] (JBoss Shutdown Hook) Shutting down Quartz Scheduler
2014-07-14 19:19:29,124 INFO [org.quartz.core.QuartzScheduler] (JBoss Shutdown Hook) Scheduler scheduler_$_s608203at1vl07shutting down.
2014-07-14 19:19:29,124 INFO [org.quartz.core.QuartzScheduler] (JBoss Shutdown Hook) Scheduler scheduler_$_s608203at1vl07 paused.
and nothing after that.
Make sure the Quartz scheduler thread and all threads in its thread pool are marked as daemon threads so that they do not to prevent the JVM from exiting.
This can be achieved by setting the following Quartz properties respectively:
org.quartz.scheduler.makeSchedulerThreadDaemon=true
org.quartz.threadPool.makeThreadsDaemons=true
While it is safe to mark the scheduler thread as a daemon thread, you should think before you mark your thread pool threads as daemons threads, because when the JVM exits, these "worker" threads can be in the middle of executing some logic that you do not want to abort abruptly. If that is the case, you can have your jobs implement the org.quartz.InterruptableJob interface and implement a JVM shutdown hook somewhere in your application that interrupts all currently executing jobs (the list of which can be obtained from the org.quartz.Scheduler API).