Getting error - Timeout after [300] seconds waiting for service container stability, I have increased the timeout still getting the same error
Related
I have a hyperledger fabric network version 2.4.4 running on Kubernetes. The peers and other components are running behind istio ingress. The chaincode is running on dind (docker-in-docker) container and connects to peer through its URL. The problem is the chaincode connection is being dropped after few minutes. Below are the logs:
2022-07-14T04:31:13.057Z info [c-api:lib/handler.js] [assetschannel-ddc183b4] Calling chaincode Invoke() succeeded. Sending COMPLETED message back to peer
2022-07-14T04:33:04.197Z error [c-api:lib/handler.js] Chat stream with peer - on error: %j "Error: 14 UNAVAILABLE: Connection dropped\n at Object.callErrorFromStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/call.js:31:26)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client.js:391:49)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:328:181)\n at /usr/local/src/node_modules/#grpc/grpc-js/build/src/call-stream.js:187:78\n at processTicksAndRejections (node:internal/process/task_queues:78:11)"
I did set the following environment variables in the peer pod to keep the connection alive:
CORE_CHAINCODE_KEEPALIVE: 60000
CORE_PEER_KEEPALIVE_CLIENT_INTERVAL: 600s
CORE_PEER_KEEPALIVE_CLIENT_TIMEOUT: 2s
CORE_PEER_KEEPALIVE_DELIVERYCLIENT_INTERVAL: 20s
CORE_PEER_KEEPALIVE_MININTERVAL: 15s
but this did not resolve the issue.
Any suggestions would be appreciated.
It appears to be an issue with aws elb. The idle timeout was set to 60s which was breaking the connection between chaincode and peer when there was no communication between them. Increasing this time fixed the issue.
my airflow webserver shut down abruptly around the same timing about 16:37 GMT.
My airflow scheduler runs fine (no crash) tasks still run.
There is not much error except.
Handling signal: ttou
Worker exiting (pid: 118711)
ERROR - No response from gunicorn master within 120 seconds
ERROR - Shutting down webserver
Handling signal: term
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Shutting down: Master
Is it a cause of memory?
My cfg setting for webserver is standard.
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 120
# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = 120
# Number of workers to refresh at a time. When set to 0, worker refresh is
# disabled. When nonzero, airflow periodically refreshes webserver workers by
# bringing up new ones and killing old ones.
worker_refresh_batch_size = 1
# Number of seconds to wait before refreshing a batch of workers.
worker_refresh_interval = 30
Update:
Ok its doesn't crash everyday but today I have gunicorn unable to restart log.
ERROR - [0/0] Some workers seem to have died and gunicorn did not restart them as expected
Update: 30 October 2020
[CRITICAL] WORKER TIMEOUT (pid:108237)
I am getting this, I have increased timeout to 240, twice the default value.
Anyone know why this keep arising?
I have a few streams that wake every min or so and pulling some docs from the DB and performing some actions and in the end sending messages to SNS.
The tick interval is every 1 min currently.
Every few minutes I see this error info in the log:
[INFO] [06/04/2020 07:50:32.326] [default-akka.actor.default-dispatcher-5] [default/Pool(shared->https://sns.eu-west-1.amazonaws.com:443)] Pool is now shutting down as requested.
[INFO] [06/04/2020 07:51:32.666] [default-akka.actor.default-dispatcher-15] [default/Pool(shared->https://sns.eu-west-1.amazonaws.com:443)] Pool shutting down because akka.http.host-connection-pool.idle-timeout triggered after 30 seconds.
What does it mean? Did someone have it before? 443 was worrying me.
Akka http connection pools are terminated by akka automatically if not used for a certain time (default is 30 seconds). This can be configured and set to infinite if needed.
The pools are re-created on next use but this takes some time, so the request initiating the creation will be "blocked" till the pool is re-created.
From documentation.
The time after which an idle connection pool (without pending requests) will automatically terminate itself. Set to infinite to completely disable idle timeouts.
The config parameter that controls it is
akka.http.host-connection-pool.idle-timeout
The log message points to the config parameter too
Pool shutting down because akka.http.host-connection-pool.idle-timeout
triggered after 30 seconds.
I have a jboss cluster composed of two jboss eap 6.3.3 instances. In some cases because an application presents a bug two instances raises an exception and I must to restart these two instances (node1 - node 2).
In that situtation when I restart the node1, for instance, and the node 2 is not reachable because is stalled, the node1 start the deploy of my app war and logs the following exception.
ERROR [org.jboss.msc.service.fail] [] (ServerService Thread Pool -- 107) MSC000001:
Failed to start service jboss.persistenceunit."app.war#persistencename":
org.jboss.msc.service.StartException in service jboss.persistenceunit."app.war#persistencename":
org.infinispan.CacheException: Unable to invoke method public void
org.infinispan.statetransfer.StateTransferManagerImpl.start() throws
java.lang.Exception on object of type StateTransferManagerImpl
at org.jboss.as.jpa.service.PersistenceUnitServiceImpl$1.run(PersistenceUnitServiceImpl.java:103)
...
Caused by: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending
message to node2/hibernate
at org.infinispan.util.Util.rewrapAsCacheException(Util.java:542)
...
Caused by: org.jgroups.TimeoutException: timeout sending
message to node2/hibernate
at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:392)
After that Jboss logs that the implementation of the war has failed.
If I restart the node 2 too, then the node1 starts without problems and deploys the war succesfully.
Why does the deployment stop if one node of the cluster is not reachable?
I'm am using JBoss EAP 6.2 as Webapplication server and Apace Modcluster for load balancing.
Whenever I try to undeploy my webapplication, I get the following warning
14:22:16,318 WARN [org.jboss.modcluster] (ServerService Thread Pool -- 136) MODCLUSTER000025: Failed to drain 2 remaining active sessions from default-host:/starrassist within 10.000000.1 seconds
14:22:16,319 INFO [org.jboss.modcluster] (ServerService Thread Pool -- 136) MODCLUSTER000021: All pending requests drained from default-host:/starrassist in 10.002000.1 seconds
and it takes forever to undeploy and the EAP server-group and node in which the application is deployed becomes unresponsive.
The only workaround is to restart the entire EAP server. My Question is, Is there an attribute that I can set in EAP or ModCluster so that the active sessions beyond a maxTimeOut would expire itself?
To control the timeout to stop a context you can use the following configuration value:
Stop Context Timeout
The amount of time, measure in units specified by
stopContextTimeoutUnit, for which to wait for clean shutdown of a
context (completion of pending requests for a distributable context;
or destruction/expiration of active sessions for a non-distributable
context).
CLI Command:
/profile=full-ha/subsystem=modcluster/mod-cluster-config=configuration/:write-attribute(name=stop-context-timeout,value=10)
Ref: Configure the mod_cluster Subsystem
Likewise if you are using JDK 8 take a look at this issue: Draining pending requests fails with Oracle JDK8