Where can I configure the timeout of JBoss cluster node drop? - jboss

We have 4 nodes clustering. When one of them has a long GC pause, then this causes the note to be dropped from clustering and generating the following log:
2012-06-14 03:27:48,277 INFO [org.jboss.messaging.core.impl.postoffice.GroupMember] org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener#6225352b got new view [10.164.218.18:7910|10] [10.164.218.18:7910, 10.164.107.69:7910, 10.164.107.65:7910], old view is [10.164.218.14:7910|9] [10.164.218.14:7910, 10.164.218.18:7910, 10.164.107.69:7910, 10.164.107.65:7910]
2012-06-14 03:27:48,277 INFO [org.jboss.messaging.core.impl.postoffice.GroupMember] I am (10.164.218.18:7910)
2012-06-14 03:27:48,998 INFO [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] JBoss Messaging is failing over for failed node 52. If there are many messages to reload this may take some time...
I would like to configure the timeout of the node drop. It seems to be 2 minutes in my case and I would like to increase it, but I can't find where to configure it.
Where can I configure the timeout of JBoss cluster node drop?

I found it, it in oracle-persistence-service.xml. You need to adjust the configurations in
ControlChannelConfig section. I think it is the 'timeout' of 'FD' protocol.

Related

[ERROR][o.o.a.a.AlertIndices] info deleteOldIndices

I am running an Opensearch cluster 2.3, and from the log I can see the following error:
[2023-02-13T03:37:44,711][ERROR][o.o.a.a.AlertIndices ] [opensearch-node1] info deleteOldIndices
What trigger this error? I've never set up the alert, and in the past on the same cluster I used to have some ISM policies for the indices, but I removed them all, can this be linked to the error I am seeing?
Thanks.

Keycloak cluster fails on Amazon ECS (org.infinispan.commons.CacheException: Initial state transfer timed out for cache)

I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.
In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.

Concurrent Timeout exception on starting Jboss Wildfly 9.02 server

I am new to jboss server. When I am trying to deploy .war file on server the following exception gets print on console:
6:38:04,388 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
("core-service" => "management"),
("management-interface" => "http-interface")
]'
16:38:05,642 INFO [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-4) WFLYJCA0019: Stopped Driver service with driver-name = Aerobay.war_com.mysql.jdbc.Driver_5_1
16:38:09,548 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler#5f88823f for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:396)
at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1384)
at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1332)
at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1292)
at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1180)
at org.jboss.as.controller.AbstractOperationContext.handleContainerStabilityFailure(AbstractOperationContext.java:964)
at org.jboss.as.controller.AbstractOperationContext.doCompleteStep(AbstractOperationContext.java:590)
at org.jboss.as.controller.AbstractOperationContext.completeStepInternal(AbstractOperationContext.java:354)
at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:330)
at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1183)
at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:453)
at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:327)
at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:313)
at org.jboss.as.server.ServerService.boot(ServerService.java:384)
at org.jboss.as.server.ServerService.boot(ServerService.java:359)
at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:271)
at java.lang.Thread.run(Thread.java:745)
Thanks in advance for the help !
I had the same problem when I tried to deploy the WAR file on my Red Hat Jboss EAP 7.0.
But the server was integrated into my IDE (Eclipse Neon) and the problem only occured in Debug-Modus.
I was able to solve the problem by removing all breakpoints and after that i started the server again.
Try increasing timeout by adding java option "blocking.timeout". You can do it in bin/standalone.conf.bat (depends on how you configure wildfly) by adding line:
set "JAVA_OPTS=%JAVA_OPTS% -Djboss.as.management.blocking.timeout=600
Change the number if it's not enough.
increasing the timeout doesn't solve the root cause of the problem. You need to check the cause of the time of the block and solve the issue. Maybe in some cases the solution is to increase the timeout.
In most cases, increasing resources is a bad way to solve issues. I had this case, the Wildfly took a lot of time to boot. I increased the timeout to 600 and solved the issue but was still having issue with the wildfly booting time which was so annoying.
2018-03-26 07:50:36,523 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[("path" => "xxxxxxxxxxxxxxxx")]'
Finally I checked the block cause in and found the block was due to network host resolving (NAS storage defined as a path in wildfly).
I jumped to the network setting and found that my local DNS was not set properly. I added the local DNS instead of the public DNS and the block issue was gone. Hope this helps
Regards
Sleem
When i tried to debug and started the server with debug mode got the following error:
16:19:50,096 ERROR [org.jboss.as.controller.management-operation] (management-handler-thread - 1) JBAS013412: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'deploy' at address '[("deployment" => "ViprWeb.war")]'
16:19:50,096 ERROR [org.jboss.as.server] (management-handler-thread - 1) JBAS015870
16:20:00,117 ERROR [org.jboss.as.controller.management-operation] (management-handler-thread - 1) JBAS013413: Timeout after [5000] seconds waiting for service container stability while finalizing an operation.
I removed all my breakpoints and restarted my server jboss and it resolved the issue.
just increase time out in standalone.conf.bat
set as set "JAVA_OPTS=%JAVA_OPTS% -Djboss.as.management.blocking.timeout=600
It worked for me.
I had the same problem running a "dockerized" application locally - turns out increasing the resources fixed the issue. What I finally settled on:
CPUs: 4
Memory: 8GB
Swap: 2GB
Same problem, with netbeans
but I had not break points.
Running jboss by command line, helped me
Stop jboss
Close Netbeans
open command line
Go to jboss folder > bin >
type: standalone.bat (this starts jboss)
open Netbeans
worked fine!
Hope it'll help someone else.
I've been facing the same problem recently with WildFly 18 and 21, trying to run a WAR file containing JSR-352 batch jobs that worked fine on WildFly 14.
Increasing the timeout did not solve the situation, only prolonged the time before the TimeoutException was casted, no matter the value (e.g. 5, 10 or 20 minutes).
I've just found that to turn off microprofile-metrics-smallryesubsystem seems to be a possible solution.
After commenting out this line from the standalone.xml file, the war deploy was successful and much faster (about 2 minutes):
<subsystem xmlns="urn:wildfly:microprofile-metrics-smallrye:2.0" security-enabled="false" exposed-subsystems="*" prefix="${wildfly.metrics.prefix:wildfly}"/>
I am having problem with keycloak server 15.0.2.
WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler#410c55ac for operation add-deployer-chains at address [] failed
I am using mysql5.7 with jconnect8.0 jar.
I had the same problem. Then I killed the Kaspersky process and it helped!
I tackled a similar problem and had only succeed with undeploy the the apps. This gave a clean environment for Wildfly to restart and start the management and http-service. Then deploy the apps/WARs and identify what got you to this state.
In my case it was transactions that wanted to recover and deleting those from DB solve the problem bot to re-occur.

How to check cluster working between two different JBoss Server

I configured cluster between two different JBoss server using Multicast method.
Both server will be connected , when I start both JBoss server.
After one days , I'm getting following messages
Errors start to show for the clustering in server.log
05:28:17,447 ERROR [org.hornetq.core.server] (Thread-11905 (HornetQ-client-global-threads-377807954)) HQ224037:
cluster connection Failed to handle message: java.lang.IllegalStateException:
Cannot find binding for d7c1004f-b1a1-4160-8888-c38175ac45d599cf0dfe-5f30-11e4-bd7e-556a35fb9ec6 on
ClusterConnectionImpl#538608327[nodeUUID=930dee51-5f30-11e4-9695-ef52e2a27831, connector=TransportConfiguration(name=netty,
factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-29-250-191, address=jms,
server=HornetQServerImpl::serverUUID=930dee51-5f30-11e4-9695-ef52e2a27831]
at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.doConsumerCreat
05:28:17,411 ERROR [org.hornetq.core.server] (Thread-11439
(HornetQ-remoting-threads-HornetQServerImpl::serverUUID=99cf0dfe-5f30-11e4-bd7e-556a35fb9ec6-136247994-702467456))
HQ224016: Caught exception: HornetQException[errorType=QUEUE_EXISTS message=HQ119019:
Queue already exists 7a8b46d5-a038-4efd-900e-4c041c2c121f]
At org.hornetq.core.server.impl.HornetQServerImpl.createQueue(HornetQServerImpl.java:1811)
[hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
How to ensure cluster between two servers. Is there any procedures or any work around available?
Red Hat provides a McastReceiverTest java client test utility- further information on its use can be located at https://access.redhat.com/solutions/123073

How to enable JBoss server TRACE log?

I am having web application running in JBOSS AS 4.2.2.
Observed that jboss server automatically shuts down, and the following exception is observed in server.log
14:20:38,048 INFO [Server] Runtime shutdown hook called, forceHalt: true
14:20:38,049 INFO [Server] JBoss SHUTDOWN: Undeploying all packages
I want to enable TRACE for org.jboss.system.server.Server in jboss-log4j.xml, to hopefully get some more info when the server shuts down.
Please let me know how to enable TRACE for org.jboss.system.server.Server in jboss-log4j.xml.
I was able to add trace for server log and i could see the following output when JBOSS AS shuts down automatically:
2010-06-09 19:07:46,631 DEBUG [org.jboss.wsf.stack.jbws.RequestHandlerImpl] END handleRequest: jboss.ws:context=hpnp_lqs,endpoint=APIWebService
2010-06-09 19:07:46,631 DEBUG [org.jboss.ws.core.soap.MessageContextAssociation] popMessageContext: org.jboss.ws.core.jaxws.handler.SOAPMessageContextJAXWS#3290a11e (Thread http-0.0.0.0-8080-1)
2010-06-09 19:07:55,895 INFO [org.jboss.system.server.Server] Runtime shutdown hook called, forceHalt: true
2010-06-09 19:07:55,895 TRACE [org.jboss.system.server.Server] Shutdown caller:
java.lang.Throwable: Here
at org.jboss.system.server.ServerImpl$ShutdownHook.shutdown(ServerImpl.java:1017)
at org.jboss.system.server.ServerImpl$ShutdownHook.run(ServerImpl.java:996)
2010-06-09 19:07:55,895 INFO [org.jboss.system.server.Server] JBoss SHUTDOWN: Undeploying all packages
If anybody, has any clue, on what might be cause for automatic shutdown, pls help me.
Thanks!
There's a JBoss wiki page listing log output for various shutdown causes. It looks like yours was caused by a Ctrl-C. I assume you would have known if you hit Ctrl-C, though.
On unix-type servers, Ctrl-C generates a TERM signal, which could also come from someone or some script running as your jboss user or as root executing "kill <jboss pid>". If you're on linux I'd take a look at this question about the OOM killer.
One possible cause for this behaviour is console logout. We have observed this with our own server.
In brief, by default the Sun JVM listens to the event of the console user logging out, and shuts itself down automatically when that happens. To disable this, start the JVM with the -Xrs parameter.
See here for more details (look for Mysterious shutdowns).
One possible cause for a forced shutdown is if the virtual machine is out of memory.
I had this problem several years ago when a colleague implemented some very nasty bulk loading of objects from a database which caused jboss to shutdown on certain requests.
Try searching for "memory" or similar keywords in the log file and/or monitor the memory usage of the server.