This does not happens all time but many a times.
3 Clusters of Server Group
Wildfly 16
Deploy .war from UI. It picks fine on one server::
2020-02-26 07:21:12,951 INFO [org.wildfly.clustering.server] (LegacyDistributedSingletonService - 1) WFLYCLSV0003: alp-esb-app02:servicedesk-02 elected as the singleton provider of the jboss.deployment.unit."Now-1.11-SNAPSHOT.war".installer service
2020-02-26 07:21:13,115 INFO [org.jboss.as.server] (ServerService Thread Pool -- 26) WFLYSRV0010: Deployed "Now-1.11-SNAPSHOT.war" (runtime-name : "Now-1.11-SNAPSHOT.war")
2020-02-26 07:21:14,133 INFO [org.wildfly.clustering.server] (LegacyDistributedSingletonService - 1) WFLYCLSV0001: This node will now operate as the singleton provider of the jboss.deployment.unit."Now-1.11-SNAPSHOT.war".installer service
But i disable-renable or deploy next time: It shows same logs in two server.
An there is scheduler which runs twice which is corrupting database with duplicates.
Need to redeploy and redeploy and check when logs went fine i.e only one server is elected.
Project Structure:
webapp -> Meta INF -> singleton-deployment.xml
<?xml version="1.0" encoding="UTF-8"?>
<singleton-deployment xmlns="urn:jboss:singleton-deployment:1.0"/>
Scheduler Starts like:
#Startup
#Singleton
#AccessTimeout(value = 30, unit = TimeUnit.MINUTES)
public class SnowPollerNew {
Any suggestion why do it runs fine but do not runs fine many a time.
Is it linked to JGroups? or communication between two clusters?
You need to ensure that the servers are building the cluster correctly.
Also I remember some issues (WFLY-11619) with the singleton election.
I would suppose that this is not reproducable at WildFly 18.
Related
I have been running PAM 7.9/JBPM 7.48 for about a year under JBOSS EAP 7.3. My JBPM's KieServer persists using SQL Server. I repeatedly deployed the KieServer yesterday but deploying today fails.
2021-12-16 15:25:53,645 ERROR [org.jboss.as.controller.management-operation] (DeploymentScanner-threads - 1) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'full-replace-deployment' at address '[]'
2021-12-16 15:26:03,649 ERROR [org.jboss.as.controller.management-operation] (DeploymentScanner-threads - 1) WFLYCTL0190: Step handler org.jboss.as.server.deployment.DeploymentHandlerUtil$4#74e289e9 for operation full-replace-deployment at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:523)
at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1518)
I have already set the property to increase the timeout for the deployment but it still complains about a 5 second timeout that must be controlled by another property
2021-12-16 13:40:47,039 ERROR [org.jboss.as.controller.management-operation] (DeploymentScanner-threads - 1) WFLYCTL0349: Timeout after [5] seconds waiting for service container stability while finalizing an operation. Process must be restarted. Step that first updated the service container was 'deploy' at address '[("deployment" => "kie-server.war")]'
I have changed the logging level to trace in order to gain all the information I can. How else can I debug / solve this issue?
There are two factors that may be contributing to this, but I don't have a good approach for addressing them.
There was a Windows Update yesterday (likely due to the recent Log4j exploit)
Some people at my company are having problems connecting to the SQL Server database. I am not seeing log messages about KieServer being unable to connect to the DB, but when it cannot reaching the DB the KieServer fails to start.
I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.
In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.
So I have a rather strange issue with wildfly not starting...
If I clean the standalone/deployments of everything but one .war file, wildfly starts perfectly. I can then add in all other .war files(6 in total) and wildfly deploys them without issues.
However if I have all the war files in there and start wildfly it completely fails. It stays in a state where everything is set to .isdeploying for maybe 5 minutes until everything gets set to failed.
The logs that I am getting from service wildfly status
Feb 09 08:49:12 wildfly[2079]: /etc/init.d/wildfly: 3: /etc/default/wildfly: default: not found
Feb 09 08:49:12 wildfly[2079]: * Starting WildFly Application Server wildfly
Feb 09 08:49:43 wildfly[2079]: ...done.
Feb 09 08:49:43 wildfly[2079]: * WildFly Application Server hasn't started within the timeout allowed
Feb 09 08:49:43 wildfly[2079]: * please review file "/var/log/wildfly/console.log" to see the status of the service
Has anyone seen anything like this before?
After looking aroung I found this just before it undeployed everything:
ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated t
he service container was 'add' at address '[
("core-service" => "management"),
("management-interface" => "http-interface")
]'
But I am still not sure what i means...
This happened to me too starting on WildFly 11 and above IIRC.
Are you trying to access to the public or management IP while the server is booting? Basically you have to wait until the server has started to access those IPs.
My workaround was use the marker files that deployment scanner checks.
https://docs.jboss.org/author/display/WFLY/Application+deployment#Applicationdeployment-MarkerFiles
Before you start wildfly you have to put a .skipdeploy file for each .war you want to skip. Then, when the server has started you only have to delete that file to let wildfly start the deployment. You can achieve this making a shell script and calling it from your standalone.sh
This error shows that your IP/Port is being used by another process.
Use below command to check it.
For windows: use netstat -aon | find "port number"
You can configure jboss.as.management.blocking.timeout system property to tune timeout (seconds) waiting for service container stabilityas below :
...
</extensions>
<system-properties>
<property name="jboss.as.management.blocking.timeout" value="900"/>
</system-properties>
<management>
...
Or, if still doesn't work this way, collect a series of thread dumps during your startup period so we can see what it might be getting stuck on.
I have a Camel application deployed on JBoss in a WAR file with a spring configuration for starting the Camel context.
It deploys and runs very nicely on a JBoss EAP 7.0.0.GA.
If I want to change values in a property file that my application depends on and touch the war file, it normally redeploys the application. But in some cases it fails.
I get the following in the server.log:
2017-07-25 12:05:26.671 INFO class=org.apache.camel.impl.DefaultShutdownStrategy thread="ServerService Thread Pool -- 74" Starting to graceful shutdown 12 routes (timeout 300 seconds)
2017-07-25 12:05:26.725 INFO class=org.apache.camel.impl.DefaultShutdownStrategy thread="Camel (interfacedb) thread #2 - ShutdownTask" Waiting as there are still 4 inflight and pending exchanges to complete, timeout in 300 seconds. Inflights per route: [interfacePersistDirect = 1, route1 = 1, pullFromTransferEntityTable = 1, lastScheduledRun = 1]
...
2017-07-25 12:10:26.691 WARN class=org.apache.camel.impl.DefaultShutdownStrategy thread="ServerService Thread Pool -- 74" Timeout occurred during graceful shutdown. Forcing the routes to be shutdown now. Notice: some resources may still be running as graceful shutdown did not complete successfully.
2017-07-25 12:10:26.691 WARN class=org.apache.camel.impl.DefaultShutdownStrategy thread="Camel (interfacedb) thread #2 - ShutdownTask" Interrupted while waiting during graceful shutdown, will force shutdown now.
2017-07-25 12:10:26.694 INFO class=org.apache.camel.impl.DefaultShutdownStrategy thread="ServerService Thread Pool -- 74" Graceful shutdown of 12 routes completed in 300 seconds
After this the application will not start again. JBoss reports the following in the myApp.war.failed file in the deployments folder.
"WFLYDS0022: Did not receive a response to the deployment operation within the allowed timeout period [600 seconds]. Check the server configuration file and the server logs to find more about the status of the deployment."
The application normally deploys a lot quicker than 600 seconds. I can touch the war file or delete the .failed file, which normally triggers a redeployment, but JBoss keeps giving me the error above in the .failed file.
The application starts normally if I restart the JBoss VM, but I would like to avoid restarting the other applications running on the JBoss instance.
Any suggestions?
I configured cluster between two different JBoss server using Multicast method.
Both server will be connected , when I start both JBoss server.
After one days , I'm getting following messages
Errors start to show for the clustering in server.log
05:28:17,447 ERROR [org.hornetq.core.server] (Thread-11905 (HornetQ-client-global-threads-377807954)) HQ224037:
cluster connection Failed to handle message: java.lang.IllegalStateException:
Cannot find binding for d7c1004f-b1a1-4160-8888-c38175ac45d599cf0dfe-5f30-11e4-bd7e-556a35fb9ec6 on
ClusterConnectionImpl#538608327[nodeUUID=930dee51-5f30-11e4-9695-ef52e2a27831, connector=TransportConfiguration(name=netty,
factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-29-250-191, address=jms,
server=HornetQServerImpl::serverUUID=930dee51-5f30-11e4-9695-ef52e2a27831]
at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.doConsumerCreat
05:28:17,411 ERROR [org.hornetq.core.server] (Thread-11439
(HornetQ-remoting-threads-HornetQServerImpl::serverUUID=99cf0dfe-5f30-11e4-bd7e-556a35fb9ec6-136247994-702467456))
HQ224016: Caught exception: HornetQException[errorType=QUEUE_EXISTS message=HQ119019:
Queue already exists 7a8b46d5-a038-4efd-900e-4c041c2c121f]
At org.hornetq.core.server.impl.HornetQServerImpl.createQueue(HornetQServerImpl.java:1811)
[hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
How to ensure cluster between two servers. Is there any procedures or any work around available?
Red Hat provides a McastReceiverTest java client test utility- further information on its use can be located at https://access.redhat.com/solutions/123073