Timeout deploying an artifact that has deployed many times under JBoss EAP 7.3 - jboss

I have been running PAM 7.9/JBPM 7.48 for about a year under JBOSS EAP 7.3. My JBPM's KieServer persists using SQL Server. I repeatedly deployed the KieServer yesterday but deploying today fails.
2021-12-16 15:25:53,645 ERROR [org.jboss.as.controller.management-operation] (DeploymentScanner-threads - 1) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'full-replace-deployment' at address '[]'
2021-12-16 15:26:03,649 ERROR [org.jboss.as.controller.management-operation] (DeploymentScanner-threads - 1) WFLYCTL0190: Step handler org.jboss.as.server.deployment.DeploymentHandlerUtil$4#74e289e9 for operation full-replace-deployment at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:523)
at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1518)
I have already set the property to increase the timeout for the deployment but it still complains about a 5 second timeout that must be controlled by another property
2021-12-16 13:40:47,039 ERROR [org.jboss.as.controller.management-operation] (DeploymentScanner-threads - 1) WFLYCTL0349: Timeout after [5] seconds waiting for service container stability while finalizing an operation. Process must be restarted. Step that first updated the service container was 'deploy' at address '[("deployment" => "kie-server.war")]'
I have changed the logging level to trace in order to gain all the information I can. How else can I debug / solve this issue?
There are two factors that may be contributing to this, but I don't have a good approach for addressing them.
There was a Windows Update yesterday (likely due to the recent Log4j exploit)
Some people at my company are having problems connecting to the SQL Server database. I am not seeing log messages about KieServer being unable to connect to the DB, but when it cannot reaching the DB the KieServer fails to start.

Related

Keycloak cluster fails on Amazon ECS (org.infinispan.commons.CacheException: Initial state transfer timed out for cache)

I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.
In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.

Wildfly not starting properly

So I have a rather strange issue with wildfly not starting...
If I clean the standalone/deployments of everything but one .war file, wildfly starts perfectly. I can then add in all other .war files(6 in total) and wildfly deploys them without issues.
However if I have all the war files in there and start wildfly it completely fails. It stays in a state where everything is set to .isdeploying for maybe 5 minutes until everything gets set to failed.
The logs that I am getting from service wildfly status
Feb 09 08:49:12 wildfly[2079]: /etc/init.d/wildfly: 3: /etc/default/wildfly: default: not found
Feb 09 08:49:12 wildfly[2079]: * Starting WildFly Application Server wildfly
Feb 09 08:49:43 wildfly[2079]: ...done.
Feb 09 08:49:43 wildfly[2079]: * WildFly Application Server hasn't started within the timeout allowed
Feb 09 08:49:43 wildfly[2079]: * please review file "/var/log/wildfly/console.log" to see the status of the service
Has anyone seen anything like this before?
After looking aroung I found this just before it undeployed everything:
ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated t
he service container was 'add' at address '[
("core-service" => "management"),
("management-interface" => "http-interface")
]'
But I am still not sure what i means...
This happened to me too starting on WildFly 11 and above IIRC.
Are you trying to access to the public or management IP while the server is booting? Basically you have to wait until the server has started to access those IPs.
My workaround was use the marker files that deployment scanner checks.
https://docs.jboss.org/author/display/WFLY/Application+deployment#Applicationdeployment-MarkerFiles
Before you start wildfly you have to put a .skipdeploy file for each .war you want to skip. Then, when the server has started you only have to delete that file to let wildfly start the deployment. You can achieve this making a shell script and calling it from your standalone.sh
This error shows that your IP/Port is being used by another process.
Use below command to check it.
For windows: use netstat -aon | find "port number"
You can configure jboss.as.management.blocking.timeout system property to tune timeout (seconds) waiting for service container stabilityas below :
...
</extensions>
<system-properties>
<property name="jboss.as.management.blocking.timeout" value="900"/>
</system-properties>
<management>
...
Or, if still doesn't work this way, collect a series of thread dumps during your startup period so we can see what it might be getting stuck on.

War with spring configured Camel context will not redeploy on JBoss

I have a Camel application deployed on JBoss in a WAR file with a spring configuration for starting the Camel context.
It deploys and runs very nicely on a JBoss EAP 7.0.0.GA.
If I want to change values in a property file that my application depends on and touch the war file, it normally redeploys the application. But in some cases it fails.
I get the following in the server.log:
2017-07-25 12:05:26.671 INFO class=org.apache.camel.impl.DefaultShutdownStrategy thread="ServerService Thread Pool -- 74" Starting to graceful shutdown 12 routes (timeout 300 seconds)
2017-07-25 12:05:26.725 INFO class=org.apache.camel.impl.DefaultShutdownStrategy thread="Camel (interfacedb) thread #2 - ShutdownTask" Waiting as there are still 4 inflight and pending exchanges to complete, timeout in 300 seconds. Inflights per route: [interfacePersistDirect = 1, route1 = 1, pullFromTransferEntityTable = 1, lastScheduledRun = 1]
...
2017-07-25 12:10:26.691 WARN class=org.apache.camel.impl.DefaultShutdownStrategy thread="ServerService Thread Pool -- 74" Timeout occurred during graceful shutdown. Forcing the routes to be shutdown now. Notice: some resources may still be running as graceful shutdown did not complete successfully.
2017-07-25 12:10:26.691 WARN class=org.apache.camel.impl.DefaultShutdownStrategy thread="Camel (interfacedb) thread #2 - ShutdownTask" Interrupted while waiting during graceful shutdown, will force shutdown now.
2017-07-25 12:10:26.694 INFO class=org.apache.camel.impl.DefaultShutdownStrategy thread="ServerService Thread Pool -- 74" Graceful shutdown of 12 routes completed in 300 seconds
After this the application will not start again. JBoss reports the following in the myApp.war.failed file in the deployments folder.
"WFLYDS0022: Did not receive a response to the deployment operation within the allowed timeout period [600 seconds]. Check the server configuration file and the server logs to find more about the status of the deployment."
The application normally deploys a lot quicker than 600 seconds. I can touch the war file or delete the .failed file, which normally triggers a redeployment, but JBoss keeps giving me the error above in the .failed file.
The application starts normally if I restart the JBoss VM, but I would like to avoid restarting the other applications running on the JBoss instance.
Any suggestions?

Concurrent Timeout exception on starting Jboss Wildfly 9.02 server

I am new to jboss server. When I am trying to deploy .war file on server the following exception gets print on console:
6:38:04,388 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
("core-service" => "management"),
("management-interface" => "http-interface")
]'
16:38:05,642 INFO [org.jboss.as.connector.deployers.jdbc] (MSC service thread 1-4) WFLYJCA0019: Stopped Driver service with driver-name = Aerobay.war_com.mysql.jdbc.Driver_5_1
16:38:09,548 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler#5f88823f for operation {"operation" => "add-deployer-chains","address" => []} at address [] failed handling operation rollback -- java.util.concurrent.TimeoutException: java.util.concurrent.TimeoutException
at org.jboss.as.controller.OperationContextImpl.waitForRemovals(OperationContextImpl.java:396)
at org.jboss.as.controller.AbstractOperationContext$Step.handleResult(AbstractOperationContext.java:1384)
at org.jboss.as.controller.AbstractOperationContext$Step.finalizeInternal(AbstractOperationContext.java:1332)
at org.jboss.as.controller.AbstractOperationContext$Step.finalizeStep(AbstractOperationContext.java:1292)
at org.jboss.as.controller.AbstractOperationContext$Step.access$300(AbstractOperationContext.java:1180)
at org.jboss.as.controller.AbstractOperationContext.handleContainerStabilityFailure(AbstractOperationContext.java:964)
at org.jboss.as.controller.AbstractOperationContext.doCompleteStep(AbstractOperationContext.java:590)
at org.jboss.as.controller.AbstractOperationContext.completeStepInternal(AbstractOperationContext.java:354)
at org.jboss.as.controller.AbstractOperationContext.executeOperation(AbstractOperationContext.java:330)
at org.jboss.as.controller.OperationContextImpl.executeOperation(OperationContextImpl.java:1183)
at org.jboss.as.controller.ModelControllerImpl.boot(ModelControllerImpl.java:453)
at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:327)
at org.jboss.as.controller.AbstractControllerService.boot(AbstractControllerService.java:313)
at org.jboss.as.server.ServerService.boot(ServerService.java:384)
at org.jboss.as.server.ServerService.boot(ServerService.java:359)
at org.jboss.as.controller.AbstractControllerService$1.run(AbstractControllerService.java:271)
at java.lang.Thread.run(Thread.java:745)
Thanks in advance for the help !
I had the same problem when I tried to deploy the WAR file on my Red Hat Jboss EAP 7.0.
But the server was integrated into my IDE (Eclipse Neon) and the problem only occured in Debug-Modus.
I was able to solve the problem by removing all breakpoints and after that i started the server again.
Try increasing timeout by adding java option "blocking.timeout". You can do it in bin/standalone.conf.bat (depends on how you configure wildfly) by adding line:
set "JAVA_OPTS=%JAVA_OPTS% -Djboss.as.management.blocking.timeout=600
Change the number if it's not enough.
increasing the timeout doesn't solve the root cause of the problem. You need to check the cause of the time of the block and solve the issue. Maybe in some cases the solution is to increase the timeout.
In most cases, increasing resources is a bad way to solve issues. I had this case, the Wildfly took a lot of time to boot. I increased the timeout to 600 and solved the issue but was still having issue with the wildfly booting time which was so annoying.
2018-03-26 07:50:36,523 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[("path" => "xxxxxxxxxxxxxxxx")]'
Finally I checked the block cause in and found the block was due to network host resolving (NAS storage defined as a path in wildfly).
I jumped to the network setting and found that my local DNS was not set properly. I added the local DNS instead of the public DNS and the block issue was gone. Hope this helps
Regards
Sleem
When i tried to debug and started the server with debug mode got the following error:
16:19:50,096 ERROR [org.jboss.as.controller.management-operation] (management-handler-thread - 1) JBAS013412: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'deploy' at address '[("deployment" => "ViprWeb.war")]'
16:19:50,096 ERROR [org.jboss.as.server] (management-handler-thread - 1) JBAS015870
16:20:00,117 ERROR [org.jboss.as.controller.management-operation] (management-handler-thread - 1) JBAS013413: Timeout after [5000] seconds waiting for service container stability while finalizing an operation.
I removed all my breakpoints and restarted my server jboss and it resolved the issue.
just increase time out in standalone.conf.bat
set as set "JAVA_OPTS=%JAVA_OPTS% -Djboss.as.management.blocking.timeout=600
It worked for me.
I had the same problem running a "dockerized" application locally - turns out increasing the resources fixed the issue. What I finally settled on:
CPUs: 4
Memory: 8GB
Swap: 2GB
Same problem, with netbeans
but I had not break points.
Running jboss by command line, helped me
Stop jboss
Close Netbeans
open command line
Go to jboss folder > bin >
type: standalone.bat (this starts jboss)
open Netbeans
worked fine!
Hope it'll help someone else.
I've been facing the same problem recently with WildFly 18 and 21, trying to run a WAR file containing JSR-352 batch jobs that worked fine on WildFly 14.
Increasing the timeout did not solve the situation, only prolonged the time before the TimeoutException was casted, no matter the value (e.g. 5, 10 or 20 minutes).
I've just found that to turn off microprofile-metrics-smallryesubsystem seems to be a possible solution.
After commenting out this line from the standalone.xml file, the war deploy was successful and much faster (about 2 minutes):
<subsystem xmlns="urn:wildfly:microprofile-metrics-smallrye:2.0" security-enabled="false" exposed-subsystems="*" prefix="${wildfly.metrics.prefix:wildfly}"/>
I am having problem with keycloak server 15.0.2.
WFLYCTL0190: Step handler org.jboss.as.server.DeployerChainAddHandler$FinalRuntimeStepHandler#410c55ac for operation add-deployer-chains at address [] failed
I am using mysql5.7 with jconnect8.0 jar.
I had the same problem. Then I killed the Kaspersky process and it helped!
I tackled a similar problem and had only succeed with undeploy the the apps. This gave a clean environment for Wildfly to restart and start the management and http-service. Then deploy the apps/WARs and identify what got you to this state.
In my case it was transactions that wanted to recover and deleting those from DB solve the problem bot to re-occur.

MSDTC (Distributed Transaction Coordinator) Stops working. Error code -1073737669

I cannot start Distributed Transaction Coordinator service.
It stoped to work few days ago.
When I am trying to start service:
Registry properties:
RPC (For a test values was changed here to oposite and back - without any results ):
Windows logs\application logs:
53283
A MS DTC component has encountered an internal error. The process is being terminated. Error Specifics: DtcSystemShutdown (d:\w7rtm\com\complus\dtc\dtc\msdtc\src\msdtc.cpp#2539): Shutting down with an error
4111
The MS DTC service is stopping.
4102
DTC Security Configuration values (OFF = 0 and ON = 1): Network Administration of Transactions = 1,
Network Clients = 1,
Inbound Distributed Transactions using Native MSDTC Protocol = 1,
Outbound Distributed Transactions using Native MSDTC Protocol = 1,
Transaction Internet Protocol (TIP) = 0,
XA Transactions = 1,
SNA LU 6.2 Transactions = 1
Could not initialize the MS DTC Transaction Manager.
4356
Failed to initialize the MS DTC Communication Manager. Error Specifics: hr = 0x80070057, d:\w7rtm\com\complus\dtc\dtc\cm\src\ccm.cpp:2117, CmdLine: C:\Windows\System32\msdtc.exe, Pid: 5332
4358
The MS DTC Connection Manager is unable to register with RPC to use one of LRPC, TCP/IP, or UDP/IP. Please ensure that RPC is configured properly. If "ServerTcpPort" registry key is configured(DWORD value under the HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC for local DTC instance or under cluster hive for clustered DTC instance), please verify if the configured port is valid and the port is not already in use by a different component. Error Specifics:hr = 0x80070057, d:\w7rtm\com\complus\dtc\dtc\cm\src\iomgrsrv.cpp:2523, CmdLine: C:\Windows\System32\msdtc.exe, Pid: 5332
4156
String message: RPC raised an exception with a return code RPC_S_INVALIDA_ARG..
I found that we can use -resetlog command. But this doesnot resolving my problem:
Firewall is disabled.
Try to delete key HKLM\Software\Microsoft\Rpc\Internet from registry.
To get around this issue, I had to copy the log file (I accidentally deleted) to the location specified by the Local DTC Log information location.