JBoss EAP HA singleton deployment on two server groups - jboss

I'm trying to use the HA Singleton on JBoss EAP 6.4. I have two hosts: app-05, app-06 and two groups: ADM, DEV. Both server groups use their own profiles based on standard HA profile and both use ha-sockets socket binding.
+-----------------------------------+
|Group v / Host > | app-05 | app-06 |
|-----------------+--------+--------|
| ADM | ADM_01 | ADM_02 |
|-----------------+--------+--------|
| DEV | DEV_01 | DEV_02 |
+-----------------------------------+
When I assign an application containing HA singleton to one of the groups the result is as I would expect. The app-05 server starts:
INFO [stdout] -------------------------------------------------------------------
INFO [stdout] GMS: address=app-05:DEV_01/singleton, cluster=singleton, physical address=10.7.0.131:55400
INFO [stdout] -------------------------------------------------------------------
INFO [o.i.r.t.j.JGroupsTransport] ISPN000094: Received new cluster view: [app-05:DEV_01/singleton|0] [app-05:DEV_01/singleton]
INFO [o.i.r.t.j.JGroupsTransport] ISPN000079: Cache local address is app-05:DEV_01/singleton, physical addresses are [10.7.0.131:55400]
INFO [o.j.a.clustering] JBAS010238: Number of cluster members: 1
and is elected as the HA singleton provider:
INFO [o.j.a.c.infinispan] JBAS010281: Started default cache from singleton container
INFO [o.j.a.c.singleton] JBAS010342: app-05:DEV_01/singleton elected as the singleton provider of the jboss.test.ha.singleton service
INFO [o.j.a.c.singleton] JBAS010340: This node will now operate as the singleton provider of the jboss.test.ha.singleton service
then app-06 server starts and forms a cluster:
INFO [stdout] -------------------------------------------------------------------
INFO [stdout] GMS: address=app-06:DEV_02/singleton, cluster=singleton, physical address=10.7.0.132:55400
INFO [stdout] -------------------------------------------------------------------
INFO [o.i.r.t.j.JGroupsTransport] ISPN000094: Received new cluster view: [app-05:DEV_01/singleton|1] [app-05:DEV_01/singleton, app-06:DEV_02/singleton]
INFO [o.i.r.t.j.JGroupsTransport] ISPN000079: Cache local address is app-06:DEV_02/singleton, physical addresses are [10.7.0.132:55400]
INFO [o.j.a.clustering] JBAS010238: Number of cluster members: 2
and ultimateley second app-06 is elected:
INFO [o.j.a.c.infinispan] JBAS010281: Started default cache from singleton container
INFO [o.j.a.c.singleton] JBAS010342: app-06:DEV_02/singleton elected as the singleton provider of the jboss.test.ha.singleton service
INFO [o.j.a.c.singleton] JBAS010340: This node will now operate as the singleton provider of the jboss.test.ha.singleton service
But when I assign the same application to the second group of servers strange things are happening. All FOUR servers form a cluster:
INFO [o.i.r.t.j.JGroupsTransport] ISPN000094: Received new cluster view: [app-05:DEV_01/singleton|2] [app-05:DEV_01/singleton, app-06:DEV_02/singleton, app-06:ADM_02/singleton, app-05:ADM_01/singleton]
INFO [o.j.a.c.singleton] JBAS010342: app-05:DEV_01/singleton elected as the singleton provider of the jboss.test.ha.singleton service
INFO [o.j.a.c.singleton] JBAS010340: This node will now operate as the singleton provider of the jboss.test.ha.singleton service
Why do servers from different groups and using different profiles see each other? And all of them form only one cluster with one HA singleton?
I would expect there will be two independent clusters, each one with its own instance of HA singleton.

Do you use different JGroups multicast addresses for your groups? Two different server groups correspond to two different clusters. In order for JBoss cluster to work correctly with other clusters in the same network, you need to isolate it so that it does not interfere with others. This is done by setting different values for jgroups-udp and messaging-group in socket binding group - for example by setting system properties jboss.default.multicast.address and jboss.messaging.group.address. The multicast addresses should be in range 224.0.0.0 to 239.255.255.255.

Related

Wildfly : Singleton Deployment on Cluster | Elects two servers in Server Group

This does not happens all time but many a times.
3 Clusters of Server Group
Wildfly 16
Deploy .war from UI. It picks fine on one server::
2020-02-26 07:21:12,951 INFO [org.wildfly.clustering.server] (LegacyDistributedSingletonService - 1) WFLYCLSV0003: alp-esb-app02:servicedesk-02 elected as the singleton provider of the jboss.deployment.unit."Now-1.11-SNAPSHOT.war".installer service
2020-02-26 07:21:13,115 INFO [org.jboss.as.server] (ServerService Thread Pool -- 26) WFLYSRV0010: Deployed "Now-1.11-SNAPSHOT.war" (runtime-name : "Now-1.11-SNAPSHOT.war")
2020-02-26 07:21:14,133 INFO [org.wildfly.clustering.server] (LegacyDistributedSingletonService - 1) WFLYCLSV0001: This node will now operate as the singleton provider of the jboss.deployment.unit."Now-1.11-SNAPSHOT.war".installer service
But i disable-renable or deploy next time: It shows same logs in two server.
An there is scheduler which runs twice which is corrupting database with duplicates.
Need to redeploy and redeploy and check when logs went fine i.e only one server is elected.
Project Structure:
webapp -> Meta INF -> singleton-deployment.xml
<?xml version="1.0" encoding="UTF-8"?>
<singleton-deployment xmlns="urn:jboss:singleton-deployment:1.0"/>
Scheduler Starts like:
#Startup
#Singleton
#AccessTimeout(value = 30, unit = TimeUnit.MINUTES)
public class SnowPollerNew {
Any suggestion why do it runs fine but do not runs fine many a time.
Is it linked to JGroups? or communication between two clusters?
You need to ensure that the servers are building the cluster correctly.
Also I remember some issues (WFLY-11619) with the singleton election.
I would suppose that this is not reproducable at WildFly 18.

Keycloak cluster fails on Amazon ECS (org.infinispan.commons.CacheException: Initial state transfer timed out for cache)

I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.
In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.

ejabberd clustering problems and solutions

Setup Details
2 ejabberd nodes with postgresql as database (OS : Ubuntu 16.04)
Trying to do clustering of two ejabberd as mentioned in
https://docs.ejabberd.im/admin/guide/clustering/
After starting the master node the following steps have been performed on the slave node
copy .erlang.cookie to the slave node
copy ejabbed.yml from master to slave.
slave started successfully but shows the below error.
=====Error=========
Eshell V9.2 (abort with ^G)
(ejabberd#gim-Veriton-M6650G)1> 18:29:41.856 [notice] Changed loghwm of /usr/local/var/log/ejabberd/error.log to 100
18:29:41.856 [notice] Changed loghwm of /usr/local/var/log/ejabberd/ejabberd.log to 100
18:29:41.857 [info] Application lager started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.860 [info] Application crypto started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.865 [info] Application sasl started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.871 [info] Application asn1 started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.871 [info] Application public_key started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.880 [info] Application ssl started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.881 [info] Application p1_utils started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.883 [info] Application fast_yaml started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.888 [info] Application fast_tls started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.892 [info] Application fast_xml started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.895 [info] Application stringprep started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.899 [info] Application xmpp started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.903 [info] Application cache_tab started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.910 [info] Application eimp started on node 'ejabberd#gim-Veriton-M6650G'
18:29:41.910 [info] Loading configuration from /usr/local/etc/ejabberd/ejabberd.yml
18:29:41.913 [error] CRASH REPORT Process <0.67.0> with 0 neighbours exited with reason: no case clause matching <<>> in ejabberd_config:get_config_option_key/2 line 473 in application_master:init/4 line 134
18:29:41.913 [info] Application ejabberd exited with reason: no case clause matching <<>> in ejabberd_config:get_config_option_key/2 line 473
(ejabberd#gim-Veriton-M6650G)1>
I've tried re creating mnesia DB also but didn't help.
ejabberdctl status shows ejabberd is not running in that node
Can some oe please look into the issue and help.
Finally I found the solution to the problem
The issue is with the node name as the node name of the master ia FQ name
but the slave node's name is without a domain.
Also added both the node names in the /etc/hosts file
For ejabberd clustering ,Please refer the below steps.
Before starting , configure proper entries in the /etc/hosts files of both nodes.
ie the nodes should resolve each other using their host names.
set ejaberd node name in ejabberd.cfg file , both the nodes should have different node names.
1.cofigure ejabberd in one master node with a proper node name (either a FQDN or just a name of your convenience)
2.Configure slave node with the same config as that of master ie. bot the nodes should have the same configuration in ejabberd.yml file)
3.copy erlang.cookie from master node to slave and the ejabberd user should be ale to read the cookie file.
4.Start the master node in live mode (ejabberdctl live )
5.Start slave node in live mode
6.Check the cookie value in erlang console of both the nodes using the command 'erlang:get_cookie().' , both the nodes should have the same value.
7.If bot the nodes have same value then execute "ejabberdctl --not-timeout join_cluser ejabberd#nodename" in the slave.
change ejabberd#nodename according to your environment.
In my case I ran ejabberd with 'ejabberd' user with node name as ejabberd#cluster-node1 (If you want you can use a FQDN also like ejabberd#example.com)
8.If the abode command executed without any error then the nodes are in cluster
9.Confirm the cluster in any of the erlang console using the command mnesia:info(). here you will get the node details in "running_db_nodes"
10.Hurrayyyy you are done...
For load balancing the cluster you can use HAProxy
Please refer https://blog.onefellow.com/post/76702632637/haproxy-and-ejabberd for details
I've not done load balancing using any hardware load balancer , need to check on that
If anyone have done that please do post here ..

WildFly content is obsolete and will be removed

I try to deploy an EAR on an empty WildFly server using a script containing this CLI command:
deploy path/app.ear --server-groups=myServerGroup
However, it seems that WildFly removes the deployment. Indeed, in the wildfly/domain/log/host-controller.log file, I see:
2017-03-23 18:00:29,173 INFO [org.jboss.as.repository] (management-handler-thread - 1) WFLYDR0001: Content added at location /opt/jboss/wildfly/domain/data/content/e1/42b70f3e3e5127f67b742db96d522c4602a779/content
2017-03-23 18:01:42,262 INFO [org.jboss.as.repository] (External Management Request Threads -- 5) WFLYDR0001: Content added at location /opt/jboss/wildfly/domain/data/content/b3/ea005efe4d3f22d006db850ac1c88b0a470b3a
/content
2017-03-23 18:10:15,028 INFO [org.jboss.as.repository] (Host Controller Service Threads - 32) WFLYDR0009: Content /opt/jboss/wildfly/domain/data/content/e1/42b70f3e3e5127f67b742db96d522c4602a779 is obsolete and will
be removed
2017-03-23 18:10:15,034 INFO [org.jboss.as.repository] (Host Controller Service Threads - 32) WFLYDR0002: Content removed from location /opt/jboss/wildfly/domain/data/content/e1/42b70f3e3e5127f67b742db96d522c4602a779
/content
Moreover, if I manually run the command, the deployment works.
When does WildFly consider a content as obsolete?
Why my deployment would be considered as obsolete?
UPDATE
Another example :
2017-05-23 16:06:48,493 INFO [org.jboss.as.repository] (management-handler-thread - 3) WFLYDR0001: Content added at location /opt/jboss/wildfly/domain/data/content/57/f5a4fe985f3528222053b0404654ead3502fa9/content
2017-05-23 16:13:05,281 INFO [org.jboss.as.repository] (Host Controller Service Threads - 3) WFLYDR0009: Content /opt/jboss/wildfly/domain/data/content/57/f5a4fe985f3528222053b0404654ead3502fa9 is obsolete and will be removed
2017-05-23 16:13:05,288 INFO [org.jboss.as.repository] (Host Controller Service Threads - 3) WFLYDR0002: Content removed from location /opt/jboss/wildfly/domain/data/content/57/f5a4fe985f3528222053b0404654ead3502fa9/ content
The content is considered obsolete when it is not referenced from a deployment through its hash and if it is more than 10 minutes old.

Singleton cluster actor is not starting up

The following cluster singleton is not starting up.
commander = system.actorOf(
ClusterSingletonManager.props(Commander.props(this),
terminationMessage = PoisonPill.getInstance,
settings = ClusterSingletonManagerSettings.create(system).withRole("commander")
), name = "Commander")
No error messages are thrown.
Logs are:
[INFO] [08/03/2016 11:43:58.656] [ScalaTest-run-running-ClusterSuite] [akka.remote.Remoting] Starting remoting
[INFO] [08/03/2016 11:43:59.007] [ScalaTest-run-running-ClusterSuite] [akka.remote.Remoting] Remoting started; listening on addresses :[akka.tcp://galaxyFarFarAway#127.0.0.1:59592]
[INFO] [08/03/2016 11:43:59.035] [ScalaTest-run-running-ClusterSuite] [akka.cluster.Cluster(akka://galaxyFarFarAway)] Cluster Node [akka.tcp://galaxyFarFarAway#127.0.0.1:59592] - Starting up...
[INFO] [08/03/2016 11:43:59.218] [ScalaTest-run-running-ClusterSuite] [akka.cluster.Cluster(akka://galaxyFarFarAway)] Cluster Node [akka.tcp://galaxyFarFarAway#127.0.0.1:59592] - Registered cluster JMX MBean [akka:type=Cluster]
[INFO] [08/03/2016 11:43:59.218] [ScalaTest-run-running-ClusterSuite] [akka.cluster.Cluster(akka://galaxyFarFarAway)] Cluster Node [akka.tcp://galaxyFarFarAway#127.0.0.1:59592] - Started up successfully
[INFO] [08/03/2016 11:43:59.247] [galaxyFarFarAway-akka.actor.default-dispatcher-2] [akka.cluster.Cluster(akka://galaxyFarFarAway)] Cluster Node [akka.tcp://galaxyFarFarAway#127.0.0.1:59592] - Metrics will be retreived from MBeans, and may be incorrect on some platforms. To increase metric accuracy add the 'sigar.jar' to the classpath and the appropriate platform-specific native libary to 'java.library.path'. Reason: java.lang.ClassNotFoundException: org.hyperic.sigar.Sigar
[INFO] [08/03/2016 11:43:59.257] [galaxyFarFarAway-akka.actor.default-dispatcher-2] [akka.cluster.Cluster(akka://galaxyFarFarAway)] Cluster Node [akka.tcp://galaxyFarFarAway#127.0.0.1:59592] - Metrics collection has started successfully
[INFO] [08/03/2016 11:43:59.268] [galaxyFarFarAway-akka.actor.default-dispatcher-3] [akka.cluster.Cluster(akka://galaxyFarFarAway)] Cluster Node [akka.tcp://galaxyFarFarAway#127.0.0.1:59592] - No seed-nodes configured, manual cluster join required
Disconnected from the target VM, address: '127.0.0.1:59574', transport: 'socket'
The configuration is:
akka {
actor {
provider = "akka.cluster.ClusterActorRefProvider"
default-dispatcher {
throughput = 10
}
}
cluster {
roles = [commander]
}
remote {
log-remote-lifecycle-events = off
netty.tcp {
hostname = "127.0.0.1"
port = 0
}
}
akka.extensions=["akka.cluster.metrics.ClusterMetricsExtension"]
}
When I debug the code of Commander class, the constructor is not even called anywhere. When I omit the ClusterSingletonManager and just create it with Props it does work however, the Commander actor is going to be created.
I sense incorrect configuration behind this issue. Do you guys have any remarks about this?
You've sensed quite right: you haven't specified the seed node configuration for the Akka clustering. You can see this in the last line of the log:
[akka.tcp://galaxyFarFarAway#127.0.0.1:59592] - No seed-nodes configured, manual cluster join required Disconnected from the target VM, address: '127.0.0.1:59574', transport: 'socket'
Because you haven't specified any seed nodes in the configuration file, Akka will wait for you to specify the seed nodes programmatically. You can specify the seed nodes in the config like this:
akka.cluster.seed-nodes = [
"akka.tcp://yourClusterSystem#127.0.0.1:2551",
"akka.tcp://yourClusterSystem#127.0.0.1:2552"
]
Alternatively, you can call the joinSeedNodes method to join the cluster programmatically. In both cases, you have to specify at least one seed node that is available. The actor system itself can also act as a seed node.
Once the seed nodes have been specified and the actor system has joined the cluster, Akka features depending on clustering (cluster singletons, sharding etc.) will boot up. This is why you can launch an ordinary actor, but not the singleton.
For more information on setting up seed nodes see Akka cluster documentation.