jboss clustering GMS, join - jboss

I have jboss 5.1.0.
we have configured jboss somehow using clustering, but in fact we do not use clustering while developing or testing. But in order to launch the project i have to type the following:
./run.sh -c all -g uniqueclustername
-b 0.0.0.0 -Djboss.messaging.ServerPeerID=1 -Djboss.service.binding.set=ports-01
but while jboss starting i able to see something like this in the console:
17:24:45,149 WARN [GMS]
join(172.24.224.7:60519) sent to
172.24.224.2:61247 timed out (after 3000 ms), retrying 17:24:48,170 WARN
[GMS] join(172.24.224.7:60519) sent to
172.24.224.2:61247 timed out (after 3000 ms), retrying 17:24:51,172 WARN
[GMS] join(172.24.224.7:60519)
here 172.24.224.7 it is my local IP
though 172.24.224.2 other IP of other developer in our room (and jboss there is stoped).
So, it tries to join to the other node or something. (i'm not very familiar how jboss acts in clusters). And as a result the application are not starting.
What may be the problem in? how to avoid this joining ?

You can probably fix this by specifying
-Djgroups.udp.ip_ttl=0
in your startup. This Sets the IP time-to-live on the JGroups packets to zero, so they never get anywhere, and the cluster will never form. We use this in dev here to stop the various developer machines from forming a cluster. There's no need to specify a unique cluster name.
I'm assuming you need to do clustering in production, is that right? Could you just use the default configuration instead of all? This would remove the clustering stuff altogether.

while setting up the server, keeping the host name = localhost and --host=localhost instead of ip address will solve the problem. That makes the server to start in non clustered mode.

Related

docker swarm - connections from wildfly to postgres randomly hang

I'm experiencing a weird problem when deploying a docker stack (compose file).
I have a three node docker swarm - master and two workers.
All machines are CentOS 7.5 with kernel 3.10.0 and docker 18.03.1-ce.
Most things run on the master, one of which is a wildfly (v9.x) application server.
On one of the workers is a postgres database.
After deploying the stack things work normally, but after a while (or maybe after a specific action in the web app) request start to hang.
Running netstat -ntp inside the wildfly container shows 52 bytes stuck in the Send-q:
tcp 0 52 10.0.0.72:59338 10.0.0.37:5432 ESTABLISHED -
On the postgres side the connection is also in ESTABLISHED state, but the send and receive queues are 0.
It's always exactly 52 bytes. I read somewhere that ACK packets with timestamps are also 52 bytes. Is there any way I can verify that?
We have the following sysctl tunables set:
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_timestamps = 0
The first three were needed because of this.
All services in the stack are connected to the same default network that docker creates.
Now if I move the postgres service to be on the same host as the wildfly service the problem doesn't seem to surface or if I declare a separate network for postgres and add it only to the services that need the database (and the database of course) the problem also doesn't seem to show.
Has anyone come across a similar issue? Can anyone provide any pointers on how I can debug the problem further?
Turns out this is a known issue with pooled connections in swarm with services on different nodes.
Basically the workaround is to set the above tuneables + enable tcp keepalive on the socket. See here and here for more details.

Why can't my Zookeeper server rejoin the Quorum?

I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr. One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:
2014-03-03 18:44:40,995 [myid:1] - INFO [main:QuorumPeer#429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
and:
2014-03-03 18:44:41,233 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager#190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager#190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection#774] - Notification time out: 400
Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch. It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.
Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails. This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.
It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?
Thanks for reading and for your help!
This is a bug of zookeeper: Server is unable to join quorum after connection broken to other peers
Restart the leader solves this issue.
Everyone has this problem when your pods or hosts rejoining the cluster with different ips using the same id. For your host your Ip could change because specify in your config perhaps 0.0.0.0 or domains name. So Follow these instructions:
1.stop all server, and in config use
server.1=10.x.x.x:1234:5678
server.2=10.x.x.y:1234:5678
server.3=10.x.x.z:1234:5678
not dns name .
Use Your IP LAN as Identifier .
start your server it should work

Ganglia Web - Hosts Up and Hosts Down Issue

I have set up Ganglia(Ganglia Core 3.6.0 and Ganglia Web 3.5.10) to monitor my cluster.
When gmond is restarted in a machine, metrics from all other gmond machines also gets stopped ie I am not able to see metrics getting published from other machines in Ganglia Web. And I can also see Hosts up going to 0 and Hosts down as 13(total number of machines). As time goes, the Hosts up comes back to 13.
Am I missing something ?? Can some one help me...
If it's always the same machine, it should be a gmond 'end-point'. The gmetad daemon is querying only one gmond (no redundancy), if he goes down everybody seems to be going down.
If there are a redundancy (eg. more than one host in a data source), you can expect some lag if the first one goes down because of the number of TCP queries before it timesout.

Heavy multicast traffic created by Jboss 5 AS cluster

We have a Jboss 5 AS cluster consiteing of 2 nodes using multicast, every thing works fine and the servers are able to discover and make a cluster
but the problem is these servers generate heavy multicast traffic which effects the network performace of other servers shareing the same network.
I am new to Jboss clustering is there any way to use unicast (point-to-point) instead of multicast ? Or configure the multicast such that its not problem for rest of the network ? can you refer me to some documentation , blog post or simmillar that can help me get rid of this problem.
Didn't got any answers here but this might be of help to someone in future we managed to resolve it by
Set the following TTL property for jboss in the start up script
-Djgroups.udp.ip_ttl=1
this will restrict the number of hops to 1 for the multicast messages. This will not reduce the amount of network traffic between the clustered JBoss but will prevent it spreading outside.
If you have other servers in the same subnet that are effected by flooding problem then
you might have to switch to TCP stack and do unicast instead of multicast
-Djboss.default.jgroups.stack=tcp
Also there are more configuration files in jboss deploy for clustering that you should look at.
server/production/deploy/cluster/jboss-cache-manager.sar/META-INF/jboss-cache-manager-jboss-beans.xml
and other conf files in the JGroups config.
If multicast is not an option of for some reason it doesn't work due to network topology we can use the unicast.
To use unicast clustering instead of UDP mcast. Open up your profile and look into file jgroups-channelfactory-stacks.xml and locate the stack named "tcp". That stacks still uses UDP only for multicast discovery. If low UDP traffic is alright, you dont need to change it. If it is or mcast doesn't work, you will need to configure TCPPING protocol and configure intial_hosts where to look for cluster members.
Afterwards, you will need to tell JBoss Cache to use this stack, open up jboss-cache-manager-jboss-beans.xml where for each cache you have a stack defined. You can either change it here from udp to tcp or you can simply use the property when starting AS, just add:
-Djboss.default.jgroups.stack=tcp

how to disable HAJNDI on jboss-4.0.3sp1

my test bed is 2 server which all run service based on jboss-4.0.3sp1, they are configured as cluster and has HA-JNDI online between 2 nodes.
due to some framework change, i need to shutdown the service on one node, how could we shutdown HA-JNDI?
i can not update cluster-service.xml to remove HA JDNI definition, that will cause application start-up error.
thanks,
Emre
Here is from JBoss Clustering documentation:
java.naming.provider.url JNDI setting can now
accept a list of urls separated by a comma. Example:
java.naming.provier.url=server1:1100,server2:1100,server3:1100,server4:1100
When initialising, the JNP client code will try to get in touch with each
server from the list, one after the other, stopping as soon as one server
has been reached.
So set it to server that is up.
I hope it is helps.