issue with wildfly cluster setup - wildfly

I have downloaded Wildfly-21.0.1.Final and deployed it on different 2 different machines on the same network. I haven't modified/updated the configuration and tried to start the applications servers using the following commands, but cluster is not forming
./bin/standalone.sh -c standalone-full-ha.xml -b 10.1xx.2.15 --u 230.0.0.1 & (on 1st node)
./bin/standalone.sh -c standalone-full-ha.xml -b 10.1xx.2.16 --u 230.0.0.1 & (on 2nd node)
Basically starting with the same multicast address but nodes are not discovering each other. Both are in the same network/subnet. We were able to form a cluster with JBoss4.
The logs in respective nodes
1st Node
[org.infinispan.CLUSTER] (ServerService Thread Pool -- 91) ISPN000094: Received new cluster view for channel ejb: [10.1xx.2.15|0] (1) [10.1xx.2.15]
2nd Node
[org.infinispan.CLUSTER] (ServerService Thread Pool -- 91) ISPN000094: Received new cluster view for channel ejb: [10.1xx.2.16|0] (1) [10.1xx.2.16]
Any help/idea is much appreciated.

WildFly's -b option sets the interface of the public interface, but not the private interface, which is the the default interface for clustering.
e.g.
./bin/standalone.sh -c standalone-full-ha.xml -b 10.1xx.2.15 -bprivate=10.1xx.2.15 -u 230.0.0.1

Related

How do I get a multi-node Keycloak cluster running with docker containers (no k8/swarm/etc)?

I have three EC2 instances in AWS:
instance A - docker with nginx container - private IP address 1.2.3.4
instance B and C - docker with keycloak containers - private IP address 1.2.3.5 and 1.2.3.6
RDS instance running MySQL 8 - host foo.us-east-1.rds.amazonaws.com
All in the same VPC. Instance B and C are in different subnets (different availability zones), but can communicate with each other via port 80 and 7600.
The docker instances launch without issue with the following command:
docker run \
--name test-node-1 \
-e DB_PORT=3306 \
-e PROXY_ADDRESS_FORWARDING=true \
-e DB_VENDOR=mysql \
-e DB_DATABASE=keycloak \
-e DB_ADDR=foo.us-east-1.rds.amazonaws.com \
-e KEYCLOAK_STATISTICS=all \
-e DB_USER=keycloak \
-e KEYCLOAK_USER=kcuser \
-e DB_PASSWORD=... \
-e KEYCLOAK_PASSWORD=... \
-p 80:8080 \
-p 7600:7600 \
jboss/keycloak:16.1.0
Both containers launch fine, but they aren't talking to each other.
Adding the following three environment variables:
-e JGROUPS_DISCOVERY_EXTERNAL_IP=1.2.3.5 \
-e JGROUPS_DISCOVERY_PROTOCOL=TCPPING \
-e JGROUPS_DISCOVERY_PROPERTIES='1.2.3.5[7600],1.2.3.6[7600]' \
Causes Keycloak to crash on startup:
=========================================================================
Using MySQL database
=========================================================================
17:01:35,028 INFO [org.jboss.modules] (CLI command executor) JBoss Modules version 2.0.0.Final
17:01:35,124 INFO [org.jboss.msc] (CLI command executor) JBoss MSC version 1.4.13.Final
17:01:35,134 INFO [org.jboss.threads] (CLI command executor) JBoss Threads version 2.4.0.Final
17:01:35,267 INFO [org.jboss.as] (MSC service thread 1-2) WFLYSRV0049: Keycloak 16.1.0 (WildFly Core 18.0.0.Final) starting
...
17:01:43,320 INFO [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server
17:01:43,322 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Keycloak 16.1.0 (WildFly Core 18.0.0.Final) started in 3261ms - Started 49 of 79 services (31 services are lazy, passive or on-demand)
The batch executed successfully
17:01:43,560 INFO [org.jboss.as] (MSC service thread 1-1) WFLYSRV0050: Keycloak 16.1.0 (WildFly Core 18.0.0.Final) stopped in 21ms
Setting JGroups discovery to TCPPING with properties {1.2.3.5[7600],1.2.3.6[7600]}
That last log line hangs for a few seconds, and then the process crashes. Note that it's the FIRST instance that crashes (I never get to launching the second one), so I don't think it's a matter of communication/firewall/etc, but port 80 and 7600 are open.
I'm using the jboss/Keycloak docker image v16.1 from Docker Hub.
The container will need a TCPPING.cli script, or the appropriate modifications made to standalone-ha.xml. The following TCPPING.cli file worked for me (mounted into the docker container with -v $(pwd)/TCPPING.cli:/opt/jboss/tools/cli/jgroups/discovery/TCPPING.cli):
embed-server --server-config=standalone-ha.xml --std-out=echo
batch
/subsystem=infinispan/cache-container=keycloak/distributed-cache=sessions:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=authenticationSessions:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=offlineSessions:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=loginFailures:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=jgroups/stack=udp:remove()
/subsystem=jgroups/stack=tcp/protocol=MPING:remove()
/subsystem=jgroups/stack=tcp/protocol=$keycloak_jgroups_discovery_protocol:add(add-index=0, properties={"initial_hosts"=>$keycloak_jgroups_discovery_protocol_properties})
/subsystem=jgroups/channel=ee:write-attribute(name=stack, value="tcp")
/subsystem=jgroups/stack=tcp/transport=TCP/property=external_addr/:add(value=${env.JGROUPS_DISCOVERY_EXTERNAL_IP:127.0.0.1})
run-batch
stop-embedded-server
Note that this is different from what is recommended in https://www.keycloak.org/2019/05/keycloak-cluster-setup - specifically the line
/subsystem=jgroups/stack=tcp/protocol=$keycloak_jgroups_discovery_protocol:add(add-index=0, properties={"initial_hosts"=>$keycloak_jgroups_discovery_protocol_properties})
I also changed the JGROUPS_DISCOVERY_PROPERTIES env var to only be the first server (e.g. -e JGROUPS_DISCOVERY_PROPERTIES=1.2.3.5[7600]) - each server in the cluster should just need to check with the master in order to join.

JBOSS/Keycloak cluster wait 1 minute before voting for coordinator

I have 3 nodes keycloak cluster. If one node is down, JBOSS start coordinator selection in 1 minute. Is it possible to decrease this timeout, because of downtime? How can I config fail node detection timeout?
[root#keycloak-01 ~]# date; systemctl stop keycloak
Tue May 25 11:35:46 MSK 2021
11:36:58,629 INFO [org.infinispan.CLUSTER] (VERIFY_SUSPECT.TimerThread-43,ejb,keycloak-02) ISPN000094: Received new cluster view for channel ejb: [keycloak-02|24] (2) [keycloak-02, keycloak-03]
11:36:58,630 INFO [org.infinispan.CLUSTER] (VERIFY_SUSPECT.TimerThread-43,ejb,keycloak-02) ISPN100001: Node keycloak-01 left the cluster
11:36:58,772 INFO [org.infinispan.CLUSTER] (non-blocking-thread--p7-t1) [Context=quartz] ISPN100007: After merge (or coordinator change), recovered members [keycloak-01, keycloak-02, keycloak-03] with topology id 104
11:36:58,774 INFO [org.infinispan.CLUSTER] (non-blocking-thread--p7-t1) [Context=quartz] ISPN100008: Updating cache members list [keycloak-02, keycloak-03], topology id 105
11:36:58,808 INFO [org.infinispan.CLUSTER] (non-blocking-thread--p15-t2) [Context=offlineClientSessions] ISPN100002: Starting rebalance with members [keycloak-02, keycloak-03], phase READ_OLD_WRITE_ALL, topology id 106
I offer to pay more attention to 'Failure detection' (FD and FD_ALL) in docs. I solved my task with the help of:
<protocol type="FD_ALL">
<property name="timeout">5000</property>
<property name="interval">3000</property>
<property name="timeout_check_interval">2000</property>
</protocol>

Can't change kafka broker-id in Incubator Helm chart?

I have one Zookeeper server (say xx.xx.xx.xxx:2181) running on one GCP Compute Instance VM separately.
I have 3 GKE clusters all in different regions on which I am trying to install Kafka broker nodes so that all nodes connect to one Zookeeper server(xx.xx.xx.xxx:2181).
I installed the Zookeeper server on the VM following this guide with zookeeper properties looking like below:
dataDir=/tmp/data
clientPort=2181
maxClientCnxns=0
initLimit=5
syncLimit=2
tickTime=2000
# list of servers
server.1=0.0.0.0:2888:3888
I am using this Incubator Helm Chart to deploy the brokers on GKE clusters.
As per the README.md I am trying to install with the below command:
helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install --name my-kafka \
--set replicas=1,zookeeper.enabled=false,configurationOverrides."broker\.id"=1,configurationOverrides."zookeeper\.connect"="xx.xx.xx.xxx:2181" \
incubator/kafka
Error
When I deploy using any of the above ways described above on all of the three GKE Clusters, only one of the brokers gets connected to the Zookeeper server and the other two pods just restart infinitely.
When I check the Zookeeper log (on the VM), it looks something like below:
...
[2019-10-30 14:32:30,930] INFO Accepted socket connection from /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2019-10-30 14:32:30,936] INFO Client attempting to establish new session at /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.ZooKeeperServer)
[2019-10-30 14:32:30,938] INFO Established session 0x100009621af0057 with negotiated timeout 6000 for client /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.ZooKeeperServer)
[2019-10-30 14:32:32,335] INFO Got user-level KeeperException when processing sessionid:0x100009621af0057 type:create cxid:0xc zxid:0x422 txntype:-1 reqpath:n/a Error Path:/config/users Error:KeeperErrorCode = NodeExists for /config/users (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:34,472] INFO Got user-level KeeperException when processing sessionid:0x100009621af0057 type:create cxid:0x14 zxid:0x424 txntype:-1 reqpath:n/a Error Path:/brokers/ids/0 Error:KeeperErrorCode = NodeExists for /brokers/ids/0 (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:35,126] INFO Processed session termination for sessionid: 0x100009621af0057 (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:35,127] INFO Closed socket connection for client /xx.xx.xx.xxx:54978 which had sessionid 0x100009621af0057 (org.apache.zookeeper.server.NIOServerCnxn)
[2019-10-30 14:36:49,123] INFO Expiring session 0x100009621af003b, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
...
I am sure I have created firewall rules to open necessary ports and that is not a problem because one of the broker nodes is able to connect (the one who reaches first).
To me, this seems like the borkerID are not getting changed for some reason and that is the reason why Zookeeper is rejecting the connections.
I say this because kubectl logs pod/my-kafka-n outputs something like below:
...
[2019-10-30 19:56:24,614] INFO [SocketServer brokerId=0] Shutdown completed (kafka.network.SocketServer)
...
[2019-10-30 19:56:24,627] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
...
As we can see above says brokerId=0 for all of the pods in all the 3 clusters.
However, when I do kubectl exec -ti pod/my-kafka-n -- env | grep BROKER, I can see the environment variable KAFKA_BROKER_ID is changed to 1, 2 and 3 for different brokers as I set.
What am I doing wrong? What is the correct way to change the kafka-broker id or to make all brokers connect to one Zookeeper instance?
make all brokers connect to one Zookeeper instance?
Seems like you are doing that okay via the configurationOverrides option. That'll deploy all pods with the same configuration.
That being said, the broker ID should not be the same per pod. If you inspect the statefulset YAML, it appears that the broker ID is calculated based on the POD_NAME variable
Sidenote
3 GKE clusters all in different regions on which I am trying to install Kafka broker nodes so that all nodes connect to one Zookeeper server
It's not clear to me how you would able to deploy to 3 sepearate clusters in one API call. But, this architecture isn't recommended by Kafka, Zookeeper, or Kubernetes communities unless these regions are "geographically close"

mesos masters keep restarting

I have 3 mesos masters with version 0.26.0 setup with a quorum of 2. When I start them, they keep restarting even before I turn up any frameworks or slaves.
Here's the errors I'm seeing:
F0322 19:36:56.009903 51459 master.cpp:1368] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
E0322 19:37:18.300568 41095 process.cpp:1911] Failed to shutdown socket with fd 26: Transport endpoint is not connected
There's no firewall running.
I start them with supervisord and the following command:
/usr/sbin/mesos-master --cluster=int --log_dir=/var/log/mesos/int --quorum=2 --port=5050 --work_dir=/tmp/mesos/work/int --zk=zk://intMesosMaster01:2181,intMesosMaster02:2181,intMesosMaster03:2181/mesos
Zookeeper is up and running fine with 3 nodes. It's in use for other projects and has no issues at all with them.

Apache Modcluster failed to drain active sessions

I'm am using JBoss EAP 6.2 as Webapplication server and Apace Modcluster for load balancing.
Whenever I try to undeploy my webapplication, I get the following warning
14:22:16,318 WARN [org.jboss.modcluster] (ServerService Thread Pool -- 136) MODCLUSTER000025: Failed to drain 2 remaining active sessions from default-host:/starrassist within 10.000000.1 seconds
14:22:16,319 INFO [org.jboss.modcluster] (ServerService Thread Pool -- 136) MODCLUSTER000021: All pending requests drained from default-host:/starrassist in 10.002000.1 seconds
and it takes forever to undeploy and the EAP server-group and node in which the application is deployed becomes unresponsive.
The only workaround is to restart the entire EAP server. My Question is, Is there an attribute that I can set in EAP or ModCluster so that the active sessions beyond a maxTimeOut would expire itself?
To control the timeout to stop a context you can use the following configuration value:
Stop Context Timeout
The amount of time, measure in units specified by
stopContextTimeoutUnit, for which to wait for clean shutdown of a
context (completion of pending requests for a distributable context;
or destruction/expiration of active sessions for a non-distributable
context).
CLI Command:
/profile=full-ha/subsystem=modcluster/mod-cluster-config=configuration/:write-attribute(name=stop-context-timeout,value=10)
Ref: Configure the mod_cluster Subsystem
Likewise if you are using JDK 8 take a look at this issue: Draining pending requests fails with Oracle JDK8