Keycloak cluster fails on Amazon ECS (org.infinispan.commons.CacheException: Initial state transfer timed out for cache) - wildfly

I am trying to deploy a cluster of 2 Keycloak docker images (6.0.1) on Amazon ECS (Fargate) using the built-in ECS Service Discovery mecanism (using DNS_PING).
Environment:
JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING
JGROUPS_DISCOVERY_PROPERTIES=dns_query=my.services.internal,dns_record_type=A
JGROUPS_TRANSPORT_STACK=tcp <---(also tried udp)
The instances IP are correctly resolved from Route53 private namespace and they discover each other without any problem (x.x.x.138 is started first, then x.x.x.76).
Second instance:
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: entries collected from DNS (in 3 ms): [x.x.x.76:0, x.x.x.138:0]
[org.jgroups.protocols.dns.DNS_PING] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending discovery requests to hosts [x.x.x.76:0, x.x.x.138:0] on ports [55200 .. 55200]
[org.jgroups.protocols.pbcast.GMS] (ServerService Thread Pool -- 58) ip-x.x.x.76: sending JOIN(ip-x-x-x-76) to ip-x-x-x-138
And on the first instance:
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN000094: Received new cluster view for channel ejb: [ip-x-x-x-138|1] (2) [ip-x-x-x-138, ip-172-x-x-x-76]
[org.infinispan.remoting.transport.jgroups.JGroupsTransport] (thread-8,ejb,ip-x-x-x-138) Joined: [ip-x-x-x-76], Left: []
[org.infinispan.CLUSTER] (thread-8,ejb,ip-x-x-x-138) ISPN100000: Node ip-x-x-x-76 joined the cluster
[org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger-12,ejb,ip-x-x-x-76) ip-x-x-x-76: pingable_mbrs=[ip-x-x-x-138, ip-x-x-x-76], ping_dest=ip-x-x-x-138
So it seems we have a working cluster. Unfortunately, the second instance ends up failing with the following exception:
Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache work on ip-x-x-x-76
Before this occurs, I am seeing a bunch of failure discovery task suspecting/unsuspecting the opposite instance:
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) haven't received a heartbeat from ip-x-x-x-76 for 60016 ms, adding it to suspect list
[org.jgroups.protocols.FD_ALL] (Timer runner-1,null,null) ip-x-x-x-138: suspecting [ip-x-x-x-76]
[org.jgroups.protocols.FD_ALL] (thread-9,ejb,ip-x-x-x-138) Unsuspecting ip-x-x-x-76
[org.jgroups.protocols.FD_SOCK] (thread-9,ejb,ip-x-x-x-138) ip-x-x-x-138: broadcasting unsuspect(ip-x-x-x-76)
On the Infinispan side (cache), everything seems to occur correctly but I am not sure. Every cache is "rebalanced" and each "rebalance" seems to end up with, for example:
[org.infinispan.statetransfer.StateConsumerImpl] (transport-thread--p24-t2) Finished receiving of segments for cache offlineSessions for topology 2.
It feels like its a connectivity issue, but all the ports are wide open between these 2 instances, both for TCP and UDP.
Any idea ? Anyone successfull at configuring this on ECS (fargate) ?
UPDATE 1
The second instance was initially shutting down not because of the "Initial state transfer timed out .." error but because the health check was taking longer than the configured grace period. Nonetheless, with 2 healthy instances, I receive "404 - Not Found" once every 2 queries, telling me that there is indeed a cache problem.

In current keycloak docker image (6.0.1), the default stack is UDP. According to this, version 7.0.0 will default to TCP and will also introduce a variable to toggle the stack (JGROUPS_TRANSPORT_STACK).
Using the UDP stack in Amazon ECS will "partially" work, meaning the discovery will work, the cluster will form, but the Infinispan cache won't be able to sync between instances, which will produce erratic errors. There is probably a way to make it work "as-is", but I dont see anything blocked between the instances when checking the VPC Flow logs.
A workaround is to switch to TCP by modifying the JGroups stack directly in the image in file /opt/jboss/keycloak/standalone/configuration/standalone-ha.xml:
<subsystem xmlns="urn:jboss:domain:jgroups:6.0">
<channels default="ee">
<channel name="ee" stack="tcp" cluster="ejb"/> <-- set stack to tcp
</channels>
Then commit the new image:
docker commit -m="TCP cluster stack" CONTAINER_ID jboss/keycloak:6.0.1-tcp-cluster
Tag/Push the image to Amazon ECR and make sure the port 7600 is accepted in your security group between your Amazon ECS tasks.

Related

"SchemaRegistryException: Failed to get Kafka cluster ID" for LOCAL setup

I'm downloaded the .tz (I am on MAC) for confluent version 7.0.0 from the official confluent site and was following the setup for LOCAL (1 node) and Kafka/ZooKeeper are starting fine, but the Schema Registry keeps failing (Note, I am behind a corporate VPN)
The exception message in the SchemaRegistry logs is:
[2021-11-04 00:34:22,492] INFO Logging initialized #1403ms to org.eclipse.jetty.util.log.Slf4jLog (org.eclipse.jetty.util.log)
[2021-11-04 00:34:22,543] INFO Initial capacity 128, increased by 64, maximum capacity 2147483647. (io.confluent.rest.ApplicationServer)
[2021-11-04 00:34:22,614] INFO Adding listener: http://0.0.0.0:8081 (io.confluent.rest.ApplicationServer)
[2021-11-04 00:35:23,007] ERROR Error starting the schema registry (io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication)
io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryException: Failed to get Kafka cluster ID
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.kafkaClusterId(KafkaSchemaRegistry.java:1488)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.<init>(KafkaSchemaRegistry.java:166)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.initSchemaRegistry(SchemaRegistryRestApplication.java:71)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryRestApplication.configureBaseApplication(SchemaRegistryRestApplication.java:90)
at io.confluent.rest.Application.configureHandler(Application.java:271)
at io.confluent.rest.ApplicationServer.doStart(ApplicationServer.java:245)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:73)
at io.confluent.kafka.schemaregistry.rest.SchemaRegistryMain.main(SchemaRegistryMain.java:44)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:180)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.kafkaClusterId(KafkaSchemaRegistry.java:1486)
... 7 more
My schema-registry.properties file has bootstrap URL set to
kafkastore.bootstrap.servers=PLAINTEXT://localhost:9092
I saw some posts saying its the SchemaRegistry unable to connect to the KafkaCluster URL because of the localhost address potentially. I am fairly new to Kafka and basically just need this local setup to run a git repo that is utilizing some Topics/Kafka so my questions...
How can I fix this (I am behind a corporate VPN but I figured this shouldn't affect this)
Do I even need the SchemaRegistry?
I ended up just going with the Docker local setup inside, and the only change I had to make to the docker compose YAML was to change the schema-registry port (I changed it to 8082 or 8084, don't remember exactly but just an unused port that is not being used by some other Confluent service listed in the docker-compose.yaml) and my local setup is working fine now

Wildfly : Singleton Deployment on Cluster | Elects two servers in Server Group

This does not happens all time but many a times.
3 Clusters of Server Group
Wildfly 16
Deploy .war from UI. It picks fine on one server::
2020-02-26 07:21:12,951 INFO [org.wildfly.clustering.server] (LegacyDistributedSingletonService - 1) WFLYCLSV0003: alp-esb-app02:servicedesk-02 elected as the singleton provider of the jboss.deployment.unit."Now-1.11-SNAPSHOT.war".installer service
2020-02-26 07:21:13,115 INFO [org.jboss.as.server] (ServerService Thread Pool -- 26) WFLYSRV0010: Deployed "Now-1.11-SNAPSHOT.war" (runtime-name : "Now-1.11-SNAPSHOT.war")
2020-02-26 07:21:14,133 INFO [org.wildfly.clustering.server] (LegacyDistributedSingletonService - 1) WFLYCLSV0001: This node will now operate as the singleton provider of the jboss.deployment.unit."Now-1.11-SNAPSHOT.war".installer service
But i disable-renable or deploy next time: It shows same logs in two server.
An there is scheduler which runs twice which is corrupting database with duplicates.
Need to redeploy and redeploy and check when logs went fine i.e only one server is elected.
Project Structure:
webapp -> Meta INF -> singleton-deployment.xml
<?xml version="1.0" encoding="UTF-8"?>
<singleton-deployment xmlns="urn:jboss:singleton-deployment:1.0"/>
Scheduler Starts like:
#Startup
#Singleton
#AccessTimeout(value = 30, unit = TimeUnit.MINUTES)
public class SnowPollerNew {
Any suggestion why do it runs fine but do not runs fine many a time.
Is it linked to JGroups? or communication between two clusters?
You need to ensure that the servers are building the cluster correctly.
Also I remember some issues (WFLY-11619) with the singleton election.
I would suppose that this is not reproducable at WildFly 18.

How to check cluster working between two different JBoss Server

I configured cluster between two different JBoss server using Multicast method.
Both server will be connected , when I start both JBoss server.
After one days , I'm getting following messages
Errors start to show for the clustering in server.log
05:28:17,447 ERROR [org.hornetq.core.server] (Thread-11905 (HornetQ-client-global-threads-377807954)) HQ224037:
cluster connection Failed to handle message: java.lang.IllegalStateException:
Cannot find binding for d7c1004f-b1a1-4160-8888-c38175ac45d599cf0dfe-5f30-11e4-bd7e-556a35fb9ec6 on
ClusterConnectionImpl#538608327[nodeUUID=930dee51-5f30-11e4-9695-ef52e2a27831, connector=TransportConfiguration(name=netty,
factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5445&host=172-29-250-191, address=jms,
server=HornetQServerImpl::serverUUID=930dee51-5f30-11e4-9695-ef52e2a27831]
at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.doConsumerCreat
05:28:17,411 ERROR [org.hornetq.core.server] (Thread-11439
(HornetQ-remoting-threads-HornetQServerImpl::serverUUID=99cf0dfe-5f30-11e4-bd7e-556a35fb9ec6-136247994-702467456))
HQ224016: Caught exception: HornetQException[errorType=QUEUE_EXISTS message=HQ119019:
Queue already exists 7a8b46d5-a038-4efd-900e-4c041c2c121f]
At org.hornetq.core.server.impl.HornetQServerImpl.createQueue(HornetQServerImpl.java:1811)
[hornetq-server-2.3.1.Final-redhat-1.jar:2.3.1.Final-redhat-1]
How to ensure cluster between two servers. Is there any procedures or any work around available?
Red Hat provides a McastReceiverTest java client test utility- further information on its use can be located at https://access.redhat.com/solutions/123073

Cassandra Channel has been closed

We have a small test cluster with 3 nodes on Amazon. Everything seems working with cqlsh. But when I try to debug my app from my laptop (outside of Amazon of course), I'm getting 'Channel has been closed' errors, and it starts retrying forever. I know it's likely caused by the config in cassandra.ymal, as it shows some private IPs in my Eclipse console. Tried many different ways but still getting the same problem. Appreciate any input on this. How to get rid of the private IPs 10.251.x.x from the client?
Here are some context,
Versions:
[cqlsh 4.0.1 | Cassandra 2.0.4 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
cassandra-driver-core-2.0.0-rc1.jar
In cassandra.ymal:
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "54.203.x.x,54.203.x.y"
listen_address: 10.251.a.b
broadcast_address: 54.203.x.x
native_transport_port: 9042
endpoint_snitch: Ec2MultiRegionSnitch
In Eclipse console:
DEBUG [main] (ControlConnection.java:145) - [Control connection] Successfully connected to /54.203.x.x
DEBUG [Cassandra Java Driver worker-0] (Session.java:379) - Adding /54.203.x.x to list of queried hosts
DEBUG [Cassandra Java Driver worker-1] (Session.java:379) - Adding /10.251.a.c to list of queried hosts
DEBUG [Cassandra Java Driver worker-1] (Connection.java:103) - [/10.251.a.c-1] Error connecting to /10.251.a.c (connection timed out: /10.251.a.c:9042)
DEBUG [Cassandra Java Driver worker-1] (Session.java:390) - Error creating pool to /10.251.a.c ([/10.251.a.c] Cannot connect)
DEBUG [Cassandra Java Driver worker-1] (Cluster.java:1064) - /10.251.a.c is down, scheduling connection retries
DEBUG [New I/O worker #4] (Connection.java:194) - Defuncting connection to /10.251.a.c
com.datastax.driver.core.TransportException: [/10.251.a.b] Channel has been closed
at com.datastax.driver.core.Connection$Dispatcher.channelClosed(Connection.java:548)
...
It seem that your Java driver is using auto discovery by calling "describe cluster" to get a list of all nodes in your cluster. In AWS using Ec2Snitch, that yields to private ips which obviously won't work from outside of AWS. There is a discussion on this topic here:
https://datastax-oss.atlassian.net/browse/JAVA-145
The last commend got my attention. It says you can do something with LoadBalancingPolicy of the driver to limit the nodes. Hope this includes specifying the specific IPs so it does not auto discover.

Where can I configure the timeout of JBoss cluster node drop?

We have 4 nodes clustering. When one of them has a long GC pause, then this causes the note to be dropped from clustering and generating the following log:
2012-06-14 03:27:48,277 INFO [org.jboss.messaging.core.impl.postoffice.GroupMember] org.jboss.messaging.core.impl.postoffice.GroupMember$ControlMembershipListener#6225352b got new view [10.164.218.18:7910|10] [10.164.218.18:7910, 10.164.107.69:7910, 10.164.107.65:7910], old view is [10.164.218.14:7910|9] [10.164.218.14:7910, 10.164.218.18:7910, 10.164.107.69:7910, 10.164.107.65:7910]
2012-06-14 03:27:48,277 INFO [org.jboss.messaging.core.impl.postoffice.GroupMember] I am (10.164.218.18:7910)
2012-06-14 03:27:48,998 INFO [org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] JBoss Messaging is failing over for failed node 52. If there are many messages to reload this may take some time...
I would like to configure the timeout of the node drop. It seems to be 2 minutes in my case and I would like to increase it, but I can't find where to configure it.
Where can I configure the timeout of JBoss cluster node drop?
I found it, it in oracle-persistence-service.xml. You need to adjust the configurations in
ControlChannelConfig section. I think it is the 'timeout' of 'FD' protocol.