Infinispan (Red Hat Data Grid) in Openshift, with WebSphere Liberty - kubernetes

We're trying to use Red Hat Data Grid (RHDG)/Infinispan in our OCP (4.5.36) cluster. We have the latest official RHDG Operator installed and a Cache type cluster defined. (Which is apparently a k8s StatefulSet.)
I've then configured a WebSphere Liberty container/Deployment to try to use that Infinispan cluster for its sessions, as described in https://github.com/WASdev/ci.docker#session-caching.
Both the Infinispan cluster and the Liberty Deployment are in the same Project/namespace.
However, the Liberty container fails to connect, and the Infinispan containers are reporting several warnings of their own.
The Liberty container "client" log:
INFINISPAN_SERVICE_NAME(original): session-infinispan
INFINISPAN_SERVICE_NAME(normalized): SESSION_INFINISPAN
INFINISPAN_HOST: 172.30.137.86
INFINISPAN_PORT: 11222
INFINISPAN_USER: developer
INFINISPAN_PASS: <redacted>
Launching defaultServer (WebSphere Application Server 21.0.0.3/wlp-1.0.50.cl210320210309-1101) on Eclipse OpenJ9 VM, version 1.8.0_282-b08 (en_US)
[AUDIT ] CWWKE0001I: The server defaultServer has been launched.
[AUDIT ] CWWKE0100I: This product is licensed for development, and limited production use. The full license terms can be viewed here: https://public.dhe.ibm.com/ibmdl/export/pub/software/websphere/wasdev/license/base_ilan/ilan/21.0.0.3/lafiles/en.html
[AUDIT ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/defaults/keystore.xml
[AUDIT ] CWWKG0093A: Processing configuration drop-ins resource: /opt/ibm/wlp/usr/servers/defaultServer/configDropins/overrides/infinispan-client-sessioncache.xml
[AUDIT ] CWWKZ0058I: Monitoring dropins for applications.
[AUDIT ] CWWKT0016I: Web application available (default_host): http://payment-engine-6dcc5b6d5-jclx2:9080/payment/
[ERROR ] ISPN004007: Exception encountered. Retry 10 out of 10
org.infinispan.client.hotrod.exceptions.TransportException:: ISPN004071: Connection to 172.30.137.86/172.30.137.86:11222 was closed while waiting for response.
[ERROR ] SESN0307E: An exception occurred when initializing the cache. The exception is: org.infinispan.client.hotrod.exceptions.TransportException:: org.infinispan.client.hotrod.exceptions.TransportException:: ISPN004071: Connection to 172.30.137.86/172.30.137.86:11222 was closed while waiting for response.
at org.infinispan.client.hotrod.impl.transport.netty.ActivationHandler.exceptionCaught(ActivationHandler.java:53)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:300)
...
What looks like the relevant part of the Inifinispan container log:
03:40:18,628 WARN (SINGLE_PORT-ServerIO-4-2) [io.netty.handler.ssl.ApplicationProtocolNegotiationHandler] [id: 0xc39380c8, L:/10.254.0.248:11222 ! R:/10.254.2.65:32986] TLS handshake failed: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: a0061e21000003ffffffff0f0000
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1254)
(Actually, there are several Infinispan startup WARNs, mostly about deprecated capabilities. But this is the only one with a stack trace, so I'm jumping to the conclusion that it might be "the culprit")
Also, this is the Infinispan Service, so you can see the IP and port match what the Liberty container is using:

Working through this on the Infinispan chat service, it does appear that there's incorrect or incomplete setup of SSL/TLS.
I had attempted to remove encryption in the Infinispan cluster, but I either didn't sufficiently restart components or you can't change it after the fact. Removing the cluster and recreating with it disabled, though, enabled the Liberty communication to work.
The following CR YAML works:
apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
name: session-infinispan
spec:
replicas: 1
service:
type: Cache
security:
endpointEncryption:
type: None
Now to pursue what's missing from the Liberty setup to make use of SSL correctly. The Infinispan chat conversation says that this Liberty XML setup from the official image:
<server>
<featureManager>
<feature>sessionCache-1.0</feature>
</featureManager>
<httpSessionCache libraryRef="InfinispanLib">
<properties infinispan.client.hotrod.server_list="${INFINISPAN_HOST}:${INFINISPAN_PORT}"/>
<properties infinispan.client.hotrod.marshaller="org.infinispan.commons.marshall.JavaSerializationMarshaller"/>
<properties infinispan.client.hotrod.java_serial_whitelist=".*"/>
<properties infinispan.client.hotrod.auth_username="${INFINISPAN_USER}"/>
<properties infinispan.client.hotrod.auth_password="${INFINISPAN_PASS}"/>
<properties infinispan.client.hotrod.auth_realm="default"/>
<properties infinispan.client.hotrod.sasl_mechanism="DIGEST-MD5"/>
<properties infinispan.client.hotrod.auth_server_name="infinispan"/>
</httpSessionCache>
<httpSessionCache enableBetaSupportForInfinispan="true"/> <!-- TODO remove once no longer gated -->
<library id="InfinispanLib">
<fileset dir="${shared.resource.dir}/infinispan" includes="*.jar"/>
</library>
</server>
Needs the following properties added:
# Encryption
infinispan.client.hotrod.sni_host_name=$SERVICE_HOSTNAME
# Path to the TLS certificate.
# Clients automatically generate trust stores from certificates.
infinispan.client.hotrod.trust_store_path=/var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt

Related

Consul agent on kubernetes, on node or pod?

I deployed an aws eks cluster via terraform. I also deployed Consul following hasicorp’s tutorial and I see the nodes in consul’s UI.
Now I’m wondering how al the consul agents will know about the pods I deploy? I deploy something and it’s not shown anywhere on consul.
I can’t find any documentation as to how to register pods (services) on consul via the node’s consul agent, do I need to configure that somewhere? Should I not use the node’s agent and register the service straight from the pod? Hashicorp discourages this since it may increase resource utilization depending on how many pods one deploy on a given node. But then how does the node’s agent know about my services deployed on that node?
Moreover, when I deploy a pod in a node and ssh into the node, and install consul, consul’s agent can’t find the consul server (as opposed from the node, which can find it)
EDIT:
Bottom line is I can't find WHERE to add the configuration. If I execute ON THE POD:
consul members
It works properly and I get:
Node Address Status Type Build Protocol DC Segment
consul-consul-server-0 10.0.103.23:8301 alive server 1.10.0 2 full <all>
consul-consul-server-1 10.0.101.151:8301 alive server 1.10.0 2 full <all>
consul-consul-server-2 10.0.102.112:8301 alive server 1.10.0 2 full <all>
ip-10-0-101-129.ec2.internal 10.0.101.70:8301 alive client 1.10.0 2 full <default>
ip-10-0-102-175.ec2.internal 10.0.102.244:8301 alive client 1.10.0 2 full <default>
ip-10-0-103-240.ec2.internal 10.0.103.245:8301 alive client 1.10.0 2 full <default>
ip-10-0-3-223.ec2.internal 10.0.3.249:8301 alive client 1.10.0 2 full <default>
But if i execute:
# consul agent -datacenter=voip-full -config-dir=/etc/consul.d/ -log-file=log-file -advertise=$(wget -q -O - http://169.254.169.254/latest/meta-data/local-ipv4)
I get the following error:
==> Starting Consul agent...
Version: '1.10.1'
Node ID: 'f10070e7-9910-06c7-0e12-6edb6cc4c9b9'
Node name: 'ip-10-0-3-223.ec2.internal'
Datacenter: 'voip-full' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
Cluster Addr: 10.0.3.223 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2021-08-16T18:23:06.936Z [WARN] agent: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.936Z [WARN] agent: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.946Z [WARN] agent.auto_config: skipping file /etc/consul.d/consul.env, extension must be .hcl or .json, or config format must be set
2021-08-16T18:23:06.947Z [WARN] agent.auto_config: Node name "ip-10-0-3-223.ec2.internal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.
2021-08-16T18:23:06.948Z [INFO] agent.client.serf.lan: serf: EventMemberJoin: ip-10-0-3-223.ec2.internal 10.0.3.223
2021-08-16T18:23:06.948Z [INFO] agent.router: Initializing LAN area manager
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=udp
2021-08-16T18:23:06.950Z [WARN] agent.client.serf.lan: serf: Failed to re-join any previously known node
2021-08-16T18:23:06.950Z [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=tcp
2021-08-16T18:23:06.951Z [INFO] agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http
2021-08-16T18:23:06.951Z [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set `telemetry { disable_compat_1.9 = true }` to disable them.
2021-08-16T18:23:06.953Z [INFO] agent: started state syncer
2021-08-16T18:23:06.953Z [INFO] agent: Consul agent running!
2021-08-16T18:23:06.953Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:06.954Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2021-08-16T18:23:34.169Z [WARN] agent.router.manager: No servers available
2021-08-16T18:23:34.169Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
So where to add the config?
I also tried adding a service in k8s pointing to the pod, but the service doesn't come up on consul's UI...
What do you guys recommend?
Thanks
Consul knows where these services are located because each service
registers with its local Consul client. Operators can register
services manually, configuration management tools can register
services when they are deployed, or container orchestration platforms
can register services automatically via integrations.
if you planning to use manual option you have to register the service into the consul.
Something like
echo '{
"service": {
"name": "web",
"tags": [
"rails"
],
"port": 80
}
}' > ./consul.d/web.json
You can find the good example at : https://thenewstack.io/implementing-service-discovery-of-microservices-with-consul/
Also this is a very nice document for having detailed configuration of the health check and service discovery : https://cloud.spring.io/spring-cloud-consul/multi/multi_spring-cloud-consul-discovery.html
Official document : https://learn.hashicorp.com/tutorials/consul/get-started-service-discovery
BTW, I was finally able to figure out the issue.
consul-dns is not deployed by default, i had to manually deploy it, then forward all .consul requests from coredns to consul-dns.
All is working now. Thanks!

how to implement EJBTimer (persistant) in Open Liberty

Product name: Open Liberty
Product version: 20.0.0.7
Product edition: Open
is it possible to implement persistent ejbtimers on filesystem based default derby DB, using embedded.derby.DB
I installed derby in /tmp/derby, configured server.xml with the following, i don't see any file being created under /tmp when I start the OpenLiberty JVM, what am I missing in this approach?
<feature>ejbPersistentTimer-3.2</feature>
<library id="DerbyLib">
<fileset dir="/tmp/derby/lib" includes="derby.jar"/>
</library>
<dataSource id="DefaultDerbyDatasource" jndiName="jdbc/defaultDatasource" statementCacheSize="10" transactional="false">
<jdbcDriver libraryRef="DerbyLib"/>
<properties.derby.embedded createDatabase="create" databaseName="/tmp/sample.ejbtimer.db" shutdownDatabase="false"/>
<containerAuthData user="user1" password="derbyuser" />
</dataSource>
Check this book - http://www.redbooks.ibm.com/abstracts/sg248076.html?Open
In chapter "5.2.4 Developing applications using timers" you should find all stuff needed.
UPDATE based on comment:
If you look to the book and to the log it shows:
[INFO ] CNTR4000I: The ITSOTimerApp.war EJB module in the ITSOTimerApp
application is starting.
[INFO ] CNTR0167I: The server is binding the com.ibm.itso.timers.TimerBean
interface of the TimerBean enterprise bean in the ITSOTimerApp.war module of
the ITSOTimerApp application. The binding location is:
java:global/ITSOTimerApp/TimerBean!com.ibm.itso.timers.TimerBean
[INFO ] DSRA8203I: Database product name : Apache Derby
[INFO ] DSRA8204I: Database product version : 10.8.2.3 - (1212722)
[INFO ] DSRA8205I: JDBC driver name : Apache Derby Embedded JDBC Driver
[INFO ] DSRA8206I: JDBC driver version : 10.8.2.3 - (1212722)
[INFO ] CNTR0219I: The server created 1 persistent automatic timer or timers
and 0 non-persistent automatic timer or timers for the ITSOTimerApp.war module.
TimerBean initialized
It creates db 'as needed' so if you dont have any persistent timers beans, the service will not be started nor db created.
Liberty in general follows lazy model and doesn't start unneeded services.
So create sample application and then your DB will be created. There is no need to create database nor connection to database when no one is requesting for it.
In general, it is not advisable to use Derby Embedded database for persistent EJB timers due to limitations of Derby Embedded that all connections use the same class loader (implying the same JVM as well). This means you cannot leverage the failover capability (missedTaskThreshold setting) or even have multiple servers connected to the database at all. If you decide to use a Derby Embedded database, it means that you are limiting yourself to a single server. You can decide for yourself if that is acceptable based on what your needs are.
In the case of the example configuration you gave, it doesn't work because the EJB persistent timers feature in Liberty has no way of knowing that you dataSource, "DefaultDerbyDatasource" with jndiName "jdbc/defaultDatasource" is the data source that it ought to use. Also, it is incorrect to specify transactional="false" on the data source that you want EJB persistent timers to use because EJB persistent timers are transactional in nature.
I assume that what you are intending to do is configure the Java EE default data source and expecting EJB persistent timers to use it. That approach will work, except that you'll need to configure the Java EE default data source, you need to specify the id as "DefaultDataSource".
Here is an example that switches your configured data source to the Java EE default data source and removes the transactional="false" config,
<library id="DerbyLib">
<fileset dir="/tmp/derby/lib" includes="derby.jar"/>
</library>
<dataSource id="DefaultDataSource" jndiName="jdbc/defaultDatasource" statementCacheSize="10">
<jdbcDriver libraryRef="DerbyLib"/>
<properties.derby.embedded createDatabase="create" databaseName="/tmp/sample.ejbtimer.db" shutdownDatabase="false"/>
<containerAuthData user="user1" password="derbyuser" />
</dataSource>
By default, the EJB persistent timers feature should create database tables once the application runs and the EJB module is used.
However, you may be able to verify the configuration prior to that point by running the ddlgen utility (after correcting the configuration as above)
https://www.ibm.com/docs/en/was-liberty/base?topic=line-running-ddlgen-utility
which gives you the opportunity to see the DDL that it will use and optionally to run it manually (which is useful if you turned off automatic table creation via
<databaseStore id="defaultDatabaseStore" createTables="false"/> )

Restcomm - Solving SMSC GW 7.2 configuration failures

We configured the latest version (7.2) SMSC-GW to work on on our server with the environment (cassandra and such). However, after setting up everything. Some failures are appearing (which did not appear in previous versions).
Firstly, when connecting the simulators and the gateway using the default settings (JSS7 <-> SMSCGW <-> SMPP)
JSS7 is connected and sending, but no response is received.
SMPP is connected to SMSC-GW and the EMSE is bound. SMPP tries to send to SS7 but receives a response PDU packet failure from the SMSC-GW
I tried configuring DB routing rules, but that did not work.
Also, the log in the SMSC-GW server is frequently displaying the following message:
16:00:28,504 INFO [SchedulerResourceAdaptor] (pool-56-thread-1) Not all SBB are running now: ServicesDownList=[smscTxSmppServerServiceState, smscRxSmppServerServiceState, smscTxSipServerServiceState, smscRxSipServerServiceState, smscTxHttpServerServiceState, moServiceState, homeRoutingServiceState, mtServiceState, alertServiceState, chargingServiceState, ]
And the JSS7 management console GUI is displaying this (which looks wrong):
So are these the source of the SMSC-GW failures?
UPDATE: I found this error in the server.log
2017-02-02 10:57:42,005 WARN [org.mobicents.slee.container.deployment.jboss.SleeContainerDeployerImpl] (SLEE-InternalDeployer-thread-1) SLEE DUs not deployed, due to missing dependencies: file:/home/coreteam/kitchensink/restcomm-smsc-7.2.109/jboss-5.1.0.GA/server/simulator/deploy/smsc-services-du-7.2.109.jar/
Followed by:
EventTypeID[name=org.mobicents.smsc.slee.services.smpp.server.events.SS7_SEND_MT,vendor=org.mobicents,version=1.0]
ResourceAdaptorTypeID[name=PersistenceResourceAdaptorType,vendor=org.mobicents,version=1.0]
ResourceAdaptorTypeID[name=SchedulerResourceAdaptorType,vendor=org.mobicents,version=1.0]
SipRA
EventTypeID[name=org.mobicents.smsc.slee.services.smpp.server.events.SS7_SEND_RSDS,vendor=org.mobicents,version=1.0]
SchedulerResourceAdaptor^M
PersistenceResourceAdaptor^M
EventTypeID[name=org.mobicents.smsc.slee.services.smpp.server.events.SMPP_SM,vendor=org.mobicents,version=1.0]
EventTypeID[name=org.mobicents.smsc.slee.services.smpp.server.events.SS7_SM,vendor=org.mobicents,version=1.0]
EventTypeID[name=org.mobicents.smsc.slee.services.smpp.server.events.SIP_SM,vendor=org.mobicents,version=1.0]
2017-02-02 14:41:17,450 WARN [org.mobicents.slee.container.deployment.jboss.DeploymentManager] (main) Unable to INSTALL smsc-services-du-7.3.0-SNAPSHOT.jar right now. Waiting for dependencies to be resolved.
Solved it quite a while ago, but thought I would share. I just simply installed the SipRA missing dependency by adding the following in the deploy-config.xml file:
<ra-entity
resource-adaptor-id="ResourceAdaptorID[name=JainSipResourceAdaptor,vendor=net.java.slee.sip,version=1.2]"
entity-name="SipRA">
<properties>
<property name="javax.sip.PORT" type="java.lang.Integer" value="5060" />
</properties>
<ra-link name="SipRA" />
In the $JBOSS_HOME/server/profile_name/deploy/restcomm-slee directory.
I set the port to some other value since that number was already taken by some other service.
The smsc-services-du-7.2.109.jar then installed automatically the next time I ran the SMSC-GW.

NServiceBus Worker checking in with capacity of 0

I am using NServiceBus 3.3. I am trying to get a new Pre-Prod-Environment setup.
It all works fine in production and in one of my existing Pre-Prod-Environments with my existing configurations.
But in my new Environment, I am getting my workers checking-in with a capacity of 0. (They check-in with a capacity of 1 in the working Environments).
Again, the configs are the same between the Environments. (Except for machine names of course.)
Any idea why this could happen?
This is the output of my log (with Queue names and machine names changed):
2015-05-07 10:53:33,904 [1] INFO NServiceBus.Host [(null)] - Going to activate profile: NServiceBus.Distributor, NServiceBus.Host, Version=3.3.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c
2015-05-07 10:53:33,904 [1] INFO NServiceBus.Host [(null)] - Going to activate profile: NServiceBus.Production, NServiceBus.Host, Version=3.3.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c
2015-05-07 10:53:33,919 [1] INFO NServiceBus.Host [(null)] - Going to activate profile: NServiceBus.PerformanceCounters, NServiceBus.Host, Version=3.3.0.0, Culture=neutral, PublicKeyToken=9fc386479f8a226c
2015-05-07 10:53:33,935 [1] WARN Distributor.myFromQueue [(null)] - No transport configuration found so the distributor will default to one thread, for production scenarios you would want to adjust this setting
2015-05-07 10:53:33,950 [1] INFO Distributor.myFromQueue [(null)] - Endpoint configured to host the distributor, applicative input queue re routed to myFromQueue.worker#DistributorHost
2015-05-07 11:10:05,015 [Worker.13] INFO Distributor.myFromQueue [(null)] - Worker myFromQueue#WorkerMachine has started up, clearing previous reported capacity
2015-05-07 11:10:05,030 [Worker.13] INFO Distributor.myFromQueue [(null)] - Worker myFromQueue#WorkerMachine checked in with available capacity: 0
And this is the relevant part of my worker config file:
<MsmqTransportConfig NumberOfWorkerThreads="4" MaxRetries="5" />
<MessageForwardingInCaseOfFaultConfig ErrorQueue="error" />
<MasterNodeConfig Node="DistributorHost" />
<UnicastBusConfig>
<MessageEndpointMappings>
<add Messages="Bus.MyMessageAssembly" Endpoint="QueueForTheDistributor#DistributorHost" />
</MessageEndpointMappings>
</UnicastBusConfig>
I had the same problem and was able to resolve it by deleting the queues (myfromqueue and myfromqueue.retries) on the worker agent. NServiceBus automatically recreated the queues and everything started processing again for me.

Wildfly 9 - mod_cluster on TCP

We are currently testing to move from Wildfly 8.2.0 to Wildfly 9.0.0.CR1 (or CR2 built from snapshot). The system is a cluster using mod_cluster and is running on VPS what in fact prevents it from using multicast.
On 8.2.0 we have been using the following configuration of the modcluster that works well:
<mod-cluster-config proxy-list="1.2.3.4:10001,1.2.3.5:10001" advertise="false" connector="ajp">
<dynamic-load-provider>
<load-metric type="cpu"/>
</dynamic-load-provider>
</mod-cluster-config>
Unfortunately, on 9.0.0 proxy-list was deprecated and the start of the server will finish with an error. There is a terrible lack of documentation, however after a couple of tries I have discovered that proxy-list was replaced with proxies that are a list of outbound-socket-bindings. Hence, the configuration looks like the following:
<mod-cluster-config proxies="mc-prox1 mc-prox2" advertise="false" connector="ajp">
<dynamic-load-provider>
<load-metric type="cpu"/>
</dynamic-load-provider>
</mod-cluster-config>
And the following should be added into the appropriate socket-binding-group (full-ha in my case):
<outbound-socket-binding name="mc-prox1">
<remote-destination host="1.2.3.4" port="10001"/>
</outbound-socket-binding>
<outbound-socket-binding name="mc-prox2">
<remote-destination host="1.2.3.5" port="10001"/>
</outbound-socket-binding>
So far so good. After this, the httpd cluster starts registering the nodes. However I am getting errors from load balancer. When I look into /mod_cluster-manager, I see a couple of Node REMOVED lines and there are also many many errors like:
ERROR [org.jboss.modcluster] (UndertowEventHandlerAdapter - 1) MODCLUSTER000042: Error MEM sending STATUS command to node1/1.2.3.4:10001, configuration will be reset: MEM: Can't read node
In the log of mod_cluster there are the equivalent warnings:
manager_handler STATUS error: MEM: Can't read node
As far as I understand, the problem is that although wildfly/modcluster is able to connect to httpd/mod_cluster, it does not work the other way. Unfortunately, even after an extensive effort I am stuck.
Could someone help with setting mod_cluster for Wildfly 9.0.0 without advertising? Thanks a lot.
I ran into the Node Removed issue to.
I managed to solve it by using the following as instance-id
<subsystem xmlns="urn:jboss:domain:undertow:2.0" instance-id="${jboss.server.name}">
I hope this will help someone else to ;)
There is no need for any unnecessary effort or uneasiness about static proxy configuration. Each WildFly distribution comes with xsd sheets that describe xml subsystem configuration. For instance, with WildFly 9x, it's:
WILDFLY_DIRECTORY/docs/schema/jboss-as-mod-cluster_2_0.xsd
It says:
<xs:attribute name="proxies" use="optional">
<xs:annotation>
<xs:documentation>List of proxies for mod_cluster to register with defined by outbound-socket-binding in socket-binding-group.</xs:documentation>
</xs:annotation>
<xs:simpleType>
<xs:list itemType="xs:string"/>
</xs:simpleType>
</xs:attribute>
The following setup works out of box
Download wildfly-9.0.0.CR1.zip or build with ./build.sh from sources
Let's assume you have 2 boxes, Apache HTTP Server with mod_cluster acting as a load balancing proxy and your WildFly server acting as a worker. Make sure botch servers can access each other on both MCMP enabled VirtualHost's address and port (Apache HTTP Server side) and on WildFly AJP and HTTP connector side. The common mistake is to binf WildFLy to localhost; it then reports its addess as localhost to the Apache HTTP Server residing on a dofferent box, which makes it impossible for it to contact WildFly server back. The communication is bidirectional.
This is my configuration diff from the default wildfly-9.0.0.CR1.zip.
328c328
< <mod-cluster-config advertise-socket="modcluster" connector="ajp" advertise="false" proxies="my-proxy-one">
---
> <mod-cluster-config advertise-socket="modcluster" connector="ajp">
384c384
< <subsystem xmlns="urn:jboss:domain:undertow:2.0" instance-id="worker-1">
---
> <subsystem xmlns="urn:jboss:domain:undertow:2.0">
435c435
< <socket-binding-group name="standard-sockets" default-interface="public" port-offset="${jboss.socket.binding.port-offset:102}">
---
> <socket-binding-group name="standard-sockets" default-interface="public" port-offset="${jboss.socket.binding.port-offset:0}">
452,454d451
< <outbound-socket-binding name="my-proxy-one">
< <remote-destination host="10.10.2.4" port="6666"/>
< </outbound-socket-binding>
456c453
< </server>
---
> </server>
Changes explanation
proxies="my-proxy-one", outbound socket binding name; could be more of them here.
instance-id="worker-1", the name of the worker, a.k.a. JVMRoute.
offset -- you could ignore, it's just for my test setup. Offset does not apply to outbound socket bindings.
<outbound-socket-binding name="my-proxy-one"> - IP and port of the VirtualHost in Apache HTTP Server containing EnableMCPMReceive directive.
Conclusion
Generally, these MEM read / node error messages are related to network problems, e.g. WildFly can contact Apache, but Apache cannot contact WildFly back. Last but not least, it could happen that the Apache HTTP Server's configuration uses PersistSlots directive and some substantial enviroment conf change took place, e.g. switch from mpm_prefork to mpm_worker. In this case, MEM Read error messages are not realted to WildFly, but to the cached slotmem files in HTTPD/cache/mod_custer that need to be deleted.
I'm certain it's network in your case though.
After a couple of weeks I got back to the problem and found the solution. The problem was - of course - in configuration and had nothing in common with the particular version of Wildfly. Mode specifically:
There were three nodes in the domain and three servers in each node. All nodes were launched with the following property:
-Djboss.node.name=nodeX
...where nodeX is the name of a particular node. However, it meant that all three servers in the node get the same name, which is exactly what confused the load balancer.
As soon as I have removed this property, everything started to work.