Pymodbus RTU [Errno 25] Inappropriate ioctl for device - pyserial

I'm trying to have a modbus RTU client on my Raspberry Pi 4 talking to the modbus synchronous server available in the pymodbus examples (https://pymodbus.readthedocs.io/en/latest/source/example/synchronous_client.html). I set up the server like this:
# RTU:
StartSerialServer(context, framer=ModbusRtuFramer, identity=identity,
port='/dev/ttyAMA0', timeout=.5, baudrate=9600)
What my client does:
from pymodbus.pdu import ModbusRequest
from pymodbus.client.sync import ModbusSerialClient
from pymodbus.transaction import ModbusRtuFramer
import time
import logging
### Logs
FORMAT = ('%(asctime)-15s %(threadName)-15s '
'%(levelname)-8s %(module)-15s:%(lineno)-8s %
(message)s')
logging.basicConfig(format=FORMAT)
log = logging.getLogger()
log.setLevel(logging.DEBUG)
### Modbus RTU stuff
client = ModbusSerialClient(method='rtu', port='/dev/ttyAMA0', baudrate=9600, timeout=.5, parity='N')
client.connect()
client.write_registers(1, 1, unit=1)
client.close()
What I see in server console:
2020-03-24 00:08:14,189 MainThread DEBUG sync :46 Client Connected [/dev/ttyAMA0:/dev/ttyAMA0]
2020-03-24 00:08:14,190 MainThread DEBUG sync :580 Started thread to serve client
Client logs:
2020-03-24 00:09:53,937 MainThread DEBUG transaction :115 Current transaction state - IDLE
2020-03-24 00:09:53,938 MainThread DEBUG transaction :120 Running transaction 1
2020-03-24 00:09:53,939 MainThread DEBUG transaction :219 SEND: 0x1 0x10 0x0 0x1 0x0 0x1 0x2 0x0 0x1 0x66 0x41
2020-03-24 00:09:53,939 MainThread DEBUG sync :75 New Transaction state 'SENDING'
2020-03-24 00:09:53,940 MainThread DEBUG transaction :238 Transaction failed. ([Errno 25] Inappropriate ioctl for device)
2020-03-24 00:09:53,941 MainThread DEBUG rtu_framer :235 Frame - [b''] not ready
2020-03-24 00:09:53,941 MainThread DEBUG transaction :394 Getting transaction 1
2020-03-24 00:09:53,942 MainThread DEBUG transaction :193 Changing transaction state from 'PROCESSING REPLY' to 'TRANSACTION_COMPLETE'
I'm really confused by the error message:
[Errno 25] Inappropriate ioctl for device
Up to that point the logs show what I expected (It looks like I can connect and the SEND stuff is correct), but I can't understand what's going on with that error message. On the server side nothing happens in the terminal window. I made sure that with
sudo raspi-config
to disable shell and kernel from the serial connection.
Any idea about what the issue could be? Ty

Related

debugging with OpenOCD reports Protocol error with Rcmd

I'm working on establishing a debug connection to a Renesas RZ/G2L MPU.
My OpenOCD connection appears to launch fine (and i can connect to it with gdb from the shell afterwards with (gdb) target remote localhost:3333 fine):
Open On-Chip Debugger 0.12.0-rc2+dev-00989-g9501b263e-dirty (2022-12-12-17:03)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
G2L
G2L - 0 CA57(s), 2 CA55(s), 0 CA53(s), 0 CR7(s), 1 CM33(s)
Boot Core - CA55
r9a07g044l.cpu
SMP targets: r9a07g044l.a55.0 r9a07g044l.a55.1
init_reset
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
Info : J-Link V11 compiled Dec 5 2022 13:50:41
Info : Hardware version: 11.00
Info : VTarget = 1.812 V
Info : clock speed 4000 kHz
Info : JTAG tap: r9a07g044l.cpu tap/device found: 0x6ba00477 (mfg: 0x23b (ARM Ltd), part: 0xba00, ver: 0x6)
Info : r9a07g044l.a55.0: hardware has 6 breakpoints, 4 watchpoints
Info : starting gdb server for r9a07g044l.a55.0 on 3333
Info : Listening on port 3333 for gdb connections
Info : starting gdb server for r9a07g044l.m33 on 3334
Info : Listening on port 3334 for gdb connections
Info : gdb port disabled
I'm trying to hook it up with eclipse to debug Flash Writer which should allow me to bring up the system. I createda debug configuration, set the init commands under start up (per instructions from Renesas) and set the launch target to localhost:3333. Upon launch of a debug session from Eclipse however, I get the following error:
Error in final launch sequence:
Failed to execute MI command:
source /home/mistywest/git/rzg2_flash_writer/gdb_smarc_g2l_flash_writer
Error message from debugger back end:
/home/mistywest/git/rzg2_flash_writer/gdb_smarc_g2l_flash_writer:12: Error in sourced command file:
Protocol error with Rcmd
Failed to execute MI command:
source /home/mistywest/git/rzg2_flash_writer/gdb_smarc_g2l_flash_writer
Error message from debugger back end:
/home/mistywest/git/rzg2_flash_writer/gdb_smarc_g2l_flash_writer:12: Error in sourced command file:
Protocol error with Rcmd
Failed to execute MI command:
source /home/mistywest/git/rzg2_flash_writer/gdb_smarc_g2l_flash_writer
Error message from debugger back end:
/home/mistywest/git/rzg2_flash_writer/gdb_smarc_g2l_flash_writer:12: Error in sourced command file:
Protocol error with Rcmd
/home/mistywest/git/rzg2_flash_writer/gdb_smarc_g2l_flash_writer:12: Error in sourced command file:\nProtocol error with Rcmd
and on the OpenOCD console:
Info : accepting 'gdb' connection on tcp/3333
Info : r9a07g044l.a55.0 cluster 0 core 0 multi core
r9a07g044l.a55.0 halted in AArch64 state due to debug-request, current mode: EL3H
cpsr: 0x400003cd pc: 0x3a94
MMU: disabled, D-Cache: disabled, I-Cache: enabled
Info : New GDB Connection: 1, Target r9a07g044l.a55.0, state: halted
Warn : Prefer GDB command "target extended-remote :3333" instead of "target remote :3333"
Error: JTAG scan chain interrogation failed: all zeroes
Error: Check JTAG interface, timings, target power, etc.
Error: Trying to use configured scan chain anyway...
Error: r9a07g044l.cpu: IR capture error; saw 0x00 not 0x01
Warn : Bypassing JTAG setup events due to errors
Error: Invalid ACK (0) in DAP response
Info : Deferring arp_examine of r9a07g044l.a55.1
Info : Use arp_examine command to examine it manually!
Info : Deferring arp_examine of r9a07g044l.m33
Info : Use arp_examine command to examine it manually!
Error: Invalid ACK (0) in DAP response
Error: Debug regions are unpowered, an unexpected reset might have happened
Error: JTAG-DP STICKY ERROR
Error: Could not initialize the APB-AP
Info : dropped 'gdb' connection
Oh I'm running this from a Fedora 36 host - if it matters

Zookeeper error: Exception causing close of session 0x0 due to java.io.IOException: Len error

We have a well configured zookeeper and kafka cluster nodes. The manual test for creation a topic and sending a message on that topic passed successfully. But when I run a test from a test equipment in order to create a topic with MQTT protocol, I receive:
Exception causing close of session 0x0 due to java.io.IOException: Len error 271056900
[myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1008] - Closed socket connection for client /192.18.0.1:15659 (no session established for client).
Can someone give me some hint on how to solve this issue?
Looks like you are exceeding your jute.maxbuffer. Try to increase it. Here you can find some more information.
If you are using docker-compose, this helps me:
environment:
KAFKA_OPTS: -Djute.maxbuffer=500000000

Leader election with Curator and Zookeeper

I am running 3 instances of ZooKeeper and the config is this:
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/tmp/zookeeper1
clientPort=2181
maxClientCnxns=1000
server.1=127.0.0.1:2888:3888
server.2=127.0.0.1:2889:3889
server.3=127.0.0.1:2890:3890
I am using the leader election example code given here:
https://git-wip-us.apache.org/repos/asf?p=curator.git;a=tree;f=curator-examples/src/main/java/leader;h=73b547eadb98995c0ccbd06a5b76d0741ffef263;hb=HEAD
The code runs fine with TestingServer but when I change connection string to : "127.0.0.1:2181,127.0.0.1:2182,127.0.0.1:2183", I get the exceptions:
[main-SendThread(127.0.0.1:2183)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 127.0.0.1/127.0.0.1:2183. Will not attempt to authenticate using SASL (unknown error)
[main-SendThread(127.0.0.1:2183)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /127.0.0.1:56111, server: 127.0.0.1/127.0.0.1:2183
[main-SendThread(127.0.0.1:2183)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 127.0.0.1/127.0.0.1:2183, sessionid = 0x3521552283c0000, negotiated timeout = 40000
[main-EventThread] INFO org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED
[main-SendThread(127.0.0.1:2183)] INFO org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x3521552283c0000, likely server has closed socket, closing socket connection and attempting reconnect
[main-EventThread] INFO org.apache.curator.framework.imps.EnsembleTracker - New config event received: null
[main-EventThread] ERROR org.apache.curator.framework.imps.CuratorFrameworkImpl - Background exception was not retry-able or retry gave up
java.lang.NullPointerException
at java.io.ByteArrayInputStream.<init>(ByteArrayInputStream.java:106)
at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:163)
at org.apache.curator.framework.imps.EnsembleTracker.access$200(EnsembleTracker.java:48)
at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:134)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:829)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:611)
at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:151)
at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:210)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:619)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:528)
[main-EventThread] INFO org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
What could be the issue?
I am hitting the same issue. I think it might be related to the Zookeeper 3.5.1 ClientCnxn. Even though I return back to curator 2.6.0, I still see the same stack trace. A GET_CONFIG event type is sent without the event data.
My stack trace looks like this:
org.apache.curator.framework.imps.CuratorFrameworkImpl: Background exception was not retry-able or retry gave up
! java.lang.NullPointerException: null
! at java.io.ByteArrayInputStream.(ByteArrayInputStream.java:106)
! at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:163)
! at org.apache.curator.framework.imps.EnsembleTracker.access$200(EnsembleTracker.java:48)
! at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:134)
! at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:829)
! at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:611)
! at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:151)
! at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:210)
! at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:619)
! at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:528)
If use Zookeeper 3.5.1, then curator-recipes 3.2.1+ fix this issue.

Debugging - GNU ARM Bare Metal Development

I am trying to debug a sample code given by Atmel. I have built the program successfully.
For debugging, I am using eclipse plusgdb plus JlinkGDBServer plus onboard Jtag.
Although the program can be downloaded to the board and is running well, I can't debug the program. Everytime I launch a debug session, the JLinkGDBServer will be terminated with an error as below:
Below are the messages shown in the console for each program termination:
JLinkGDBServer
SEGGER J-Link GDB Server V4.96g Command Line Version
JLinkARM.dll V4.96g (DLL compiled Feb 6 2015 17:54:32)
-----GDB Server start settings-----
GDBInit file: none
GDB Server Listening port: 2331
SWO raw output listening port: 2332
Terminal I/O port: 2333
Accept remote connection: localhost only
Generate logfile: on
Verify download: on
Init regs on start: on
Silent mode: off
Single run mode: on
Target connection timeout: 5 ms
------J-Link related settings------
J-Link Host interface: USB
J-Link script: none
J-Link settings file: none
------Target related settings------
Target device: Cortex-A5
Target interface: JTAG
Target interface speed: 1000kHz
Target endian: little
Connecting to J-Link...
J-Link is connected.
Firmware: J-Link OB-SAM3U128 V1 compiled Nov 28 2014 10:24:11
Hardware: V1.00
S/N: 480300770
Checking target voltage...
Target voltage: 3.30 V
Listening on TCP/IP port 2331
Connecting to target...
J-Link found 1 JTAG device, Total IRLen = 4
JTAG ID: 0x4BA00477 (Cortex-A5)
Connected to target
Waiting for GDB connection...Connected to 127.0.0.1
Reading all registers
Read 4 bytes # address 0x00000000 (Data = 0xE59FF070)
Target interface speed set to 1000 kHz
Resetting target
Halting target CPU...
...Target halted (PC = 0x00000000)
PC = 00000000, CPSR = 000001D3 (SVC mode, ARM FIQ dis. IRQ dis.)
R0 = 00000004, R1 = 0031E0C3, R2 = 00016AAD, R3 = 00016965
R4 = 0031FFA0, R5 = C0542A08, R6 = C0512000, R7 = C051DA90
USR: R8 =C051DD80, R9 =410FC051, R10=C0512000, R11 =0031FF94, R12 =003020C0
R13=BEBF5C70, R14=B6F12F1C
FIQ: R8 =9ABE0586, R9 =7E72A55E, R10=73DBFC6B, R11 =4F6717CF, R12 =05EDA809
R13=5AC81462, R14=24683958, SPSR=370D2C67
SVC: R13=0031FF80, R14=00300620, SPSR=000001D3
ABT: R13=C0542B4C, R14=C000DC80, SPSR=A0000193
IRQ: R13=00320000, R14=80000053, SPSR=80000053
UND: R13=C0542B58, R14=C000DB60, SPSR=60000093
Reading all registers
Select auto target interface speed (1000 kHz)
Flash breakpoints enabled
Semi-hosting enabled (VectorAddr = 0x08)
Semihosting I/O set to TELNET and GDB Client
Downloading 15488 bytes # address 0x00300000 - Verified OK
Writing register (PC = 0x00300080)
GDB closed TCP/IP connection
Connected to 127.0.0.1
Reading all registers
Read 4 bytes # address 0x00300080 (Data = 0xF1080100)
Resetting target
Writing register (PC = 0x00300000)
Writing register (PC = 0x00300000)
Starting target CPU...
GDB closed TCP/IP connection
arm-none-eabi-gdb
Warning: the current language does not match this frame.
The target endianness is set automatically (currently little endian)
Semihosting and SWV
SEGGER J-Link GDB Server V4.96g - Terminal output channel
Connection closed by the GDB server.
The following is my debugging configuration:
Under the Run Commands, the commands in the box are as below:
target remote localhost:2331
monitor reset
load
mon reg pc = 0x300000
mon reg pc = 0x300000
end
I dont know what root cause is. I suspect that it is arm-none-eabi-gdb that causes the JLinkGDBServer to terminate with an exit code of -1.
Please help.
Edit 1
FYI, I am using SAMA5D3x-EK development board.
It is difficult to say, but looks like if the processor is having an exception while loading your code. I suggest to verify the load address of the sections of your ELF file, and review the linker settings of your project.
Hope it helps...

JBoss 4.2.2 nodes start to cluster then suspect each other

I have a website running with JBoss 4.2.2 on an existing Red Hat server. I'm setting up a second server so as to have a clustered pair (which will then be load-balanced). However, I can't get them to cluster successfully.
The existing server starts up JBoss with:
run.sh -c default -b 0.0.0.0
(I know the 'default' configuration doesn't support clustering out of the box - I'm using a modified version of it which includes clustering support.)
When I start the second JBoss instance with the same command, it forms its own cluster without noticing the first. Both use the same partition name and multicast address and port.
I tried the McastReceiverTest and McastSenderTest programs to check that the machines could communicate over multicast; they could.
I then noticed the info at http://docs.jboss.org/jbossas/docs/Clustering_Guide/beta422/html/ch07s07s07.html, saying that JGroups cannot bind to all interfaces, and instead binds to the default interface; so presumably it was binding to 127.0.0.1, and thereby not getting the messages through. So instead I set the instances to tell JGroups to use the internal IPs:
run.sh -c default -b 0.0.0.0 -Djgroups.bind_addr=10.51.1.131
run.sh -c default -b 0.0.0.0 -Djgroups.bind_addr=10.51.1.141
(.131 is the existing server, .141 is the new server).
The nodes now notice each other and form a cluster - at first. However, while trying to deploy the .ear, the server log says this:
2010-08-07 22:26:39,321 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:46294 (own address=10.51.1.141:47629)
2010-08-07 22:26:45,412 WARN [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48733; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 22:26:49,324 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:46294 (own address=10.51.1.141:47629)
2010-08-07 22:26:49,324 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:46294 (number=0)
2010-08-07 22:26:49,529 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=10.51.1.141:60365, coord_addr=10.51.1.141:60365, is_server=true]]
2010-08-07 22:26:52,092 WARN [org.jboss.cache.TreeCache] replication failure with method_call optimisticPrepare; id:18; Args: ( arg[0] = GlobalTransaction:<10.51.1.131:46294>:5421085 ...) exception org.jboss.cache.lock.TimeoutException: failure acquiring lock: fqn=/Yudu_ear,Yudu-ejb_jar,Yudu-ejbPU/com/yudu/ejb/entity, caller=GlobalTransaction:<10.51.1.131:46294>:5421085, lock=read owners=[GlobalTransaction:<10.51.1.131:46294>:5421081] (activeReaders=1, activeWriter=null, waitingReaders=0, waitingWriters=1, waitingUpgrader=0)
...and the .ear fails to deploy.
If I change CacheMode in ejb3-entity-cache-service.xml from REPL_SYNC to LOCAL, the .ear deploys correctly, although of course the entity cache replication then doesn't happen. However, the log still shows interesting signs of the same problem.
It looks like:
first the new node finds the existing one and forms a cluster
then the FD checks fail, and after a set number of failures the new node splits off from the cluster and forms its own cluster of one
then it finds it again, re-clusters and this time the FD checks work.
Relevant bits of the log file:
2010-08-07 23:47:07,423 INFO [org.jgroups.protocols.UDP] socket information: local_addr=10.51.1.141:35666, mcast_addr=228.1.2.3:45566, bind_addr=/10.51.1.141, ttl=2 sock: bound to 10.51.1.141:35666, receive buffer size=131071, send buffer size=131071 mcast_recv_sock: bound to 0.0.0.0:45566, send buffer size=131071, receive buffer size=131071 mcast_send_sock: bound to 10.51.1.141:59196, send buffer size=131071, receive buffer size=131071
2010-08-07 23:47:07,431 DEBUG [org.jgroups.protocols.UDP] created unicast receiver thread
2010-08-07 23:47:09,445 DEBUG [org.jgroups.protocols.pbcast.GMS] initial_mbrs are [[own_addr=10.51.1.131:48888, coord_addr=10.51.1.131:48888, is_server=true]]
2010-08-07 23:47:09,446 DEBUG [org.jgroups.protocols.pbcast.GMS] election results: {10.51.1.131:48888=1}
2010-08-07 23:47:09,446 DEBUG [org.jgroups.protocols.pbcast.GMS] sending handleJoin(10.51.1.141:35666) to 10.51.1.131:48888
2010-08-07 23:47:09,751 DEBUG [org.jgroups.protocols.pbcast.GMS] [10.51.1.141:35666]: JoinRsp=[10.51.1.131:48888|61] [10.51.1.131:48888, 10.51.1.141:35666] [size=2]
2010-08-07 23:47:09,752 DEBUG [org.jgroups.protocols.pbcast.GMS] new_view=[10.51.1.131:48888|61] [10.51.1.131:48888, 10.51.1.141:35666]
...
2010-08-07 23:47:10,047 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Number of cluster members: 2
2010-08-07 23:47:10,047 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Other members: 1
...
2010-08-07 23:47:20,034 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:30,037 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:30,038 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:48888 (number=0)
2010-08-07 23:47:40,040 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:47:40,040 DEBUG [org.jgroups.protocols.FD] heartbeat missing from 10.51.1.131:48888 (number=1)
...
2010-08-07 23:48:19,758 WARN [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:48:20,054 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48888 (own address=10.51.1.141:35666)
2010-08-07 23:48:20,055 DEBUG [org.jgroups.protocols.FD] [10.51.1.141:35666]: received no heartbeat ack from 10.51.1.131:48888 for 6 times (60000 milliseconds), suspecting it
2010-08-07 23:48:20,058 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[10.51.1.131:48888]] to group
...
2010-08-07 23:48:21,691 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing 10.51.1.131:48888 from received_msgs (not member anymore)
2010-08-07 23:48:21,691 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (127.0.0.1:1099) received membershipChanged event:
2010-08-07 23:48:21,691 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 0 ([])
2010-08-07 23:48:21,691 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
2010-08-07 23:48:21,691 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([127.0.0.1:1099])
...
2010-08-07 23:49:59,793 WARN [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:50:09,796 WARN [org.jgroups.protocols.FD] I was suspected by 10.51.1.131:48888; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2010-08-07 23:50:19,144 DEBUG [org.jgroups.protocols.FD] Recevied Ack. is invalid (was from: 10.51.1.131:48888),
2010-08-07 23:50:19,144 DEBUG [org.jgroups.protocols.FD] Recevied Ack. is invalid (was from: 10.51.1.131:48888),
...
2010-08-07 23:50:21,791 DEBUG [org.jgroups.protocols.pbcast.GMS] new=[10.51.1.131:48902], suspected=[], leaving=[], new view: [10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
...
2010-08-07 23:50:21,792 DEBUG [org.jgroups.protocols.pbcast.GMS] view=[10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
2010-08-07 23:50:21,792 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=10.51.1.141:35666] view is [10.51.1.141:35666|63] [10.51.1.141:35666, 10.51.1.131:48902]
2010-08-07 23:50:21,822 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 63, delta: 1) : [127.0.0.1:1099, 127.0.0.1:1099]
2010-08-07 23:50:21,822 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] membership changed from 1 to 2
...
2010-08-07 23:50:31,825 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to 10.51.1.131:48902 (own address=10.51.1.141:35666)
2010-08-07 23:50:31,832 DEBUG [org.jgroups.protocols.FD] received ack from 10.51.1.131:48902
But I'm at a loss to understand why the FD checks fail the first time round; and although it eventually seems to cluster with the other node, the initial failure seems to be enough to mess up the deployment when it tries to share entity state, and thereby prevent it from actually working in a useful way.
If anyone can shed light on this I'll be hugely grateful!
I think that before you move on to JBoss 4.2.3 (which is probably a good place to be eventually) or building a new configuration (I agree with #skaffman about pruning being easier than adding), you might want to try the following:
On 10.51.1.131:
run.sh -c default -b 10.51.1.131 -Djgroups.bind_addr=10.51.1.131
On 10.51.1.141:
run.sh -c default -b 10.51.1.141 -Djgroups.bind_addr=10.51.1.141
According to all the documentation I can find on this, the -b parameter is the server instance bind address, and having them be different might be creating some significant schizophrenia for JGroups. I had a four-server clustered environment working successfully for over three years, and that was part of the recommended configuration from RH/JBoss (we had a support contract, and got help from Bela Ban).