Why is a FIX connection dropping heartbeats when both endpoints are on the same server? - quickfix

We have two applications using QuickFIX engine, both are running in the same machine.
Sometimes we see that the session ends due to lack of heartbeats.
How can it be since both are running on the same machine?

FIX heartbeat mechanism has nothing to do with the fact where applications communicating using FIX protocol run. If you see the session being dropped due to lack of heartbeat then you have to determine which session did not send heartbeat (it will also fail to respond to «Test Request» message, if any) and why did that happen. Possible reasons are:
Server and client have different heartbeat interval settings and server does not honor client's heartbeat interval (field #108 in «Logon» message) and test request/response logic is broken (or turned off).
Underlying transport errors (i.e. TCP/IP errors or UDP packet drops).
Other software/hardware bugs.
Something else.
Hope it helps. Good Luck!

Related

Why is my TCP socket showing connected but not responding?

I have a program using a bi-directional TCP socket to send messages from the host PC to a VLinx ethernet-to-serial converter and then on to a PLC via RS-232. During heavy traffic the socket will intermittently stop communicating although all soft tests of the connection show that it is connected, active and writeable. I suspect that something is interrupting the connection causing the socket to close with out FIN/ACK. How can I test to see where this disconnect might be occuring?
The program itself is written in VB6 and uses Catalyst SocketTools/SocketWrench as opposed to the standard Winsock library. The methodology, properties and code seem to be sound since the same setup works reliably at two other sites. It's just this one site in particular where this problem occurs. It only happens during production when there is traffic on the network and can lose connection anywhere between 20 - 100 times per 10-hour day.
There are redundant tests in place to catch this loss of communication and keep the system running. We have tests on ACK messages, message queue size, time between transmissions (tokens on 2s interval), etc. Typically, the socket will not be unresponsive for more than 30 seconds before it is caught, closed and re-established which works properly >99% of the time.
Previously I had enabled the SocketTools logging capabilities which did not capture any relevant information. Most recently I have tried to have the system ping the VLinx on the first sign of a missed message (2.5 seconds). Those pings have always been successful, meaning that if there is a momentary loss of connection at a switch or AP it does not stay disconnected for long.
I do not have access to the network hardware aside from the PC and VLinx that we own. The facility's IT is also not inclined to help track these kinds of things down because they work on a project-based model.
Does anyone have any suggestions what I can do to try and determine where the problem is occurring so that I can then try to come up with a permanent solution to this issue rather than the band-aid of reconnecting multiple times per day?
A tool like Wireshark may be helpful in seeing what's going on at the network level. The logging facility in SocketTools/SocketWrench can only report what's going on at the API level, and it sounds like whatever the underlying problem is occurs at a lower level in the TCP stack.
If this is occurring after periods of relative inactivity, followed by a burst of activity, one thing you could try doing is enabling keep-alive and see if that makes any difference.

How to detect when socket connection is lost?

I have a script (I don't have the code example here at the moment but I used IO::Async) which connects to socket on a remote server and listens. Client usually just listens for new data.
Problem is that the client is not able to detect if network problems occur and the socket connection is gone.
I used IO::Async and I also tried it with IO::Socket. Handle is always "connected" after the initial connection is established.
If the network connection is established again the socket connection is naturally still lost because the script has no idea that it should reconnect.
I was thinking to create some kind of "keepAlive" which "pings" (syswrite) the socket every X seconds (if nothing new came through socket) to check whether the connection is still there.
Is this the correct way to do it or is there maybe an another more creative or cleaner solution?
You can set the SO_KEEPALIVE socket option which, for TCP, sends periodic keepalive messages, and may help detect this condition. If this is detected, you will be delivered an EOF condition (most likely causing the containing IO::Async::Stream to fire on_read_eof).
For a better solution you might consider some sort of application-level keepalive message, such as IRC's PING command.
The short answer is there is no default way to automatically detect a dropped socket in perl.
Your approach of pinging would probably work pretty well; you could run a continuous thread in the background that sends ping requests and if it doesn't receive a response the main thread can be notified and a reconnect should be issued.
If you want to get messy you can work with select() to detect keep alive messages; however this may require some OS configuration depending upon your platform.
See this thread for more details: http://www.perlmonks.org/?node_id=566568

Dropping a streaming HTTP connection as soon as possible after losing connection

So what we're trying to achieve is maintaining a vast number of concurrent connections from mobile devices to our Erlang HTTP server. Mobile devices of course can have have pretty intermittent connections, so we're looking to drop dead connections as soon as possible to avoid their overhead.
Now, I'm not sure at what level we should be detecting dead connections. TCP has keepalive packets, which require an ACK. So ideally we'd send a keepalive packet ever 15 seconds, and if we didn't receive the ACK within the next 15 seconds then we'd drop the connection. However, I've no idea if this is even possible in Erlang. Also, I think there's the possibility that some NATs, wi-fi routers and mobile networks are ACKing the keepalives for a certain amount of time, correct me if I'm wrong. Is that the case, and if so is there any TCP-level alternative way of doing 'heartbeats'?
We've also tried an application-level heartbeat - sending a \n down the HTTP stream. However, even with all applicable Erlang options set, including send_timeout, we're not getting any error for about 5 minutes under certain circumstances, such as, say, the mobile device straying too far from its wi-fi router.
How best can we implement a streaming HTTP connection that the server will drop as soon as possible after losing contact? Any help'd be much appreciated!
You can add a specific watchdog for HTTP connection. Watchdog will have configurable timeout that will be reset after each operation (read or write) on connection. And if there were no operations on socket within specified timeout - connection is closed.
This approach will eliminate the problem of stale connections (connections perfectly healthy but without any I/O activity). And if clients is out of coverage - connection will last only up to specified timeout. Also no keep-alive mechanism is needed when using watchdog approach.
The only drawback is that server will not detect broken connections immediately but will instead wait timeout specified in connection watchdog.
Isac's comment answered it for me - configuring the socket keep alive timeout at the machine level.
See http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

JMX RMI agent fault tolerance mechanisms

I am using the JMX-RMI agent for message passing. I have a java program which sends a message having a name/id to a set of listener/listeners.Based on the message received by the listeners the client side programs behave accordingly.This piece works fine but I would like to know what kind of fault tolerance is built in into the JMX RMI agent.
If the listener stops accidentally, does JMX restart it or logs the error somewhere,what if the message queue on either side is full. Any documentation which explains the underlying architecture of JMX RMI or the in built fault tolerance mechanism will be appreciated. If it doesn't have any fault tolerance mechanisms, what would be a good way of doing it.
Thanks Much
I am assuming your client side listeners are using the standard javax.management.remote connectors. Without some customization, I would say you can implement some straightforward Fault Detection. For Fault Tolerance you're probably looking at some sort of clustering solution.
There are two layers of connectivity you need to be concerned about:
The MBeanServerConnection itself. In other words, if the whole server side JVM terminates, your client side processes need to know.
While the server JVM and the subsidiary MBeanServerConnection may continue to be available, the "hosted", the listener/client message forwarder service itself may stop/fail/stall.
For #1, the client processes can register a NotificationListener with the JMXConnector using the addConnectionNotificationListener method. Your local connection will then emit JMXConnectionNotifications on all the following events:
A new client connection has been opened.
A client connection has been closed.
A client connection has failed unexpectedly.
A client connection has potentially lost notifications. This notification only appears on the client side.
This way, your clients will know when a connection to the server has been established and lost.
For #2, it's a bit more specific to your application, but perhaps you can adapt a simple pattern like this:
When your listener/forwarder service starts, emit a start-notification. When it stops,emit a stopped-notification. The two categories of listeners that would register for these notifications would be:
The clients, so they know the service has started/stopped.
A server side "watcher" that can listen for a "stop" and restart the service.
Is that more-or-less what you were thinking of ?

Best socket options for client and sever that continuously transfer data

I am using Java (although I think the socket options is implement in most languages) to implement a client and server. The server sends data to the client for processing which the client acknowledges. On another port the client then sends the results of the processing back to the server. When it comes to options such as
SO_LINGER
SO_KEEPALIVE
SO_NODELAY
SO_REUSEADDRESS
SO_SENDBUFFER
SO_RECBUFFER
TCP_NODELAY
We have noticed that the connection between the client and server occasionally breaks. There will be a timeout on the send or the receive. When this happens will kill the socket and open a new one to continue.
What would be the best options to set in terms of the above scenario and is there anything that we could do from our side (programmatically or options-wise) to try minimize the amount of times the connection is dropped. We are using normal TCP/IP.
UPDATE:
The bounty on this ends soon. I haven't had a satisfactory answer yet so it is still open. I think everyone is missing the point of the quest. What is the best practice with regards to the options above for sockets that continuously chat. I have already got a ping packet in that if there is no work to be done (hardly ever the scenario) the normal message is sent with no inner elements so there is always processing.
Strictly speaking, you don't need any of these socket options:
* SO_LINGER
You need to set SO_LINGER only if your application still has outstanding packets to send when close(2) or shutdown(2) has been called. Not really applicable for your application.
* SO_KEEPALIVE
Sending keepalive-pings every two hours would really only help very long-lived but -very- quiet connections going through stateful firewalls with very long session timeouts. (Two hours between pings is entirely too long to be practical in today's Internet.)
* SO_NODELAY
This (presumably an alias for TCP_NODELAY) disables Nagle's algorithm, which is just a small-packet-avoidance problem. Perhaps Nagle is getting in the way in your application, but it takes special sequences of packets to introduce 500ms delays into processing; it never just hangs connections.
* SO_REUSEADDRESS
Useful for all 'servers' that listen on well-known port numbers; use on 'clients' is almost always covering up some bug or other, but it is sometimes necessary if requests must come from a well-known port number.
* SO_SENDBUFFER
* SO_RECBUFFER
These buffer sizes influence the kernel-side buffer sizes maintained for receiving or sending data while your program (receive buffer) or the socket (send buffer) isn't yet ready to accept more data. If these are set too small, your application might not transfer data as smoothly as possible, reducing throughput, but it should not lead to any stalls if these are set smaller than optimal. Of course, too large may put unreasonable demands on kernel memory, but there should be a reasonable system-wide maximum allowed size.
* TCP_NODELAY
Disables Nagle. Not likely to do more than introduce 500ms delays if your application sends multiple small packets before attempting a blocking read.
Really, you shouldn't need to set any socket options.
Can you distill your code into something that could be pasted here and tested or inspected? I'm used to TCP sessions surviving for days or weeks without trouble, so this is pretty surprising.
First I think that this page is relevant, regarding half-open connections.
http://nitoprograms.blogspot.com/2009/05/detection-of-half-open-dropped.html
That being said, TCP is designed to hide connection problems, so you may often find yourself in cases where the connection is broken, but neither side thinks it is. You have addressed this partially by using timeouts and taking that as a sign the connection is broken.
Since you are writing the client and server, I would avoid relying on TCP to tell you when the connection is broken altogether. I would just have the server also acknowledge the receipt of the result from the client. Then both sides will expect immediate responses to their messages, and you can track which messages have been ack'd and set an appropriately small timeout for receiving the ack. This is not a timeout on the send or receive, but a timeout on the time between sending a message and receiving the ack for that message. Then you can set the timeout appropriately depending on the quality of your connection (e.g. very small if you are running on loopback, but large if running over wireless with a weak signal).
Regarding the options you list, you will want to use SO_REUSEADDRESS so that you won't be prevented from reopening the socket, for example if it hasn't finished closing from a previously killed process.
You probably have, but it is best to check the obvious....
Have you verified that it IS the socket that is timing out, and not your code? Sockets are fairly stable, and while there might be an issue somewhere, it seems more likely that it is in your code. I would use logs, timestamps, and synchronised clocks to be sure.
There may be an issue that you genuinely DO take a long time to do the calculation, so maybe adding a 'I'm still thinking about it' message to your protocol that gets sent regularly, to keep the connection alive?
Of course networks will drop out from time to time regardless of what you do, and it sounds like you are already handling that case nicely.
try these options
SO_LINGER - for specyfying when the Socket close s called while some unsent data in the queue
TCP_NODELAY - For non blocking datat transfer
I would strongly encourage you to use a ping/echo model between client and server, so that if no data is sent for x seconds a ping message needs to be send. A typical reason for a break might be a firewall, which shuts down socketss because of inactivity.
The typical issue where the TCP model fails are physical problems e.g. a pulled/broken cable and hangs on one side, where technically someone is listening until a queue overrun kicks in (which might never happen given your amount of data).
What are the chances the connection is going through a NAT firewall somewhere along the way? Stateful firewalls maintain a table of open connections so that packets belonging to an allowed connection can quickly pass through the system, without forcing firewall admins to write overly-complex rule sets.
The downside is that this table can grow immensely large, so it must be pruned as connections are closed or as they appear to have simply grown stale and died quietly. A connection that has gone silent for 20 minutes is usually quiet enough to reaped. (Which is really very quick, as the TCP KEEPALIVE is typically two hours, making it nearly useless in the face of NAT firewalls.)
So: is this going through a NAT firewall? Is the connection quiet for long stretches? If so, add a ping/pong to your protocol, and fire it every few minutes.