Does the goinstant connection automatically handle reconnection in case of disconnection?
I can't find any indication of this in the docs.
Yes, GoInstant reconnects automatically, but that appears to be a hole in our documentation.
Internally, the GoInstant client implements a capped Fibonacci back-off sequence (in milliseconds):
[100,100,200,300,500,800,1300,2100,3000,3000,3000,3000,3000,3000,3000,3000,3000,3000,3000,3000]
Each time the connection drops, that sequence of reconnects is attempted. Once it reaches the end of that sequence (approx 41 seconds later), the connection is considered dead and you get a goinstant.ConnectionError object passed via the .on('error') event of the Connection object.
The error classes are documented here: https://developers.goinstant.com/v1/javascript_api/connection/errors.html
Related
Context
I'm developing a REST API that, as you might expect, is backed by multiple external cross-network services, APIs, and databases. It's very possible that a transient failure is encountered at any point and for which the operation should be retried. My question is, during that retry operation, how should my API respond to the client?
Suppose a client is POSTing a resource, and my server encounters a transient exception when attempting to write to the database. Using a combination of the Retry Pattern perhaps with the Circuit Breaker Pattern, my server-side code should attempt to retry the operation, following randomized linear/exponential back-off implementations. The client would obviously be left waiting during that time, which is not something we want.
Questions
Where does the client fit into the retry operation?
Should I perhaps provide an isTransient: true indicator in the JSON response and leave the client to retry?
Should I leave retrying to the server and respond with a message and status code indicative that the server is actively retrying the request and then have the client poll for updates? How would you determine the polling interval in that case without overloading the server? Or, should the server respond via a web socket instead so the client need not poll?
What happens if there is an unexpected server crash during the retry operation? Obviously, when the server recovers, it won't "remember" the fact that it was retrying an operation unless that fact was persisted somewhere. I suppose that's a non-critical issue that would just cause further unnecessary complexity if I attempted to solve it.
I'm probably over-thinking the issue, but while there is a lot of documentation about implementing transient exception retry logic, seldom have I come across resources that discuss how to leave the client "pending" during that time.
Note: I realize that similar questions have been asked, but my queries are more specific, for I'm specifically interested in the different options for where the client fits into a given retry operation, how the client should react in those cases, and what happens should a crash occur that interrupts a retry sequence.
Thank you very much.
There are some rules for retry:
always create an idempotency key to understand that there is retry operation.
if your operation a complex and you want to wrap rest call with retry, you must ensure that for duplicate requests no side effects will be done(start from failure point and don't execute success code).
Personally, I think the client should not know that you retry something, and of course, isTransient: true should not be as a part of the resource.
Warning: Before add retry policy to something you must check side effects, put retry policy everywhere is bad practice
We use Spring integration for TCP socket communication with the hardware.
The client would be sending a sequence number to uniquely identify a message.
My requirement is to store these sequence numbers part of the socket message and validate them for non repetitive sequence numbers.
I went thru IdempotentReceiver, sounds like what i wanted.
But I need a durable and faster mechanism to store it, before unexpected shutdown of service and use the in memory cache for retrieving the latest sequence number.
Thank you in advance.!
You can use PropertiesPersistingMetadataStore for idempotent receiver:
The PropertiesPersistingMetadataStore is backed by a properties file and a PropertiesPersister.
By default, it only persists the state when the application context is closed normally. It implements Flushable so you can persist the state at will, be invoking flush().
See more in its JavaDocs.
I looked through FIX v4.2 spec, it is not clear to me what the expected behavior it should be when the TCP connection is lost in the middle of a session.
More specifically, suppose the current sequence number is 100 and at this point the TCP connection is lost, when either side tries to resume the session, it re-sends message number 100, or starts a new session with logon?
In describing FIX session, the spec says one session has one logon and one logout, but could go across multiple physical connections. This leads me to think that when the TCP connection is lost, the resuming process should not be starting with a logon message, but I am not positive on that.
Thanks in advance!
FIX protocol does not define anything related to the transport protocol. There were some documents on the official web site that only suggest how it can be implemented on top of this or that protocol, but only suggests.
Therefore, the expected behavior in case of TCP/IP disconnect depends on implementation. For instance, it is possible to have a system that does not care about TCP/IP disconnects at all, which would make those details irrelevant. In that case, the expected behavior would have been to continue sending receiving messages after connection is re-established, and of course proceed to a “recovery” of lost messages, if any. In reality, though, I have never seen a system like that.
In practice, all systems treat TCP/IP disconnects as implicit lose of session and expect clients to send a logon upon re-connect.
When logging in, there are two options — a re-connecting session may send the next outgoing sequence number or it may ask server to reset the sequence (to 1). In first case, the server side may send a logon acknowledgement if sequence is greater or equal to what it expected, or close (or even reject) the session if the received sequence number is less than expected. Additionally, if the sequence was greater than expected, server will issue a re-transmission. Client session monitors the sequence of the server as well, and needs to request a re-transmission if it detects a gap (received sequence is greater than expected). In the second case, if the server supports sequence reset, both in and out sequences are reset to 1 and no messages are recovered.
In your case, if connection is lost after sending a message with sequence number 100, client would have to re-connect and send a logon with sequence 101, and proceed from there. Alternatively, connect and reset the sequence, in which case some messages might get lost.
Also, don’t forget to check specifics of the venue you connect to. There could be very weird details that are not specified by the FIX protocol at all, or even those going against the FIX protocol. For instance, ICE (indeed one of the most brain-dead exchanges in general) is one of the silliest exchanges in this regard — it doesn’t allow re-connecting within first 15 seconds, and then if clients cannot connect for 30 seconds, they should switch to a failover server. If failover happens, they fail to keep the sequence number in tact, and clients are left no choice but reset the sequence number.
Hope it makes things a bit clearer for you. Good Luck!
If the transport layer is TCP/IP, I would expect the session initator to:
Re-establish a socket connection
Send a new logon message
The sequence number to use on the logon message depends on the type of session and what has been agreed with the FIX session acceptor (see the spec for details). For sessions where there is no value in replaying any lot messages e.g. market data feeds where the prices would be stale, it makes sense to send a logon message with sequence number 1 and set tag 141=Y (to reset the sequence numbers). For an orders session, where message replay might be required, the session initiator should generally logon with a sequence number of one greater than the last message sent (and expect a logon response from the FIX session acceptor with sequence number of 1 greater than the last message received).
Unless you really need the message replay, it is cleaner and easier to reset the sequence numbers each time upon logon. This obviously depends on the FIX session acceptor (FIX server) support for this. For things like STP feeds, I've found this to be far more reliable and it is generally better for the application protocol to provide application level replay facilities rather than relying on the brittleness of FIX session replay.
While reading the ZooKeeper's recipe for lock, I got confused. It seems that this recipe for distributed locks can not guarantee "any snapshot in time no two clients think they hold the same lock". But since ZooKeeper is so widely adopted, if there were such mistakes in the reference documentation, someone should have pointed it out long ago, so what did I misunderstand?
Quoting the recipe for distributed locks:
Locks
Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two clients think they hold the same lock. These can be implemented using ZooKeeeper. As with priority queues, first define a lock node.
Call create( ) with a pathname of "locknode/guid-lock-" and the sequence and ephemeral flags set.
Call getChildren( ) on the lock node without setting the watch flag (this is important to avoid the herd effect).
If the pathname created in step 1 has the lowest sequence number suffix, the client has the lock and the client exits the protocol.
The client calls exists( ) with the watch flag set on the path in the lock directory with the next lowest sequence number.
if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2.
Consider the following case:
Client1 successfully acquired the lock (in step 3), with ZooKeeper node "locknode/guid-lock-0";
Client2 created node "locknode/guid-lock-1", failed to acquire the lock, and is now watching "locknode/guid-lock-0";
Later, for some reason (say, network congestion), Client1 fails to send a heartbeat message to the ZooKeeper cluster on time, but Client1 is still working away, mistakenly assuming that it still holds the lock.
But, ZooKeeper may think Client1's session is timed out, and then
delete "locknode/guid-lock-0",
send a notification to Client2 (or maybe send the notification first?),
but can not send a "session timeout" notification to Client1 in time (say, due to network congestion).
Client2 gets the notification, goes to step 2, gets the only node ""locknode/guid-lock-1", which it created itself; thus, Client2 assumes it hold the lock.
But at the same time, Client1 assumes it holds the lock.
Is this a valid scenario?
The scenario you describe could arise. Client 1 thinks it has the lock, but in fact its session has timed out, and Client 2 acquires the lock.
The ZooKeeper client library will inform Client 1 that its connection has been disconnected (but the client doesn't know the session has expired until the client connects to the server), so the client can write some code and assume that his lock has been lost if he has been disconnected too long. But the thread which uses the lock needs to check periodically that the lock is still valid, which is inherently racy.
...But, Zookeeper may think client1's session is timeouted, and then...
From the Zookeeper documentation:
The removal of a node will only cause one client to wake up since
each node is watched by exactly one client. In this way, you avoid
the herd effect.
There is no polling or timeouts.
So I don't think the problem you describe arises. It looks to me as thought there could be a risk of hanging locks if something happens to the clients that create them, but the scenario you describe should not arise.
from packt book - Zookeeper Essentials
If there was a partial failure in the creation of znode due to connection loss, it's
possible that the client won't be able to correctly determine whether it successfully
created the child znode. To resolve such a situation, the client can store its session ID
in the znode data field or even as a part of the znode name itself. As a client retains
the same session ID after a reconnect, it can easily determine whether the child znode
was created by it by looking at the session ID.
I am working on QuickFix/J (FIX 4.2)to submit orders to an acceptor FIX engine. Basically I need help on two accounts:
When I first try to establish a connection with the acceptor, the acceptor rejects the initial Logon requests saying "Msg Seq No too Low". After this my initiator goes on incrementing the outgoing sequence number by one and when this seq no. and the no. expected by the acceptor engine match, I get a stable connection. To speed this process, I began to extract the expected seq. no. from the reject message sent by the acceptor engine and changed the outgoing sequence no. for my engine using
session.setNextTargetMsgSeqNum(expectedSeqNo).
However, later on, if my engine finds incoming sequence no. higher than expected, it sends a Resend request. In response, the other party sends back a Sequence Reset msg (35=4, 123=Y). Now after receiving this msg, incoming seq no. for my engine should be automatically set to the one it received from Seq Reset msg. But this does not happen and my engine goes on asking for messages resend request with no change in the incoming seq no.
Interesting thing is, I found this thing to work when I don't explicitly change the outgoing seq no in the first place (using setNextTargetMsgSeqNum).
Why is my engine not showing expected behavior when it gets Sequence Reset Msg?
I have talked to the other party and they won't have ResetOnLogon=Y in their configuration. So every time my engine comes up, it often sends Logon request with a seq no. lower than expected(starts from 1). Is there a better way to have the connection set up quickly? Like can I somehow make my engine use the sequence no. resuming from the point just before it went down? What should be the ideal approach?
So I am now persisting the messages in a file which is taking care of sequence numbers. However, what is troubling again is, my quickfix initiator engine is not responding to Sequence Reset messages. There are no admin call backs at all now.
I notice that no response to sequence reset message is happening almost always when I am connecting to the acceptor from one server and then, closing that session, and using a different server to connect to the acceptor, using the same session id. Once the logon is accepted, I expect things to work fine. However, while the other engine sends sequence reset to a particular number (gap fill basically), my fix engine does not respond to it, meaning, it does not reset its expected sequence number and keeps on sending resend requests to the acceptor. Any help will be greatly appreciated!
For normal FIX session usage, you configure the session start and end times and let the engine manage the sequence numbers. For example, if your session is active from 8:00 AM to 4:30 PM then QuickFIX/J will automatically reset the outgoing and incoming sequence number to 1 the first time the engine is started after 8:00 AM (or at 8:00 AM if the engine is already started at that time).
(Question #1). You are correct that your engine should use the new incoming sequence number after the Sequence Reset. Given that this works properly for thousands of QuickFIX/J users, think about what you might be doing that would change that behavior. For example, do you have an admin message callback and might it be throwing exceptions. Have you looked at your log files to see if there are any hints there?
(Question #2). If you are using a persistent MessageStore (FileStore, JdbcStore, etc.) then your outgoing sequence number will be available when you restart.