How does the FIX protocol handle a message sequence number overflow? - quickfix

We are currently incorporating a FIX engine (using QuickFixJ) in our application. We will be the initiator and use trade capture reports to get informed on all trades happening on the platform.
The trading (and thus the FIX session) will be running 24/7 and we are currently looking into ways to handle this properly. Our concern is that at some point we will need to reset the message sequence numbers to avoid an overflow. We would ideally not want to reset the sequence number as we need to be sure that we catch every single trade. We are worried about the following scenario:
We send a SequenceReset message
Our system crashes due to unrelated reasons
The acceptor side send us one or more TradeCaptureReport messages
Only now does the acceptor side receive our SequenceReset message
Our system has recovered and sends a ResendRequest message, with BeginSeqNo equal to 1 (because we have reset the message sequence number)
We do not get the TradeCaptureReport messages from (3.)
However, we have noticed that in case of a message sequence overflow, neither our engine nor the acceptor side seem to be troubled by this.
The example I have tested is simply sending heartbeats which will overflow the sequence number:
8=FIXT.1.19=13135=A34=149=INITIATOR50=INITIATOR52=20220901-15:26:03.40356=ACCEPTOR98=0108=10141=Y553=INITIATOR554=password1137=910=224
8=FIXT.1.19=00010235=A49=ACCEPTOR56=INITIATOR34=157=INITIATOR52=20220901-15:26:03.65498=0108=10141=Y1409=01137=910=212
8=FIXT.1.19=9035=434=249=INITIATOR50=INITIATOR52=20220901-15:26:03.71856=ACCEPTOR36=2147483646123=Y10=038
8=FIXT.1.19=00007035=049=ACCEPTOR56=INITIATOR34=257=INITIATOR52=20220901-15:26:13.79210=009
8=FIXT.1.19=7935=034=214748364649=INITIATOR50=INITIATOR52=20220901-15:26:13.78956=ACCEPTOR10=044
8=FIXT.1.19=00007035=049=ACCEPTOR56=INITIATOR34=357=INITIATOR52=20220901-15:26:23.85210=008
8=FIXT.1.19=7935=034=214748364749=INITIATOR50=INITIATOR52=20220901-15:26:23.85056=ACCEPTOR10=035
8=FIXT.1.19=00007035=049=ACCEPTOR56=INITIATOR34=457=INITIATOR52=20220901-15:26:33.89610=018
8=FIXT.1.19=8035=034=-214748364849=INITIATOR50=INITIATOR52=20220901-15:26:33.89256=ACCEPTOR10=080
8=FIXT.1.19=00007035=049=ACCEPTOR56=INITIATOR34=557=INITIATOR52=20220901-15:26:43.93310=012
8=FIXT.1.19=8035=034=-214748364749=INITIATOR50=INITIATOR52=20220901-15:26:43.93256=ACCEPTOR10=075
Is this a feature of the FIX protocol or is it undefined behaviour (and just works coincidentally)? And if this doesn't work (or is discouraged), is there a best way to handle ongoing FIX sessions? We have not found any usable information and most exchanges we have seen simply reset once a day.

I think the title of the question should rather be "how does a FIX engine handle message sequence number overflow".
As per the FIX spec the sequence number is always positive: FIX datatypes
Sequence of character digits without commas or decimals. Value must be
positive.
I can only speak for QuickFIX/J: internally the sequence number is of type java.lang.Integer which means its maximum positive value is 2147483647.
Now when QuickFIX/J (or any other engine) accepts or uses negative sequence numbers it clearly is a bug.
Maybe you should approach your Exchange how other clients handle this. I think at some point they have a time window where sequence numbers can (and should) be reset.
I guess the exchange handles it like outlined here: FIX session 24-hour connectivity

Related

How old orders should be treated with a resend request

Occasionally my quickfixN engine loses connection with the exchange, and when it reconnects the exchange realises there are missing messages and asks for a resend. My engine then sends the messages.
However often the orders are old, and often I will have subsequently sent an orderCancellation request. Nevertheless when the exchange executes the messages in order when they are resent there's a good chance the Orders will be filled.
What is the correct way to deal with this problem? ie, how can I tell the exchange not to execute these orders, or alternatively, how can I stop quickfixN from resending old orders?
I don't know if there is a universally "correct" way to handle this issue.
In our system, we always, always respond with a Gap Fill, i.e.
Exchange: "Hey, we're missing sequences 537 through 542!"
Us: "Don't worry about it. Expect sequence 545 next."
The 545 is not a typo—we may have already sent 543 and 544 while their Resend Request was in transmission.
This technique is expressly to avoid the kind of dilemmas you're facing. By refusing to send old messages, you at the very least retain control over your executions.
To illustrate a larger perspective, what we do is, when we initiate any action on an order, we flag the order as "in progress," meaning it cannot be actioned further in any way (amended/CFO'ed or cancelled). Only when we receive an ACK, i.e. an Execution Report, do we remove this state. So if the exchange misses a message pertaining to that order, that order simply ends up "stuck" (and gets highlighted as such on the front-end). Not ideal, but again, at least it's not out of control. The trader then simply re-enters the desired order. (Note that it's the very guarantee that we won't resend messages that enables a trader to safely re-enter orders.) With our system it's just try-again-and-move-on, without need for complex sequence-scenario resolving.
Source: Work on an order entry system connecting to >10 Canadian exchanges, used by >50 Canadian brokers.

What's the expected behavior when TCP connection is lost?

I looked through FIX v4.2 spec, it is not clear to me what the expected behavior it should be when the TCP connection is lost in the middle of a session.
More specifically, suppose the current sequence number is 100 and at this point the TCP connection is lost, when either side tries to resume the session, it re-sends message number 100, or starts a new session with logon?
In describing FIX session, the spec says one session has one logon and one logout, but could go across multiple physical connections. This leads me to think that when the TCP connection is lost, the resuming process should not be starting with a logon message, but I am not positive on that.
Thanks in advance!
FIX protocol does not define anything related to the transport protocol. There were some documents on the official web site that only suggest how it can be implemented on top of this or that protocol, but only suggests.
Therefore, the expected behavior in case of TCP/IP disconnect depends on implementation. For instance, it is possible to have a system that does not care about TCP/IP disconnects at all, which would make those details irrelevant. In that case, the expected behavior would have been to continue sending receiving messages after connection is re-established, and of course proceed to a “recovery” of lost messages, if any. In reality, though, I have never seen a system like that.
In practice, all systems treat TCP/IP disconnects as implicit lose of session and expect clients to send a logon upon re-connect.
When logging in, there are two options — a re-connecting session may send the next outgoing sequence number or it may ask server to reset the sequence (to 1). In first case, the server side may send a logon acknowledgement if sequence is greater or equal to what it expected, or close (or even reject) the session if the received sequence number is less than expected. Additionally, if the sequence was greater than expected, server will issue a re-transmission. Client session monitors the sequence of the server as well, and needs to request a re-transmission if it detects a gap (received sequence is greater than expected). In the second case, if the server supports sequence reset, both in and out sequences are reset to 1 and no messages are recovered.
In your case, if connection is lost after sending a message with sequence number 100, client would have to re-connect and send a logon with sequence 101, and proceed from there. Alternatively, connect and reset the sequence, in which case some messages might get lost.
Also, don’t forget to check specifics of the venue you connect to. There could be very weird details that are not specified by the FIX protocol at all, or even those going against the FIX protocol. For instance, ICE (indeed one of the most brain-dead exchanges in general) is one of the silliest exchanges in this regard — it doesn’t allow re-connecting within first 15 seconds, and then if clients cannot connect for 30 seconds, they should switch to a failover server. If failover happens, they fail to keep the sequence number in tact, and clients are left no choice but reset the sequence number.
Hope it makes things a bit clearer for you. Good Luck!
If the transport layer is TCP/IP, I would expect the session initator to:
Re-establish a socket connection
Send a new logon message
The sequence number to use on the logon message depends on the type of session and what has been agreed with the FIX session acceptor (see the spec for details). For sessions where there is no value in replaying any lot messages e.g. market data feeds where the prices would be stale, it makes sense to send a logon message with sequence number 1 and set tag 141=Y (to reset the sequence numbers). For an orders session, where message replay might be required, the session initiator should generally logon with a sequence number of one greater than the last message sent (and expect a logon response from the FIX session acceptor with sequence number of 1 greater than the last message received).
Unless you really need the message replay, it is cleaner and easier to reset the sequence numbers each time upon logon. This obviously depends on the FIX session acceptor (FIX server) support for this. For things like STP feeds, I've found this to be far more reliable and it is generally better for the application protocol to provide application level replay facilities rather than relying on the brittleness of FIX session replay.

FIX protocol sequence number

I have few question on FIX protocol sequence number:
What is the benefit of setting ResetOnLogon=N?
Does initiator and acceptor both can send Resend request?
How message sequence helps in session recovery/error handling?
it means that sequence numbers are reset by the protocol on a logon message. This keeps sequence numbers low which can be useful. The sell side usually defines whether this should be done or not.
Yes, as long as the engine thinks that, due to out of synch sequence numbers, a message may have been lost it may request a resend.
If sequence numbers are out of synch between a message and its predecessor, and the number is higher than expected then the engine may assume that some messages have been lost in the connection. This means that it needs to recover these meaasges.
If you have any more questions or want more information I would be happy to reply.
ResetOnLogon determines if sequence numbers should be reset when recieving a logon request. (please find documentation here: http://www.quickfixengine.org/quickfix/doc/html/configuration.html)
Yes, both can send a Resend Request, but you must follow the specs between your side and the counterparty.
The message sequence numbers tell that no messages were lost during the current session. If there is a mismatch, actions must be taken in order to establish the correct sync between the 2 sides.

QuickFix Sequence Reset not working

I am working on QuickFix/J (FIX 4.2)to submit orders to an acceptor FIX engine. Basically I need help on two accounts:
When I first try to establish a connection with the acceptor, the acceptor rejects the initial Logon requests saying "Msg Seq No too Low". After this my initiator goes on incrementing the outgoing sequence number by one and when this seq no. and the no. expected by the acceptor engine match, I get a stable connection. To speed this process, I began to extract the expected seq. no. from the reject message sent by the acceptor engine and changed the outgoing sequence no. for my engine using
session.setNextTargetMsgSeqNum(expectedSeqNo).
However, later on, if my engine finds incoming sequence no. higher than expected, it sends a Resend request. In response, the other party sends back a Sequence Reset msg (35=4, 123=Y). Now after receiving this msg, incoming seq no. for my engine should be automatically set to the one it received from Seq Reset msg. But this does not happen and my engine goes on asking for messages resend request with no change in the incoming seq no.
Interesting thing is, I found this thing to work when I don't explicitly change the outgoing seq no in the first place (using setNextTargetMsgSeqNum).
Why is my engine not showing expected behavior when it gets Sequence Reset Msg?
I have talked to the other party and they won't have ResetOnLogon=Y in their configuration. So every time my engine comes up, it often sends Logon request with a seq no. lower than expected(starts from 1). Is there a better way to have the connection set up quickly? Like can I somehow make my engine use the sequence no. resuming from the point just before it went down? What should be the ideal approach?
So I am now persisting the messages in a file which is taking care of sequence numbers. However, what is troubling again is, my quickfix initiator engine is not responding to Sequence Reset messages. There are no admin call backs at all now.
I notice that no response to sequence reset message is happening almost always when I am connecting to the acceptor from one server and then, closing that session, and using a different server to connect to the acceptor, using the same session id. Once the logon is accepted, I expect things to work fine. However, while the other engine sends sequence reset to a particular number (gap fill basically), my fix engine does not respond to it, meaning, it does not reset its expected sequence number and keeps on sending resend requests to the acceptor. Any help will be greatly appreciated!
For normal FIX session usage, you configure the session start and end times and let the engine manage the sequence numbers. For example, if your session is active from 8:00 AM to 4:30 PM then QuickFIX/J will automatically reset the outgoing and incoming sequence number to 1 the first time the engine is started after 8:00 AM (or at 8:00 AM if the engine is already started at that time).
(Question #1). You are correct that your engine should use the new incoming sequence number after the Sequence Reset. Given that this works properly for thousands of QuickFIX/J users, think about what you might be doing that would change that behavior. For example, do you have an admin message callback and might it be throwing exceptions. Have you looked at your log files to see if there are any hints there?
(Question #2). If you are using a persistent MessageStore (FileStore, JdbcStore, etc.) then your outgoing sequence number will be available when you restart.

Why don't I get all the data when with my non-blocking Perl socket?

I'm using Perl sockets in AIX 5.3, Perl version 5.8.2
I have a server written in Perl sockets. There is a option called "Blocking", which can be set to 0 or 1. When I use Blocking => 0 and run the server and client send data (5000 bytes), I am able to recieve only 2902 bytes in one call. When I use Blocking => 1, I am able to recieve all the bytes in one call.
Is this how sockets work or is it a bug?
This is a fundamental part of sockets - or rather, TCP, which is stream-oriented. (UDP is packet-oriented.)
You should never assume that you'll get back as much data as you ask for, nor that there isn't more data available. Basically more data can come at any time while the connection is open. (The read/recv/whatever call will probably return a specific value to mean "the other end closed the connection.)
This means you have to design your protocol to handle this - if you're effectively trying to pass discrete messages from A to B, two common ways of doing this are:
Prefix each message with a length. The reader first reads the length, then keeps reading the data until it's read as much as it needs.
Have some sort of message terminator/delimiter. This is trickier, as depending on what you're doing you may need to be aware of the possibility of reading the start of the next message while you're reading the first one. It also means "understanding" the data itself in the "reading" code, rather than just reading bytes arbitrarily. However, it does mean that the sender doesn't need to know how long the message is before starting to send.
(The other alternative is to have just one message for the whole connection - i.e. you read until the the connection is closed.)
Blocking means that the socket waits till there is data there before returning from a recieve function. It's entirely possible there's a tiny wait on the end as well to try to fill the buffer before returning, or it could just be a timing issue. It's also entirely possible that the non-blocking implementation returns one packet at a time, no matter if there's more than one or not. In short, no it's not a bug, but the specific 'why' of it is the old cop-out "it's implementation specific".