I looked through FIX v4.2 spec, it is not clear to me what the expected behavior it should be when the TCP connection is lost in the middle of a session.
More specifically, suppose the current sequence number is 100 and at this point the TCP connection is lost, when either side tries to resume the session, it re-sends message number 100, or starts a new session with logon?
In describing FIX session, the spec says one session has one logon and one logout, but could go across multiple physical connections. This leads me to think that when the TCP connection is lost, the resuming process should not be starting with a logon message, but I am not positive on that.
Thanks in advance!
FIX protocol does not define anything related to the transport protocol. There were some documents on the official web site that only suggest how it can be implemented on top of this or that protocol, but only suggests.
Therefore, the expected behavior in case of TCP/IP disconnect depends on implementation. For instance, it is possible to have a system that does not care about TCP/IP disconnects at all, which would make those details irrelevant. In that case, the expected behavior would have been to continue sending receiving messages after connection is re-established, and of course proceed to a “recovery” of lost messages, if any. In reality, though, I have never seen a system like that.
In practice, all systems treat TCP/IP disconnects as implicit lose of session and expect clients to send a logon upon re-connect.
When logging in, there are two options — a re-connecting session may send the next outgoing sequence number or it may ask server to reset the sequence (to 1). In first case, the server side may send a logon acknowledgement if sequence is greater or equal to what it expected, or close (or even reject) the session if the received sequence number is less than expected. Additionally, if the sequence was greater than expected, server will issue a re-transmission. Client session monitors the sequence of the server as well, and needs to request a re-transmission if it detects a gap (received sequence is greater than expected). In the second case, if the server supports sequence reset, both in and out sequences are reset to 1 and no messages are recovered.
In your case, if connection is lost after sending a message with sequence number 100, client would have to re-connect and send a logon with sequence 101, and proceed from there. Alternatively, connect and reset the sequence, in which case some messages might get lost.
Also, don’t forget to check specifics of the venue you connect to. There could be very weird details that are not specified by the FIX protocol at all, or even those going against the FIX protocol. For instance, ICE (indeed one of the most brain-dead exchanges in general) is one of the silliest exchanges in this regard — it doesn’t allow re-connecting within first 15 seconds, and then if clients cannot connect for 30 seconds, they should switch to a failover server. If failover happens, they fail to keep the sequence number in tact, and clients are left no choice but reset the sequence number.
Hope it makes things a bit clearer for you. Good Luck!
If the transport layer is TCP/IP, I would expect the session initator to:
Re-establish a socket connection
Send a new logon message
The sequence number to use on the logon message depends on the type of session and what has been agreed with the FIX session acceptor (see the spec for details). For sessions where there is no value in replaying any lot messages e.g. market data feeds where the prices would be stale, it makes sense to send a logon message with sequence number 1 and set tag 141=Y (to reset the sequence numbers). For an orders session, where message replay might be required, the session initiator should generally logon with a sequence number of one greater than the last message sent (and expect a logon response from the FIX session acceptor with sequence number of 1 greater than the last message received).
Unless you really need the message replay, it is cleaner and easier to reset the sequence numbers each time upon logon. This obviously depends on the FIX session acceptor (FIX server) support for this. For things like STP feeds, I've found this to be far more reliable and it is generally better for the application protocol to provide application level replay facilities rather than relying on the brittleness of FIX session replay.
Related
quickfixj initiator getting Disconnecting: Encountered END_OF_STREAM while trying to logon to the acceptor. We are using vendor's fix engine as acceptor. and feedback from acceptor is that logon request for xxxx was not accepted, incoming too small, expect 305, received 27.
I read the quickfix documentation but didn't get it exactly what's the proper solution for the sequence number mismatch. I understand that if I am disconnected, my initiator will send an 35=4 for resend with initiator side seqnum asking acceptor to resend the messages and fill up the gap.
But in what case, if initiator is sending a lower seqnum will be rejected by acceptor and refuse the connection?
And what's the proper procedure to handle this kind of rejection and reconnect? In order to not loose any message, how should both side do the reset and fill the gap?
In case there is a break between the initiator and acceptor, what's the recommended solution to keep the messages in sync and not loosing any?
Due to the first sentence of your question I would like to show you an answer to the same error message Disconnecting: Encountered END_OF_STREAM. There is a blog post by bhageera quoted.
In the end the reason was pretty silly… the counterparty I was connecting to allows only 1 connection per user/password (i.e. session with those credentials) at a time. As it turns out there was another application using the same credentials against the same TargetCompID. As soon as that application was killed off, the current one logged in fine.
I searched for the cause of the bug for a while, until I realized that I had two initiators with the same credentials running on two different test environments.
According to default logic in QuickfixJ:
QuickfixJ manages 2 sequence number, expectedSeqNum to receive(targetSeqNum) and nextSeqNumber to sent.
Check the next expected target SeqNum against the received SeqNum.If a mismatch is detected, apply the following logic:
if lower than expected SeqNum, logout
if higher, send a resend request
In your case received was lower than expected so it gets disconnected.
Reason for receiving higher than expected SeqNum:
Receiver misses some message so it could be a normal scenario.
Reason for lower than expected SeqNum(Your case):
One of the counterparties resets its sequence number, which is not expected it should be agreed by both the counterparties.
In a normal scenario, whenever you miss the message you will receive a higher number and it would be managed by QuickFixJ.
I have read following on msdn about accept function:
https://msdn.microsoft.com/pl-pl/library/windows/desktop/ms737526(v=vs.85).aspx
When using the accept function, realize that the function may return
before connection establishment has traversed the entire distance
between sender and receiver. This is because the accept function
returns as soon as it receives a CONNECT ACK message; in ATM, a
CONNECT ACK message is returned by the next switch in the path as soon
as a CONNECT message is processed (rather than the CONNECT ACK being
sent by the end node to which the connection is ultimately
established). As such, applications should realize that if data is
sent immediately following receipt of a CONNECT ACK message, data loss
is possible, since the connection may not have been established all
the way between sender and receiver.
Could someone explain it in more details? What it has with SYN, SYN ACK? What's the problem here? So when such data loss can happen, and how to prevent it?
You're omitting an important paragraph on that page, right before your quote:
The following are important issues associated with connection setup,
and must be considered when using Asynchronous Transfer Mode (ATM)
with Windows Sockets 2
That is, it is only applicable when you use things like AF_ATM and SOCKADDR_ATM. It is not relevant for TCP which you seem to imply with:
What it has with SYN, SYN ACK
I am a newbie to networks and in particular TCP (I have been fooling a bit with UDP, but that's it).
I am developing a simple protocol based on exchanging messages between two endpoints. Those messages need to be certified, so I implemented a cryptographic layer that takes care of that. However, while UDP has a sound definition of a packet that constitutes the minimum unit that can get transferred at a time, the TCP protocol (as far as my understanding goes) is completely stream oriented.
Now, this puzzles me a bit. When exchanging messages, how can I tell where one starts and the other one ends? In principle, I can obviously communicate fixed length messages or first communicate the size of each message in some header. However, this can be subject to attacks: while of course it is going to be impossible to distort or determine the content of the communication, the above technique would make it easy to completely disrupt my communication just by adding a single byte in the middle.
Say that I need to transfer a message 1234567 bytes long. First of all, I communicate 4 bytes with an integer representing the size of the message. Okay. Then I start sending out the actual message. That message gets split in several packets, which get separately received. Now, an attacker just sends in an additional packet, faking it as if it was part of the conversation. It can just be one byte long: this completely destroys any synchronization mechanism I have implemented! The message has a spurious byte in the middle, and it doesn't successfully get decoded. Not only that, the last byte of the first message disrupts the alignment of the second message and so on: the connection is destroyed, and with a simple, simple attack! How likely and feasible is this attack anyway?
So I am wondering: what is the maximum data unit that can be transferred at once? I understand that to a call to send doesn't correspond a call to receive: the message can be split in different chunks. How can I group the packets together in some way so that I know that they get packed together? Is there a way to define an higher level message that gets reconstructed and aligned all together and triggers a single call to a receive-like function? If not, what other solutions can I find to keep my communication re-alignable even in presence of an attacker?
Basically it is difficult to control the way the OS divides the stream into TCP packets (The RFC defining TCP protocol states that TCP stack should allow the clients to force it to send buffered data by using push function, but it does not define how many packets this should generate. After all the attacker can modify any of them).
And these TCP packets can get divided even more into IP fragments during their way through the network (which can be opted-out by a 'Do not fragment' IP flag -- but this flag can cause that your packets are not delivered at all).
I think that your problem is not about introducing packets into a stream protocol, but about securing it.
IPSec could be very beneficial in your scenario, as it operates on the network layer.
It provides integrity for every packet sent, so any modification on-the-wire gets detected and the invalid packets are dropped. In case of TCP the dropped packets get re-transmitted automatically.
(Almost) everything is done automatically by the OS -- so yo do not need to worry about it (and make mistakes doing so).
The confidentiality can be assured as well (with the same advantage of not re-inventing the wheel).
IPSec should provide you a reliable transport protocol ontop of which you can use whatever framing format you like.
Another alternative is using SSL/TLS on top of TCP session which is less robust (as it does close the whole connection on integrity error).
Now, an attacker just sends in an additional packet, faking it as if it was part of the conversation. It can just be one byte long: this completely destroys any synchronization mechanism I have implemented!
Thwarting such an injection problem is dealt with by securing the stream. Create an encrypted stream and send your packets through that.
Of course the encrypted stream itself then has this problem; its messages can be corrupted. But those messages have secure integrity checks. The problem is detected, and the connection can be torn down and re-established to resynchronize it.
Also, some fixed-length synchronizing/framing bit sequence can be used between messages: some specific bit pattern. It doesn't matter if that pattern occurs inside messages by accident, because we only ever specifically look for that pattern when things go wrong (a corrupt message is received), otherwise we skip that sequence. If a corrupt message is received, we then receive bytes until we see the synchronizing pattern, and assume that whatever follows it is the start of a message (length followed by payload). If that fails, we repeat the process. When we receive a correct message, we reply to the peer, which will re-transmit anything we didn't get.
How likely and feasible is this attack anyway?
TCP connections are identified by four items: the source and destination IP, and source and destination port number. The attacker has to fake a packet which matches your stream in these four identifiers, and sneak that packet past all the routers and firewalls between that attacker and the receiving machine. The attacker also has to be in the right ballpark with regard to the TCP sequence number.
Basically, this is next to impossible for an attacker C to perpetrate against endpoints A and B which are both distant from C on the network. The fake source IP will be rejected long before C is able to reach its destination. It's more plausible as an inside job (which includes malware): C is close to A and B.
I have few question on FIX protocol sequence number:
What is the benefit of setting ResetOnLogon=N?
Does initiator and acceptor both can send Resend request?
How message sequence helps in session recovery/error handling?
it means that sequence numbers are reset by the protocol on a logon message. This keeps sequence numbers low which can be useful. The sell side usually defines whether this should be done or not.
Yes, as long as the engine thinks that, due to out of synch sequence numbers, a message may have been lost it may request a resend.
If sequence numbers are out of synch between a message and its predecessor, and the number is higher than expected then the engine may assume that some messages have been lost in the connection. This means that it needs to recover these meaasges.
If you have any more questions or want more information I would be happy to reply.
ResetOnLogon determines if sequence numbers should be reset when recieving a logon request. (please find documentation here: http://www.quickfixengine.org/quickfix/doc/html/configuration.html)
Yes, both can send a Resend Request, but you must follow the specs between your side and the counterparty.
The message sequence numbers tell that no messages were lost during the current session. If there is a mismatch, actions must be taken in order to establish the correct sync between the 2 sides.
I am working on QuickFix/J (FIX 4.2)to submit orders to an acceptor FIX engine. Basically I need help on two accounts:
When I first try to establish a connection with the acceptor, the acceptor rejects the initial Logon requests saying "Msg Seq No too Low". After this my initiator goes on incrementing the outgoing sequence number by one and when this seq no. and the no. expected by the acceptor engine match, I get a stable connection. To speed this process, I began to extract the expected seq. no. from the reject message sent by the acceptor engine and changed the outgoing sequence no. for my engine using
session.setNextTargetMsgSeqNum(expectedSeqNo).
However, later on, if my engine finds incoming sequence no. higher than expected, it sends a Resend request. In response, the other party sends back a Sequence Reset msg (35=4, 123=Y). Now after receiving this msg, incoming seq no. for my engine should be automatically set to the one it received from Seq Reset msg. But this does not happen and my engine goes on asking for messages resend request with no change in the incoming seq no.
Interesting thing is, I found this thing to work when I don't explicitly change the outgoing seq no in the first place (using setNextTargetMsgSeqNum).
Why is my engine not showing expected behavior when it gets Sequence Reset Msg?
I have talked to the other party and they won't have ResetOnLogon=Y in their configuration. So every time my engine comes up, it often sends Logon request with a seq no. lower than expected(starts from 1). Is there a better way to have the connection set up quickly? Like can I somehow make my engine use the sequence no. resuming from the point just before it went down? What should be the ideal approach?
So I am now persisting the messages in a file which is taking care of sequence numbers. However, what is troubling again is, my quickfix initiator engine is not responding to Sequence Reset messages. There are no admin call backs at all now.
I notice that no response to sequence reset message is happening almost always when I am connecting to the acceptor from one server and then, closing that session, and using a different server to connect to the acceptor, using the same session id. Once the logon is accepted, I expect things to work fine. However, while the other engine sends sequence reset to a particular number (gap fill basically), my fix engine does not respond to it, meaning, it does not reset its expected sequence number and keeps on sending resend requests to the acceptor. Any help will be greatly appreciated!
For normal FIX session usage, you configure the session start and end times and let the engine manage the sequence numbers. For example, if your session is active from 8:00 AM to 4:30 PM then QuickFIX/J will automatically reset the outgoing and incoming sequence number to 1 the first time the engine is started after 8:00 AM (or at 8:00 AM if the engine is already started at that time).
(Question #1). You are correct that your engine should use the new incoming sequence number after the Sequence Reset. Given that this works properly for thousands of QuickFIX/J users, think about what you might be doing that would change that behavior. For example, do you have an admin message callback and might it be throwing exceptions. Have you looked at your log files to see if there are any hints there?
(Question #2). If you are using a persistent MessageStore (FileStore, JdbcStore, etc.) then your outgoing sequence number will be available when you restart.