RDMA memory semantics for READ/WRITE operations and local memory operations - rdma

I am seeing unexpected results from RDMA reads that make me doubt my understanding of the RDMA read and write semantics.
I'm trying to implement message passing in RMDA in a manner similar to L5, but am running into issues that look like memory tearing. But that shouldn't be happening.
I have a struct that is a bit more complicated than what L5 has:
struct Header {
std::atomic<uint8_t> mailbox = 1;
std::atomic<uint32_t> length;
char data[128];
};
On the writing side, I do RDMA reads until I see a value of 1 in mailbox. Then I do an RDMA write of length + data, set mailbox to 0, and send mailbox with a second RDMA write.
On the reading side, I check for mailbox == 0, read the data, and set length to 0 and mailbox to 1.
When I do my RDMA reads I am occasionally seeing lengths <> 0 along with mailbox values of 0. Since RDMA operations are supposed to happen in order, I do not understand how this is happening.

One possible explanation is that if you do an RDMA read targeting a whole struct Header then there's no guarantee what order the target RDMA adapter will read from memory to satisfy that read. Especially since your struct is not aligned to a cacheline size (I'm guessing you're on x86, where cachelines are 64 bytes), so mailbox and length could be in different cachelines.
I still don't really understand why it's surprising to see a length != 0 and mailbox == 0 - isn't that the case where the where the reading side hasn't processed the mailbox at all? From what you wrote, the final state of the struct after the two RDMA writes from the writing side is exactly length != 0, mailbox == 0.
In any case, since as I described above an RDMA adapter is completely free to read the memory being accessed by RDMA read in any order, it's possible for RDMA read to return any mix of old and new data, no matter what order the CPU updates the fields in. For example if you had:
RDMA read comes in targeting struct Header
RDMA adapter reads cacheline with length field
CPU updates length to 0 then mailbox to 1
RDMA adapter reads cacheline with mailbox field
then the READ read would fetch length != 0, mailbox == 1. This is because the RDMA read operation does not participate in any memory barriers or other ordering, even though you declared your struct members as atomic.

Related

RDMA read and write data placement/visibility semantics

I am trying to get more details on the RDMA read and write semantics (especially data placement semantics) and I would like to confirm my understanding with the experts here.
RDMA read :
Would the data be available/seen in the local buffer, once the RDMA read completion is seen in the completion queue. Is the behavior the same, if I am using GPU Direct DMA and the local address maps to GPU memory. Would the data be immediately available in GPU, once the RDMA READ completion is seen in completion queue. If it is not immediately available, what operation will make ensure it.
RDMA Write with Immediate (or) RDMA Write + Send:
Can the remote host check for presence of data in its memory, after it has seen the Immediate data in receive queue. And is the expectation/behavior going to change, if the Write is to GPU memory (using GDR).
RDMA read. Would the data be available/seen in the local buffer, once the RDMA read completion is seen in the completion queue?
Yes
Is the behavior the same, if I am using GPU Direct DMA and the local address maps to GPU memory?
Not necessarily. It is possible that the NIC has sent the data towards the GPU, but the GPU hasn't received it yet. Meanwhile the RDMA read completion has already arrived to the CPU. The root cause of this is PCIe semantics, which allow reordering of writes to different destination (CPU/GPU memory).
If it is not immediately available, what operation will make ensure it?
To ensure data has arrived to the GPU, one may set a flag on CPU following the RDMA completion and poll on this flag from GPU code. This works because the PCIe read issued by the GPU will "push" the NIC's DMA writes (according to PCIe ordering semantics).
RDMA Write with Immediate (or) RDMA Write + Send:
Can the remote host check for presence of data in its memory, after it has seen the Immediate data in receive queue. And is the expectation/behavior going to change, if the Write is to GPU memory (using GDR).
Yes, this works, but GDR suffers from the same issue as above with writes arriving out-of-order to GPU memory as compared to CPU memory, again due to PCIe ordering semantics. The RNIC cannot control PCIe and therefore it cannot force the "desired" semantics in either case.

What happens when a process tries to read more bytes than the one that sent it

If Two processes communicate via sockets and Process A sends Process B 100 bytes.
Process B tries to read 150 bytes. Later Process A sends 50 bytes.
What is the result of Process B's read?
Will the process B read wait until it receives 150 bytes?
That is dependent on many factors, especially related to the type of socket, but also to the timing.
Generally, however, the receive buffer size is considered a maximum. So, if a process executes a recv with a buffer size of 150, but the operating system has only received 100 bytes so far from the peer socket, usually the available 100 are delivered to the receiving process (and the return value of the system call will reflect that). It is the responsibility of the receiving application to go back and execute recv again if it is expecting more data.
Another related factor (which will not generally be the case with a short transfer like 150 bytes but definitely will if you're sending a megabyte, say) is that the sender's apparently "atomic" send of 1000000 bytes will not all be delivered in one packet to the receiving peer, so if the receiver has a corresponding recv with a 1000000 byte buffer, it's very unlikely that all the data will be received in one call. Again, it's the receiver's responsibility to continue calling recv until it has received all the data sent.
And it's generally the responsibility of the sender and receiver to somehow coordinate what the expected size is. One common way to do so is by including a fixed-length header at the beginning of each logical transmission telling the receiver how many bytes are to be expected.
Depends on what kind of socket it is. For a STREAM socket, the read will return either the amount of data currently available or the amount requested (whichever is less) and will only ever block (wait) if there is no data available.
So in this example, assuming the 100 bytes have (all) been transmitted and received into the receive buffer when B reads from the socket and the additional 50 bytes have not yet been transmitted, the read will return those 100 bytes and will not wait.
Note also, the dependency of all the data being transmitted and received -- when process A writes data to a socket it will not necessarily be sent immediately or all at once. Depending on the underlying transport, there's an MTU size and any write larger than that will be broken up. Smaller writes may also be delayed and combined with later writes to make up the MTU. So in your case the send of 100 bytes might be too large (and broken up), or might be too small and not be transmitted immediately.

Is send() function in TCP Guaranteed to arrive in order?

It is known that the return value of send() can be less than length, and it means a part of message, not whole, has arrived. I'm supposed to send 2 packets whose contents are "ABC" and "DEF" respectively, and their length is 3. I want to send "DEF" by send() after send() was called to transfer "ABC". However, there is a case that the return value of send() for "ABC" is less than its length, 3. I think there is opportunity that messages are not delivered in order. For example, if the return value for "ABC" is 2, received message is "ABDEF".
Is send() function in TCP Guaranteed to arrive in order?
First of all, send() itself doesn't guarantee anything, send() only writes the data you want to send over the network to the socket's buffer. There it's segmented (placed in TCP segments) by the operating system, which manages the reliability of the transmission. If the underlying buffer is full, then you'll get a return value that's lower that the number of bytes you wanted to write. This usually indicates that the buffer is not being emptied by the operating system fast enough, i.e. the rate at which you write data to the buffer is higher than the rate at which the data is being sent to the network (or read by the other party).
Second, TCP is a stream protocol, if you send() "ABC" and then "DEF", there's no guarantee about how that data will be segmented, it might end up in one packet, or in six packets. Exactly like writing data to a file.
Third, the network stack (TCP/IP implementation in the OS) guarantees in-order delivery, as well as the other nice things promised by TCP - reliability, congestion control, flow control, etc.

Is a successful send() "atomic"?

Does a successful call to send() with the number returned equal to the amount specified in the size parameter guarantee that no "partial sends" will occur?
Or is there some way that the OS might be interrupted while servicing the system call, send part of the data, wait for a possibly long time, then send the rest and return without notifying me with a smaller return value?
I'm not talking about a case where there is not enough room in the kernel buffer; I realize that I would then get a smaller return value and have to try again.
Update:
Based on the answers so far, my question could be rephrased as follows:
Is there any way for packets/data to be sent over the wire before the call to send() returns?
Does a successful call to send() with the number returned equal to the amount specified in >the size parameter guarantee that no "partial sends" will occur?
No, it's possible that parts of your data gets passed over the wire, and another part only goes as far as being copied into the internal buffers of the local TCP stack. send() will return the no. of bytes passed to the local TCP stack, not the no. of bytes that gets passed onto the wire (and even if the data reaches the wire, it might not reach the peer).
Or is there some way that the OS might be interrupted while servicing the system call, send part of the data, wait for a possibly long time, then send the rest and return without notifying me with a smaller return value?
As send() only returns the no. of bytes passed into the local TCP stack, not whether send() actually sends anything, you can't really distinguish these two cases anyway. But yes, it's possibly only some data makes it over the wire. Even if there's enough space in the local buffer, the peer might not have enough space. If you send 2 bytes, but the peer only has room for 1 more byte, 1 byte might be sent, the other will reside in the local tcp stack until the peer has enough room again.
(That's an extreme example, most TCP stacks protects against sending such small segments of data at a time, but the same applies if you try to send 4k of data but the peer only have room for 3k).
I'm not talking about a case where there is not enough room in the kernel buffer; I realize that I would then get a smaller return value and have to try again
That will only happen if your socket is non-blocking. If it's blocking and the local buffers are full, send() will wait until there's room in the local buffers again (or, it might return
a short count if parts of the data was delivered, but an error occured in the mean time.)
Edit to answer:
Is there any way for packets/data to be sent over the wire before the call to send() returns?
Yes. That might happen for many reasons.
e.g.
The local buffers gets filled up by that recent send() call, and you use blocking I/O.
The TCP stack sends your data over the wire but decides to schedule other processes to
run before that sending process returns from send().
Though this depends on the protocol you are using, the general question is no.
For TCP the data gets buffered inside the kernel and then sent out at the discretion of the TCP packetization algorithm, which is pretty hairy - it keeps multiple timers, minds path MTU trying to avoid IP fragmentation.
For UDP you can only assume this kind of "atomicity" if your datagram does not exceed link frame size (usual value is 1472 = 1500 of ethernet frame - 20 bytes of IP header - 8 bytes of UDP header). Otherwise your sending host will have to IP-fragment the datagram.
Then intermediate routers can still IP-fragment the passing packet if their outgoing link MTU is less then the packet size.

TCP Socket Transfer

A while back i had a question about why my socket sometimes received only 653 octets ( for example ) when i sent 1024 octets and thanks to Rakis i understood: The OS allows reception to occur in arbitrarily sized chunks.
This time i need a confirmation :)
On any OS ( Well GNU/Linux and Windows at least ), In any Language ( I'm using Python here ), if i send a packet of a random number of bytes, can be 2 bytes, can be 12000 bytes, let's say X, when i write socket.send(X), am i absolutely guaranteed that X will be FULLY received ( regardless of any chunks the receiving OS divides it into ) on the other end of the socket BEFORE i do another socket.send(any string) ?
Or in other words if i have the code :
socket.send(X)
socket.send(Y)
Even if X > MTU so it will be obliged to send multiple packets, does it wait until every packet is sent and acknowledged by the endpoint of the socket before sending Y ? Well writing that makes me believe that the answer is yes it is guaranteed and that this is exactly the purpose of setting a socket in blocking mode but i want to be sure :D
Thanks in advance,
Nolhian
You are guaranteed that X will be received (at the application level) before Y, if it's a stream socket. If it's a datagram socket, no guarantees.
Depending on the networking implementation, it's possible that at a lower level, X will be sent, lost in transmission, then Y will be sent, then X will be re-sent because no acknowledgement was received.
Even in blocking mode, the socket.send(Y) can execute before X even makes it "onto the wire", because the OS will buffer network traffic.
No, you can't.
All you know is that the client will receive the data in order, assuming it does receive it all. There's no way of knowing (at the application level) whether the client has received all the data without having some sort of "ACK" at the application level protocol.
am i absolutely guaranteed that X will be FULLY received ( regardless of any chunks the receiving OS divides it into ) on the other end of the socket BEFORE i do another socket.send(any string) ?
No. In general, more data may be sent without waiting for the receiving side, within certain limits:
on the sending side, you will have a maximum amount of data you can enqueue for transmission until the client has acknowledged some receipt (but typically the client's OS will acknowledge and buffer quite a lot before it refuses further data until the application has processed some), after which the sending socket may start blocking
forces the application design to consider how to enqueue and buffer excessive amounts of data, rather than having naively written applications utilise excessive amounts of Operating System-provided buffer memory
reduces retransmission rates when the receiving side is flooded with data too fast to process it
avoids sending huge amounts of data despite the network connection having been lost
So, strictly speaking and for large transmissions, the sender should be designed to handle sockets blocked from further sends (either knowing it is ok to block in the attempt (perhaps due to a dedicated sending thread) or waiting until it is possible to send more via non-blocking sockets or select/poll).
Whatever retransmission and buffering may be required, what you CAN be sure of is that the receiving side will have to read all of "X" before it starts being given the subsequently sent data "Y" (unless it specifically asks to have it otherwise, e.g. Out Of Band data).
Depending on the type of Sockets that you use, you can, in some cases, have a guarantee that data will be received, but not a feedback or a confirmation when it actually was.
Back to your question:
does it wait until every packet is sent and acknowledged by the endpoint of the socket before sending Y
So, you could say:
YES it does wait until it is sent, and
NO it does not wait for acknowledgment
A suggestion:
Since there are no auto-magic/built-in confirmations that your data was received, you could fairly easily implement your own logic for ACKnowledging the package was received, which would basically come down to your custom communication protocol.