This paragraph if from UNP,
chapter 21.3 page 555
A host running an application that has joined some multicast group whose
corresponding Ethernet address just happens to be one that the interface
receives when it is programmed to receive 01:00:5e:00:01:01 (e.e., the
interface card performs imperfect filtering). This frame will be discarded
either by datalink layer or by the IP layer.
I just don't know which special case is the author talking about. Could you help me explain it clearly?
IN IPV4. A multicast Address (old class D) consists of 4 bits fixed for identifying it as multicast(1110), and the remaining 28 bits to Identify the group.
Since there are only 23 Bits available in a MAC Address (the high order 25 bits are fixed), when you map the lower order 23 bits of the multicast address into the lower order 23 bits of the mac you lose 5 bits of addressing information. So multiple Multicast addresses all have the same MAC address.
for example
237.138.0.1
238.138.0.1
239.138.0.1
all map to MAC address: 01:00:5e:0a:00:01 (There are more, this is just a subset to illustrate)
so if you join group 237.138.0.1, your ethernet card will start sending frames up the stack for that MAC. Since it is an imperfect match (since we discarded those 5 bits), the ethernet card will also send 238.138.0.1 and 239.138.0.1 up the stack as well. But since you are not interested in those frames they will be discard at Layer 2 (data link) or Layer 3 (Network) when they can be matched exactly.
So the special case is that if you have multiple multicast streams that occupy the same lower 23 bits of address space, all hosts on the network segment are going to have to process the packets higher up in the stack and thus do more work to tell if the packet they got is one they are interested in).
normally you just need to make sure when planning your multicast deployments, that you try to avoid overlapping addresses.
Related
This question was closed on the Networking SE because "questions about protocols above OSI layer-4 are off-topic here" so I'm trying here.
This may be a silly question, but if in QUIC we maintain separate sliding windows for each stream, why is there a need for sequencing even below the stream level?
It seems to me that an application will not receive the same data twice because we already sequence streams by themselves, and we also can acknowledge each stream separately without sequencing the packets themselves.
I presume you mean packet number, as there isn't a sequence number as such in QUIC.
If so, then the packet number used in QUIC performs a number of roles, such as being used within the ACK process, sequencing at the packet level, and also is used as part of the nonce input into the AEAD encryption process.
From https://datatracker.ietf.org/doc/html/rfc9001 5.3 AEAD
The nonce, N, is formed by combining the packet protection IV with the packet number.
From https://datatracker.ietf.org/doc/html/rfc9000 13.2.3. Managing ACK Ranges
A receiver SHOULD include an ACK Range containing the largest received packet number in every ACK frame.
I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?
Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.
A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.
In Ethernet networks, the MAC layer is the first layer to detect the destination address of the received message.
my questions: is that means that the transceiver shall take a copy of each message on the bus and forward it to the MAC layer who will decide to accept that message or discard it? If so, this means that the MAC layer must have a very large buffers to save all that intended and non intended message. am I correct ?
The MAC layer does not typically have much buffering. It may not even be able to store a full packet. Packets instead stream through the MAC.
Packets enter and exit the MAC one flit at a time. It may take hundreds of cycles for a full packet to pass into a MAC depending on the size of the packet and the width of the interface. For example, a MAC with an 8-byte interface (8-byte flit size) will take 1000 cycles to receive an 8kB packet.
The MAC may only have 800 bytes of buffering. In that case, the packet will start coming out the other end after 100 cycles when only 10% of the packet has entered. In fact, many MACs have a latency well below 100 cycles.
Packets which are rejected on the basis of destination address stream in one side but nothing comes out the other side. The frames are simply forgotten/dropped as they arrive.
I understand that when sending ip messages around, each hop in the network path between be and my packet's destination will check if the next hop's MTU is bigger than the size of the packet I sent. If so, the packet will be fragmented and the two packets will be separately sent to the next hop, only to be reassembled at destination (or, in some cases, at the first NAT router encountered).
As far as I understand, this thing can be pretty bad, but I don't really understand why.
I understand that if the connection tends to drop a lot of packets, losing a single fragment means I have to resend the whole packet (this is actually the only thing I figured out myself)
Is there a chance that instead of being fragmented my packet will just be dropped?
How are packet fragments identified? Can I be 100% sure that they will be reassembled correctly? On example, if I send two ip packets of the same length nearly simultaneously to the same destination, how likely it is that fragments of the two will be swaped, like AAA, BBB reassembled into ABA, BAB?
In principle, if packets aren't dropped and fragments are reassembled correctly, actually using packet fragmentation seems like a good idea to save on local bandwidth and avoid having to send more and more headers instead of just one big packet.
Thank you
IP fragmentation can cause several problems:
1) Application layer loss is increased
As you mentioned, if a single fragment is dropped, the entire layer 4 packet will be lost. Thus, for a network with a small random packet loss rate, the application layer loss rate is increased by a factor approximately equal to the number of fragments for each layer 4 packet.
2) Not all networks handle fragmented packets
Some systems, such as Google's Compute Engine, do not reassemble fragmented packets.
3) Fragmentation can cause re-ordering
When routers split traffic down parallel paths, they may try to keep packets from the same flow on a single path. Because only the first fragment has layer 4 information like UDP/TCP port number, subsequent fragments may be routed down a different path, delaying assembly of the layer 4 packet and causing re-ordering.
4) Fragmentation can cause confusing behavior that is hard to debug
For example, if you send two UDP streams, A and B, from one source to a destination running Linux, the destination may discard packets from one of the streams. This is because by default, Linux "times out" fragment queues if more than 64 other fragments have been received from the same source. If stream A has a much higher data rate than stream B, 64 fragments from stream A may arrive in between the fragments from stream B, causing the B fragment to be dropped.
Thus, while IP fragmentation can reduce overhead by minimizing user headers, it may cause more trouble than it is worth.
To my knowledge, the only case where packets will be dropped rather than fragmented (barring cases where it would be dropped anyway), is packets which are marked "don't fragment". These packets are to be discarded rather than being fragmented.
Fragmented packets have identifier, fragment offset, and more fragments fields in their headers that, when combined, allow the destination host to reliably reassemble the packet upon receipt of all the fragments. The first fragment's offset is zero, and the last fragment has the more fragments flag set to zero. It is still possible (although very unlikely) to reassemble an incorrect packet if two packets' headers are mutated so their fragment offsets are exchanged, but their checksums are still valid. The probability of this happening is essentially zero. Bear in mind that IP does not provide any mechanism for ensuring the integrity of the data payload, only the integrity of the control information in the header.
Packet fragmentation necessarily wastes bandwidth because each fragment has a copy of [most of] the original datagram's header. Packets can be fragmented down to only 8 bytes per fragment, so we could have a maximum-sized packet at 60 + 65536 bytes fragmented into 60 * 8192 + 65536 bytes, yielding a payload increase of about 750% in the worst case. The only example I can come up with where you would come out ahead is if you fragmented a packet in order to send its fragments in parallel using some kind of Frequency Division Multiplexing scheme with the knowledge that the other channels are free. At that point, it still seems like it would require more work than would be saved to detect that circumstance and divide the packet rather than just sending it.
All the basic details about the mechanics of packet fragmentation in IP can be found in IETF RFC 791, if you're hungry for more information.
I am looking at a blocking call of send() and looking if there is a way to measure the times spent in the function while be able to know what event's occured during that time so that a qualitative analysis can be made about the connection speed etc.
With that, one of the first thing to know is at which layer does the function return success.
The send() API is going to return success nearly immediately IFF there is sufficient buffer space available to hold the data and the routing tables still show a way to route the packet to the peer. (It doesn't actually have to be able to reach the peer -- simply that the machine must have a next hop available...) If it needs to wait for buffer space to free up, it will. (Watching data being ACKed or sent on the wire ought to be easy with Wireshark.)
Incidentally, the OSI layers are imperfectly applied to the TCP/IP family of protocols; layers 1 and 2 fit very closely, layer 3 is roughly IP routing, layer 4 is roughly TCP, UDP, SCTP, ICMP, and so on. But layers 5, 6, 7 don't have real analogs -- SMTP over TLS might be considered a layer 7 or perhaps the SMTP is layer 7 and the TLS is layer 6 ... it all gets pretty fuzzy very quickly.
It's easier to just talk about the specific layer in the TCP/IP protocol stack that you're curious about. send() works with stream, datagram, and raw sockets, so it could straddle multiple layers of the stack -- you could use it to send TCP, UDP, SCTP, or ICMP packets, or scribble directly on the wire if you wish.