Give priority to the (software) time stamping thread in e1000e net driver - linux-device-driver

I am new to linux programming. I have an intel NIC with e1000e driver. I am working on precise time stamping of packets being captured from a network. An interrupt is generated at every packet reception. Thus an interrupt handler registered by the driver queues the captured packets and timestamps it. My question is if want to use RT Linux how can I increase the priority of the time stamping thread.
Is this possible?

Related

HW IO and CPU low jitter application

I have a hardware IO task (write/read serial message) that has a strict jitter requirement of less than 200 micro seconds. I need to be able to isolate both a CPU core/s and hardware/interrupt.
I have tried 2 things that have helped but not gotten me all the way there.
Using <termios.h> to configure the tty device. Specifically setting VMIN=packet_size and VTIME=0
Isolcpus kernel argument in /etc/default/grub and running with taskset
I am still seeing upwards of 5 ms (5000 us) of jitter on my serial reads. I tested this same application with pseudo serial devices (created by socat) to eliminate the HW variable but am still seeing high jitter.
My test application right now just opens a serial connection, configures it, then does a while loop of writes/reads.
Could use advice on how to bring jitter down to 200 us or less. I am considering moving to a dual boot RTOS/Linux with shared memory, but would rather solve on one OS.
Real Application description:
Receive message from USB serial
Get PTP (precision time protocol) time within 200 us of receiving the first bit
Write packet received along with timestamp to shared memory buffer shared with a python application: <timestamp, packet>.
Loop.
Also on another isolated HW/core:
Read some <timestamp, packet> from a shared memory buffer
Poll PTP time until <timestamp>
Transmit <packet> at within 200 us of <timestamp> over USB serial
Loop
To reduce process latency/jitter:
Isolate some cores
/etc/default/grub... isolcpus=0
Never interrupt RT tasks
set /proc/sys/kernel/sched_rt_runtime_us to -1
Run high priority on isolated core
schedtool -a 0 -F -p 99 -n -20 -e $CMD
To reduce serial latency/jitter
File descriptor options O_SYNC
Ioctl ASYNC_LOW_LATENCY
Termios VMIN = message size and VTIME = 0
Use tcdrain after issuing write commands

How exactly do socket receives work at a lower level (eg. socket.recv(1024))?

I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?
Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.
A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.

What is the difference between "Interrupt coalescing" and the "Nagle algorithm"?

Is the main difference that?
Interrupt coalescing (ethtool -C eth1 rx-usecs 0) - coalesce the received packets from different connections, i.e. increase bandwitdh, but increase the latency of the receive
Nagle algorithm (socket options = TCP_NODELAY) - coalesce the sent packets from the same connection, i.e. increase bandwitdh, but increasethe the latency of the send
Interrupt coalescing concerns the network driver: the idea is to avoid invoking the interrupt handler anew every time a network packet shows up. Instead, after receiving a packet, the NIC waits until M packets are received or until N microseconds have passed before generating an interrupt. Then the driver can process many packets at once. (Otherwise, with modern gigabit and 10-gigabit adapters, the processor would need to field hundreds of thousands or millions of interrupts per second, which can prevent the system from being able to accomplish much else.) As your link points out, there is (or at least may be) a cost of additional latency since the OS doesn't start processing a received packet at the earliest possible instant.
Nagle's algorithm is focused on reducing the number of packets sent by coalescing payload data from multiple packets into one. The classic example is a telnet session. Without Nagle, every time you press a key, the system has to create an entire new packet (min 64 bytes on Ethernet) to send one byte.
So the intent of interrupt coalescing is to support greater bandwidth utilization, while the intent of Nagle's algorithm is actually to produce lower bandwidth (by sending fewer packets).

interrupt scheduling in .Net Micro Framework

.NET MF doesn't support preemptive interrupts. Once a process is completed or the 20 msec timer assigned by the scheduler times out, the interrupts can be processed.
Is there any way to change this 20 msec to a shorter time, or change the scheduler process and make it like an real-time scheduler?
Alternatively, assuming the 20 msec delay can be tolerated to begin the interrupt processing, but the exact time of interrupt occurrence at is a must-to-know factor. I think with time argument in an InterruptPort event handler, one can work backwards and determine the time at which the iteruppt got queued.
However, how about if serial port is used, and the time of data arrival to the port must be known? Is there any way that we can determine at what time data arrived to the serial port, or its corresponding interrupt was queued by the framework? Thanks.
The .NET micro framework was designed to be used on memory constrained micro controllers without an MMU (Memory Management Unit). The .NET Micro framework, even with all its multiple threading model and TCP/IP stack support, does not run on a classical real time operating system structure.
If you need strict real time capability, please check out Windows CE 6.0 R3 and Windows Compact 7 which are hard real time operating systems and have native 32 bit real time support.
Hope my answer helps you.

Polling vs Interrupt

I have a basic doubt regarding interrupts. Imagine a computer that does not have any interrupts, so in order for it to do I/O the CPU will have to poll* the keyboard for a key press, the mouse for a click etc at regular intervals. Now if it has interrupts the CPU will keep checking whether the interrupt line got high( or low) at regular intervals. So how is CPU cycles getting saved by using interrupts. As per my understanding instead of checking the device now we are checking the interrupt line. Can someone explain what basic logic I am getting wrong.
*Here by polling I don't mean that the CPU is in a busy-wait. To quote Wikipedia "Polling also refers to the situation where a device is repeatedly checked for readiness, and if it is not the computer returns to a different task"
#David Schwartz and #RKT are right, it doesn't take any CPU cycles to check the interrupt line.
Basically, the processor has a set of interrupt wires which are connected to a bunch of devices. When one of the devices has something to say, it turns its interrupt wire on, which triggers the processor (without the help of any software) to pause the execution of current instructions and start running a handler function.
Here's how it works. When the operating system boots, it registers a set of callbacks (a table of function pointers, actually) with the processor using a special instruction which takes the address of the first entry of the table. When interrupt N is triggered, the processor pulls the Nth entry from the table and runs the code at the location in memory it refers to. The code inside the function is written by the OS authors in assembly, but typically all it does is save the state of the stack and registers so that the current task can be resumed after the interrupt handler has been called and then call a higher-level common interrupt handler which is written in C and that handles the logic of "If this a page fault, do X", "If this is a keyboard interrupt, do Y", "If this is a system call, do Z", etc. Of course there are variations on this with different architectures and languages, but the gist of it is the same.
The idea with software interrupts ("signals", in Unix parlance) is the same, except that the OS does the work of setting up the stack for the signal handler to run. The basic procedure is that the userland process registers signal handlers one at a time to the OS via a system call which takes the address of the handler function as an argument, then some time in the future the OS recognizes that it should send that process a signal. The next time that process is run, the OS will set its instruction pointer to the beginning of the handler function and save all its registers to somewhere the process can restore them from before resuming the execution of that process. Usually, the handler will have some sort of routing logic to alert the relevant bit of code that it received a signal. When the process finishes executing the signal handler, it restores the register state that existed previous to the signal handler running, and resumes execution where it left off. Hence, software interrupts are also more efficient than polling for learning about events coming from the kernel to this process (however this is not really a general-use mechanism since most of the signals have specific uses).
It doesn't take any CPU cycles to check the interrupt line. It's done by dedicated hardware, not CPU instructions. The reason it's called an interrupt is because if the interrupt line is asserted, the CPU is interrupted.
"CPU is interrupted" : It will leave (put on hold) the normal program execution and then execute the ISR( interrupt subroutine) and again get back to execution of suspended program.
CPU come to know about interrupts through IRQ(interrupt request) and IF(interrupt flag)
Interrupt: An event generated by a device in a computer to get attention of the CPU.
Provided to improve processor utilization.
To handle an interrupt, there is an Interrupt Service Routine (ISR) associated with it.
To interrupt the processor, the device sends a signal on its IRQ line and continue doing so until the processor acknowledges the interrupt.
CPU then performs a context switch by pushing the Program Status Word (PSW) and PC onto the control stack.
CPU executes the ISR.
whereas Pooling is the process where the computer waits for an external device to check for it readiness.
The computer does not do anything else than check the status of the device
Polling is often used with low-level hardware
Example: when a printer connected via a Parrnell port the computer waits until the next character has been received by the printer.
These process can be as minute as only reading 1 Byte
There are two different methods(Polling & interrupt) to serve I/O of a computer system. In polling, CPU continuously remain busy, either an input data is given to an I/O device and if so, then checks the source port of corresponding device and the priority of that input to serve it.
In Interrupt driven approach, when a data is given to an I/O device, an interrupt is generated and CPU checks the priority of that input to serve it.