Packet generation in PCI PCIe devices

Packet generation in PCI PCIe devices - packet

I have few questions on the PCI/PCIe Packet generation and the CRC generation and calculation. I have tried many searches but could not get the satisfactory answer. Please help me to understand the below points.
1.How does Packets(TLP, DLLP and PLLP) are formed in the PCI/PCIe System : For example lets say The CPU generates a Memory read/write from/to a PCIe device(here device is mapped into the memory). This request will be received by the PCI/PCIe Root Complex. The Root Complex will generate the TLP, also the DLLP and PLLP will be generated and appended to the TLP accordingly to form a PCI/PCIe pcket. This packet will be claimed by one of the root ports based on the MMIO address ranges. Each port on the Switch/Endpoints generate the DLLP and PLLP and pass it over to the next device on the link where it will be stripped and checked for errors.
Q.1 - Is it true that the packet generation/checking is fully done by the hardware ? What contribution does software do in packet generation as well as packet checking for errors on the receiving device ?
Q.2 - How does ECRC and LCRC are generated for a packet ? As the LCRC will be generated and checked at each PCI/PCIe device/ports and ECRC will be generated only once by requester which is root complex in our example. So Does the ECRC/LCRC generation/check are completely done by Hardware ? Can someone please explain with an example how the CRC/ECRC generated/check from the moment when the CPU generates a PCI read/write request ?
Q.3 - When we say that the "Transaction Layer", "DataLink Layer" and the "Physical Link Layer" generates the TLP, DLLP and PLLP respectively, Does this layers mean the Hardware or software layers ?
I think that if software will come into play each time when a packet, CRCs are generated/checked, It would slow down the data transfer. Also the Hardware can do these tasks much faster.
please correct me If I am wrong somewhere. I want to understand the above scenarios from HW vs SW points of view. Please help.

Related

How exactly do socket receives work at a lower level (eg. socket.recv(1024))?

I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?

Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.

A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.

Can I write in an Input Register? Modbus

I've been working for 2 months in a MODBUS project and now I found a problem.
My client is asking me to write in an input register (Address 30001 to 40000).
I thought that was not a thing for me because every modbus documentation says that 30001 to 40000 registers are read-only.
Is it even possible to write in those registers? Thanks in advance

Both holding and input register related functions contain a 2-byte address value. This means that you can have 65536 input registers and 65536 holding registers in a device at the same time.
If your client is developing the firmware of the slave, they can place holding registers into the 3xxxx - 4xxxx area. They don't need to follow the memory layout of the original Modicon devices.
If one can afford diverging from the Modbus standard, it's even possible to increase the number of registers. In one of my projects, I was considering to use Preset Single Register (06) function as a bank select command. Of course, you can't call it Modbus anymore. But, the master can still access the slave using a standard library or diagnostics tools.

You can't write to Input Contacts or Input Registers, there is no Modbus function to write to them, they are read only by definition
Modbus is a protocol and in no case specifies where the values are stored, only how they are transmitted
Currently there are devices that support 6-digit addresses and therefore can address up to 65536 registers per group

PCIe Understanding

As this domain is new for me, I have some confusions understanding PCIe.
I was previously working on some protocols like I2c,spi,uart,can and most of these protocols have well defined docs(a max of 300 pages).
In almost all these protocols mentioned, from a software perspective, the application had to just write to a data register and the rest will be taken care by the hardware.
Like for example, in Uart, we just load data into the data register and the data is sent out with a start, parity and stop bit.
I have read a few things about PCIe online and here is the understanding i have so far.
During system boot, the BIOS firmware will figure out the memory space required by the PCIe device by a magic write and read procedure to the BAR in the PCIe device(endpoint).
Once it figures out that, it will allocate an address space for the device in the system memory map(no actual RAM is used in the HOST, memory resides only in the endpoint.The enpoint is memory mapped into the Host).
I see that the PCIe has a few header fields that the BIOS firmware figures out during the bus enumeration phase.
Now,if the Host wants to set a bit in a configuration register located at address 0x10000004(address mapped for the enpoint), the host would do something like(assume just 1 enpoint exists with no branches):
*(volatile uint32 *)0x10000004 |= (1<<Bit_pos);
1.How does the Root complex know where to direct these messages because the BAR is in the enpoint.
Does the RC broadcast to all enpoints and then the enpoints each compare the address to the address programmed in BAR to see if it must accept it or not?(like an acceptence filter in CAN).
Does the RC add all the PCIe header related info(the host just writes to the address)?
If Host writes to 0x10000004, will it write to register at location 0x4 in the endpoint?
How does the host know the enpoint is given an address space starting from 0x10000000?
Is the RC like a router?
The above queries were related to, only if a config reg in the enpoint was needed to be read or written to.
The following queries below are related to data transfer from the host to the enpoint.
1.Suppose the host asks the enpoint to save a particular data present in the dram to a SSD,and since the SSD is conneted to the PCIe slot, will PCIe also perform DMA transfers?
Like, are the special BAR in the enpoint that the host writes with a start address in the Dram that has to be moved to ssd, which in turn triggers the PCIe to perform a DMA tranfer from host to enpoint?
I am trying to understand PCIe relative any other protocols i have worked on so far. This seems a bit new to me.

The RC is generally part of the CPU itself. It serves as a bridge that routes the request of the CPU downstream, and also from the endpoint to the CPU upstream.
PCIe endpoints have Type 0 headers and Bridges/Switches have Type 1 header. Type 1 headers have base(min address) and limit registers(max address). Type 0 headers have BAR registers that are programmed during the enumeration phase.
After the enumeration phase is complete, and all the endpoints have their BARs programmed, the Base and Limit registers in the Type 1 header of the RC and Bridges/Switches are programmed.
Ex: Assume a system that has only 1 endpoint connected directly to the RC with no intermediate Bridges/Switches, whose BAR has the value A00000.
If it requests 4Kb of address space in the CPU(MMIO), the RC would have its Base register as A00000 and Limit register as AFFFFF(It is always 1 MB aligned,though the space requested by the endpoint is much less than 1MB).
If the CPU writes to the register A00004, the RC will look at the base and limit register to find out if the address falls in its range and route the packet downstream to the endpoint.
Endpoints use BAR to find out if they must accept the packets or not.
RC, Bridges and Switches use Base and Limit registers to route packets to the correct downstream port. Mostly, a switch can have multiple downstream ports and each port will have its own Type 1 header,whose Base and Limit register will be programmed with respect to the endpoints connected to its port. This is used for routing the packets.
Data transfer between CPU memory and endpoints is via PCIe Memory Writes. Each PCIe packet has a max payload capacity of 4K. If more than 4K has to be sent to the endpoint, it is via multiple Memory Writes.
Memory Writes are posted transactions(no ACK from the endpoint is needed).

Omron PLC Ethernet card

I have an Ethernet card in a Omron PLC. Is there any way to do an automatic check to see if the Ethernet card is working? If not, is there a manual way? For example, if the card was to go out on the PLC it would give an error. But if the card just loses signal with the server then it would NOT give error. Any help on how to do this?

There are several types of errors you can check for. The way you do this depends on the type of error. Things you can check :
ETN unit Error Status (found at PLC CIO address CIO 1500 + (25 x unit number) +18)
What it reports : IP configuration, routing, DNS, mail, network services, etc, errors.
See : Manual Section 8-2
The ETN unit also keeps an internal error log (manual section 8-3) that you can read out to your HMI software (if you use it) using FINS commands. This documents all manner of errors internal to the ETN unit.
There are also other memory reservations in the PLC for CPU bus devices (like the ETN unit) which provide basic status flags you can include in ladder logic to raise alarms, etc. (See section 4-3 : Auxiliary Area Data).
These flags indicate whether the unit is initializing, for example, has initialized successfully, is ready to execute network commands, whether the last executed command completed OK or returned an error code (which can be read from the Error Log as above), etc. These can indicate whether the PLC is unable to properly communicate with the ETN device.

You can implement single byte location which will be autoincremented each second by the server. Then every few seconds you check in your PLC logic if old reading is the same as new reading, and if it is then you trigger an alarm that physical server (which is an communication client) is not communicating to PLC ethernet card.

AHCI Driver for own OS

I have been programming a little AHCI driver for two weeks. I have read this article and Intel's Serial ATA Advanced Host Controller Interface (AHCI) 1.3. There is an example, which shows how to read sectors via DMA mode (osdev.org). I have done this operation (ATA_CMD_READ_DMA 0xC8) successfully, but when i tried to write sectors (ATA_CMD_WRITE_DMA 0xCA) to the device, the HBA set the error
Offset 30h: PxSERR – Port x Serial ATA Error - Handshake Error
(this is decoding from Intel AHCI specification). I don't understand why it happened. Please, help me.
In addition, I have tried to issue the command IDENTIFY 0xEC, but not successfully...

You asked this question nearly two months ago so I'm not sure if you've already figured this out. Please note that I'm writing from memory in terms of what must be done first, etc. I may not have remembered all, or accurately, what must be done. You should reference the AHCI spec for everything. The methods for doing this are as varied as there are programmers that have done this. For this reason, I'm not including code examples.
For starters, ensure that you've set the HBA state machine accordingly. You'll be able to find references for the state machines supported by the HBA in that same SATA spec 1.3. In lieu of this, you should check a few registers.
Please note that all page numbers are given with respect to viewing in Adobe Acrobat and are 8 pages more than numbered in the actual document
From page 24 and 25 of the spec., check GHC.IE and GHC.AE. These two will turn on interrupts and ensure that the HBA is working in AHCI mode. Another, very important register to check, is CAP.SSS (Page 23). If this bit is high, then the HBA Supports Staggered Spin-up. This means that the HBA will not perform any protocol negotiation for any port. Before you do the following, store the value of PxSIG (Page 35 and 36).
To actually spin up the port, you'll need to visit pages 33, 34 and 35 of the spec. These pages cover the PxCMD register. For each port supported by the HBA (check CAP.NP to know how many are there), you'll have to switch high bit PxCMD.SUD. After switching that bit high, you'll want to poll on PxSSTS (Page 36) to check the state of the PHY. You can check CAP.ISS in order to know what speed you can expect to see "come alive" on PxSSTS.
After spinning up the port, check PxSIG (Page 35 & 36). The value should be different than when you started. I don't recall now what you can expect them to become, but they will be different. When communication is actually established, the device sends to the host an initial FIS. Without this first FIS, the HBA will be unable to communicate with the device. (It's with this first FIS that the HBA sets the correct bits in PxSIG.)
Finally, after all of this, you'll need to set PxCMD.FRE (page 34). This bit in the port command register enables FIS delivery to the device. If this bit is low, the HBA will ignore anything you send to it.
As I said in the beginning, I'm not sure if this will answer all of your question but I hope that it does get you on the right track. I'm going from memory on the events that must be done in order to effectively communicate to a SATA device. I may not have remembered in full detail.
I hope this helps you.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse