How PCIE Root complex moves DMA transaction from PCIe endpoint to Host memory - linux-device-driver

I have very basic doubt ,how PCIE Root complex moves DMA transaction from PCIe endpoint to Host memory.
Suppose ,Pcie EP(End Point) want to initiate a DMA write transaction to HOST memory from its local memory.
So DMA read channel present on PcieEP ,will read data from its local memory,then PCIe module in the PcieEP convert this to Pci TLP transaction and direct it to PCIE root complex.
So my Query is
Know how PCIE rootcomplex ,will come to know that it has to redirect this packet to HOST Memory ?
How is the hardware connection from PCIeroot complex to Host Memory ? Will there be DMA Write in PCIe root complex to write this data to Host Memory .

The PCIe RC will receive the TLP and it will have a address translation function which optionally translates the address and send the packet to its user side interface. And usually after the PCIe RC, there is IOMMU logic which converts PCIe address to host physical address (and checks permissions). The IOMMU has for PCIe uses address translation table on memory for for each {bus, device, function} pairs or even PSID(process space id) and then that packet will have new physical address and go to an interconnect (usually supporting cache coherency). The interconnect receives the packet from iommu (the iommu becomes a master to the interconnect), and that interface node has system memory map having information where the addressed target is located within the interconnect. The system address map should be set by the firmware before OS runs. (usually there is interrupt controller - Interrupt translation service for arm system - after iommu and before the interconnect to intercept MSI-message signaled interrupt- and generate interrupt to the main interrupt controller).

Related

PCIe Understanding

As this domain is new for me, I have some confusions understanding PCIe.
I was previously working on some protocols like I2c,spi,uart,can and most of these protocols have well defined docs(a max of 300 pages).
In almost all these protocols mentioned, from a software perspective, the application had to just write to a data register and the rest will be taken care by the hardware.
Like for example, in Uart, we just load data into the data register and the data is sent out with a start, parity and stop bit.
I have read a few things about PCIe online and here is the understanding i have so far.
During system boot, the BIOS firmware will figure out the memory space required by the PCIe device by a magic write and read procedure to the BAR in the PCIe device(endpoint).
Once it figures out that, it will allocate an address space for the device in the system memory map(no actual RAM is used in the HOST, memory resides only in the endpoint.The enpoint is memory mapped into the Host).
I see that the PCIe has a few header fields that the BIOS firmware figures out during the bus enumeration phase.
Now,if the Host wants to set a bit in a configuration register located at address 0x10000004(address mapped for the enpoint), the host would do something like(assume just 1 enpoint exists with no branches):
*(volatile uint32 *)0x10000004 |= (1<<Bit_pos);
1.How does the Root complex know where to direct these messages because the BAR is in the enpoint.
Does the RC broadcast to all enpoints and then the enpoints each compare the address to the address programmed in BAR to see if it must accept it or not?(like an acceptence filter in CAN).
Does the RC add all the PCIe header related info(the host just writes to the address)?
If Host writes to 0x10000004, will it write to register at location 0x4 in the endpoint?
How does the host know the enpoint is given an address space starting from 0x10000000?
Is the RC like a router?
The above queries were related to, only if a config reg in the enpoint was needed to be read or written to.
The following queries below are related to data transfer from the host to the enpoint.
1.Suppose the host asks the enpoint to save a particular data present in the dram to a SSD,and since the SSD is conneted to the PCIe slot, will PCIe also perform DMA transfers?
Like, are the special BAR in the enpoint that the host writes with a start address in the Dram that has to be moved to ssd, which in turn triggers the PCIe to perform a DMA tranfer from host to enpoint?
I am trying to understand PCIe relative any other protocols i have worked on so far. This seems a bit new to me.
The RC is generally part of the CPU itself. It serves as a bridge that routes the request of the CPU downstream, and also from the endpoint to the CPU upstream.
PCIe endpoints have Type 0 headers and Bridges/Switches have Type 1 header. Type 1 headers have base(min address) and limit registers(max address). Type 0 headers have BAR registers that are programmed during the enumeration phase.
After the enumeration phase is complete, and all the endpoints have their BARs programmed, the Base and Limit registers in the Type 1 header of the RC and Bridges/Switches are programmed.
Ex: Assume a system that has only 1 endpoint connected directly to the RC with no intermediate Bridges/Switches, whose BAR has the value A00000.
If it requests 4Kb of address space in the CPU(MMIO), the RC would have its Base register as A00000 and Limit register as AFFFFF(It is always 1 MB aligned,though the space requested by the endpoint is much less than 1MB).
If the CPU writes to the register A00004, the RC will look at the base and limit register to find out if the address falls in its range and route the packet downstream to the endpoint.
Endpoints use BAR to find out if they must accept the packets or not.
RC, Bridges and Switches use Base and Limit registers to route packets to the correct downstream port. Mostly, a switch can have multiple downstream ports and each port will have its own Type 1 header,whose Base and Limit register will be programmed with respect to the endpoints connected to its port. This is used for routing the packets.
Data transfer between CPU memory and endpoints is via PCIe Memory Writes. Each PCIe packet has a max payload capacity of 4K. If more than 4K has to be sent to the endpoint, it is via multiple Memory Writes.
Memory Writes are posted transactions(no ACK from the endpoint is needed).

How does PCIe endpoint remembers its Bus Device Function Number?

How does the PCIe endpoint claim the configuration transaction since there is no register (in Type0 config space) defined by PCIe specification which holds the Bus Device and Function number.
The device must capture the destination address from the first config transaction it receives and store it for use in outgoing transactions. Since PCIe is actually point-to-point, not a bus, a device only receives config transactions that are intended for it.

Packet generation in PCI PCIe devices

I have few questions on the PCI/PCIe Packet generation and the CRC generation and calculation. I have tried many searches but could not get the satisfactory answer. Please help me to understand the below points.
1.How does Packets(TLP, DLLP and PLLP) are formed in the PCI/PCIe System : For example lets say The CPU generates a Memory read/write from/to a PCIe device(here device is mapped into the memory). This request will be received by the PCI/PCIe Root Complex. The Root Complex will generate the TLP, also the DLLP and PLLP will be generated and appended to the TLP accordingly to form a PCI/PCIe pcket. This packet will be claimed by one of the root ports based on the MMIO address ranges. Each port on the Switch/Endpoints generate the DLLP and PLLP and pass it over to the next device on the link where it will be stripped and checked for errors.
Q.1 - Is it true that the packet generation/checking is fully done by the hardware ? What contribution does software do in packet generation as well as packet checking for errors on the receiving device ?
Q.2 - How does ECRC and LCRC are generated for a packet ? As the LCRC will be generated and checked at each PCI/PCIe device/ports and ECRC will be generated only once by requester which is root complex in our example. So Does the ECRC/LCRC generation/check are completely done by Hardware ? Can someone please explain with an example how the CRC/ECRC generated/check from the moment when the CPU generates a PCI read/write request ?
Q.3 - When we say that the "Transaction Layer", "DataLink Layer" and the "Physical Link Layer" generates the TLP, DLLP and PLLP respectively, Does this layers mean the Hardware or software layers ?
I think that if software will come into play each time when a packet, CRCs are generated/checked, It would slow down the data transfer. Also the Hardware can do these tasks much faster.
please correct me If I am wrong somewhere. I want to understand the above scenarios from HW vs SW points of view. Please help.

How does an OS find a peripheral's assigned address(es)?

OK, here's what I mean:
Let's say you want to write your own bootable code.
Further, your code is going to be really simple.
So simple, in fact, that it only consists of a single instruction.
Your bootable code is going to write a byte or word or double word or whatever to a register or RAM location on a peripheral device, not main RAM or a CPU register.
How do you find out what address(es) have been assigned to that peripheral memory location by the BIOS / UEFI?
Here's a more concrete example:
My bootable code's first and only instruction will write the number 11H to a register located on the sound card.
If the BIOS / UEFI initialization code did its job properly, that sound card register should be mapped into the CPU's memory space and/or IO space.
I need to find that address to accomplish that write.
How do I find it?
This is what real operating systems must do at some point.
When you open control panel / device manager in Windows, you see all the memory ranges for peripherals listed there.
At some point, Windows must have queried the BIOS /UEFI to find this data.
Again, how is this done?
EDIT:
Here is my attempt at writing this bootable assembly program:
BITS 16
ORG 100h
start:
;I want to write a byte into a register on the sound card or NIC or
;whatever. So, I'm using a move instruction to accomplish that where X
;is the register's memory mapped or IO mapped address.
mov X,11h
times 510 - ($ - $$) db 0
dw 0xaa55
What number do I put in for X? How do I find the address of this peripheral's register?
If you want to do this with one instruction, you can just get the address for the device from the Windows device manager. But if you want to do it the "proper" way, you need to scan the PCI bus to find the device you want to program, and then read the Base Address Registers (BARs) of the device to find its MMIO ranges. This is what Windows does; it doesn't query the BIOS.
To find the device that you want to access, scan the PCI bus looking for the device. Devices are addressed on the PCI bus by their "BDF" (short for Bus/ Device/ Function). Devices are identified by a Vendor ID and a Device ID assigned by the vendor.
Read offset 0 and 2 of each BDF to get the Vendor ID and Device ID. When you have found the device you want to program, read the correct 32-bit BAR value at an offset between 10h and 24h. You need to know which BAR contains the register you want to program, which is specific to the device you are using.
This article describes how to access PCI config space and has sample code in C showing how to scan the PCI bus. http://wiki.osdev.org/PCI

pcie raw throughput test

I am doing a PCIE throughput test via a kernel module, the test result numbers are quite strange (write is 210MB/s but read is just 60MB/s for PCIE gen1 x1). I would like to ask for your suggestions and correction if there are wrong approaches in my test configuration.
My test configuration is as follow:
One board is configured as the Root Port, one board is configured as
the Endpoint. PCIE link is gen 1, width x1, MPS 128B. Both boards run
Linux OS
At Root Port side, we allocate a memory buffer and its size is 4MB.
We map the inbound PCIE memory transaction to this buffer.
At Endpoint side, we do DMA read/write to the remote buffer and
measure throughput. With this test the Endpoint will always be the
initiator of transactions.
The test result is 214MB/s for EP Write test and it is only 60MB/s
for EP Read test. The Write test throughput is reasonable for PCIe
Gen1 x1, but the EP Read throughput is too low.
For the RP board, I tested it with PCIE Ethernet e1000e card and get maximum throughput ~900Mbps. I just wonder in the case of Ethernet TX path, the Ethernet card (plays Endpoint role) also does EP Read request and can get high throughput (~110MB/s) with even smaller DMA transfer, so there must be something wrong with my DMA EP Read configuration.
The detail of the DMA Read test can be summarized with below pseudo code:
dest_buffer = kmalloc(1MB)
memset(dest_buffer, 0)
dest_phy_addr = dma_map_single(destination_buffer)
source_phy_addr = outbound region of Endpoint
get_time(t1)
Loop 100 times
Issue DMA read from source_phy_addr to dest_phy_addr
wait for DMA read completion
get_time(t2)
throughput = (1MB * 100)/(t2 - t1)
Any recommendations and suggestion are appreciated. Thanks in advanced!