I have a 2-byte SPI transaction in HID and USBXpress firmware on the C8051F320. The SPI routines are the same in both firmwares.
Running two back-to-back transactions, there is a 1ms delay between transactions in USBXpress and a 2ms delay using HID. The delays are consistent. Why is the HID slower and how can I make it 1ms? bInterval in HID is 1.
Going a bit on a lark here (no experience with USBXpress, and merely some experience with Microchip's USB stacks):
HID stack would use two USB frames to do a back-to-back transaction - if I recall it correctly, there could never be two HID transactions outstanding (i.e. one report request followed by one report response). The first one is in the first USB frame, the first response is in the second USB frame, and the second request can only happen in the third USB frame.
With USBXpress, one can relax that condition and make the next request before waiting for the completion of the previous one.
Can it be made into 1ms? I'd suggest reading HID specification to find out if it's legal.. and if so, how can host be forced to handle two outstanding HID transactions.
Related
We have implemented our custom driver that uses DMA to copy a large amount of data from the FMC interface (an FPGA mapped to it) to the RAM using the STM32 mdma engine with 32 dma channels. The FPGA contains a small FIFO we want to copy the data from.
For very fast data acquisition the setup time for new DMA transactions becomes critical!
The first implementation used a workqueue to create the next DMA transaction. It could not be done directly from the "dma_completed" atomic context though some necessary IO that has to wait. This lead to pauses between DMA transaction up to 5ms and buffer overflows in the FPGAs FIFO.
As I am copying from a memory mapped region to RAM, I am using dmaengine_prep_dma_memcpy.
I implemented a number of improvements that reduced the pause betweens DMAs:
I am fusing dma mapped pages so that less dma transaction entries have to be created so less dma engine programming is necessary.
I am preparing the next dma pages upfront. So the next DMA transaction can be directly started from the "dma_completed" routine.
I am using a second dma channel and toggle between them when dma_completed is called. This allows to setup a second DMA with the first one still running. Though linux dma api allows this with one channel, the MDMA engine does not and ignores the added transactions.
Usually the pause is now lower than 1ms. But there a spikes were the FIFO nearly overflowing.
Finally I tried to use dmaengine_prep_dma_cyclic. This would be perfect. A continuously running DMA with no need for a setup time between interrupts.
But this does not work. Or better: I do not get it to work...
The transaction created with dmaengine_prep_dma_cyclic does not want to start!
I am getting a new dma_cookie and any status request to the channel returns "DMA_IN_PROGRESS". It never completes and the completetion callback is also never called.
Though dmaengine_prep_dma_memcpy works fine...
I think this is because of the difference between software vs hardware triggered DMA transactions.
Looking into stm32-mdma.c is see that dmaengine_prep_dma_memcpy has its own setup routine whereas dmaengine_prep_dma_cyclic use stm32_mdma_set_xfer_param() that always configures a HW request.
My very big big questions:
Is there a way to use dmaengine_prep_dma_cyclic for a MEMORY to MEMORY DMA transaction (software triggered)? This would be the perfect solution to my performance problem...
Are we missing some signals to connect the FPGA to the SOC? My FPGA programming collegue suspects some missing TSEL (trigger selection) setting. He suspects dmaengine_prep_dma_cyclic will work then.
If a minimum driver module source code example would help in getting better answers, I can provide one in short time. Please note that this is highly hardware specific. Other SOCs than STM32MP157F may have different behaviour.
Thanks for every feedback!
Bye Gunther
References:
https://wiki.st.com/stm32mpu/wiki/Dmaengine_overview
https://github.com/STMicroelectronics/linux/blob/v5.15-stm32mp/drivers/dma/stm32-mdma.c
I have a Nucleo-F446RE, and I'm trying to get the I2C working with an IMU I have (LSM6DS33). I am using STM32CubeMX and checked out all the example code for my board which is related to I2C. Specifically I'll be talking about their 'I2C_TwoBoards_ComIT' example, but all their examples which use the interrupt method have this same quirk. Here is a snipped of their code from main.c:
/* The board sends the message and expects to receive it back */
do
{
/*##-2- Start the transmission process #####################################*/
/* While the I2C in reception process, user can transmit data through
"aTxBuffer" buffer */
if(HAL_I2C_Master_Transmit_IT(&I2cHandle, (uint16_t)I2C_ADDRESS, (uint8_t*)aTxBuffer, TXBUFFERSIZE)!= HAL_OK)
{
/* Error_Handler() function is called in case of error. */
Error_Handler();
}
/*##-3- Wait for the end of the transfer ###################################*/
/* Before starting a new communication transfer, you need to check the current
state of the peripheral; if it’s busy you need to wait for the end of current
transfer before starting a new one.
For simplicity reasons, this example is just waiting till the end of the
transfer, but application may perform other tasks while transfer operation
is ongoing. */
while (HAL_I2C_GetState(&I2cHandle) != HAL_I2C_STATE_READY)
{
}
/* When Acknowledge failure occurs (Slave don't acknowledge its address)
Master restarts communication */
}
while(HAL_I2C_GetError(&I2cHandle) == HAL_I2C_ERROR_AF);
Under comment ##-3- they explain that unless we wait for the I2C state to be ready again, after sending a command, the next command will overwrite the previous one, so they use a while loop which waits for the I2C state to be 'ready' before continuing.
Isn't this a very inefficient way to use an interrupt, and no different from using the standard polling method? Both block the main code, so what's the purpose of the interrupt?
In my personal example, I want to collect the accelerometer/gyroscope data at the 1.66 kHz rate which the IMU is capable of. I use a 2kHz timer to send an I2C command to read the acc/gyr data-ready register, and if the data is ready for either sensor I read their 6 bytes to get the x/y/z plane information. Using the polling method is too slow as blocking the code at a rate of 2kHz is not inefficient, but the interrupt method doesn't seem to be any faster as I still need to hang the system during the aforementioned while loop to check if I2C is ready for another command. What am I missing here?
Is this (the example you provided) an efficient way of doing things? No. Can blocking part be avoided? Yes. It's only a small example, a proof of concept, so there is some blocking in there. You should look deeper at why it is there and how can you implement what it does without blocking.
The point of that blocking part is to not start an I2C communication while another I2C communication is in progress. The problem is that while your line of code to send something over I2C has already been executed, the data is still being physically sent over the line, just because your MCU is much faster than I2C. You need to wait until I2C line is idle and available for transmission.
How to achieve that with interrupts and not waste cycles and processing time? Given in your case you can easily estimate the amount of data per each transmission, there is no probem to estimate how much time every transmission will take given your I2C speed. Since you're smartly and correctly using timer to schedule regular transmissions, you should be able to set the timer in such a way that by the next timer interrupt, which will send data, your previous communication has already ended.
For example, if you set the timer to 1Hz to start transmission, you can obviously be sure that by the next interrupt all the communication has happened. You don't need to poll anything at all.
I don't see much point in I2C-polling the IC at 2kHz if it produces data at 1.6kHz. You will have uneven time periods between samples, some data will be very fresh, while some data will come with little delay, plus there will be communication without data ready. It would be better to poll it at something like 1.5-1.6kHz and just expect data to always be there. Of course, given the communication fits into 1.5kHz period, which requires some napkin math.
I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?
Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.
A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.
Does anyone have a sample code of transfering data with SPI in DMA CIRCULAR mode for stm32?(16 bit)
With my code, master sends 16 bit data and in the next cycle receives the answer. But this transaction done with one cycle delay.
SPI is supposed to work that way.
When the SPI data register is written the first time, it starts sending the data, and immediately signals the DMA controller that it's ready for the next data word. Now there are two data words down in the transmitter, when it has barely started receiving the first one. When the first outgoing word is completely transmitted, and the first incoming word is completely received (these happen almost simultaneously), SPI starts sending the second word already in the data register, signals the transmit DMA channel that it's ready for the third data word, about the same time it also signals the receiving channel that the first incoming data word is ready.
I am learning LDD3. Chatper Interrupt Handling. And wanna double check my understanding, also have question about logic relationship of the statement
1.Although some devices can be controlled using nothing but their I/O regions(char driver is the example, right??),
2.most real devices are a bit more complicated than that. Devices have to deal with the external world, which often includes things such as spinning disks, moving tape, wires to distant places, and so on.(understood)
3.Much has to be done in a time frame that is different from, and far slower than, that of the processor.
4.Since it is almost always undesirable to have the processor wait on external events, there must be a way for a device to let the processor know when something has happened.
is the author trying to say because of both 3rd condition and 4th condition, then we use interrupt handler?? I always thought just 4th condition can lead to interrupt handling. Does 3rd condition really matter here??
Thanks
They are related. I would have phrased as "much can be done". A processor can go and handle a multitude of tasks when waiting for a response from some external device if that device is a spinning disk or I/O response or other mechanical thing.
If the device were much faster than the processor, then #4 wouldn't be an issue.