Why am I seeing more RFO (Read For Ownership) requests using REP MOVSB than with vmovdqa - x86-64

Checkout Edit3
I was getting the wrong results because I was measuring without including prefetch triggered events as discussed here. That being said AFAIK I am only see a reduction in RFO requests with rep movsb as compared to Temporal Store memcpy because of better prefetching on loads and no prefetching on stores. NOT due to RFO requests being optimized out for full cache line stores. This kind of makes sense as we don't see RFO requests optimized out for vmovdqa with a zmm register which we would expect if that where really the case for full cache line stores. That being said the lack of prefetching on stores and lack of non-temporal writes makes it hard to see how rep movsb has reasonable performance.
Edit: It is possible that the RFO requests from rep movsb for different those those for vmovdqa in that for rep movsb it might not request data, just take the line in exclusive state. This could also be the case for stores with a zmm register. I don't see any perf metrics to test this however. Does anyone know any?
Questions
Why am I not seeing a reduction in RFO requests when I use rep movsb for memcpy as compared to a memcpy implemented with vmovdqa?
Why am I seeing more RFO requests when I used rep movsb for memcpy as compared to a memcpy implemented with vmovdqa
Two seperate questions because I believe I should be seeing a reduction in RFO requests with rep movsb, but if that is not the case, should I be seeing an increase as well?
Background
CPU - Icelake: Intel(R) Core(TM) i7-1065G7 CPU # 1.30GHz
I was trying to test out the number of RFO requests when using different methods of memcpy including:
Temporal Stores -> vmovdqa
Non-Temporal Stores -> vmovntdq
Enhanced REP MOVSB -> rep movsb
And have been unable to see a reduction in RFO requests using rep movsb. In fact I have been seeing more RFO requests with rep movsb than with Temporal Stores. This is counter-intuitive given that the consensus understanding seems be that for ivybridge and new rep movsb is able to avoid RFO requests and in turn save memory bandwidth:
In Enhanced REP MOVSB for memcpy:
When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:
Avoiding the RFO request when it knows the entire cache line will be overwritten.
In What's missing/sub-optimal in this memcpy implementation?:
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not
I wrote a simple test program to verify this but was unable to do so.
Test Program
#include <assert.h>
#include <errno.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
#define TEMPORAL 0
#define NON_TEMPORAL 1
#define REP_MOVSB 2
#define NONE_OF_THE_ABOVE 3
#define TODO 1
#if TODO == NON_TEMPORAL
#define store(x, y) _mm256_stream_si256((__m256i *)(x), y)
#else
#define store(x, y) _mm256_store_si256((__m256i *)(x), y)
#endif
#define load(x) _mm256_load_si256((__m256i *)(x))
void *
mmapw(uint64_t sz) {
void * p = mmap(NULL, sz, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
assert(p != NULL);
return p;
}
void BENCH_ATTR
bench() {
uint64_t len = 64UL * (1UL << 22);
uint64_t len_alloc = len;
char * dst_alloc = (char *)mmapw(len);
char * src_alloc = (char *)mmapw(len);
for (uint64_t i = 0; i < len; i += 4096) {
// page in before testing. perf metrics appear to still come through
dst_alloc[i] = 0;
src_alloc[i] = 0;
}
uint64_t dst = (uint64_t)dst_alloc;
uint64_t src = (uint64_t)src_alloc;
uint64_t dst_end = dst + len;
asm volatile("lfence" : : : "memory");
#if TODO == REP_MOVSB
// test rep movsb
asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(len) : : "memory");
#elif TODO == TEMPORAL || TODO == NON_TEMPORAL
// test vmovtndq or vmovdqa
for (; dst < dst_end;) {
__m256i lo = load(src);
__m256i hi = load(src + 32);
store(dst, lo);
store(dst + 32, hi);
dst += 64;
src += 64;
}
#endif
asm volatile("lfence\n\tmfence" : : : "memory");
assert(!munmap(dst_alloc, len_alloc));
assert(!munmap(src_alloc, len_alloc));
}
int
main(int argc, char ** argv) {
bench();
}
Build (assuming file name is rfo_test.c):
gcc -O3 -march=native -mtune=native rfo_test.c -o rfo_test
Run (assuming executable is rfo_test):
perf stat -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test
Test Data
Note: Data with less noise in edit2
TODO = TEMPORAL
583,912,867 cpu-cycles
9,352,817 l2_rqsts.all_rfo
188,343,479 offcore_requests_outstanding.cycles_with_demand_rfo
11,560,370 offcore_requests.demand_rfo
0.166557783 seconds time elapsed
0.044670000 seconds user
0.121828000 seconds sys
TODO = NON_TEMPORAL
560,933,296 cpu-cycles
7,428,210 l2_rqsts.all_rfo
123,174,665 offcore_requests_outstanding.cycles_with_demand_rfo
8,402,627 offcore_requests.demand_rfo
0.156790873 seconds time elapsed
0.032157000 seconds user
0.124608000 seconds sys
TODO = REP_MOVSB
566,898,220 cpu-cycles
11,626,162 l2_rqsts.all_rfo
178,043,659 offcore_requests_outstanding.cycles_with_demand_rfo
12,611,324 offcore_requests.demand_rfo
0.163038739 seconds time elapsed
0.040749000 seconds user
0.122248000 seconds sys
TODO = NONE_OF_THE_ABOVE
521,061,304 cpu-cycles
7,527,122 l2_rqsts.all_rfo
123,132,321 offcore_requests_outstanding.cycles_with_demand_rfo
8,426,613 offcore_requests.demand_rfo
0.139873929 seconds time elapsed
0.007991000 seconds user
0.131854000 seconds sys
Test Results
The baseline RFO requests with just the setup but without the memcpy is in TODO = NONE_OF_THE_ABOVE with 7,527,122 RFO requests.
With TODO = TEMPORAL (using vmovdqa) we can see 9,352,817 RFO requests. This is lower than with TODO = REP_MOVSB (using rep movsb) which has 11,626,162 RFO requests. ~2 million more RFO requests with rep movsb than with Temporal Stores. The only case I was able to see RFO requests avoided was the TODO = NON_TEMPORAL (using vmovntdq) which has 7,428,210 RFO requests, about the same as the baseline indicating none from the memcpy itself.
I played around with different sizes for memcpy thinking I might need to decrease / increase the size for rep movsb to make that optimization but I have been seeing the same general results. For all sizes I tested I see the number of RFO requests in the following order NON_TEMPORAL < TEMPORAL < REP_MOVSB.
Theories
[Unlikely] Something new on Icelake?
Edit: #PeterCordes was able to reproduc the results on Skylake
I don't think this is an Icelake specific thing as the only changes I could find in the Intel Manual on rep movsb for Icelake are:
Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short
operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long.
Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H,
ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.
Which should not be playing a factor in the test program I am using given that len is well above 128.
[Likelier] My test program is broken
I don't see any issues but this is a very surprising result. At the very least verified that the compiler is not optimizing out the tests here
Edit: Fixed build instructions to use G++ instead of GCC and file postfix from .c to .cc
Edit2:
Back to C and GCC.
Better Pref Recipe:
perf stat --all-user -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test
Numbers with better perf recipe (same trend but less noise):
TODO = TEMPORAL
161,214,341 cpu-cycles
1,984,998 l2_rqsts.all_rfo
61,238,129 offcore_requests_outstanding.cycles_with_demand_rfo
3,161,504 offcore_requests.demand_rfo
0.169413413 seconds time elapsed
0.044371000 seconds user
0.125045000 seconds sys
TODO = NON_TEMPORAL
142,689,742 cpu-cycles
3,106 l2_rqsts.all_rfo
4,581 offcore_requests_outstanding.cycles_with_demand_rfo
30 offcore_requests.demand_rfo
0.166300952 seconds time elapsed
0.032462000 seconds user
0.133907000 seconds sys
TODO = REP_MOVSB
150,630,752 cpu-cycles
4,194,202 l2_rqsts.all_rfo
54,764,929 offcore_requests_outstanding.cycles_with_demand_rfo
4,194,016 offcore_requests.demand_rfo
0.166844489 seconds time elapsed
0.036620000 seconds user
0.130205000 seconds sys
TODO = NONE_OF_THE_ABOVE
89,611,571 cpu-cycles
321 l2_rqsts.all_rfo
3,936 offcore_requests_outstanding.cycles_with_demand_rfo
19 offcore_requests.demand_rfo
0.142347046 seconds time elapsed
0.016264000 seconds user
0.126046000 seconds sys
Edit3: This may have to do with hiding RFO events triggered by the L2 Prefetcher
I used the pref recipe #BeeOnRope made that include RFO events started by the L2 Prefetcher:
perf stat --all-user -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/ ./rfo_test
And the equivilent perf recipe without L2 Prefetch events:
perf stat --all-user -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/ ./rfo_test
And got more reasonable results:
Tl;dr; w/ prefetching numbers we see less RFO requests with rep movsb. But it does not appear that rep movsb actually avoids RFO requests, rather it just touch less cache lines
Data With and Without Prefetch Triggered Events Included
TODO =
Perf Event
w/ Prefetching
w/o Prefetching
Difference
----------------------
----------------------
----------------------
----------------------
----------------------
TEMPORAL
l2_rqsts_references
16812993
4358692
12454301
TEMPORAL
l2_rqsts_all_rfo
14443392
1981560
12461832
TEMPORAL
l2_rqsts_rfo_hit
1297932
1038243
259689
TEMPORAL
l2_rqsts_rfo_miss
13145460
943317
12202143
----------------------
----------------------
----------------------
----------------------
----------------------
NON_TEMPORAL
l2_rqsts_references
8820287
1946591
6873696
NON_TEMPORAL
l2_rqsts_all_rfo
6852605
346
6852259
NON_TEMPORAL
l2_rqsts_rfo_hit
66845
317
66528
NON_TEMPORAL
l2_rqsts_rfo_miss
6785760
29
6785731
----------------------
----------------------
----------------------
----------------------
----------------------
REP_MOVSB
l2_rqsts_references
11856549
7400277
4456272
REP_MOVSB
l2_rqsts_all_rfo
8633330
4194510
4438820
REP_MOVSB
l2_rqsts_rfo_hit
1394372
546
1393826
REP_MOVSB
l2_rqsts_rfo_miss
7238958
4193964
3044994
----------------------
----------------------
----------------------
----------------------
----------------------
LOAD_ONLY_TEMPORAL
l2_rqsts_references
6058269
619924
5438345
LOAD_ONLY_TEMPORAL
l2_rqsts_all_rfo
5103905
337
5103568
LOAD_ONLY_TEMPORAL
l2_rqsts_rfo_hit
438518
311
438207
LOAD_ONLY_TEMPORAL
l2_rqsts_rfo_miss
4665387
26
4665361
----------------------
----------------------
----------------------
----------------------
----------------------
STORE_ONLY_TEMPORAL
l2_rqsts_references
8069068
837616
7231452
STORE_ONLY_TEMPORAL
l2_rqsts_all_rfo
8033854
802969
7230885
STORE_ONLY_TEMPORAL
l2_rqsts_rfo_hit
585938
576955
8983
STORE_ONLY_TEMPORAL
l2_rqsts_rfo_miss
7447916
226014
7221902
----------------------
----------------------
----------------------
----------------------
----------------------
STORE_ONLY_REP_STOSB
l2_rqsts_references
4296169
4228643
67526
STORE_ONLY_REP_STOSB
l2_rqsts_all_rfo
4261756
4194548
67208
STORE_ONLY_REP_STOSB
l2_rqsts_rfo_hit
17337
309
17028
STORE_ONLY_REP_STOSB
l2_rqsts_rfo_miss
4244419
4194239
50180
----------------------
----------------------
----------------------
----------------------
----------------------
STORE_ONLY_NON_TEMPORAL
l2_rqsts_references
99713
36112
63601
STORE_ONLY_NON_TEMPORAL
l2_rqsts_all_rfo
64148
427
63721
STORE_ONLY_NON_TEMPORAL
l2_rqsts_rfo_hit
17091
398
16693
STORE_ONLY_NON_TEMPORAL
l2_rqsts_rfo_miss
47057
29
47028
----------------------
----------------------
----------------------
----------------------
----------------------
NONE_OF_THE_ABOVE
l2_rqsts_references
74074
27656
46418
NONE_OF_THE_ABOVE
l2_rqsts_all_rfo
46833
375
46458
NONE_OF_THE_ABOVE
l2_rqsts_rfo_hit
16366
344
16022
NONE_OF_THE_ABOVE
l2_rqsts_rfo_miss
30467
31
30436
It seems most of the RFO differences boil down to prefetching Enhanced REP MOVSB for memcpy
Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.
Stores
It all appears to come down to rep movsb not prefetching store addresses causing less lines to require an RFO request. With STORE_ONLY_REP_STOSB we can get a better idea of where the RFO requests are saved with rep movsb (assuming the two are implemented simliarly). With Prefetching events NOT counted, we see rep movsb having about the exact same number of RFO requests as rep stosb (and same breakdown of HITS / MISSES). It has about ~2.5 million extra L2 references which are fair to attribute to the loads.
Whats especially interesting for the STORE_ONLY_REP_STOSB numbers is that they barely change with prefetch vs non-prefetch data. This makes me think that rep stosb at the very least is NOT prefetching the store address. This also corresponds with the fact that we see almost no RFO_HITS and almost entirely RFO_MISSES. Temporal Store memcpy, on the otherhand IS prefetching the store address so the origional numbers where skewed in that they didn't count the store RFO requests from vmovdqa but counted all of them from rep movsb.
Another pointer of interest is that STORE_ONLY_REP_STOSB still has many RFO requests compared with STORE_ONLY_NON_TEMORAL. This makes me think rep movsb/rep stosb is only saving RFO requests on stores because it is not making extra prefetches but it is using a temporal store that goes through cache. One thing I am having a hard time reconcilling is it seems the stores from rep movsb / rep stosb neither prefetch not use non-temporal stores that include an RFO so I am unsure how it has comparable performance.
Loads
I think rep movsb is prefetching loads and it is doing a better job of it that standard vmovdqa loop. If you look at the diff between rep movsb w/ and w/o prefetch and the diff for LOAD_ONLY_TEMPORAL you see about the same pattern with the numbers of LOAD_ONLY_TEMPORAL being about 20% higher for references but lower for hits. This would indicate the the vmovdqa loop is doing extra prefetches past the tail and prefetching less effectively. So rep movsb does a better job prefetching the load address (thus less total references and higher hit rate).
Results
The following is what I am thinking from the data:
rep movsb does NOT optimize out RFO requests for a given load/store
Maybe its a different type of RFO request that does not require data to be sent but have been unable to find a counter to test this.
rep movsb does not prefetch stores and does not use non-temporal stores. It thus uses less RFO requests for stores because it doesn't pull in unnecissary lines with prefetching.
Possible it is expecting the store buffer to hide the latency from getting the lines into cache as it knows that there is never a dependency on the stored value.
Possible that the heuristic is a false invalidation of another cores data is too expensive so it doesn't want to prefetch lines for E/M state.
I have a hard time reconciling this with "good performance"
rep movsb is prefetching loads and does so better than a normal temporal load loop.
Edit4:
Using new perf recipe to measure uncore reads / writes:
perf stat -a -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" ./rfo_test
The idea is the if rep stosb is send RFO-ND then it should have about the same numbers as movntdq. This seems to be the case.
TODO = STORE_ONLY_REP_STOSB
24,251,861 data_reads
52,130,870 data_writes
TODO = STORE_ONLY_TEMPORAL
Note: this is done with vmovdqa ymm, (%reg). This is not a 64 byte store so an RFO w/ data should be necessary. I did test this with vmodqa32 zmm, (%reg) and saw about the same numbers. That means either 1) zmm stores are not optimized to skip the RFO in favor of an ItoM, or 2) these events are not indicative of what I think they are Beware.
39,785,140 data_reads
35,225,418 data_writes
TODO = STORE_ONLY_NON_TEMPORAL
22,680,373 data_reads
51,057,807 data_writes
One thing that is strange is that while reads are lower for STORE_ONLY_NON_TEMPORAL and STORE_ONLY_REP_STOSB writes are higher for both of them.
There is a real name of RFO-ND; ItoM.
RFO: For writes to part of cache line. If in 'I' needs to have data forwarded to it.
ItoM: For writes to full cache line. If in 'I' does NOT need data forwarded to it.
Its aggregated with RFO in OFFCORE_REQUESTS.DEMAND_RFO. Intel has a performance tool that seems sample its value from MSR but they don't have support for ICL and so far am having trouble finding documentation for ICL. Need to investigate more into how to isolate it.
Edit5: The reason for less writes with STORE_ONLY_TEMPORAL earlier was zero store elimination.
One of this issue with my measurement method is the uncore_imc events arent supported with the all-user option. I changed up the perf recipe a bit to try and mitigate this:
perf stat -D 1000 -C 0 -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" taskset -c 0 ./rfo_test
I pin rfo_test to core 0 and only collect stats on core 0. As well I only start collecting stats after the first second and usleep in the benchmark until the 1 second mark after setup has completed. Still some noise to I included NONE_OF_THE_ABOVE which is just the perf numbers from setup / teardown of the benchmark.
TODO = STORE_ONLY_REP_STOSB
2,951,318 data_reads
18,034,260 data_writes
TODO = STORE_ONLY_TEMPORAL
20,021,299 data_reads
18,048,681 data_writes
TODO = STORE_ONLY_NON_TEMPORAL
2,876,755 data_reads
18,030,816 data_writes
TODO = NONE_OF_THE_ABOVE
2,942,999 data_reads
1,274,211 data_writes

Related

QSPI connection on STM32 microcontrollers with other peripherals instead of Flash memories

I will start a project which needs a QSPI protocol. The component I will use is a 16-bit ADC which supports QSPI with all combinations of clock phase and polarity. Unfortunately, I couldn't find a source on the internet that points to QSPI on STM32, which works with other components rather than Flash memories. Now, my question: Can I use STM32's QSPI protocol to communicate with other devices that support QSPI? Or is it just configured to be used for memories?
The ADC component I want to use is: ADS9224R (16-bit, 3MSPS)
Here is the image of the datasheet that illustrates this device supports the full QSPI protocol.
Many thanks
page 33 of the datasheet
The STM32 QSPI can work in several modes. The Memory Mapped mode is specifically designed for memories. The Indirect mode however can be used for any peripheral. In this mode you can specify the format of the commands that are exchanged: presence of an instruction, of an adress, of data, etc...
See register QUADSPI_CCR.
QUADSPI supports indirect mode, where for each data transaction you manually specify command, number of bytes in address part, number of data bytes, number of lines used for each part of the communication and so on. Don't know whether HAL supports all of that, it would probably be more efficient to work directly with QUADSPI registers - there are simply too many levers and controls you need to set up, and if the library is missing something, things may not work as you want, and QUADSPI is pretty unpleasant to debug. Luckily, after initial setup, you probably won't need to change very much in its settings.
In fact, some time ago, when I was learning QUADSPI, I wrote my own indirect read/write for QUADSPI flash. Purely a demo program for myself. With a bit of tweaking it shouldn't be hard to adapt it. From my personal experience, QUADSPI is a little hard at first, I spent a pair of weeks debugging it with logic analyzer until I got it to work. Or maybe it was due to my general inexperience.
Below you can find one of my functions, which can be used after initial setup of QUADSPI. Other communication functions are around the same length. You only need to set some settings in a few registers. Be careful with the order of your register manipulations - there is no "start communication" flag/bit/command. Communication starts automatically when you set some parameters in specific registers. This is explicitly stated in the reference manual, QUADSPI section, which was the only documentation I used to write my code. There is surprisingly limited information on QUADSPI available on the Internet, even less with registers.
Here is a piece from my basic example code on registers:
void QSPI_readMemoryBytesQuad(uint32_t address, uint32_t length, uint8_t destination[]) {
while (QUADSPI->SR & QUADSPI_SR_BUSY); //Make sure no operation is going on
QUADSPI->FCR = QUADSPI_FCR_CTOF | QUADSPI_FCR_CSMF | QUADSPI_FCR_CTCF | QUADSPI_FCR_CTEF; // clear all flags
QUADSPI->DLR = length - 1U; //Set number of bytes to read
QUADSPI->CR = (QUADSPI->CR & ~(QUADSPI_CR_FTHRES)) | (0x00 << QUADSPI_CR_FTHRES_Pos); //Set FIFO threshold to 1
/*
* Set communication configuration register
* Functional mode: Indirect read
* Data mode: 4 Lines
* Instruction mode: 4 Lines
* Address mode: 4 Lines
* Address size: 24 Bits
* Dummy cycles: 6 Cycles
* Instruction: Quad Output Fast Read
*
* Set 24-bit Address
*
*/
QUADSPI->CCR =
(QSPI_FMODE_INDIRECT_READ << QUADSPI_CCR_FMODE_Pos) |
(QIO_QUAD << QUADSPI_CCR_DMODE_Pos) |
(QIO_QUAD << QUADSPI_CCR_IMODE_Pos) |
(QIO_QUAD << QUADSPI_CCR_ADMODE_Pos) |
(QSPI_ADSIZE_24 << QUADSPI_CCR_ADSIZE_Pos) |
(0x06 << QUADSPI_CCR_DCYC_Pos) |
(MT25QL128ABA1EW9_COMMAND_QUAD_OUTPUT_FAST_READ << QUADSPI_CCR_INSTRUCTION_Pos);
QUADSPI->AR = (0xFFFFFF) & address;
/* ---------- Communication Starts Automatically ----------*/
while (QUADSPI->SR & QUADSPI_SR_BUSY) {
if (QUADSPI->SR & QUADSPI_SR_FTF) {
*destination = *((uint8_t*) &(QUADSPI->DR)); //Read a byte from data register, byte access
destination++;
}
}
QUADSPI->FCR = QUADSPI_FCR_CTOF | QUADSPI_FCR_CSMF | QUADSPI_FCR_CTCF | QUADSPI_FCR_CTEF; //Clear flags
}
It is a little crude, but it may be a good starting point for you, and it's well-tested and definitely works. You can find all my functions here (GitHub). Combine it with reading the QUADSPI section of the reference manual, and you should start to get a grasp of how to make it work.
Your job will be to determine what kind of commands and in what format you need to send to your QSPI slave device. That information is available in the device's datasheet. Make sure you send command and address and every other part on the correct number of QUADSPI lines. For example, sometimes you need to have command on 1 line and data on all 4, all in the same transaction. Make sure you set dummy cycles, if they are required for some operation. Pay special attention at how you read data that you receive via QUADSPI. You can read it in 32-bit words at once (if incoming data is a whole number of 32-bit words). In my case - in the function provided here - I read it by individual bytes, hence such a scary looking *destination = *((uint8_t*) &(QUADSPI->DR));, where I take an address of the data register, cast it to pointer to uint8_t and dereference it. Otherwise, if you read DR just as QUADSPI->DR, your MCU reads 32-bit word for every byte that arrives, and QUADSPI goes crazy and hangs and shows various errors and triggers FIFO threshold flags and stuff. Just be mindful of how you read that register.

What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory?

In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.
But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?
#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"
//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem
int main()
{
size_t mapped_len;
char str[32];
int is_pmem;
sprintf(str, "/mnt/pmem/pmmap_file_1");
int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
if (p == NULL)
{
printf("map file fail!");
exit(1);
}
if (!is_pmem)
{
printf("map file fail!");
exit(1);
}
struct timeval start;
struct timeval end;
unsigned long diff;
int loop_num = 10000;
_mm_mfence();
gettimeofday(&start, NULL);
for (int i = 0; i < loop_num; i++)
{
p[i] = 0x2222;
_mm_clwb(p + i);
// _mm_stream_si64(p + i, 0x2222);
_mm_sfence();
}
gettimeofday(&end, NULL);
diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;
printf("Total time is %ld us\n", diff);
printf("Latency is %ld ns\n", diff * 1000 / loop_num);
return 0;
}
Any help or correction is much appreciated!
The main reason is repeating flush to the same cacheline is delayed dramatically[1].
You are testing the avg latency instead of best-case latency like the FAST20 papaer.
ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.
appended on 4.14
Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:
Intel Memory Bandwidth Monitoring
Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]
[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.
https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write1. So the CPU-pipeline latency of getting a store "accepted" into something persistent.
This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:
Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.
You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.
Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)
Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?
Yes, that would be my guess.
If this is the case, is there a tool to detect it?
I don't know, sorry.

MPU-6050 Burst Read Auto Increment

I'm trying to write a driver for the MPU-6050 and I'm stuck on how to proceed regarding reading the raw accelerometer/gyroscope/temperature readings. For instance, the MPU-6050 has the accelerometer X readings in 2 registers: ACCEL_XOUT[15:8] at address 0x3B and ACCEL_XOUT[7:0] at address 0x3C. Of course to read the raw value I need to read both registers and put them together.
BUT
In the description of the registers (in the register map and description sheet, https://invensense.tdk.com/wp-content/uploads/2015/02/MPU-6000-Register-Map1.pdf) it says that to guarantee readings from the same sampling instant I must use burst reads b/c as soon as an idle I2C bus is detected, the sensor registers are refreshed with new data from a new sampling instant. The datasheet snippet shows the simple I2C burst read:
However, this approach (to the best of my understanding) would only work reading the ACCEL_X registers from the same sampling instant if the auto-increment was supported (such that the first DATA in the above sequence would be from ACCEL_XOUT[15:8] # address 0x3B and the second DATA would be from ACCEL_XOUT[7:0] # address 0x3C). But the datasheet (https://invensense.tdk.com/wp-content/uploads/2015/02/MPU-6000-Datasheet1.pdf) only mentions that I2C burst writes support the auto-increment feature. Without auto-increment on the I2C read side how would I go about reading two different registers whilst maintaining the same sampling instant?
I also recognize that I could use the sensor's FIFO feature or the interrupt to accomplish what I'm after, but (for my own curiosity) I would like a solution that didn't rely on either.
I also have the same problem, looks like the documentation on this topic is incomplete.
Reading single sample
I think you can burst read the ACCEL_*OUT_*, TEMP_OUT_* and GYRO_*OUT_*. In fact I tried reading the data one register at once, but I got frequent data corruption.
Then, just to try, I requested 6 bytes from ACCEL_XOUT_H, 6 bytes from GYRO_XOUT_H and 2 bytes from TEMP_OUT_H and... it worked! No more data corruption!
I think they simply forgot to mention this in the register map.
How to
Here is some example code that can work in the Arduino environment.
These are the function that I use, they are not very safe, but it works for my project:
////////////////////////////////////////////////////////////////
inline void requestBytes(byte SUB, byte nVals)
{
Wire.beginTransmission(SAD);
Wire.write(SUB);
Wire.endTransmission(false);
Wire.requestFrom(SAD, nVals);
while (Wire.available() == 0);
}
////////////////////////////////////////////////////////////////
inline byte getByte(void)
{
return Wire.read();
}
////////////////////////////////////////////////////////////////
inline void stopRead(void)
{
Wire.endTransmission(true);
}
////////////////////////////////////////////////////////////////
byte readByte(byte SUB)
{
requestBytes(SUB, 1);
byte result = getByte();
stopRead();
return result;
}
////////////////////////////////////////////////////////////////
void readBytes(byte SUB, byte* buff, byte count)
{
requestBytes(SUB, count);
for (int i = 0; i < count; i++)
buff[i] = getByte();
stopRead();
}
At this point, you can simply read the values in this way:
// ACCEL_XOUT_H
// burst read the registers using auto-increment:
byte data[6];
readBytes(ACCEL_XOUT_H, data, 6);
// convert the data:
acc_x = (data[0] << 8) | data[1];
// ...
Warning!
Looks like this cannot be done for other registers. For example, to read the FIFO_COUNT_* I have to do this (otherwise I get incorrect results):
uint16_t FIFO_size(void)
{
byte bytes[2];
// this does not work
//readBytes(FIFO_COUNT_H, bytes, 2);
bytes[1] = readByte(FIFO_COUNT_H);
bytes[2] = readByte(FIFO_COUNT_L);
return unisci_bytes(bytes[1], bytes[2]);
}
Reading the FIFO
Looks like the FIFO works differently: you can burst read by simply requesting multiple bytes from the FIFO_R_W register and the MPU6050 will give you the bytes in the FIFO without incrementing the register.
I found this example where they use I2Cdev::readByte(SAD, FIFO_R_W, buffer) to read a given number of bytes from the FIFO and if you look at I2Cdev::readByte() (here) it simply requests N bytes from the FIFO register:
// ... send FIFO_R_W and request N bytes ...
for(...; ...; count++)
data[count] = Wire.receive();
// ...
How to
This is simple since the FIFO_R_W does not auto-increment:
byte data[12];
void loop() {
// ...
readBytes(FIFO_R_W, data, 12); // <- replace 12 with your burst size
// ...
}
Warning!
Using FIFO_size() is very slow!
Also my advice is to use 400kHz I2C frequency, which is the MPU6050's maximum speed
Hope it helps ;)
As Luca says, the burst read semantic seems to be different depending on the register the read operation starts at.
Reading consistent samples
To read a consistent set of raw data values, you can use the method I2C.readRegister(int, ByteBuffer, int) with register number 59 (ACCEL_XOUTR[15:8]) and a length of 14 to read all the sensor data ACCEL, TEMP, and GYRO in one operation and get consistent data.
Burst read of FIFO data
However, if you use the FIFO buffer of the chip, you can start the burst read with the same method signature on register 116 (FIFO_R_W) to read the given amount of data from the chip-internal fifo buffer. Doing so you must keep in mind that there is a limit on the number of bytes that can be read in one burst operation. If I'm interpreting https://github.com/joan2937/pigpio/blob/c33738a320a3e28824af7807edafda440952c05d/pigpio.c#L3914 right, a maximum of 31 bytes can be read in a single burst operation.

How to minimize latency when reading audio with ALSA?

When trying to acquire some signals in the frequency domain, I've encountered the issue of having snd_pcm_readi() take a wildly variable amount of time. This causes problems in the logic section of my code, which is time dependent.
I have that most of the time, snd_pcm_readi() returns after approximately 0.00003 to 0.00006 seconds. However, every 4-5 call to snd_pcm_readi() requires approximately 0.028 seconds. This is a huge difference, and causes the logic part of my code to fail.
How can I get a consistent time for each call to snd_pcm_readi()?
I've tried to experiment with the period size, but it is unclear to me what exactly it does even after re-reading the documentation multiple times. I don't use an interrupt driven design, I simply call snd_pcm_readi() and it blocks until it returns -- with data.
I can only assume that the reason it blocks for a variable amount of time, is that snd_pcm_readi() pulls data from the hardware buffer, which happens to already have data readily available for transfer to the "application buffer" (which I'm maintaining). However, sometimes, there is additional work to do in kernel space or on the hardware side, hence the function call takes longer to return in these cases.
What purpose does the "period size" serve when I'm not using an interrupt driven design? Can my problem be fixed at all by manipulation of the period size, or should I do something else?
I want to achieve that each call to snd_pcm_readi() takes approximately the same amount of time. I'm not asking for a real time compliant API, which I don't imagine ALSA even attempts to be, however, seeing a difference in function call time on the order of being 500 times longer (which is what I'm seeing!) then this is a real problem.
What can be done about it, and what should I do about it?
I would present a minimal reproducible example, but this isn't easy in my case.
Typically when reading and writing audio, the period size specifies how much data ALSA has reserved in DMA silicon. Normally the period size specifies your latency. So for example while you are filling a buffer for writing through DMA to the I2S silicon, one DMA buffer is already being written out.
If you have your period size too small, then the CPU doesn't have time to write audio out in the scheduled execution slot provided. Typically people aim for a minimum of 500 us or 1 ms in latency. If you are doing heavy forms of computation, then you may want to choose 5 ms or 10 ms of latency. You may choose even more latency if you are on a non-powerful embedded system.
If you want to push the limit of the system, then you can request the priority of the audio processing thread be increased. By increasing the priority of your thread, you ask the scheduler to process your audio thread before all other threads with lower priority.
One method for increasing priority taken from the gtkIOStream ALSA C++ OO classes is like so (taken from the changeThreadPriority method) :
/** Set the current thread's priority
\param priority <0 implies maximum priority, otherwise must be between sched_get_priority_max and sched_get_priority_min
\return 0 on success, error code otherwise
*/
static int changeThreadPriority(int priority){
int ret;
pthread_t thisThread = pthread_self(); // get the current thread
struct sched_param origParams, params;
int origPolicy, policy = SCHED_FIFO, newPolicy=0;
if ((ret = pthread_getschedparam(thisThread, &origPolicy, &origParams))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
printf("ALSA::Stream::changeThreadPriority : Current thread policy %d and priority %d\n", origPolicy, origParams.sched_priority);
if (priority<0) //maximum priority
params.sched_priority = sched_get_priority_max(policy);
else
params.sched_priority = priority;
if (params.sched_priority>sched_get_priority_max(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too high\n");
if (params.sched_priority<sched_get_priority_min(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too low\n");
if ((ret = pthread_setschedparam(thisThread, policy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_setschedparam - are you su or do you have permission to set this priority?\n");
if ((ret = pthread_getschedparam(thisThread, &newPolicy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
if(policy != newPolicy)
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_POLICY_ERROR, "requested scheduler policy is not correctly set\n");
printf("ALSA::Stream::changeThreadPriority : New thread priority changed to %d\n", params.sched_priority);
return 0;
}

Gatling: Understanding rampUsersPerSec(minTPS) to maxTPS during seconds

I am checking a scala code for gatling where they inject transactions for the period of 20 seconds.
/*TPS = Transaction Per Second */
val minTps = Integer.parseInt(System.getProperty("minTps", "1"))
val maxTps = Integer.parseInt(System.getProperty("maxTps", "5"))
var rampUsersDurationInMinutes =Integer.parseInt(System.getProperty("rampUsersDurationInMinutes", "20"))
setUp(scn.inject(
rampUsersPerSec(minTps) to maxTps during (rampUsersDurationInMinutes seconds)).protocols(tcilProtocol))
The same question was asked What does rampUsersPerSec function really do? but never answered. I think that ideally the the graph should be looking like that.
could you please confirm if I correctly understood
rampUsersPerSec?
block (ramp) 1 = 4 users +1
block (ramp) 2 = 12 users +2
block (ramp) 3 = 24 users +3
block (ramp) 4 = 40 users +4
block (ramp) 5 = 60 users +5
The results show that the requests count is indeed 60. Is my calculation correct?
---- Global Information --------------------------------------------------------
> request count 60 (OK=38 KO=22 )
> min response time 2569 (OK=2569 KO=60080 )
> max response time 61980 (OK=61980 KO=61770 )
> mean response time 42888 (OK=32411 KO=60985 )
> std deviation 20365 (OK=18850 KO=505 )
> response time 50th percentile 51666 (OK=32143 KO=61026 )
> response time 75th percentile 60903 (OK=48508 KO=61371 )
> response time 95th percentile 61775 (OK=61886 KO=61725 )
> response time 99th percentile 61974 (OK=61976 KO=61762 )
> mean requests/sec 0.741 (OK=0.469 KO=0.272 )
---- Response Time Distribution ------------------------------------------------
rampUsersPerSec is an open workload model injection where you specify the rate at which users start the scenario. The gatling documentation says that this injection profile
Injects users from starting rate to target rate, defined in users per second, during a given duration. Users will be injected at regular intervals
So while I'm not sure that the example you provide is precisely correct in that gatling is using a second as the 'regular interval' (it might be a smoother model), you are more or less correct. You specify a starting rate and a final rate, and gatling works out all the intermediate injection rates for your duration.
Note that this says nothing about the number of concurrent users your simulation will generate - that is a function of the arrival rate (which you control) and the execution time (which you do not)