What is the latency of `clwb` and `ntstore` on Intel's Optane Persistent Memory? - x86-64

In this paper, it is written that the 8 bytes sequential write of clwb and ntstore of optane PM have 90ns and 62ns latency, respectively, and sequential reading is 169ns.
But in my test with Intel 5218R CPU, clwb is about 700ns and ntstore is about 1200ns. Of course, there is a difference between my test method and the paper, but the result is too bad, which is unreasonable. And my test is closer to actual usage.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate? If this is the case, is there a tool to detect it?
#include "libpmem.h"
#include "stdio.h"
#include "x86intrin.h"
//gcc aep_test.c -o aep_test -O3 -mclwb -lpmem
int main()
{
size_t mapped_len;
char str[32];
int is_pmem;
sprintf(str, "/mnt/pmem/pmmap_file_1");
int64_t *p = pmem_map_file(str, 4096 * 1024 * 128, PMEM_FILE_CREATE, 0666, &mapped_len, &is_pmem);
if (p == NULL)
{
printf("map file fail!");
exit(1);
}
if (!is_pmem)
{
printf("map file fail!");
exit(1);
}
struct timeval start;
struct timeval end;
unsigned long diff;
int loop_num = 10000;
_mm_mfence();
gettimeofday(&start, NULL);
for (int i = 0; i < loop_num; i++)
{
p[i] = 0x2222;
_mm_clwb(p + i);
// _mm_stream_si64(p + i, 0x2222);
_mm_sfence();
}
gettimeofday(&end, NULL);
diff = 1000000 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec;
printf("Total time is %ld us\n", diff);
printf("Latency is %ld ns\n", diff * 1000 / loop_num);
return 0;
}
Any help or correction is much appreciated!

The main reason is repeating flush to the same cacheline is delayed dramatically[1].
You are testing the avg latency instead of best-case latency like the FAST20 papaer.
ntstore are more expensive than clwb, so it's latency is higher. I guess it's a typo in your first paragraph.
appended on 4.14
Q: Tools to detect possible bottleneck on WPQ of buffers?
A: You can get a baseline when PM is idle, and use this baseline to indicate the possible bottleneck.
Tools:
Intel Memory Bandwidth Monitoring
Reads Two hardware counters from performance monitoring unit (PMU) in the processor: 1) UNC_M_PMM_WPQ_OCCUPANCY.ALL, which counts the accumulated number of WPQ entries at each cycle and 2) UNC_M_PMM_WPQ_INSERTS, which counts how many entries have been inserted into WPQ. And the calculate the queueing delay of WPQ: UNC_M_PMM_WPQ_OCCUPANCY.ALL / UNC_M_PMM_WPQ_INSERTS. [2]
[1] Chen, Youmin, et al. "Flatstore: An efficient log-structured key-value storage engine for persistent memory." Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020.
[2] Imamura, Satoshi, and Eiji Yoshida. “The analysis of inter-process interference on a hybrid memory system.” Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.

https://www.usenix.org/system/files/fast20-yang.pdf describes what they're measuring: the CPU side of doing one store + clwb + mfence for a cached write1. So the CPU-pipeline latency of getting a store "accepted" into something persistent.
This isn't the same thing as making it all the way to the Optane chips themselves; the Write Pending Queue (WPQ) of the memory controllers are part of the persistence domain on Cascade Lake Intel CPUs like yours; wikichip quotes an Intel image:
Footnote 1: Also note that clwb on Cascade Lake works like clflushopt - it just evicts. So store + clwb + mfence in a loop test would test the cache-cold case, if you don't do something to load the line before the timed interval. (From the paper's description, I think they do). Future CPUs will hopefully properly support clwb, but at least CSL got the instruction supported so future libraries won't have to check CPU features before using it.
You're doing many stores, which will fill up any buffers in the memory controller or elsewhere in the memory hierarchy. So you're measuring throughput of a loop, not latency of one store plus mfence itself in a previously-idle CPU pipeline.
Separate from that, rewriting the same line repeatedly seems to be slower than sequential write, for example. This Intel forum post reports "higher latency" for "flushing a cacheline repeatedly" than for flushing different cache lines. (The controller inside the DIMM does do wear leveling, BTW.)
Fun fact: later generations of Intel CPUs (perhaps CPL or ICX) will have even the caches (L3?) in the persistence domain, hopefully making clwb even cheaper. IDK if that would affect back-to-back movnti throughput to the same location, though, or even clflushopt.
During the test, did the Write Pending Queue of CPU's iMC or the WC buffer in the optane PM become the bottleneck, causing blockage, and the measured latency has been inaccurate?
Yes, that would be my guess.
If this is the case, is there a tool to detect it?
I don't know, sorry.

Related

Why am I seeing more RFO (Read For Ownership) requests using REP MOVSB than with vmovdqa

Checkout Edit3
I was getting the wrong results because I was measuring without including prefetch triggered events as discussed here. That being said AFAIK I am only see a reduction in RFO requests with rep movsb as compared to Temporal Store memcpy because of better prefetching on loads and no prefetching on stores. NOT due to RFO requests being optimized out for full cache line stores. This kind of makes sense as we don't see RFO requests optimized out for vmovdqa with a zmm register which we would expect if that where really the case for full cache line stores. That being said the lack of prefetching on stores and lack of non-temporal writes makes it hard to see how rep movsb has reasonable performance.
Edit: It is possible that the RFO requests from rep movsb for different those those for vmovdqa in that for rep movsb it might not request data, just take the line in exclusive state. This could also be the case for stores with a zmm register. I don't see any perf metrics to test this however. Does anyone know any?
Questions
Why am I not seeing a reduction in RFO requests when I use rep movsb for memcpy as compared to a memcpy implemented with vmovdqa?
Why am I seeing more RFO requests when I used rep movsb for memcpy as compared to a memcpy implemented with vmovdqa
Two seperate questions because I believe I should be seeing a reduction in RFO requests with rep movsb, but if that is not the case, should I be seeing an increase as well?
Background
CPU - Icelake: Intel(R) Core(TM) i7-1065G7 CPU # 1.30GHz
I was trying to test out the number of RFO requests when using different methods of memcpy including:
Temporal Stores -> vmovdqa
Non-Temporal Stores -> vmovntdq
Enhanced REP MOVSB -> rep movsb
And have been unable to see a reduction in RFO requests using rep movsb. In fact I have been seeing more RFO requests with rep movsb than with Temporal Stores. This is counter-intuitive given that the consensus understanding seems be that for ivybridge and new rep movsb is able to avoid RFO requests and in turn save memory bandwidth:
In Enhanced REP MOVSB for memcpy:
When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:
Avoiding the RFO request when it knows the entire cache line will be overwritten.
In What's missing/sub-optimal in this memcpy implementation?:
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not
I wrote a simple test program to verify this but was unable to do so.
Test Program
#include <assert.h>
#include <errno.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
#define TEMPORAL 0
#define NON_TEMPORAL 1
#define REP_MOVSB 2
#define NONE_OF_THE_ABOVE 3
#define TODO 1
#if TODO == NON_TEMPORAL
#define store(x, y) _mm256_stream_si256((__m256i *)(x), y)
#else
#define store(x, y) _mm256_store_si256((__m256i *)(x), y)
#endif
#define load(x) _mm256_load_si256((__m256i *)(x))
void *
mmapw(uint64_t sz) {
void * p = mmap(NULL, sz, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
assert(p != NULL);
return p;
}
void BENCH_ATTR
bench() {
uint64_t len = 64UL * (1UL << 22);
uint64_t len_alloc = len;
char * dst_alloc = (char *)mmapw(len);
char * src_alloc = (char *)mmapw(len);
for (uint64_t i = 0; i < len; i += 4096) {
// page in before testing. perf metrics appear to still come through
dst_alloc[i] = 0;
src_alloc[i] = 0;
}
uint64_t dst = (uint64_t)dst_alloc;
uint64_t src = (uint64_t)src_alloc;
uint64_t dst_end = dst + len;
asm volatile("lfence" : : : "memory");
#if TODO == REP_MOVSB
// test rep movsb
asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(len) : : "memory");
#elif TODO == TEMPORAL || TODO == NON_TEMPORAL
// test vmovtndq or vmovdqa
for (; dst < dst_end;) {
__m256i lo = load(src);
__m256i hi = load(src + 32);
store(dst, lo);
store(dst + 32, hi);
dst += 64;
src += 64;
}
#endif
asm volatile("lfence\n\tmfence" : : : "memory");
assert(!munmap(dst_alloc, len_alloc));
assert(!munmap(src_alloc, len_alloc));
}
int
main(int argc, char ** argv) {
bench();
}
Build (assuming file name is rfo_test.c):
gcc -O3 -march=native -mtune=native rfo_test.c -o rfo_test
Run (assuming executable is rfo_test):
perf stat -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test
Test Data
Note: Data with less noise in edit2
TODO = TEMPORAL
583,912,867 cpu-cycles
9,352,817 l2_rqsts.all_rfo
188,343,479 offcore_requests_outstanding.cycles_with_demand_rfo
11,560,370 offcore_requests.demand_rfo
0.166557783 seconds time elapsed
0.044670000 seconds user
0.121828000 seconds sys
TODO = NON_TEMPORAL
560,933,296 cpu-cycles
7,428,210 l2_rqsts.all_rfo
123,174,665 offcore_requests_outstanding.cycles_with_demand_rfo
8,402,627 offcore_requests.demand_rfo
0.156790873 seconds time elapsed
0.032157000 seconds user
0.124608000 seconds sys
TODO = REP_MOVSB
566,898,220 cpu-cycles
11,626,162 l2_rqsts.all_rfo
178,043,659 offcore_requests_outstanding.cycles_with_demand_rfo
12,611,324 offcore_requests.demand_rfo
0.163038739 seconds time elapsed
0.040749000 seconds user
0.122248000 seconds sys
TODO = NONE_OF_THE_ABOVE
521,061,304 cpu-cycles
7,527,122 l2_rqsts.all_rfo
123,132,321 offcore_requests_outstanding.cycles_with_demand_rfo
8,426,613 offcore_requests.demand_rfo
0.139873929 seconds time elapsed
0.007991000 seconds user
0.131854000 seconds sys
Test Results
The baseline RFO requests with just the setup but without the memcpy is in TODO = NONE_OF_THE_ABOVE with 7,527,122 RFO requests.
With TODO = TEMPORAL (using vmovdqa) we can see 9,352,817 RFO requests. This is lower than with TODO = REP_MOVSB (using rep movsb) which has 11,626,162 RFO requests. ~2 million more RFO requests with rep movsb than with Temporal Stores. The only case I was able to see RFO requests avoided was the TODO = NON_TEMPORAL (using vmovntdq) which has 7,428,210 RFO requests, about the same as the baseline indicating none from the memcpy itself.
I played around with different sizes for memcpy thinking I might need to decrease / increase the size for rep movsb to make that optimization but I have been seeing the same general results. For all sizes I tested I see the number of RFO requests in the following order NON_TEMPORAL < TEMPORAL < REP_MOVSB.
Theories
[Unlikely] Something new on Icelake?
Edit: #PeterCordes was able to reproduc the results on Skylake
I don't think this is an Icelake specific thing as the only changes I could find in the Intel Manual on rep movsb for Icelake are:
Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short
operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long.
Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H,
ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.
Which should not be playing a factor in the test program I am using given that len is well above 128.
[Likelier] My test program is broken
I don't see any issues but this is a very surprising result. At the very least verified that the compiler is not optimizing out the tests here
Edit: Fixed build instructions to use G++ instead of GCC and file postfix from .c to .cc
Edit2:
Back to C and GCC.
Better Pref Recipe:
perf stat --all-user -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test
Numbers with better perf recipe (same trend but less noise):
TODO = TEMPORAL
161,214,341 cpu-cycles
1,984,998 l2_rqsts.all_rfo
61,238,129 offcore_requests_outstanding.cycles_with_demand_rfo
3,161,504 offcore_requests.demand_rfo
0.169413413 seconds time elapsed
0.044371000 seconds user
0.125045000 seconds sys
TODO = NON_TEMPORAL
142,689,742 cpu-cycles
3,106 l2_rqsts.all_rfo
4,581 offcore_requests_outstanding.cycles_with_demand_rfo
30 offcore_requests.demand_rfo
0.166300952 seconds time elapsed
0.032462000 seconds user
0.133907000 seconds sys
TODO = REP_MOVSB
150,630,752 cpu-cycles
4,194,202 l2_rqsts.all_rfo
54,764,929 offcore_requests_outstanding.cycles_with_demand_rfo
4,194,016 offcore_requests.demand_rfo
0.166844489 seconds time elapsed
0.036620000 seconds user
0.130205000 seconds sys
TODO = NONE_OF_THE_ABOVE
89,611,571 cpu-cycles
321 l2_rqsts.all_rfo
3,936 offcore_requests_outstanding.cycles_with_demand_rfo
19 offcore_requests.demand_rfo
0.142347046 seconds time elapsed
0.016264000 seconds user
0.126046000 seconds sys
Edit3: This may have to do with hiding RFO events triggered by the L2 Prefetcher
I used the pref recipe #BeeOnRope made that include RFO events started by the L2 Prefetcher:
perf stat --all-user -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/ ./rfo_test
And the equivilent perf recipe without L2 Prefetch events:
perf stat --all-user -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/ ./rfo_test
And got more reasonable results:
Tl;dr; w/ prefetching numbers we see less RFO requests with rep movsb. But it does not appear that rep movsb actually avoids RFO requests, rather it just touch less cache lines
Data With and Without Prefetch Triggered Events Included
TODO =
Perf Event
w/ Prefetching
w/o Prefetching
Difference
----------------------
----------------------
----------------------
----------------------
----------------------
TEMPORAL
l2_rqsts_references
16812993
4358692
12454301
TEMPORAL
l2_rqsts_all_rfo
14443392
1981560
12461832
TEMPORAL
l2_rqsts_rfo_hit
1297932
1038243
259689
TEMPORAL
l2_rqsts_rfo_miss
13145460
943317
12202143
----------------------
----------------------
----------------------
----------------------
----------------------
NON_TEMPORAL
l2_rqsts_references
8820287
1946591
6873696
NON_TEMPORAL
l2_rqsts_all_rfo
6852605
346
6852259
NON_TEMPORAL
l2_rqsts_rfo_hit
66845
317
66528
NON_TEMPORAL
l2_rqsts_rfo_miss
6785760
29
6785731
----------------------
----------------------
----------------------
----------------------
----------------------
REP_MOVSB
l2_rqsts_references
11856549
7400277
4456272
REP_MOVSB
l2_rqsts_all_rfo
8633330
4194510
4438820
REP_MOVSB
l2_rqsts_rfo_hit
1394372
546
1393826
REP_MOVSB
l2_rqsts_rfo_miss
7238958
4193964
3044994
----------------------
----------------------
----------------------
----------------------
----------------------
LOAD_ONLY_TEMPORAL
l2_rqsts_references
6058269
619924
5438345
LOAD_ONLY_TEMPORAL
l2_rqsts_all_rfo
5103905
337
5103568
LOAD_ONLY_TEMPORAL
l2_rqsts_rfo_hit
438518
311
438207
LOAD_ONLY_TEMPORAL
l2_rqsts_rfo_miss
4665387
26
4665361
----------------------
----------------------
----------------------
----------------------
----------------------
STORE_ONLY_TEMPORAL
l2_rqsts_references
8069068
837616
7231452
STORE_ONLY_TEMPORAL
l2_rqsts_all_rfo
8033854
802969
7230885
STORE_ONLY_TEMPORAL
l2_rqsts_rfo_hit
585938
576955
8983
STORE_ONLY_TEMPORAL
l2_rqsts_rfo_miss
7447916
226014
7221902
----------------------
----------------------
----------------------
----------------------
----------------------
STORE_ONLY_REP_STOSB
l2_rqsts_references
4296169
4228643
67526
STORE_ONLY_REP_STOSB
l2_rqsts_all_rfo
4261756
4194548
67208
STORE_ONLY_REP_STOSB
l2_rqsts_rfo_hit
17337
309
17028
STORE_ONLY_REP_STOSB
l2_rqsts_rfo_miss
4244419
4194239
50180
----------------------
----------------------
----------------------
----------------------
----------------------
STORE_ONLY_NON_TEMPORAL
l2_rqsts_references
99713
36112
63601
STORE_ONLY_NON_TEMPORAL
l2_rqsts_all_rfo
64148
427
63721
STORE_ONLY_NON_TEMPORAL
l2_rqsts_rfo_hit
17091
398
16693
STORE_ONLY_NON_TEMPORAL
l2_rqsts_rfo_miss
47057
29
47028
----------------------
----------------------
----------------------
----------------------
----------------------
NONE_OF_THE_ABOVE
l2_rqsts_references
74074
27656
46418
NONE_OF_THE_ABOVE
l2_rqsts_all_rfo
46833
375
46458
NONE_OF_THE_ABOVE
l2_rqsts_rfo_hit
16366
344
16022
NONE_OF_THE_ABOVE
l2_rqsts_rfo_miss
30467
31
30436
It seems most of the RFO differences boil down to prefetching Enhanced REP MOVSB for memcpy
Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.
Stores
It all appears to come down to rep movsb not prefetching store addresses causing less lines to require an RFO request. With STORE_ONLY_REP_STOSB we can get a better idea of where the RFO requests are saved with rep movsb (assuming the two are implemented simliarly). With Prefetching events NOT counted, we see rep movsb having about the exact same number of RFO requests as rep stosb (and same breakdown of HITS / MISSES). It has about ~2.5 million extra L2 references which are fair to attribute to the loads.
Whats especially interesting for the STORE_ONLY_REP_STOSB numbers is that they barely change with prefetch vs non-prefetch data. This makes me think that rep stosb at the very least is NOT prefetching the store address. This also corresponds with the fact that we see almost no RFO_HITS and almost entirely RFO_MISSES. Temporal Store memcpy, on the otherhand IS prefetching the store address so the origional numbers where skewed in that they didn't count the store RFO requests from vmovdqa but counted all of them from rep movsb.
Another pointer of interest is that STORE_ONLY_REP_STOSB still has many RFO requests compared with STORE_ONLY_NON_TEMORAL. This makes me think rep movsb/rep stosb is only saving RFO requests on stores because it is not making extra prefetches but it is using a temporal store that goes through cache. One thing I am having a hard time reconcilling is it seems the stores from rep movsb / rep stosb neither prefetch not use non-temporal stores that include an RFO so I am unsure how it has comparable performance.
Loads
I think rep movsb is prefetching loads and it is doing a better job of it that standard vmovdqa loop. If you look at the diff between rep movsb w/ and w/o prefetch and the diff for LOAD_ONLY_TEMPORAL you see about the same pattern with the numbers of LOAD_ONLY_TEMPORAL being about 20% higher for references but lower for hits. This would indicate the the vmovdqa loop is doing extra prefetches past the tail and prefetching less effectively. So rep movsb does a better job prefetching the load address (thus less total references and higher hit rate).
Results
The following is what I am thinking from the data:
rep movsb does NOT optimize out RFO requests for a given load/store
Maybe its a different type of RFO request that does not require data to be sent but have been unable to find a counter to test this.
rep movsb does not prefetch stores and does not use non-temporal stores. It thus uses less RFO requests for stores because it doesn't pull in unnecissary lines with prefetching.
Possible it is expecting the store buffer to hide the latency from getting the lines into cache as it knows that there is never a dependency on the stored value.
Possible that the heuristic is a false invalidation of another cores data is too expensive so it doesn't want to prefetch lines for E/M state.
I have a hard time reconciling this with "good performance"
rep movsb is prefetching loads and does so better than a normal temporal load loop.
Edit4:
Using new perf recipe to measure uncore reads / writes:
perf stat -a -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" ./rfo_test
The idea is the if rep stosb is send RFO-ND then it should have about the same numbers as movntdq. This seems to be the case.
TODO = STORE_ONLY_REP_STOSB
24,251,861 data_reads
52,130,870 data_writes
TODO = STORE_ONLY_TEMPORAL
Note: this is done with vmovdqa ymm, (%reg). This is not a 64 byte store so an RFO w/ data should be necessary. I did test this with vmodqa32 zmm, (%reg) and saw about the same numbers. That means either 1) zmm stores are not optimized to skip the RFO in favor of an ItoM, or 2) these events are not indicative of what I think they are Beware.
39,785,140 data_reads
35,225,418 data_writes
TODO = STORE_ONLY_NON_TEMPORAL
22,680,373 data_reads
51,057,807 data_writes
One thing that is strange is that while reads are lower for STORE_ONLY_NON_TEMPORAL and STORE_ONLY_REP_STOSB writes are higher for both of them.
There is a real name of RFO-ND; ItoM.
RFO: For writes to part of cache line. If in 'I' needs to have data forwarded to it.
ItoM: For writes to full cache line. If in 'I' does NOT need data forwarded to it.
Its aggregated with RFO in OFFCORE_REQUESTS.DEMAND_RFO. Intel has a performance tool that seems sample its value from MSR but they don't have support for ICL and so far am having trouble finding documentation for ICL. Need to investigate more into how to isolate it.
Edit5: The reason for less writes with STORE_ONLY_TEMPORAL earlier was zero store elimination.
One of this issue with my measurement method is the uncore_imc events arent supported with the all-user option. I changed up the perf recipe a bit to try and mitigate this:
perf stat -D 1000 -C 0 -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" taskset -c 0 ./rfo_test
I pin rfo_test to core 0 and only collect stats on core 0. As well I only start collecting stats after the first second and usleep in the benchmark until the 1 second mark after setup has completed. Still some noise to I included NONE_OF_THE_ABOVE which is just the perf numbers from setup / teardown of the benchmark.
TODO = STORE_ONLY_REP_STOSB
2,951,318 data_reads
18,034,260 data_writes
TODO = STORE_ONLY_TEMPORAL
20,021,299 data_reads
18,048,681 data_writes
TODO = STORE_ONLY_NON_TEMPORAL
2,876,755 data_reads
18,030,816 data_writes
TODO = NONE_OF_THE_ABOVE
2,942,999 data_reads
1,274,211 data_writes

How to minimize latency when reading audio with ALSA?

When trying to acquire some signals in the frequency domain, I've encountered the issue of having snd_pcm_readi() take a wildly variable amount of time. This causes problems in the logic section of my code, which is time dependent.
I have that most of the time, snd_pcm_readi() returns after approximately 0.00003 to 0.00006 seconds. However, every 4-5 call to snd_pcm_readi() requires approximately 0.028 seconds. This is a huge difference, and causes the logic part of my code to fail.
How can I get a consistent time for each call to snd_pcm_readi()?
I've tried to experiment with the period size, but it is unclear to me what exactly it does even after re-reading the documentation multiple times. I don't use an interrupt driven design, I simply call snd_pcm_readi() and it blocks until it returns -- with data.
I can only assume that the reason it blocks for a variable amount of time, is that snd_pcm_readi() pulls data from the hardware buffer, which happens to already have data readily available for transfer to the "application buffer" (which I'm maintaining). However, sometimes, there is additional work to do in kernel space or on the hardware side, hence the function call takes longer to return in these cases.
What purpose does the "period size" serve when I'm not using an interrupt driven design? Can my problem be fixed at all by manipulation of the period size, or should I do something else?
I want to achieve that each call to snd_pcm_readi() takes approximately the same amount of time. I'm not asking for a real time compliant API, which I don't imagine ALSA even attempts to be, however, seeing a difference in function call time on the order of being 500 times longer (which is what I'm seeing!) then this is a real problem.
What can be done about it, and what should I do about it?
I would present a minimal reproducible example, but this isn't easy in my case.
Typically when reading and writing audio, the period size specifies how much data ALSA has reserved in DMA silicon. Normally the period size specifies your latency. So for example while you are filling a buffer for writing through DMA to the I2S silicon, one DMA buffer is already being written out.
If you have your period size too small, then the CPU doesn't have time to write audio out in the scheduled execution slot provided. Typically people aim for a minimum of 500 us or 1 ms in latency. If you are doing heavy forms of computation, then you may want to choose 5 ms or 10 ms of latency. You may choose even more latency if you are on a non-powerful embedded system.
If you want to push the limit of the system, then you can request the priority of the audio processing thread be increased. By increasing the priority of your thread, you ask the scheduler to process your audio thread before all other threads with lower priority.
One method for increasing priority taken from the gtkIOStream ALSA C++ OO classes is like so (taken from the changeThreadPriority method) :
/** Set the current thread's priority
\param priority <0 implies maximum priority, otherwise must be between sched_get_priority_max and sched_get_priority_min
\return 0 on success, error code otherwise
*/
static int changeThreadPriority(int priority){
int ret;
pthread_t thisThread = pthread_self(); // get the current thread
struct sched_param origParams, params;
int origPolicy, policy = SCHED_FIFO, newPolicy=0;
if ((ret = pthread_getschedparam(thisThread, &origPolicy, &origParams))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
printf("ALSA::Stream::changeThreadPriority : Current thread policy %d and priority %d\n", origPolicy, origParams.sched_priority);
if (priority<0) //maximum priority
params.sched_priority = sched_get_priority_max(policy);
else
params.sched_priority = priority;
if (params.sched_priority>sched_get_priority_max(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too high\n");
if (params.sched_priority<sched_get_priority_min(policy))
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_PRIORITY_ERROR, "requested priority is too low\n");
if ((ret = pthread_setschedparam(thisThread, policy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_setschedparam - are you su or do you have permission to set this priority?\n");
if ((ret = pthread_getschedparam(thisThread, &newPolicy, &params))!=0)
return ALSA::ALSADebug().evaluateError(ret, "when trying to pthread_getschedparam\n");
if(policy != newPolicy)
return ALSA::ALSADebug().evaluateError(ALSA_SCHED_POLICY_ERROR, "requested scheduler policy is not correctly set\n");
printf("ALSA::Stream::changeThreadPriority : New thread priority changed to %d\n", params.sched_priority);
return 0;
}

High CPU and Memory Consumption on using boost::asio async_read_some

I have made a server that reads data from client and I am using boost::asio async_read_some for reading data, and I have made one handler function and here _ioService->poll() will run event processing loop to execute ready handlers. In handler _handleAsyncReceive I am deallocating the buf that is assigned in receiveDataAsync. bufferSize is 500.
Code is as follows:
bool
TCPSocket::receiveDataAsync( unsigned int bufferSize )
{
char *buf = new char[bufferSize + 1];
try
{
_tcpSocket->async_read_some( boost::asio::buffer( (void*)buf, bufferSize ),
boost::bind(&TCPSocket::_handleAsyncReceive,
this,
buf,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred) );
_ioService->poll();
}
catch (std::exception& e)
{
LOG_ERROR("Error Receiving Data Asynchronously");
LOG_ERROR( e.what() );
delete [] buf;
return false;
}
//we dont delete buf here as it will be deleted by callback _handleAsyncReceive
return true;
}
void
TCPSocket::_handleAsyncReceive(char *buf, const boost::system::error_code& ec, size_t size)
{
if(ec)
{
LOG_ERROR ("Error occurred while sending data Asynchronously.");
LOG_ERROR ( ec.message() );
}
else if ( size > 0 )
{
buf[size] = '\0';
LOG_DEBUG("Deleting Buffer");
emit _asyncDataReceivedSignal( QString::fromLocal8Bit( buf ) );
}
delete [] buf;
}
Here the problem is buffer is allocated at much faster rate as compare to deallocation so memory usage will go high at exponential rate and at some point of time it will consume all the memory and system will be stuck. CPU usage will also be around 90%. How can I reduce the memory and CPU consumption?
You have a memory leak. io_service poll does not guarantee that it with dispatch your _handleAsyncReceive. It can dispatch other event (e.g an accept), so your memory at char *buf is lost. My guess you are calling receiveDataAsync from a loop, but its not necessary - leak will exist in any case (with different leak speed).
Its better if you follow asio examples and work with suggested patterns rather than make your own.
You might consider using a wrap around buffer, which is also called a circular buffer. Boost has a template circular buffer version available. You can read about it here. The idea behind it is that when it becomes full, it circles around to the beginning where it will store things. You can do the same thing with other structures or arrays as well. For example, I currently use a byte array for this purpose in my application.
The advantage of using a dedicated large circular buffer to hold your messages is that you don't have to worry about creating and deleting memory for each new message that comes in. This avoids fragmentation of memory, which could become a problem.
To determine an appropriate size of the circular buffer, you need to think about the maximum number of messages that can come in and are in some stage of being processed simultaneously; multiply that number by the average size of the messages and then multiply by a fudge factor of perhaps 1.5. The average message size for my application is under 100 bytes. My buffer size is 1 megabyte, which would allow for at least 10,000 messages to accumulate without it affecting the wrap around buffer. But, if more than 10,000 messages did accumulate without being completely processed, then the circular buffer would be unuseable and the program would have to be restarted. I have been thinking about reducing the size of the buffer because the system would probably be dead long before it hit the 10,000 message mark.
As PSIAlt suggest, consider following the Boost.Asio examples and build upon their patterns for asynchronous programming.
Nevertheless, I would suggest considering whether multiple read calls need to be queued onto the same socket. If the application only allows for a single read operation to be pending on the socket, then resources are reduced:
There is no longer the scenario where there are an excessive amount of handlers pending in the io_service.
A single buffer can be preallocated and reused for each read operation. For example, the following asynchronous call chain only requires a single buffer, and allows for the concurrent execution of starting an asynchronous read operation while the previous data is being emitted on the Qt signal, as QString performs deep-copies.
TCPSocket::start()
{
receiveDataAsync(...) --.
} |
.---------------'
| .-----------------------------------.
v v |
TCPSocket::receiveDataAsync(...) |
{ |
_tcpSocket->async_read_some(_buffer); --. |
} | |
.-------------------------------' |
v |
TCPSocket::_handleAsyncReceive(...) |
{ |
QString data = QString::fromLocal8Bit(_buffer); |
receiveDataAsync(...); --------------------------'
emit _asyncDataReceivedSignal(data);
}
...
tcp_socket.start();
io_service.run();
It is important to identify when and where the io_service's event loop will be serviced. Generally, applications are designed so that the io_service does not run out of work, and the processing thread is simply waiting for events to occur. Thus, it is fairly common to start setting up asynchronous chains, then process the io_service event loop at a much higher scope.
On the other hand, if it is determined that TCPSocket::receiveDataAsync() should process the event loop in a blocking manner, then consider using synchronous operations.

How can a usage counter in Solaris 10 /proc filesystem decrease?

I'm trying to determine the CPU utilization of specific LWPs in specific processes in Solaris 10 using data from the /proc filesystem. The problem I have is that sometimes a utilization counter decreases.
Here's the gist of it:
// we'll be reading from the file named /proc/<pid>/lwp/<lwpid>/lwpusage
std::stringstream filename;
filename << "/proc/" << pid << "/lwp/" << lwpid << "/lwpusage";
int fd = open(filename.str().c_str(), O_RDONLY);
// error checking
while(1)
{
prusage_t usage;
ssize_t readResult = pread(usage_fd, &usage, sizeof(prusage_t), 0);
// error checking
std::cout << "sec=" << usage.pr_stime.tv_sec
<< "nsec=" << usage.pr_stime.tv_nsec << std::endl;
// wait
}
close(fd);
The number of nanoseconds reported in the prusage_t struct are derived from timestamps recorded each time an LWP changes state. This feature is called microstate accounting. Sounds good, but every so often the "system call cpu time" counter decreases roughly 1-10 milliseconds.
Update: its not just the "system call cpu time" counter, I've since seen other counters decreasing as well.
Another curiosity is that it always seems to be exactly one sample that's bogus - never two near each other. All the other samples are monotonically increasing at the expected rate. This seems to rule out the possibility that the counter is somehow reset in the kernel.
Any clues as to what's going on here?
> uname -a
SunOS cdc-build-sol10u7 5.10 Generic_139556-08 i86pc i386 i86pc
If you are on a multicore machine, you might check whether this is occurring when the process is migrated from one processor core to a different one. If your processes are running, prstat will show the cpu on which they are running. To minimize lock contention, frequently updated data is sometimes updated in a processor-specific memory area and then synchronized with any copies of the data for other processors.
Just a guess. You might want to disable temporarily NTP and see if the problem still appears.

Will mmap use continuous memory? (on solaris)

I used mmap(just try to understand how mmap works) to allocate 96k anonymous memory, but looks like it split the 96k into 64k and 32k. But when allocate 960k, it allocate only one chunk whose size is 960k. When solaris will split the allocate mem into several part?
Code:
#define PROT PROT_READ | PROT_WRITE
#define MAP MAP_ANON | MAP_PRIVATE
if ((src = mmap(0, 88304, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
if ((src = mmap(0, 983040, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
if ((src = mmap(0, 98304, PROT, MAP, -1, 0)) == MAP_FAILED)
printf("mmap error for input");
Truss:
mmap(0x00000000, 88304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E900000
mmap(0x00000000, 983040, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E800000
mmap(0x00000000, 98304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0)
= 0xFFFFFFFF7E700000
Pmap:
FFFFFFFF7E700000 64 - - - rw--- [anon]
==> strange is that for 96k, it was broken into 2 part.
FFFFFFFF7E710000 32 - - - rw--- [anon]
FFFFFFFF7E800000 960 - - - rw--- [anon]
FFFFFFFF7E900000 64 - - - rw--- [anon]
FFFFFFFF7E910000 24 - - - rw--- [anon]
FFFFFFFF7EA00000 64 - - - rw--- [anon]
FFFFFFFF7EA10000 32 - - - rw--- [anon]
That is contiguous memory, you can tell by the addresses (F...700000 + 64K = F...710000) so I don't think you have to worry about that. I'm pretty certain that mmap is required to give you contiguous memory in your address space. It would be pretty useless otherwise since it only gives you one base address. With two non-contiguous blocks, there would be no way to find that second block.
So I guess your question is: why does this show up as two blocks in the pmap?
To which my answer would be, "Stuffed if I know". But I can make an intelligent guess which is the best anyone can hope for from me at this time of the morning (pre-coffee).
I would suggest that those blocks had been allocated before to another process (or two) and had been released back to the mmap memory manager. I can see two possibilities on how that memory manager coalesces blocks to make bigger free blocks, either:
it does it as soon as the memory is released (not the case as your output shows that's not happening).
it does it periodically and it hadn't got around to it before you requested your 96K block; or
it doesn't bother at all because it's smart enough to do it during the allocation of a block to you.
I suspect it's the latter simply because the memory manager had no problems giving you two blocks for your request so it's obviously built to handle it. The 960K block is probably not segmented because it came from a much bigger block.
Keep in mind this is speculation (informed, but still speculation). I've seen quite a bit of the internals of UNIX (real UNIXes, not that new kid on the block :-) but I've never had a need to delve into mmap.
I can't remember the term for it (stripes? Slices? wedges? argh) but Solaris allocates different page sizes from pools of various sizes. This turns out to be somewhat more efficient than uniform page sizes, because it uses the memory mapping better. One of those sizes is 32K, another 64K, another is 1024K I believe. To get 96K, you got a 64 and a 32, to get 960 you got most of a 1024K.
The core resource for this wizardry is the Solaris Internals book. Mine, unfortunately, is in a box in the garage at the momrnt.
The answer depends on what you mean by contiguous. Solaris and all modern Unix and unix-like systems (probably all modern operating systems) will divide physical memory into pages, and the memory within a 'page' will be contiguous at the physical level. Most modern systems have a hardware MMU (Memory Management Unit) which will translate a virtual address to a physical address. So the mmap system call will return a contiguous virtual address space but that virtual address will be managed by an MMU which may use multiple pages depending on the size of the page(s) and the size of the memory mapping.
While all the virtual address will be contiguous (within the mapping)
The addresses within 'pages' will also be physically contiguous but the pages and the transitions between pages may not even be close to each other physically.