Memory throughput for strided memory accesses

Memory throughput for strided memory accesses - x86-64

I am measuring memory throughput and runtimes using _mm256_i32gather_epi32 intrinsic. Here is the loop I use for testing:
for (int i = 0; i < len; i+=8) {
const __m256i* indexes_2 = reinterpret_cast<const __m256i*>(indexes_ptr + i);
__m256i index_reg = _mm256_loadu_si256(indexes_2);
__m256i values = _mm256_i32gather_epi32(data_ptr, index_reg, 4);
sum = _mm256_add_epi32(sum, values);
}
I use the index array (specified through indexes_ptr) to change the access pattern into data_ptr array. The data_ptr array is 256 MB in size, so everything misses the caches. Here are possible values for indexes_ptr:
sequential - 0, 1, 2, 3, etc
stride 4 - 0, 4, 8, 12
stride 16 - 0, 16, 32, 48, etc
stride 32
stride 64
stride 128
So, the intrinsic _mm256_i32gather_epi32 will load 8 values. In my system, the size of a cache line is 64 bytes, so:
sequential touches one cache line
stride 4 touches two cache lines
stride 16 touches eight cache lines
stride 64 touches eight cache lines
stride 128 touches eight cache lines
My expectations is that the stride 16, 64 and 128 will have similar runtimes and memory throughputs. This is however not the case. Here are the numbers:
sequential, 0.13 s, 16828.2607 MB/s
strided 4, 0.07 s, 17246.1914 MB/s
strided 16, 0.918406, 5205.1085 MB/s
strided 32, 1.650566s, 4756.5279 MB/s
stride 64, 1.798604, 5440.2228 MB/s
stride 128, 2.186620, 4672.1329 MB/s
Where does the difference between stride 16, 32, 64 and 128 come from, since they all are accessing exactly 8 cache lines in each instructions?

Related

matlab : out of memory with convenable confi of pc

I block on my problem that I will wrote in details below. During 3 days I tried a lot of differents things, none worked..
If anyone have an idea of what to do !
Here is my message error :
the call to "ft_selectdata" took 0 seconds
preprocessing
Out of memory. Type "help memory" for your options.
Error in ft_preproc_dftfilter (line 187)
tmp = exp(2*1i*pi*freqs(:)*time); % complex sin and cos
Error in ft_preproc_dftfilter (line 144)
filt = ft_preproc_dftfilter(filt, Fs, Fl(i), 'dftreplace', dftreplace, 'dftbandwidth',
dftbandwidth(i), 'dftneighbourwidth', dftneighbourwidth(i)); % enumerate all options
Error in preproc (line 464)
dat = ft_preproc_dftfilter(dat, fsample, cfg.dftfreq, optarg{:});
Error in ft_preprocessing (line 375)
[dataout.trial{i}, dataout.label, dataout.time{i}, cfg] = preproc(data.trial{i}, data.label,
data.time{i}, cfg, begpadding, endpadding);
Error in EEG_Prosocial_script (line 101)
data_intpl = ft_preprocessing(cfg, allData_preprosses);
187 tmp = exp(2*1i*pi*freqs(:)*time); % complex sin and cos
There is some informations from matlab about my computer and about the caracteristics of the calcul
Maximum possible array: 7406 MB (7.766e+09 bytes)*
Memory available for all arrays: 7406 MB (7.766e+09 bytes) *
Memory used by MATLAB: 4195 MB (4.398e+09 bytes)
Physical Memory (RAM): 12206 MB (1.280e+10 bytes)
Limited by System Memory (physical + swap file) available.
K>> whos Name
Size Bytes Class Attributes
freqs 7753x1 62024 double
li 1x1 16 double complex
time 1x1984512 15876096 double
So there the config of the computer which failed to run the script (Alienware aurora R4) :
Ram : 4gb free / 12 # 1,6Ghz --> 2x (4Gb 1600Mhz) - 2x (2Gb 1600 MHz)
Intel core i7-3820 4 core 8 threads 3,7 GHz 1 CPU
NVIDIA GeForce GTX 690 2gb
RAM : Kingston KVT8FP HYC
Hard disk : SSD kingston 250Go SATA 3"
This code work on this computer (Dell inspiron 14-500) : config
Ram 4 Go of memory DDR4 2 666 MHz (4 Go x 1)
Intel® Core™ i5-8265U 8e generatio, (6 Mo memory, 3,9 GHz)
Intel® UHD Graphics 620
Hard disk SATA 2,5" 500 Go 5 400 tr/min
Thank you
Kind regards,

by doing freqs(:)*time you are trying to create a 7753*1984512 size array, and you dont have memory for that... (you will need ~123 gigabytes for that, or 1.2309e+11 bytes, where your computer has ~7e+09)
see for example the case for :
f=rand(7,1);
t=rand(1,19);
size(f(:)*t)
ans =
7 19
what you want do to is probably a for loop per time element etc.

Why does a loop transitioning from having its uops fed by the Uop Cache to LSD cause a spike in branch-misses?

All benchmarks are run on either
Icelake
or Whiskey Lake (In Skylake Family).
Summary
I am seeing a strange phenomina where it appears that when a loop
transitions from running out of the Uop Cache to running out of
the LSD (Loop Stream Detector) there is a spike in Branch
Misses that can cause severe performance hits. I tested on both
Icelake and Whiskey Lake by benchmarking a nested loop with the outer loop
having a sufficiently large body s.t the entire thing did not fit in
the LSD itself, but with an inner loop small enough to fit in the
LSD.
Basically once the inner loop reaches some iteration count decoding
seems to switch for idq.dsb_uops (Uop Cache) to lsd.uops (LSD)
and at that point there is a large increase in branch-misses
(without a corresponding jump in branches) causing a severe
performance drop. Note: This only seems to occur for nested
loops. Travis Down's Loop
Test for example will
not show any meaningful variation in branch misses. AFAICT this has
something to do with when a loop transitions from running out of the
Uop Cache to running out of the LSD.
Questions
What is happening when the loop transitions from running out of the
Uop Cache to running out of the LSD that causes this spike in
Branch Misses?
Is there a way to avoid this?
Benchmark
This is the minimum reproducible example I could come up with:
Note: If the .p2align statements are removed both loops will fit in
the LSD and there will not be a transitions.
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
}
Compile: gcc -O3 -march=native -mtune=native <filename>.c -o <filename>
Run Icelake: sudo perf stat -C 0 --all-user -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Run Whiskey Lake: sudo perf stat -C 0 -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Graphs
Edit: x label is N iterations of inner loop.
Below is a graph of Branch Misses, Branches, and LSD Uops.
Generally you can see that 1) there is no corresponding jump in Branches. 2) that the number of added Branch Misses stabilizes at a constant. And 3) That there is a strong relationship between the Branch Misses and LSD Uops.
Icelake Graph:
Whiskey Lake Graph:
Below is a graph of Branch Misses, Cycles, and LSD Uops
for Icelake only as performance is not affected nearly as much on:
Analysis
Hard numbers below.
For Icelake starting at N = 22 and finishing at N = 27 there
is some fluctuation in the number of uops coming from the LSD vs
Uop Cache and during that time Branch Misses increases by
roughly 3 order of magnitude from 10^4 -> 10^7. During this period
Cycles also increased by a factor of 2. For all N > 27
Branch Misses stays at around 1.67 x 10^7 (roughly outer_loop_N). For N = [17, 40]
branches continues to only increase linearly.
The results for Whiskey Lake look different in that 1) N begins fluctuating at N = 35 and continues to fluctuate until N = 49. And 2) there is less of a
performance impact and more fluctuation in the data. That being said
the increase in Branch-Misses corresponding with transitions from
uops being fed by Uop Cache to being fed by LSD still exists.
Results
Data is mean result for 25 runs.
Icelake Results:
N
cycles
branches
branch-misses
idq.ms_uops
idq.dsb_uops
lsd.uops
1
33893260
67129521
1590
43163
115243
83908732
2
42540891
83908928
1762
49023
142909
100690381
3
50725933
100686143
1782
47656
142506
117440256
4
67533597
117461172
1655
52538
186123
134158311
5
68022910
134238387
1711
53405
204481
150954035
6
85543126
151018722
1924
62445
141397
167633971
7
84847823
167799220
1935
60248
160146
184563523
8
101532158
184570060
1709
60064
361208
201100179
9
101864898
201347253
1773
63827
459873
217780207
10
118024033
218124499
1698
59480
177223
234834304
11
118644416
234908571
2201
62514
422977
251503052
12
134627567
251678909
1679
57262
133462
268435650
13
285607942
268456135
1770
74070
285032524
315423
14
302717754
285233352
1731
74663
302101097
15953
15
321627434
302010569
81796
77831
319192830
1520819
16
337876736
318787786
71638
77056
335904260
1265766
17
353054773
335565563
1798
79839
352434780
15879
18
369800279
352344970
1978
79863
369229396
16790
19
386921048
369119438
1972
84075
385984022
16115
20
404248461
385896655
29454
85348
402790977
510176
21
421100725
402673872
37598
83400
419537730
729397
22
519623794
419451095
4447767
91209
431865775
97827331
23
702206338
436228323
12603617
109064
427880075
327661987
24
710626194
453005538
12316933
106929
432926173
344838509
25
863214037
469782765
14641887
121776
428085132
614871430
26
761037251
486559974
13067814
113011
438093034
418124984
27
832686921
503337195
16381350
113953
421924080
556915419
28
854713119
520114412
16642396
124448
420515666
598907353
29
869873144
536891629
16572581
119280
421188631
629696780
30
889642335
553668847
16717446
120116
420086570
668628871
31
906912275
570446064
16735759
126094
419970933
702822733
32
923023862
587223281
16706519
132498
420332680
735003892
33
940308170
604000498
16744992
124365
419945191
770185745
34
957075000
620777716
16726856
133675
420215897
802779119
35
974557538
637554932
16763071
134871
419764866
838012637
36
991110971
654332149
16772560
130903
419641144
872037131
37
1008489575
671109367
16757219
138788
419900997
904638287
38
1024971256
687886583
16772585
139782
419663863
938988917
39
1041404669
704669411
16776722
137681
419617131
972896126
40
1058594326
721441018
16773492
142959
419662133
1006109192
41
1075179100
738218235
16776636
141185
419601996
1039892900
42
1092093726
754995452
16776233
142793
419611902
1073373451
43
1108706464
771773224
16776359
139500
419610885
1106976114
44
1125413652
788549886
16764637
143717
419889127
1139628280
45
1142023614
805327103
16778640
144397
419558217
1174329696
46
1158833317
822104321
16765518
148045
419889914
1206833484
47
1175665684
838881537
16778437
148347
419562885
1241397845
48
1192454164
855658755
16778865
153651
419552747
1275006511
49
1210199084
872436025
16778287
152468
419599314
1307945613
50
1226321832
889213188
16778464
155552
419572344
1341893668
51
1242886388
905990406
16778745
155401
419559249
1375589883
52
1259559053
922767623
16778809
154847
419554082
1409206082
53
1276875799
939544839
16778460
162521
419576455
1442424993
54
1293113199
956322057
16778931
154913
419550955
1476316161
55
1310449232
973099274
16778534
157364
419578102
1509485876
56
1327022109
989876491
16778794
162881
419562403
1543193559
57
1344097516
1006653708
16778906
157486
419567545
1576414302
58
1362935064
1023430928
16778959
315120
419583132
1609691339
59
1381567560
1040208143
16778564
179997
419661259
1640660745
60
1394829416
1056985359
16778779
167613
419575969
1677034188
61
1411847237
1073762626
16778071
166332
419613028
1710194702
62
1428918439
1090539795
16778409
168073
419610487
1743644637
63
1445223241
1107317011
16778486
172446
419591254
1777573503
64
1461530579
1124094228
16769606
169559
419970612
1810351736
Whiskey Lake Results:
N
cycles
branches
branch-misses
idq.dsb_uops
lsd.uops
1
8332553879
35005847
37925
1799462
6019
2
8329926329
51163346
34338
1114352
5919
3
8357233041
67925775
32270
9241935
5555
4
8379609449
85364250
35667
18215077
5712
5
8394301337
101563554
33177
26392216
2159
6
8409830612
118918934
35007
35318763
5295
7
8435794672
135162597
35592
43033739
4478
8
8445843118
152636271
37802
52154850
5629
9
8459141676
168577876
30766
59245754
1543
10
8475484632
185354280
30825
68059212
4672
11
8493529857
202489273
31703
77386249
5556
12
8509281533
218912407
32133
84390084
4399
13
8528605921
236303681
33056
93995496
2093
14
8553971099
252439989
572416
99700289
2477
15
8558526147
269148605
29912
109772044
6121
16
8576658106
286414453
29839
118504526
5850
17
8591545887
302698593
28993
126409458
4865
18
8611628234
319960954
32568
136298306
5066
19
8627289083
336312187
30094
143759724
6598
20
8644741581
353730396
49458
152217853
9275
21
8685908403
369886284
1175195
161313923
7958903
22
8694494654
387336207
354008
169541244
2553802
23
8702920906
403389097
29315
176524452
12932
24
8711458401
420211718
31924
184984842
11574
25
8729941722
437299615
32472
194553843
12002
26
8743658904
453739403
28809
202074676
13279
27
8763317458
470902005
32298
211321630
15377
28
8788189716
487432842
37105
218972477
27666
29
8796580152
504414945
36756
228334744
79954
30
8821174857
520930989
39550
235849655
140461
31
8818857058
537611096
34142
648080
79191
32
8855038758
555138781
37680
18414880
70489
33
8870680446
571194669
37541
34596108
131455
34
8888946679
588222521
33724
52553756
80009
35
9256640352
604791887
16658672
132185723
41881719
36
9189040776
621918353
12296238
257921026
235389707
37
8962737456
638241888
1086663
109613368
35222987
38
9005853511
655453884
2059624
131945369
73389550
39
9005576553
671845678
1434478
143002441
51959363
40
9284680907
688991063
12776341
349762585
347998221
41
9049931865
705399210
1778532
174597773
72566933
42
9314836359
722226758
12743442
365270833
380415682
43
9072200927
739449289
1344663
205181163
61284843
44
9346737669
755766179
12681859
383580355
409359111
45
9117099955
773167996
1801713
235583664
88985013
46
9108062783
789247474
860680
250992592
43508069
47
9129892784
806871038
984804
268229102
51249366
48
9146468279
822765997
1018387
282312588
58278399
49
9476835578
840085058
13985421
241172394
809315446
50
9495578885
856579327
14155046
241909464
847629148
51
9537115189
873483093
15057500
238735335
932663942
52
9556102594
890026435
15322279
238194482
982429654
53
9589094741
907142375
15899251
234845868
1052080437
54
9609053333
923477989
16049518
233890599
1092323040
55
9628950166
940997348
16172619
235383688
1131146866
56
9650657175
957049360
16445697
231276680
1183699383
57
9666446210
973785857
16330748
233203869
1205098118
58
9687274222
990692518
16523542
230842647
1254624242
59
9706652879
1007946602
16576268
231502185
1288374980
60
9720091630
1024044005
16547047
230966608
1321807705
61
9741079017
1041285110
16635400
230873663
1362929599
62
9761596587
1057847755
16683756
230289842
1399235989
63
9782104875
1075055403
16299138
237386812
1397167324
64
9790122724
1091147494
16650471
229928585
1463076072
Edit:
2 things worth noting:
If I add padding to the inner loop so it won't fit in the uop cache I don't see this behavior until ~150 iterations.
Adding an lfence on with the padding in the outer loop changes N threshold to 31.
edit2: Benchmark which clears branch history The condition was reversed. It should be cmove not cmovne. With the fixed version any iteration count sees elevated Branch Misses at the same rate as above (1.67 * 10^9). This means the LSD is not itself causing Branch Misses, but leaves open the possibility that LSD is in some way defeating the Branch Predictor (what I believe to be the case).
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"testl $3, %k[outer_loop_cnt]\n"
"movl $1000, %k[inner_loop_cnt]\n"
THIS NEEDS TO BE CMOVE
"cmovne %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}

This may be a coincidence; similar misprediction effects happen on Skylake (with recent microcode that disables the LSD1): an inner-loop count of about 22 or 23 is enough to stop its IT-TAGE predictor from learning the pattern of 21 taken, 1 not-taken for the inner-loop branch, in exactly this simple nested loop situation, last time I tested.
Choosing that iteration threshold for when to lock the loop down into the LSD might make some sense, or be a side-effect of your 1-uop loop and the LSD's "unrolling" behaviour on Haswell and later to get multiple copies of tiny loops into the IDQ before locking it down, to reduce the impact of the loop not being a multiple of the pipeline width.
Footnote 1: I'm surprised your Whiskey Lake seems to have a working LSD; I thought LSD was still disabled in all Skylake derivatives, at least including Coffee Lake, which was launched concurrently with Whiskey Lake.
My test loop was two dec/jne loops nested simply, IIRC, but your code has padding after the inner loop. (Starting with a jmp because that's what a huge .p2align does.) This puts the two loop branches at significantly different addresses. Either of both of these differences may help them avoid aliasing or some other kind of interference, because I'm seeing mostly-correct predictions for many (but not all) counts much greater than 23.
With your test code on my i7-6700k, lsd.uops is always exactly 0 of course. Compared to your Whiskey Lake data, only a few inner-loop counts produce high mispredict rates, e.g. 40, but not 50.
So there might be some effect from the LSD on your WHL CPU, making it bad for some N values where SKL is fine. (Assuming their IT-TAGE predictors are truly identical.)
e.g. with perf stat ... -r 5 ./a.out on Skylake (i7-6700k) with microcode revision 0xe2.
N
count
rate
variance
17
59,602
0.02% of all branches
+- 10.85%
20
192,307
0.05% of all branches
( +- 44.60% )
21
79,853
0.02% of all branches
( +- 14.16% )
30
136,308
0.02% of all branches
( +- 18.57% )
31..32
similar to N=34
( +- 2 or 3% )
33
22,415,089
3.71% of all branches
( +- 0.11% )
34
91,483
0.01% of all branches
( +- 2.36% )
35 (and 36..37 similar)
98,806
0.02% of all branches
( +- 2.75% )
38
33,517,630
4.87% of all branches
( +- 0.05% )
39
102,077
0.01% of all branches
( +- 1.96% )
40
33,458,267
4.64% of all branches
( +- 0.06% )
41
116,241
0.02% of all branches
( +- 6.86% )
42
22,376,562
2.96% of all branches
( +- 0.01% )
43
116,713
0.02% of all branches
( +- 5.25% )
44
174,834
0.02% of all branches
( +- 35.08% )
45
124,555
0.02% of all branches
( +- 5.36% )
46
135,838
0.02% of all branches
( +- 9.95% )
These numbers are repeatable, it's not just system noise; the spikes of high mispredict counts are very real at those specific N values. Probably some effect of the size / geometry of the IT-TAGE predictor's tables.
Other counters like idq.ms_uops and idq.dsb_uops scale mostly as expected, although idq.ms_uops is somewhat higher in the ones with more misses. (That counts uops added to the IDQ while the MS-ROM is active, perhaps counting front-end work that happens while branch recovery is cleaning up the back-end? It's a very different counter from legacy-decode mite_uops.)
With higher mispredict rates, idq.dsb_uops is quite a lot higher, I guess because the IDQ gets discarded and refilled on mispredicts. Like 1,011,000,288 for N=42, vs. 789,170,126 for N=43.
Note the high variability for N=20, around that threshold near 23, but still a tiny overall miss rate, much lower than every time the inner loop exits.
This is surprising, and different from a loop without as much padding.

Reason
The cause of the spike in Branch Misses is caused by the inner loop running out of the LSD.
The reason the LSD causes an extra branch miss for low iteration counts is that the "stop" condition on the LSD is a branch miss.
From Intel Optimization Manual Page 86.
The loop is sent to allocation 5 µops per cycle. After 45 out of the 46 µops are sent, in the next cycle only
a single µop is sent, which means that in that cycle, 4 of the allocation slots are wasted. This pattern
repeats itself, until the loop is exited by a misprediction. Hardware loop unrolling minimizes the number
of wasted slots during LSD.
Essentially what is happening is that when low enough iteration counts run out of the Uop Cache they are perfectly predictable. But when they run out of the LSD since the built in stop condition for the LSD is a branch mispredict, we see an extra branch miss for every iteration of the outer loop. I guess the takeaway is don't let nested loops execute out of the LSD. Note that LSD only kicks in after ~[20, 25] iterations so an inner loop with < 20 iterations will run optimally.
Benchmark
All benchmarks are run on either Icelake
The new benchmark is essentially the same as the one in the origional post but at the advise of #PeterCordes I added a fixed byte size but varying number of nops in the inner loop. The idea is fixed length so that there is no change in how the branches may alias in the BHT (Branch History Table) but varying the number of nops to sometimes defeat the LSD.
I used 124 bytes of nop padding so that the nop padding + size of decl; jcc would be 128 bytes total.
The benchmark code is as follows:
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef INNER_NOPS
#error "INNER_NOPS must be defined"
#endif
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static const uint64_t bht_shift = 4;
static const uint64_t bht_mask = (1023 << bht_shift);
#define NOP1 ".byte 0x90\n"
#define NOP2 ".byte 0x66,0x90\n"
#define NOP3 ".byte 0x0f,0x1f,0x00\n"
#define NOP4 ".byte 0x0f,0x1f,0x40,0x00\n"
#define NOP5 ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP6 ".byte 0x66,0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP7 ".byte 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00\n"
#define NOP8 ".byte 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP9 ".byte 0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP10 ".byte 0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP11 ".byte 0x66,0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt, tmp;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
".p2align 10\n"
"2:\n"
// This is defined in "inner_nops.h" with the necessary padding.
INNER_NOPS
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt), [ tmp ] "=&r"(tmp)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N),
[ bht_mask ] "i"(bht_mask), [ bht_shift ] "i"(bht_shift)
:);
}
// gcc -O3 -march=native -mtune=native lsd-branchmiss.c -o lsd-branchmiss
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
return 0;
}
Tests
I tested nop count = [0, 39].
Note that nop count = 1 would not be only 1 nop in the inner loop but actually the following:
#define INNER_NOPS NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP3 NOP1
To reach the full 128 byte padding.
Results
At nop count <= 32 the inner loop is able to run out of the LSD and we consistently see elevant Branch Misses when Iterations is large enough that it does so. Note that the elevated Branch Misses number corresponds 1-1 with the number of outer loop iterations. For these numbers outer loop iterations = 2^24
At nop count > 32 the loop has to many uops for the LSD and runs out of the Uop Cache. Here we do not see a consistent elevated Branch Misses until Iterations becomes to large for its BHT entry to store its entire history.
nop count > 32 (No LSD)
Once there are too many nops for the LSD the number of Branch Misses stays relatively low with a few consistent spikes until Iterations = 146 where Branch Misses spike to number of outer loop iterations (2 ^ 24 in this case) and remain constants. My guess is that is the upper bound on history the BHT is able to store.
Below is a graph of Branch Misses (Y) versus Iterations (X) for nop count = [33, 39]:
All of the lines follow the same patterns and have the same spikes. The large spikes to outer loop iterations before 146 are at Iterations = [42, 70, 79, 86, 88]. This is consistently reproducible. I am not sure what is special about these values.
The key point, however, is that for the most cast for Iterations = [20, 145] Branch Misses is relatively low indicating that the entire inner loop is being predicted correctly.
nop count <= 32 (LSD)
This data is a bit more noising bit all of the different nop count follow roughly the same trend of spiking initialing to within a factor of 2 of outer loop iterations Branch Misses at Iterations = [21, 25] (note this is 2-3 orders of magnitude) at the same time the lsd.oups spiked by 4-5 orders of magnitude.
There is also a trend between nop count and what iteration value Branch Misses stablize at outer loop iterations with a Pearson Correlation of 0.81. for nop count = [0, 32] the stablization point is in range iterations = [15, 34].
Below is a graph of Branch Misses (Y) versus Iterations (X) for nops = [0, 32]:
Generally, with some noise, all of the different nop count follow the same trend. As well they follow the same trend when compared with lsd.uops.
Below is a table with nop and the Pearson Correlation between Branch Misses and lsd.uop and idq.dsb_uops respectively.
nop
lsd
uop cache
0
0.961
-0.041
1
0.955
-0.081
2
0.919
-0.122
3
0.918
-0.299
4
0.947
-0.117
5
0.934
-0.298
6
0.894
-0.329
7
0.907
-0.308
8
0.91
-0.322
9
0.915
-0.316
10
0.877
-0.342
11
0.908
-0.28
12
0.874
-0.281
13
0.875
-0.523
14
0.87
-0.513
15
0.889
-0.522
16
0.858
-0.569
17
0.89
-0.507
18
0.858
-0.537
19
0.844
-0.565
20
0.816
-0.459
21
0.862
-0.537
22
0.848
-0.556
23
0.852
-0.552
24
0.85
-0.561
25
0.828
-0.573
26
0.857
-0.559
27
0.802
-0.372
28
0.762
-0.425
29
0.721
-0.112
30
0.736
-0.047
31
0.768
-0.174
32
0.847
-0.129
Which should generally indicate that there is a strong correlation between LSD and Branch Misses and no meaningful relationship between the Uop Cache and branch misses.
Overall
Generally I think it is clear that when the inner loop executing out of the LSD is what is causing the Branch Misses until Iterations becomes too large for the BHT entry's history. For the N = [33, 39] save the explained spikes we don't see elevated Branch Misses but we do for the N = [0, 32] case and the only difference I can tell is the LSD.

Encoding Spotify URI to Spotify Codes

Spotify Codes are little barcodes that allow you to share songs, artists, users, playlists, etc.
They encode information in the different heights of the "bars". There are 8 discrete heights that the 23 bars can be, which means 8^23 different possible barcodes.
Spotify generates barcodes based on their URI schema. This URI spotify:playlist:37i9dQZF1DXcBWIGoYBM5M gets mapped to this barcode:
The URI has a lot more information (62^22) in it than the code. How would you map the URI to the barcode? It seems like you can't simply encode the URI directly. For more background, see my "answer" to this question: https://stackoverflow.com/a/62120952/10703868

The patent explains the general process, this is what I have found.
This is a more recent patent
When using the Spotify code generator the website makes a request to https://scannables.scdn.co/uri/plain/[format]/[background-color-in-hex]/[code-color-in-text]/[size]/[spotify-URI].
Using Burp Suite, when scanning a code through Spotify the app sends a request to Spotify's API: https://spclient.wg.spotify.com/scannable-id/id/[CODE]?format=json where [CODE] is the media reference that you were looking for. This request can be made through python but only with the [TOKEN] that was generated through the app as this is the only way to get the correct scope. The app token expires in about half an hour.
import requests
head={
"X-Client-Id": "58bd3c95768941ea9eb4350aaa033eb3",
"Accept-Encoding": "gzip, deflate",
"Connection": "close",
"App-Platform": "iOS",
"Accept": "*/*",
"User-Agent": "Spotify/8.5.68 iOS/13.4 (iPhone9,3)",
"Accept-Language": "en",
"Authorization": "Bearer [TOKEN]",
"Spotify-App-Version": "8.5.68"}
response = requests.get('https://spclient.wg.spotify.com:443/scannable-id/id/26560102031?format=json', headers=head)
print(response)
print(response.json())
Which returns:
<Response [200]>
{'target': 'spotify:playlist:37i9dQZF1DXcBWIGoYBM5M'}
So 26560102031 is the media reference for your playlist.
The patent states that the code is first detected and then possibly converted into 63 bits using a Gray table. For example 361354354471425226605 is encoded into 010 101 001 010 111 110 010 111 110 110 100 001 110 011 111 011 011 101 101 000 111.
However the code sent to the API is 6875667268, I'm unsure how the media reference is generated but this is the number used in the lookup table.
The reference contains the integers 0-9 compared to the gray table of 0-7 implying that an algorithm using normal binary has been used. The patent talks about using a convolutional code and then the Viterbi algorithm for error correction, so this may be the output from that. Something that is impossible to recreate whithout the states I believe. However I'd be interested if you can interpret the patent any better.
This media reference is 10 digits however others have 11 or 12.
Here are two more examples of the raw distances, the gray table binary and then the media reference:
1.
022673352171662032460
000 011 011 101 100 010 010 111 011 001 100 001 101 101 011 000 010 011 110 101 000
67775490487
2.
574146602473467556050
111 100 110 001 110 101 101 000 011 110 100 010 110 101 100 111 111 101 000 111 000
57639171874
edit:
Some extra info:
There are some posts online describing how you can encode any text such as spotify:playlist:HelloWorld into a code however this no longer works.
I also discovered through the proxy that you can use the domain to fetch the album art of a track above the code. This suggests a closer integration of Spotify's API and this scannables url than previously thought. As it not only stores the URIs and their codes but can also validate URIs and return updated album art.
https://scannables.scdn.co/uri/800/spotify%3Atrack%3A0J8oh5MAMyUPRIgflnjwmB

Your suspicion was correct - they're using a lookup table. For all of the fun technical details, the relevant patent is available here: https://data.epo.org/publication-server/rest/v1.0/publication-dates/20190220/patents/EP3444755NWA1/document.pdf

Very interesting discussion. Always been attracted to barcodes so I had to take a look. I did some analysis of the barcodes alone (didn't access the API for the media refs) and think I have the basic encoding process figured out. However, based on the two examples above, I'm not convinced I have the mapping from media ref to 37-bit vector correct (i.e. it works in case 2 but not case 1). At any rate, if you have a few more pairs, that last part should be simple to work out. Let me know.
For those who want to figure this out, don't read the spoilers below!
It turns out that the basic process outlined in the patent is correct, but lacking in details. I'll summarize below using the example above. I actually analyzed this in reverse which is why I think the code description is basically correct except for step (1), i.e. I generated 45 barcodes and all of them matched had this code.
1. Map the media reference as integer to 37 bit vector.
Something like write number in base 2, with lowest significant bit
on the left and zero-padding on right if necessary.
57639171874 -> 0100010011101111111100011101011010110
2. Calculate CRC-8-CCITT, i.e. generator x^8 + x^2 + x + 1
The following steps are needed to calculate the 8 CRC bits:
Pad with 3 bits on the right:
01000100 11101111 11110001 11010110 10110000
Reverse bytes:
00100010 11110111 10001111 01101011 00001101
Calculate CRC as normal (highest order degree on the left):
-> 11001100
Reverse CRC:
-> 00110011
Invert check:
-> 11001100
Finally append to step 1 result:
01000100 11101111 11110001 11010110 10110110 01100
3. Convolutionally encode the 45 bits using the common generator
polynomials (1011011, 1111001) in binary with puncture pattern
110110 (or 101, 110 on each stream). The result of step 2 is
encoded using tail-biting, meaning we begin the shift register
in the state of the last 6 bits of the 45 long input vector.
Prepend stream with last 6 bits of data:
001100 01000100 11101111 11110001 11010110 10110110 01100
Encode using first generator:
(a) 100011100111110100110011110100000010001001011
Encode using 2nd generator:
(b) 110011100010110110110100101101011100110011011
Interleave bits (abab...):
11010000111111000010111011110011010011110001...
1010111001110001000101011000010110000111001111
Puncture every third bit:
111000111100101111101110111001011100110000100100011100110011
4. Permute data by choosing indices 0, 7, 14, 21, 28, 35, 42, 49,
56, 3, 10..., i.e. incrementing 7 modulo 60. (Note: unpermute by
incrementing 43 mod 60).
The encoded sequence after permuting is
111100110001110101101000011110010110101100111111101000111000
5. The final step is to map back to bar lengths 0 to 7 using the
gray map (000,001,011,010,110,111,101,100). This gives the 20 bar
encoding. As noted before, add three bars: short one on each end
and a long one in the middle.
UPDATE: I've added a barcode (levels) decoder (assuming no errors) and an alternate encoder that follows the description above rather than the equivalent linear algebra method. Hopefully that is a bit more clear.
UPDATE 2: Got rid of most of the hard-coded arrays to illustrate how they are generated.
The linear algebra method defines the linear transformation (spotify_generator) and mask to map the 37 bit input into the 60 bit convolutionally encoded data. The mask is result of the 8-bit inverted CRC being convolutionally encoded. The spotify_generator is a 37x60 matrix that implements the product of generators for the CRC (a 37x45 matrix) and convolutional codes (a 45x60 matrix). You can create the generator matrix from an encoding function by applying the function to each row of an appropriate size generator matrix. For example, a CRC function that add 8 bits to each 37 bit data vector applied to each row of a 37x37 identity matrix.
import numpy as np
import crccheck
# Utils for conversion between int, array of binary
# and array of bytes (as ints)
def int_to_bin(num, length, endian):
if endian == 'l':
return [num >> i & 1 for i in range(0, length)]
elif endian == 'b':
return [num >> i & 1 for i in range(length-1, -1, -1)]
def bin_to_int(bin,length):
return int("".join([str(bin[i]) for i in range(length-1,-1,-1)]),2)
def bin_to_bytes(bin, length):
b = bin[0:length] + [0] * (-length % 8)
return [(b[i]<<7) + (b[i+1]<<6) + (b[i+2]<<5) + (b[i+3]<<4) +
(b[i+4]<<3) + (b[i+5]<<2) + (b[i+6]<<1) + b[i+7] for i in range(0,len(b),8)]
# Return the circular right shift of an array by 'n' positions
def shift_right(arr, n):
return arr[-n % len(arr):len(arr):] + arr[0:-n % len(arr)]
gray_code = [0,1,3,2,7,6,4,5]
gray_code_inv = [[0,0,0],[0,0,1],[0,1,1],[0,1,0],
[1,1,0],[1,1,1],[1,0,1],[1,0,0]]
# CRC using Rocksoft model:
# NOTE: this is not quite any of their predefined CRC's
# 8: number of check bits (degree of poly)
# 0x7: representation of poly without high term (x^8+x^2+x+1)
# 0x0: initial fill of register
# True: byte reverse data
# True: byte reverse check
# 0xff: Mask check (i.e. invert)
spotify_crc = crccheck.crc.Crc(8, 0x7, 0x0, True, True, 0xff)
def calc_spotify_crc(bin37):
bytes = bin_to_bytes(bin37, 37)
return int_to_bin(spotify_crc.calc(bytes), 8, 'b')
def check_spotify_crc(bin45):
data = bin_to_bytes(bin45,37)
return spotify_crc.calc(data) == bin_to_bytes(bin45[37:], 8)[0]
# Simple convolutional encoder
def encode_cc(dat):
gen1 = [1,0,1,1,0,1,1]
gen2 = [1,1,1,1,0,0,1]
punct = [1,1,0]
dat_pad = dat[-6:] + dat # 6 bits are needed to initialize
# register for tail-biting
stream1 = np.convolve(dat_pad, gen1, mode='valid') % 2
stream2 = np.convolve(dat_pad, gen2, mode='valid') % 2
enc = [val for pair in zip(stream1, stream2) for val in pair]
return [enc[i] for i in range(len(enc)) if punct[i % 3]]
# To create a generator matrix for a code, we encode each row
# of the identity matrix. Note that the CRC is not quite linear
# because of the check mask so we apply the lamda function to
# invert it. Given a 37 bit media reference we can encode by
# ref * spotify_generator + spotify_mask (mod 2)
_i37 = np.identity(37, dtype=bool)
crc_generator = [_i37[r].tolist() +
list(map(lambda x : 1-x, calc_spotify_crc(_i37[r].tolist())))
for r in range(37)]
spotify_generator = 1*np.array([encode_cc(crc_generator[r]) for r in range(37)], dtype=bool)
del _i37
spotify_mask = 1*np.array(encode_cc(37*[0] + 8*[1]), dtype=bool)
# The following matrix is used to "invert" the convolutional code.
# In particular, we choose a 45 vector basis for the columns of the
# generator matrix (by deleting those in positions equal to 2 mod 4)
# and then inverting the matrix. By selecting the corresponding 45
# elements of the convolutionally encoded vector and multiplying
# on the right by this matrix, we get back to the unencoded data,
# assuming there are no errors.
# Note: numpy does not invert binary matrices, i.e. GF(2), so we
# hard code the following 3 row vectors to generate the matrix.
conv_gen = [[0,1,0,1,1,1,1,0,1,1,0,0,0,1]+31*[0],
[1,0,1,0,1,0,1,0,0,0,1,1,1] + 32*[0],
[0,0,1,0,1,1,1,1,1,1,0,0,1] + 32*[0] ]
conv_generator_inv = 1*np.array([shift_right(conv_gen[(s-27) % 3],s) for s in range(27,72)], dtype=bool)
# Given an integer media reference, returns list of 20 barcode levels
def spotify_bar_code(ref):
bin37 = np.array([int_to_bin(ref, 37, 'l')], dtype=bool)
enc = (np.add(1*np.dot(bin37, spotify_generator), spotify_mask) % 2).flatten()
perm = [enc[7*i % 60] for i in range(60)]
return [gray_code[4*perm[i]+2*perm[i+1]+perm[i+2]] for i in range(0,len(perm),3)]
# Equivalent function but using CRC and CC encoders.
def spotify_bar_code2(ref):
bin37 = int_to_bin(ref, 37, 'l')
enc_crc = bin37 + calc_spotify_crc(bin37)
enc_cc = encode_cc(enc_crc)
perm = [enc_cc[7*i % 60] for i in range(60)]
return [gray_code[4*perm[i]+2*perm[i+1]+perm[i+2]] for i in range(0,len(perm),3)]
# Given 20 (clean) barcode levels, returns media reference
def spotify_bar_decode(levels):
level_bits = np.array([gray_code_inv[levels[i]] for i in range(20)], dtype=bool).flatten()
conv_bits = [level_bits[43*i % 60] for i in range(60)]
cols = [i for i in range(60) if i % 4 != 2] # columns to invert
conv_bits45 = np.array([conv_bits[c] for c in cols], dtype=bool)
bin45 = (1*np.dot(conv_bits45, conv_generator_inv) % 2).tolist()
if check_spotify_crc(bin45):
return bin_to_int(bin45, 37)
else:
print('Error in levels; Use real decoder!!!')
return -1
And example:
>>> levels = [5,7,4,1,4,6,6,0,2,4,3,4,6,7,5,5,6,0,5,0]
>>> spotify_bar_decode(levels)
57639171874
>>> spotify_barcode(57639171874)
[5, 7, 4, 1, 4, 6, 6, 0, 2, 4, 3, 4, 6, 7, 5, 5, 6, 0, 5, 0]

Why is accessing every other cache line slower on x86, not matching Intel's documented cache behavior?

According to Intel's optimization manual, the L1 data cache is 32 KiB and 8-way associative with 64-byte lines. I have written the following micro-benchmark to test memory read performance.
I hypothesize that if we access only blocks that can fit in the 32 KiB cache, each memory access will be fast, but if we exceed that cache size, the accesses will suddenly be slower. When skip is 1, the benchmark accesses every line in order.
void benchmark(int bs, int nb, int trials, int skip)
{
printf("block size: %d, blocks: %d, skip: %d, trials: %d\n", bs, nb, skip, trials);
printf("total data size: %d\n", nb*bs*skip);
printf("accessed data size: %d\n", nb*bs);
uint8_t volatile data[nb*bs*skip];
clock_t before = clock();
for (int i = 0; i < trials; ++i) {
for (int block = 0; block < nb; ++block) {
data[block * bs * skip];
}
}
clock_t after = clock() - before;
double ns_per_access = (double)after/CLOCKS_PER_SEC/nb/trials * 1000000000;
printf("%f ns per memory access\n", ns_per_access);
}
Again with skip = 1, the results match my hypothesis:
~ ❯❯❯ ./bm -s 64 -b 128 -t 10000000 -k 1
block size: 64, blocks: 128, skip: 1, trials: 10000000
total data size: 8192
accessed data size: 8192
0.269054 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 256 -t 10000000 -k 1
block size: 64, blocks: 256, skip: 1, trials: 10000000
total data size: 16384
accessed data size: 16384
0.278184 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 1
block size: 64, blocks: 512, skip: 1, trials: 10000000
total data size: 32768
accessed data size: 32768
0.245591 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 1024 -t 10000000 -k 1
block size: 64, blocks: 1024, skip: 1, trials: 10000000
total data size: 65536
accessed data size: 65536
0.582870 ns per memory access
So far, so good: when everything fits in L1 cache, the inner loop runs about 4 times per nanosecond, or a bit more than once per clock cycle. When we make the data too big, it takes significantly longer. This is all consistent with my understanding of how the cache should work.
Now let's accessing every other block by letting skip be 2.
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 2
block size: 64, blocks: 512, skip: 2, trials: 10000000
total data size: 65536
accessed data size: 32768
0.582181 ns per memory access
This violates my understanding! It would make sense for a direct-mapped cache, but since our cache is associative, I can't see why the lines should be conflicting with each other. Why is accessing every other block slower?
But if I set skip to 3, things are fast again. In fact, any odd value of skip is fast; any even value is slow. For example:
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 7
block size: 64, blocks: 512, skip: 7, trials: 10000000
total data size: 229376
accessed data size: 32768
0.265338 ns per memory access
~ ❯❯❯ ./bm -s 64 -b 512 -t 10000000 -k 12
block size: 64, blocks: 512, skip: 12, trials: 10000000
total data size: 393216
accessed data size: 32768
0.616013 ns per memory access
Why is this happening?
For completeness: I am on a mid-2015 MacBook Pro running macOS 10.13.4. My full CPU brand string is Intel(R) Core(TM) i7-4980HQ CPU # 2.80GHz. I am compiling with cc -O3 -o bm bm.c; the compiler is the one shipped with Xcode 9.4.1. I have omitted the main function; all it does is parse the command-line options and call benchmark.

The cache is not fully-associative, it's set-associative, meaning that each address maps to a certain set, and the associativity only works among lines that map to the same set.
By making the step equal 2, you keep half of the sets out of the game, so the fact you access effectively 32K doesn't matter - you only have 16k available (even sets, for example), so you still exceed your capacity and start thrashing (end fetching data from the next level).
When the step is 3, the problem is gone since after wrapping around you can use all the sets. Same would go for any prime number (which is why it's sometimes used for address hashing)

Meaning of the benchmark variables in snakemake

I included a benchmark directive to some of the rules in my snakemake workflow, and the resulting files have the following header:
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load
The only documentation I've found mentions a "benchmark txt file (which will contain a tab-separated table of run times and memory usage in MiB)".
I can guess that columns 1 and 2 are two different ways of displaying the time taken to execute the rule (in seconds, and converted to hours, minutes and seconds).
io_in and io_out likely related to disk read and write activity, but in what units are they measured?
What are the others? Is this documented somewhere?
Edit: Looking at the source code
I've found the following piece of code in /snakemake/benchmark.py, that might well be where the benchmark data come from:
def _update_record(self):
"""Perform the actual measurement"""
# Memory measurements
rss, vms, uss, pss = 0, 0, 0, 0
# I/O measurements
io_in, io_out = 0, 0
# CPU seconds
cpu_seconds = 0
# Iterate over process and all children
try:
main = psutil.Process(self.pid)
this_time = time.time()
for proc in chain((main,), main.children(recursive=True)):
meminfo = proc.memory_full_info()
rss += meminfo.rss
vms += meminfo.vms
uss += meminfo.uss
pss += meminfo.pss
ioinfo = proc.io_counters()
io_in += ioinfo.read_bytes
io_out += ioinfo.write_bytes
if self.bench_record.prev_time:
cpu_seconds += proc.cpu_percent() / 100 * (
this_time - self.bench_record.prev_time)
self.bench_record.prev_time = this_time
if not self.bench_record.first_time:
self.bench_record.prev_time = this_time
rss /= 1024 * 1024
vms /= 1024 * 1024
uss /= 1024 * 1024
pss /= 1024 * 1024
io_in /= 1024 * 1024
io_out /= 1024 * 1024
except psutil.Error as e:
return
# Update benchmark record's RSS and VMS
self.bench_record.max_rss = max(self.bench_record.max_rss or 0, rss)
self.bench_record.max_vms = max(self.bench_record.max_vms or 0, vms)
self.bench_record.max_uss = max(self.bench_record.max_uss or 0, uss)
self.bench_record.max_pss = max(self.bench_record.max_pss or 0, pss)
self.bench_record.io_in = io_in
self.bench_record.io_out = io_out
self.bench_record.cpu_seconds += cpu_seconds
So this seems to come from functionalities provided by psutil.

I will just leave this here for future reference.
Reading through
snakemake >= 6.0.0 benchmark module
psutil's memory_info(), memory_full_info(), io_counters(), cpu_times()
as previously suggested:
colname
type (unit)
description
s
float (seconds)
Running time in seconds
h:m:s
string (-)
Running time in hour, minutes, seconds format
max_rss
float (MB)
Maximum "Resident Set Size”, this is the non-swapped physical memory a process has used.
max_vms
float (MB)
Maximum “Virtual Memory Size”, this is the total amount of virtual memory used by the process
max_uss
float (MB)
“Unique Set Size”, this is the memory which is unique to a process and which would be freed if the process was terminated right now.
max_pss
float (MB)
“Proportional Set Size”, is the amount of memory shared with other processes, accounted in a way that the amount is divided evenly between the processes that share it (Linux only)
io_in
float (MB)
the number of MB read (cumulative).
io_out
float (MB)
the number of MB written (cumulative).
mean_load
float (-)
CPU usage over time, divided by the total running time (first row)
cpu_time
float(-)
CPU time summed for user and system

Benchmarking in snakemake could certainly be better documented, but psutil is documanted here:
get_memory_info()
Return a tuple representing RSS (Resident Set Size) and VMS (Virtual Memory Size) in bytes.
On UNIX RSS and VMS are the same values shown by ps.
On Windows RSS and VMS refer to "Mem Usage" and "VM Size" columns of taskmgr.exe.
psutil.disk_io_counters(perdisk=False)
Return system disk I/O statistics as a namedtuple including the following attributes:
read_count: number of reads
write_count: number of writes
read_bytes: number of bytes read
write_bytes: number of bytes written
read_time: time spent reading from disk (in milliseconds)
write_time: time spent writing to disk (in milliseconds)
The code you found confirms that all the memory usage and IO counts are reported in MB (= bytes * 1024 * 1024).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse