2 bit branch predictor with two for loops - cpu-architecture

I got a 2 bit branch predictor, my starting state is weakly taken and I need to calculate the prediction accuracy:
for (int i=0; i < 100; i++)
{
for (int j=0; j < 50; j++)
{
...
}
}
So with i = 0 we take the branch, so we are at i = 0 and j = 0 and set our predictor to strongly taken, right ? So if we iterate j now, does that mean we are not taking a new branch ? As we are still in the i = 0 branch, or does every iteration count as a new branch ?

Let's manually compile it into x86 assembly first for better understanding (any other would do to):
mov ebx, 0 // this is our var i
.L0:
# /------------ inner loop start -----------\
mov eax, 0 // this is our var j
.L1:
// ...
add eax, 1
cmp eax, 50
jl .L1 // jump one
# \------------ inner loop end -------------/
add ebx, 1
cmp ebx, 100
jl .L0 // jump two
I think this code is pretty straight forward even if your not familiar with assembly:
Set ebx to 0
jump two gets back here
Set eax to 0
jump one gets back here
Execute our loop code // ...
add 1 to eax
compare eax to 50 (this sets some bits in a flag register)
jump to label .L1: if eax wasn't 50
add 1 to ebx
compare ebx to 50 (this sets some bits in a flag register)
jump to label .L0: if ebx wasn't 100
End of the loops
So on the first iteration we arrive at jump one and predict it will be taken. Since eax < 50 we take it and update it to strongly taken. Now we do this another 48 times. On the 50 iteration we don't jump because eax == 50. This is a single misprediction and we update to weakly taken.
Now we arrive at jump two for the first time. since ebx < 100 we take it and update it to strongly taken. Now we start all over with that inner loop by jumping to L0. We do this another 98 times. On the 100 iteration of the inner loop we don't jump because ebx == 100. This is a single misprediction and we update to weakly taken.
So we execute the innerloop 100 times with a single misprediction each for a total of 100 mispredictions for jump one and 100 * 49 = 4900 correct predictions. The outer loop is executed only once and has only 1 misprediction and 99 correct predictions.

Related

Calculating jmp's from one segment to another in windows PE files

Assume I have a binary on my disk that I load into memory using VirtualAlloc and ReadFile.
If I want to follow a jmp instruction from one section to another, what do I need to add/subtract to get the destination address.
In other words, I want to know how IDA calculates the loc_140845BB8 from jmp loc_140845BB8
Example:
.text:000000014005D74E jmp loc_140845BB8
Jumps to the section seg007
seg007:0000000140845BB8 ; seg007:0000000140845BC4↓j
seg007:0000000140845BB8 and rbx, r14
PE info (seg007 is the section named "")
Segments are arbitary, it jumps where it jumps, without regard for segments. Jump location is calculated as the signed 32-bit value following the 0xE9 JMP opcode, added to the the address of where the next instruction would be (i.e. the location of JMP + 5 bytes).
def GetInsnLen(ea):
insn = ida_ua.insn_t()
return ida_ua.decode_insn(insn, ea)
def MakeSigned(number, size):
number = number & (1<<size) - 1
return number if number < 1<<size - 1 else - (1<<size) - (~number + 1)
def GetRawJumpTarget(ea):
if ea is None:
return None
insnlen = GetInsnLen(ea)
if not insnlen:
return None
result = MakeSigned(idc.get_wide_dword(ea + insnlen - 4), 32) + ea + insnlen
if ida_ida.cvar.inf.min_ea <= result < ida_ida.cvar.inf.max_ea:
return result
return None

Why does a loop transitioning from having its uops fed by the Uop Cache to LSD cause a spike in branch-misses?

All benchmarks are run on either
Icelake
or Whiskey Lake (In Skylake Family).
Summary
I am seeing a strange phenomina where it appears that when a loop
transitions from running out of the Uop Cache to running out of
the LSD (Loop Stream Detector) there is a spike in Branch
Misses that can cause severe performance hits. I tested on both
Icelake and Whiskey Lake by benchmarking a nested loop with the outer loop
having a sufficiently large body s.t the entire thing did not fit in
the LSD itself, but with an inner loop small enough to fit in the
LSD.
Basically once the inner loop reaches some iteration count decoding
seems to switch for idq.dsb_uops (Uop Cache) to lsd.uops (LSD)
and at that point there is a large increase in branch-misses
(without a corresponding jump in branches) causing a severe
performance drop. Note: This only seems to occur for nested
loops. Travis Down's Loop
Test for example will
not show any meaningful variation in branch misses. AFAICT this has
something to do with when a loop transitions from running out of the
Uop Cache to running out of the LSD.
Questions
What is happening when the loop transitions from running out of the
Uop Cache to running out of the LSD that causes this spike in
Branch Misses?
Is there a way to avoid this?
Benchmark
This is the minimum reproducible example I could come up with:
Note: If the .p2align statements are removed both loops will fit in
the LSD and there will not be a transitions.
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
}
Compile: gcc -O3 -march=native -mtune=native <filename>.c -o <filename>
Run Icelake: sudo perf stat -C 0 --all-user -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Run Whiskey Lake: sudo perf stat -C 0 -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Graphs
Edit: x label is N iterations of inner loop.
Below is a graph of Branch Misses, Branches, and LSD Uops.
Generally you can see that 1) there is no corresponding jump in Branches. 2) that the number of added Branch Misses stabilizes at a constant. And 3) That there is a strong relationship between the Branch Misses and LSD Uops.
Icelake Graph:
Whiskey Lake Graph:
Below is a graph of Branch Misses, Cycles, and LSD Uops
for Icelake only as performance is not affected nearly as much on:
Analysis
Hard numbers below.
For Icelake starting at N = 22 and finishing at N = 27 there
is some fluctuation in the number of uops coming from the LSD vs
Uop Cache and during that time Branch Misses increases by
roughly 3 order of magnitude from 10^4 -> 10^7. During this period
Cycles also increased by a factor of 2. For all N > 27
Branch Misses stays at around 1.67 x 10^7 (roughly outer_loop_N). For N = [17, 40]
branches continues to only increase linearly.
The results for Whiskey Lake look different in that 1) N begins fluctuating at N = 35 and continues to fluctuate until N = 49. And 2) there is less of a
performance impact and more fluctuation in the data. That being said
the increase in Branch-Misses corresponding with transitions from
uops being fed by Uop Cache to being fed by LSD still exists.
Results
Data is mean result for 25 runs.
Icelake Results:
N
cycles
branches
branch-misses
idq.ms_uops
idq.dsb_uops
lsd.uops
1
33893260
67129521
1590
43163
115243
83908732
2
42540891
83908928
1762
49023
142909
100690381
3
50725933
100686143
1782
47656
142506
117440256
4
67533597
117461172
1655
52538
186123
134158311
5
68022910
134238387
1711
53405
204481
150954035
6
85543126
151018722
1924
62445
141397
167633971
7
84847823
167799220
1935
60248
160146
184563523
8
101532158
184570060
1709
60064
361208
201100179
9
101864898
201347253
1773
63827
459873
217780207
10
118024033
218124499
1698
59480
177223
234834304
11
118644416
234908571
2201
62514
422977
251503052
12
134627567
251678909
1679
57262
133462
268435650
13
285607942
268456135
1770
74070
285032524
315423
14
302717754
285233352
1731
74663
302101097
15953
15
321627434
302010569
81796
77831
319192830
1520819
16
337876736
318787786
71638
77056
335904260
1265766
17
353054773
335565563
1798
79839
352434780
15879
18
369800279
352344970
1978
79863
369229396
16790
19
386921048
369119438
1972
84075
385984022
16115
20
404248461
385896655
29454
85348
402790977
510176
21
421100725
402673872
37598
83400
419537730
729397
22
519623794
419451095
4447767
91209
431865775
97827331
23
702206338
436228323
12603617
109064
427880075
327661987
24
710626194
453005538
12316933
106929
432926173
344838509
25
863214037
469782765
14641887
121776
428085132
614871430
26
761037251
486559974
13067814
113011
438093034
418124984
27
832686921
503337195
16381350
113953
421924080
556915419
28
854713119
520114412
16642396
124448
420515666
598907353
29
869873144
536891629
16572581
119280
421188631
629696780
30
889642335
553668847
16717446
120116
420086570
668628871
31
906912275
570446064
16735759
126094
419970933
702822733
32
923023862
587223281
16706519
132498
420332680
735003892
33
940308170
604000498
16744992
124365
419945191
770185745
34
957075000
620777716
16726856
133675
420215897
802779119
35
974557538
637554932
16763071
134871
419764866
838012637
36
991110971
654332149
16772560
130903
419641144
872037131
37
1008489575
671109367
16757219
138788
419900997
904638287
38
1024971256
687886583
16772585
139782
419663863
938988917
39
1041404669
704669411
16776722
137681
419617131
972896126
40
1058594326
721441018
16773492
142959
419662133
1006109192
41
1075179100
738218235
16776636
141185
419601996
1039892900
42
1092093726
754995452
16776233
142793
419611902
1073373451
43
1108706464
771773224
16776359
139500
419610885
1106976114
44
1125413652
788549886
16764637
143717
419889127
1139628280
45
1142023614
805327103
16778640
144397
419558217
1174329696
46
1158833317
822104321
16765518
148045
419889914
1206833484
47
1175665684
838881537
16778437
148347
419562885
1241397845
48
1192454164
855658755
16778865
153651
419552747
1275006511
49
1210199084
872436025
16778287
152468
419599314
1307945613
50
1226321832
889213188
16778464
155552
419572344
1341893668
51
1242886388
905990406
16778745
155401
419559249
1375589883
52
1259559053
922767623
16778809
154847
419554082
1409206082
53
1276875799
939544839
16778460
162521
419576455
1442424993
54
1293113199
956322057
16778931
154913
419550955
1476316161
55
1310449232
973099274
16778534
157364
419578102
1509485876
56
1327022109
989876491
16778794
162881
419562403
1543193559
57
1344097516
1006653708
16778906
157486
419567545
1576414302
58
1362935064
1023430928
16778959
315120
419583132
1609691339
59
1381567560
1040208143
16778564
179997
419661259
1640660745
60
1394829416
1056985359
16778779
167613
419575969
1677034188
61
1411847237
1073762626
16778071
166332
419613028
1710194702
62
1428918439
1090539795
16778409
168073
419610487
1743644637
63
1445223241
1107317011
16778486
172446
419591254
1777573503
64
1461530579
1124094228
16769606
169559
419970612
1810351736
Whiskey Lake Results:
N
cycles
branches
branch-misses
idq.dsb_uops
lsd.uops
1
8332553879
35005847
37925
1799462
6019
2
8329926329
51163346
34338
1114352
5919
3
8357233041
67925775
32270
9241935
5555
4
8379609449
85364250
35667
18215077
5712
5
8394301337
101563554
33177
26392216
2159
6
8409830612
118918934
35007
35318763
5295
7
8435794672
135162597
35592
43033739
4478
8
8445843118
152636271
37802
52154850
5629
9
8459141676
168577876
30766
59245754
1543
10
8475484632
185354280
30825
68059212
4672
11
8493529857
202489273
31703
77386249
5556
12
8509281533
218912407
32133
84390084
4399
13
8528605921
236303681
33056
93995496
2093
14
8553971099
252439989
572416
99700289
2477
15
8558526147
269148605
29912
109772044
6121
16
8576658106
286414453
29839
118504526
5850
17
8591545887
302698593
28993
126409458
4865
18
8611628234
319960954
32568
136298306
5066
19
8627289083
336312187
30094
143759724
6598
20
8644741581
353730396
49458
152217853
9275
21
8685908403
369886284
1175195
161313923
7958903
22
8694494654
387336207
354008
169541244
2553802
23
8702920906
403389097
29315
176524452
12932
24
8711458401
420211718
31924
184984842
11574
25
8729941722
437299615
32472
194553843
12002
26
8743658904
453739403
28809
202074676
13279
27
8763317458
470902005
32298
211321630
15377
28
8788189716
487432842
37105
218972477
27666
29
8796580152
504414945
36756
228334744
79954
30
8821174857
520930989
39550
235849655
140461
31
8818857058
537611096
34142
648080
79191
32
8855038758
555138781
37680
18414880
70489
33
8870680446
571194669
37541
34596108
131455
34
8888946679
588222521
33724
52553756
80009
35
9256640352
604791887
16658672
132185723
41881719
36
9189040776
621918353
12296238
257921026
235389707
37
8962737456
638241888
1086663
109613368
35222987
38
9005853511
655453884
2059624
131945369
73389550
39
9005576553
671845678
1434478
143002441
51959363
40
9284680907
688991063
12776341
349762585
347998221
41
9049931865
705399210
1778532
174597773
72566933
42
9314836359
722226758
12743442
365270833
380415682
43
9072200927
739449289
1344663
205181163
61284843
44
9346737669
755766179
12681859
383580355
409359111
45
9117099955
773167996
1801713
235583664
88985013
46
9108062783
789247474
860680
250992592
43508069
47
9129892784
806871038
984804
268229102
51249366
48
9146468279
822765997
1018387
282312588
58278399
49
9476835578
840085058
13985421
241172394
809315446
50
9495578885
856579327
14155046
241909464
847629148
51
9537115189
873483093
15057500
238735335
932663942
52
9556102594
890026435
15322279
238194482
982429654
53
9589094741
907142375
15899251
234845868
1052080437
54
9609053333
923477989
16049518
233890599
1092323040
55
9628950166
940997348
16172619
235383688
1131146866
56
9650657175
957049360
16445697
231276680
1183699383
57
9666446210
973785857
16330748
233203869
1205098118
58
9687274222
990692518
16523542
230842647
1254624242
59
9706652879
1007946602
16576268
231502185
1288374980
60
9720091630
1024044005
16547047
230966608
1321807705
61
9741079017
1041285110
16635400
230873663
1362929599
62
9761596587
1057847755
16683756
230289842
1399235989
63
9782104875
1075055403
16299138
237386812
1397167324
64
9790122724
1091147494
16650471
229928585
1463076072
Edit:
2 things worth noting:
If I add padding to the inner loop so it won't fit in the uop cache I don't see this behavior until ~150 iterations.
Adding an lfence on with the padding in the outer loop changes N threshold to 31.
edit2: Benchmark which clears branch history The condition was reversed. It should be cmove not cmovne. With the fixed version any iteration count sees elevated Branch Misses at the same rate as above (1.67 * 10^9). This means the LSD is not itself causing Branch Misses, but leaves open the possibility that LSD is in some way defeating the Branch Predictor (what I believe to be the case).
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"testl $3, %k[outer_loop_cnt]\n"
"movl $1000, %k[inner_loop_cnt]\n"
THIS NEEDS TO BE CMOVE
"cmovne %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
This may be a coincidence; similar misprediction effects happen on Skylake (with recent microcode that disables the LSD1): an inner-loop count of about 22 or 23 is enough to stop its IT-TAGE predictor from learning the pattern of 21 taken, 1 not-taken for the inner-loop branch, in exactly this simple nested loop situation, last time I tested.
Choosing that iteration threshold for when to lock the loop down into the LSD might make some sense, or be a side-effect of your 1-uop loop and the LSD's "unrolling" behaviour on Haswell and later to get multiple copies of tiny loops into the IDQ before locking it down, to reduce the impact of the loop not being a multiple of the pipeline width.
Footnote 1: I'm surprised your Whiskey Lake seems to have a working LSD; I thought LSD was still disabled in all Skylake derivatives, at least including Coffee Lake, which was launched concurrently with Whiskey Lake.
My test loop was two dec/jne loops nested simply, IIRC, but your code has padding after the inner loop. (Starting with a jmp because that's what a huge .p2align does.) This puts the two loop branches at significantly different addresses. Either of both of these differences may help them avoid aliasing or some other kind of interference, because I'm seeing mostly-correct predictions for many (but not all) counts much greater than 23.
With your test code on my i7-6700k, lsd.uops is always exactly 0 of course. Compared to your Whiskey Lake data, only a few inner-loop counts produce high mispredict rates, e.g. 40, but not 50.
So there might be some effect from the LSD on your WHL CPU, making it bad for some N values where SKL is fine. (Assuming their IT-TAGE predictors are truly identical.)
e.g. with perf stat ... -r 5 ./a.out on Skylake (i7-6700k) with microcode revision 0xe2.
N
count
rate
variance
17
59,602
0.02% of all branches
+- 10.85%
20
192,307
0.05% of all branches
( +- 44.60% )
21
79,853
0.02% of all branches
( +- 14.16% )
30
136,308
0.02% of all branches
( +- 18.57% )
31..32
similar to N=34
( +- 2 or 3% )
33
22,415,089
3.71% of all branches
( +- 0.11% )
34
91,483
0.01% of all branches
( +- 2.36% )
35 (and 36..37 similar)
98,806
0.02% of all branches
( +- 2.75% )
38
33,517,630
4.87% of all branches
( +- 0.05% )
39
102,077
0.01% of all branches
( +- 1.96% )
40
33,458,267
4.64% of all branches
( +- 0.06% )
41
116,241
0.02% of all branches
( +- 6.86% )
42
22,376,562
2.96% of all branches
( +- 0.01% )
43
116,713
0.02% of all branches
( +- 5.25% )
44
174,834
0.02% of all branches
( +- 35.08% )
45
124,555
0.02% of all branches
( +- 5.36% )
46
135,838
0.02% of all branches
( +- 9.95% )
These numbers are repeatable, it's not just system noise; the spikes of high mispredict counts are very real at those specific N values. Probably some effect of the size / geometry of the IT-TAGE predictor's tables.
Other counters like idq.ms_uops and idq.dsb_uops scale mostly as expected, although idq.ms_uops is somewhat higher in the ones with more misses. (That counts uops added to the IDQ while the MS-ROM is active, perhaps counting front-end work that happens while branch recovery is cleaning up the back-end? It's a very different counter from legacy-decode mite_uops.)
With higher mispredict rates, idq.dsb_uops is quite a lot higher, I guess because the IDQ gets discarded and refilled on mispredicts. Like 1,011,000,288 for N=42, vs. 789,170,126 for N=43.
Note the high variability for N=20, around that threshold near 23, but still a tiny overall miss rate, much lower than every time the inner loop exits.
This is surprising, and different from a loop without as much padding.
Reason
The cause of the spike in Branch Misses is caused by the inner loop running out of the LSD.
The reason the LSD causes an extra branch miss for low iteration counts is that the "stop" condition on the LSD is a branch miss.
From Intel Optimization Manual Page 86.
The loop is sent to allocation 5 µops per cycle. After 45 out of the 46 µops are sent, in the next cycle only
a single µop is sent, which means that in that cycle, 4 of the allocation slots are wasted. This pattern
repeats itself, until the loop is exited by a misprediction. Hardware loop unrolling minimizes the number
of wasted slots during LSD.
Essentially what is happening is that when low enough iteration counts run out of the Uop Cache they are perfectly predictable. But when they run out of the LSD since the built in stop condition for the LSD is a branch mispredict, we see an extra branch miss for every iteration of the outer loop. I guess the takeaway is don't let nested loops execute out of the LSD. Note that LSD only kicks in after ~[20, 25] iterations so an inner loop with < 20 iterations will run optimally.
Benchmark
All benchmarks are run on either Icelake
The new benchmark is essentially the same as the one in the origional post but at the advise of #PeterCordes I added a fixed byte size but varying number of nops in the inner loop. The idea is fixed length so that there is no change in how the branches may alias in the BHT (Branch History Table) but varying the number of nops to sometimes defeat the LSD.
I used 124 bytes of nop padding so that the nop padding + size of decl; jcc would be 128 bytes total.
The benchmark code is as follows:
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef INNER_NOPS
#error "INNER_NOPS must be defined"
#endif
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static const uint64_t bht_shift = 4;
static const uint64_t bht_mask = (1023 << bht_shift);
#define NOP1 ".byte 0x90\n"
#define NOP2 ".byte 0x66,0x90\n"
#define NOP3 ".byte 0x0f,0x1f,0x00\n"
#define NOP4 ".byte 0x0f,0x1f,0x40,0x00\n"
#define NOP5 ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP6 ".byte 0x66,0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP7 ".byte 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00\n"
#define NOP8 ".byte 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP9 ".byte 0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP10 ".byte 0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP11 ".byte 0x66,0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt, tmp;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
".p2align 10\n"
"2:\n"
// This is defined in "inner_nops.h" with the necessary padding.
INNER_NOPS
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt), [ tmp ] "=&r"(tmp)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N),
[ bht_mask ] "i"(bht_mask), [ bht_shift ] "i"(bht_shift)
:);
}
// gcc -O3 -march=native -mtune=native lsd-branchmiss.c -o lsd-branchmiss
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
return 0;
}
Tests
I tested nop count = [0, 39].
Note that nop count = 1 would not be only 1 nop in the inner loop but actually the following:
#define INNER_NOPS NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP3 NOP1
To reach the full 128 byte padding.
Results
At nop count <= 32 the inner loop is able to run out of the LSD and we consistently see elevant Branch Misses when Iterations is large enough that it does so. Note that the elevated Branch Misses number corresponds 1-1 with the number of outer loop iterations. For these numbers outer loop iterations = 2^24
At nop count > 32 the loop has to many uops for the LSD and runs out of the Uop Cache. Here we do not see a consistent elevated Branch Misses until Iterations becomes to large for its BHT entry to store its entire history.
nop count > 32 (No LSD)
Once there are too many nops for the LSD the number of Branch Misses stays relatively low with a few consistent spikes until Iterations = 146 where Branch Misses spike to number of outer loop iterations (2 ^ 24 in this case) and remain constants. My guess is that is the upper bound on history the BHT is able to store.
Below is a graph of Branch Misses (Y) versus Iterations (X) for nop count = [33, 39]:
All of the lines follow the same patterns and have the same spikes. The large spikes to outer loop iterations before 146 are at Iterations = [42, 70, 79, 86, 88]. This is consistently reproducible. I am not sure what is special about these values.
The key point, however, is that for the most cast for Iterations = [20, 145] Branch Misses is relatively low indicating that the entire inner loop is being predicted correctly.
nop count <= 32 (LSD)
This data is a bit more noising bit all of the different nop count follow roughly the same trend of spiking initialing to within a factor of 2 of outer loop iterations Branch Misses at Iterations = [21, 25] (note this is 2-3 orders of magnitude) at the same time the lsd.oups spiked by 4-5 orders of magnitude.
There is also a trend between nop count and what iteration value Branch Misses stablize at outer loop iterations with a Pearson Correlation of 0.81. for nop count = [0, 32] the stablization point is in range iterations = [15, 34].
Below is a graph of Branch Misses (Y) versus Iterations (X) for nops = [0, 32]:
Generally, with some noise, all of the different nop count follow the same trend. As well they follow the same trend when compared with lsd.uops.
Below is a table with nop and the Pearson Correlation between Branch Misses and lsd.uop and idq.dsb_uops respectively.
nop
lsd
uop cache
0
0.961
-0.041
1
0.955
-0.081
2
0.919
-0.122
3
0.918
-0.299
4
0.947
-0.117
5
0.934
-0.298
6
0.894
-0.329
7
0.907
-0.308
8
0.91
-0.322
9
0.915
-0.316
10
0.877
-0.342
11
0.908
-0.28
12
0.874
-0.281
13
0.875
-0.523
14
0.87
-0.513
15
0.889
-0.522
16
0.858
-0.569
17
0.89
-0.507
18
0.858
-0.537
19
0.844
-0.565
20
0.816
-0.459
21
0.862
-0.537
22
0.848
-0.556
23
0.852
-0.552
24
0.85
-0.561
25
0.828
-0.573
26
0.857
-0.559
27
0.802
-0.372
28
0.762
-0.425
29
0.721
-0.112
30
0.736
-0.047
31
0.768
-0.174
32
0.847
-0.129
Which should generally indicate that there is a strong correlation between LSD and Branch Misses and no meaningful relationship between the Uop Cache and branch misses.
Overall
Generally I think it is clear that when the inner loop executing out of the LSD is what is causing the Branch Misses until Iterations becomes too large for the BHT entry's history. For the N = [33, 39] save the explained spikes we don't see elevated Branch Misses but we do for the N = [0, 32] case and the only difference I can tell is the LSD.

How to change quantum in xv6? [duplicate]

Right now it seems that on every click tick, the running process is preempted and forced to yield the processor, I have thoroughly investigated the code-base and the only relevant part of the code to process preemption is below (in trap.c):
// Force process to give up CPU on clock tick.
// If interrupts were on while locks held, would need to check nlock.
if(myproc() && myproc() -> state == RUNNING && tf -> trapno == T_IRQ0 + IRQ_TIMER)
yield();
I guess that timing is specified in T_IRQ0 + IRQ_TIMER, but I can't figure out how these two can be modified, these two are specified in trap.h:
#define T_IRQ0 32 // IRQ 0 corresponds to int T_IRQ
#define IRQ_TIMER 0
I wonder how I can change the default RR scheduling time-slice (which is right now 1 clock tick, fir example make it 10 clock-tick)?
If you want a process to be executed more time than the others, you can allow it more timeslices, *without` changing the timeslice duration.
To do so, you can add some extra_slice and current_slice in struct proc and modify the TIMER trap handler this way:
if(myproc() && myproc()->state == RUNNING &&
tf->trapno == T_IRQ0+IRQ_TIMER)
{
int current = myproc()->current_slice;
if ( current )
myproc()->current_slice = current - 1;
else
yield();
}
Then you just have to create a syscall to set extra_slice and modify the scheduler function to reset current_slice to extra_slice at process wakeup:
// Switch to chosen process. It is the process's job
// to release ptable.lock and then reacquire it
// before jumping back to us.
c->proc = p;
switchuvm(p);
p->state = RUNNING;
p->current_slice = p->extra_slice
You can read lapic.c file:
lapicinit(void)
{
....
// The timer repeatedly counts down at bus frequency
// from lapic[TICR] and then issues an interrupt.
// If xv6 cared more about precise timekeeping,
// TICR would be calibrated using an external time source.
lapicw(TDCR, X1);
lapicw(TIMER, PERIODIC | (T_IRQ0 + IRQ_TIMER));
lapicw(TICR, 10000000);
So, if you want the timer interrupt to be more spaced, change the TICR value:
lapicw(TICR, 10000000); //10 000 000
can become
lapicw(TICR, 100000000); //100 000 000
Warning, TICR references a 32bits unsigned counter, do not go over 4 294 967 295 (0xFFFFFFFF)

How to modify process preemption policies (like RR time-slices) in XV6?

Right now it seems that on every click tick, the running process is preempted and forced to yield the processor, I have thoroughly investigated the code-base and the only relevant part of the code to process preemption is below (in trap.c):
// Force process to give up CPU on clock tick.
// If interrupts were on while locks held, would need to check nlock.
if(myproc() && myproc() -> state == RUNNING && tf -> trapno == T_IRQ0 + IRQ_TIMER)
yield();
I guess that timing is specified in T_IRQ0 + IRQ_TIMER, but I can't figure out how these two can be modified, these two are specified in trap.h:
#define T_IRQ0 32 // IRQ 0 corresponds to int T_IRQ
#define IRQ_TIMER 0
I wonder how I can change the default RR scheduling time-slice (which is right now 1 clock tick, fir example make it 10 clock-tick)?
If you want a process to be executed more time than the others, you can allow it more timeslices, *without` changing the timeslice duration.
To do so, you can add some extra_slice and current_slice in struct proc and modify the TIMER trap handler this way:
if(myproc() && myproc()->state == RUNNING &&
tf->trapno == T_IRQ0+IRQ_TIMER)
{
int current = myproc()->current_slice;
if ( current )
myproc()->current_slice = current - 1;
else
yield();
}
Then you just have to create a syscall to set extra_slice and modify the scheduler function to reset current_slice to extra_slice at process wakeup:
// Switch to chosen process. It is the process's job
// to release ptable.lock and then reacquire it
// before jumping back to us.
c->proc = p;
switchuvm(p);
p->state = RUNNING;
p->current_slice = p->extra_slice
You can read lapic.c file:
lapicinit(void)
{
....
// The timer repeatedly counts down at bus frequency
// from lapic[TICR] and then issues an interrupt.
// If xv6 cared more about precise timekeeping,
// TICR would be calibrated using an external time source.
lapicw(TDCR, X1);
lapicw(TIMER, PERIODIC | (T_IRQ0 + IRQ_TIMER));
lapicw(TICR, 10000000);
So, if you want the timer interrupt to be more spaced, change the TICR value:
lapicw(TICR, 10000000); //10 000 000
can become
lapicw(TICR, 100000000); //100 000 000
Warning, TICR references a 32bits unsigned counter, do not go over 4 294 967 295 (0xFFFFFFFF)

How to skip a line from execution in windbg everytime it hits?

Suppose I want to skip line 3 of function func everytime it is called
int func() {
int a = 10, b =20;
a = 25;
b = 30;
return a+b
}
so everytime It should be returning 40 (ie doesn't execute 3rd line a=25)
Is there any similar command in windbg like jmp in gdb?
again a very late answer but if messing with assembly is not preferable
set a conditional breakpoint to skip executing one line
in the example below 401034 is the line you do not want to execute
so set a conditional breakpoint on that line to skip it
bp 401034 "r eip = #$eip + size of current instruction";gc"
7 in this case gc = go from conditionl break
jmptest:\>dir /b
jmptest.c
jmptest:\>type jmptest.c
#include <stdio.h>
int func()
{
int a = 10 , b = 20;
a = 25;
b = 30;
return a+b;
}
int main (void)
{
int i , ret;
for (i= 0; i< 10; i++)
{
ret = func();
printf("we want 40 we get %d\n",ret);
}
return 0;
}
jmptest:\>cl /nologo /Zi jmptest.c
jmptest.c
jmptest:\>dir /b *.exe
jmptest.exe
jmptest:\>cdb -c "uf func;q" jmptest.exe | grep 401
00401020 55 push ebp
00401021 8bec mov ebp,esp
00401023 83ec08 sub esp,8
00401026 c745fc0a000000 mov dword ptr [ebp-4],0Ah
0040102d c745f814000000 mov dword ptr [ebp-8],14h
00401034 c745fc19000000 mov dword ptr [ebp-4],19h
0040103b c745f81e000000 mov dword ptr [ebp-8],1Eh
00401042 8b45fc mov eax,dword ptr [ebp-4]
00401045 0345f8 add eax,dword ptr [ebp-8]
00401048 8be5 mov esp,ebp
0040104a 5d pop ebp
0040104b c3 ret
jmptest:\>cdb -c "bp 401034 \"r eip = 0x40103b;gc\";g;q " jmptest.exe | grep wan
t
we want 40 we get 40
we want 40 we get 40
we want 40 we get 40
we want 40 we get 40
we want 40 we get 40
we want 40 we get 40
we want 40 we get 40
we want 40 we get 40
we want 40 we get 40
we want 40 we get 40
jmptest:\>
If you're familiar with assembly, you can use the a command to change the assembly (i.e. turn the opcodes for, "a = 25;" into all NOPs). This is what I typically do when I want to NOP out or otherwise change an instruction stream.
Occasionally people will rely on the fact that the byte code for the NOP instruction is 0x90 and use the e command to edit the memory (e.g. "ew #eip 0x9090"). This is the same result as using the a command.
Lastly, if you're hitting this operation infrequently and just want to manually skip the instruction you can use the, "Set Current Instruction" GUI operation:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542851(v=vs.85).aspx
There is a tutorial here that explains how to do this, you can set the offset so that it skips the line: http://cfc.kizzx2.com/index.php/tutorial-using-windbg-to-bypass-specific-functions-windbg-kung-fu-series/ and set the register eip to this value.
Also, you can set the breakpoint and put the command into the breakpoint to do the same: http://japrogbits.blogspot.co.uk/2010/01/using-breakpoints-to-skip-function-in.html and another blog: http://www.shcherbyna.com/?p=1234 and also you can use the .call to achieve the same: http://blogs.msdn.com/b/oldnewthing/archive/2007/04/27/2292037.aspx