Related
When I try to run any c++ program in vs code an exception occur and do not run the program
Oops, something went wrong. Please report this bug with the details below.
Report on GitHub: https://github.com/lzybkr/PSReadLine/issues/new
-----------------------------------------------------------------------
Last 101 Keys:
c d Space " c : \ U s e r s \ U S E R
\ D o c u m e n t s \ p r o g r a m m i n g \ " Space ; Space i f Space ( $ ?
) Space { Space g + + Space c o d e 2 . c p p Space - o Space c o d e 2 Space
} Space ; Space i f Space ( $ ? ) Space { Space . \ c o d e 2 Space } Enter
Exception:
System.ArgumentOutOfRangeException: The value must be greater than or equal to zero and less than the console's buffer size in that dimension.
Parameter name: left
Actual value was -1.
at System.Console.SetCursorPosition(Int32 left, Int32 top)
at Microsoft.PowerShell.Internal.VirtualTerminal.set_CursorLeft(Int32 value)
at Microsoft.PowerShell.PSConsoleReadLine.ReallyRender(RenderData renderData, String defaultColor)
at Microsoft.PowerShell.PSConsoleReadLine.ForceRender()
at Microsoft.PowerShell.PSConsoleReadLine.Insert(Char c)
at Microsoft.PowerShell.PSConsoleReadLine.SelfInsert(Nullable`1 key, Object arg)
at Microsoft.PowerShell.PSConsoleReadLine.ProcessOneKey(ConsoleKeyInfo key, Dictionary`2 dispatchTable, Boolean ignoreIfNoAction, Object arg)
at Microsoft.PowerShell.PSConsoleReadLine.InputLoop()
at Microsoft.PowerShell.PSConsoleReadLine.ReadLine(Runspace runspace, EngineIntrinsics engineIntrinsics)
-----------------------------------------------------------------------
By using manual command in terminal , I can run my c++ program.
I can find some issues in vscode github repository about this problem but can not understand any solution
All benchmarks are run on either
Icelake
or Whiskey Lake (In Skylake Family).
Summary
I am seeing a strange phenomina where it appears that when a loop
transitions from running out of the Uop Cache to running out of
the LSD (Loop Stream Detector) there is a spike in Branch
Misses that can cause severe performance hits. I tested on both
Icelake and Whiskey Lake by benchmarking a nested loop with the outer loop
having a sufficiently large body s.t the entire thing did not fit in
the LSD itself, but with an inner loop small enough to fit in the
LSD.
Basically once the inner loop reaches some iteration count decoding
seems to switch for idq.dsb_uops (Uop Cache) to lsd.uops (LSD)
and at that point there is a large increase in branch-misses
(without a corresponding jump in branches) causing a severe
performance drop. Note: This only seems to occur for nested
loops. Travis Down's Loop
Test for example will
not show any meaningful variation in branch misses. AFAICT this has
something to do with when a loop transitions from running out of the
Uop Cache to running out of the LSD.
Questions
What is happening when the loop transitions from running out of the
Uop Cache to running out of the LSD that causes this spike in
Branch Misses?
Is there a way to avoid this?
Benchmark
This is the minimum reproducible example I could come up with:
Note: If the .p2align statements are removed both loops will fit in
the LSD and there will not be a transitions.
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
}
Compile: gcc -O3 -march=native -mtune=native <filename>.c -o <filename>
Run Icelake: sudo perf stat -C 0 --all-user -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Run Whiskey Lake: sudo perf stat -C 0 -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Graphs
Edit: x label is N iterations of inner loop.
Below is a graph of Branch Misses, Branches, and LSD Uops.
Generally you can see that 1) there is no corresponding jump in Branches. 2) that the number of added Branch Misses stabilizes at a constant. And 3) That there is a strong relationship between the Branch Misses and LSD Uops.
Icelake Graph:
Whiskey Lake Graph:
Below is a graph of Branch Misses, Cycles, and LSD Uops
for Icelake only as performance is not affected nearly as much on:
Analysis
Hard numbers below.
For Icelake starting at N = 22 and finishing at N = 27 there
is some fluctuation in the number of uops coming from the LSD vs
Uop Cache and during that time Branch Misses increases by
roughly 3 order of magnitude from 10^4 -> 10^7. During this period
Cycles also increased by a factor of 2. For all N > 27
Branch Misses stays at around 1.67 x 10^7 (roughly outer_loop_N). For N = [17, 40]
branches continues to only increase linearly.
The results for Whiskey Lake look different in that 1) N begins fluctuating at N = 35 and continues to fluctuate until N = 49. And 2) there is less of a
performance impact and more fluctuation in the data. That being said
the increase in Branch-Misses corresponding with transitions from
uops being fed by Uop Cache to being fed by LSD still exists.
Results
Data is mean result for 25 runs.
Icelake Results:
N
cycles
branches
branch-misses
idq.ms_uops
idq.dsb_uops
lsd.uops
1
33893260
67129521
1590
43163
115243
83908732
2
42540891
83908928
1762
49023
142909
100690381
3
50725933
100686143
1782
47656
142506
117440256
4
67533597
117461172
1655
52538
186123
134158311
5
68022910
134238387
1711
53405
204481
150954035
6
85543126
151018722
1924
62445
141397
167633971
7
84847823
167799220
1935
60248
160146
184563523
8
101532158
184570060
1709
60064
361208
201100179
9
101864898
201347253
1773
63827
459873
217780207
10
118024033
218124499
1698
59480
177223
234834304
11
118644416
234908571
2201
62514
422977
251503052
12
134627567
251678909
1679
57262
133462
268435650
13
285607942
268456135
1770
74070
285032524
315423
14
302717754
285233352
1731
74663
302101097
15953
15
321627434
302010569
81796
77831
319192830
1520819
16
337876736
318787786
71638
77056
335904260
1265766
17
353054773
335565563
1798
79839
352434780
15879
18
369800279
352344970
1978
79863
369229396
16790
19
386921048
369119438
1972
84075
385984022
16115
20
404248461
385896655
29454
85348
402790977
510176
21
421100725
402673872
37598
83400
419537730
729397
22
519623794
419451095
4447767
91209
431865775
97827331
23
702206338
436228323
12603617
109064
427880075
327661987
24
710626194
453005538
12316933
106929
432926173
344838509
25
863214037
469782765
14641887
121776
428085132
614871430
26
761037251
486559974
13067814
113011
438093034
418124984
27
832686921
503337195
16381350
113953
421924080
556915419
28
854713119
520114412
16642396
124448
420515666
598907353
29
869873144
536891629
16572581
119280
421188631
629696780
30
889642335
553668847
16717446
120116
420086570
668628871
31
906912275
570446064
16735759
126094
419970933
702822733
32
923023862
587223281
16706519
132498
420332680
735003892
33
940308170
604000498
16744992
124365
419945191
770185745
34
957075000
620777716
16726856
133675
420215897
802779119
35
974557538
637554932
16763071
134871
419764866
838012637
36
991110971
654332149
16772560
130903
419641144
872037131
37
1008489575
671109367
16757219
138788
419900997
904638287
38
1024971256
687886583
16772585
139782
419663863
938988917
39
1041404669
704669411
16776722
137681
419617131
972896126
40
1058594326
721441018
16773492
142959
419662133
1006109192
41
1075179100
738218235
16776636
141185
419601996
1039892900
42
1092093726
754995452
16776233
142793
419611902
1073373451
43
1108706464
771773224
16776359
139500
419610885
1106976114
44
1125413652
788549886
16764637
143717
419889127
1139628280
45
1142023614
805327103
16778640
144397
419558217
1174329696
46
1158833317
822104321
16765518
148045
419889914
1206833484
47
1175665684
838881537
16778437
148347
419562885
1241397845
48
1192454164
855658755
16778865
153651
419552747
1275006511
49
1210199084
872436025
16778287
152468
419599314
1307945613
50
1226321832
889213188
16778464
155552
419572344
1341893668
51
1242886388
905990406
16778745
155401
419559249
1375589883
52
1259559053
922767623
16778809
154847
419554082
1409206082
53
1276875799
939544839
16778460
162521
419576455
1442424993
54
1293113199
956322057
16778931
154913
419550955
1476316161
55
1310449232
973099274
16778534
157364
419578102
1509485876
56
1327022109
989876491
16778794
162881
419562403
1543193559
57
1344097516
1006653708
16778906
157486
419567545
1576414302
58
1362935064
1023430928
16778959
315120
419583132
1609691339
59
1381567560
1040208143
16778564
179997
419661259
1640660745
60
1394829416
1056985359
16778779
167613
419575969
1677034188
61
1411847237
1073762626
16778071
166332
419613028
1710194702
62
1428918439
1090539795
16778409
168073
419610487
1743644637
63
1445223241
1107317011
16778486
172446
419591254
1777573503
64
1461530579
1124094228
16769606
169559
419970612
1810351736
Whiskey Lake Results:
N
cycles
branches
branch-misses
idq.dsb_uops
lsd.uops
1
8332553879
35005847
37925
1799462
6019
2
8329926329
51163346
34338
1114352
5919
3
8357233041
67925775
32270
9241935
5555
4
8379609449
85364250
35667
18215077
5712
5
8394301337
101563554
33177
26392216
2159
6
8409830612
118918934
35007
35318763
5295
7
8435794672
135162597
35592
43033739
4478
8
8445843118
152636271
37802
52154850
5629
9
8459141676
168577876
30766
59245754
1543
10
8475484632
185354280
30825
68059212
4672
11
8493529857
202489273
31703
77386249
5556
12
8509281533
218912407
32133
84390084
4399
13
8528605921
236303681
33056
93995496
2093
14
8553971099
252439989
572416
99700289
2477
15
8558526147
269148605
29912
109772044
6121
16
8576658106
286414453
29839
118504526
5850
17
8591545887
302698593
28993
126409458
4865
18
8611628234
319960954
32568
136298306
5066
19
8627289083
336312187
30094
143759724
6598
20
8644741581
353730396
49458
152217853
9275
21
8685908403
369886284
1175195
161313923
7958903
22
8694494654
387336207
354008
169541244
2553802
23
8702920906
403389097
29315
176524452
12932
24
8711458401
420211718
31924
184984842
11574
25
8729941722
437299615
32472
194553843
12002
26
8743658904
453739403
28809
202074676
13279
27
8763317458
470902005
32298
211321630
15377
28
8788189716
487432842
37105
218972477
27666
29
8796580152
504414945
36756
228334744
79954
30
8821174857
520930989
39550
235849655
140461
31
8818857058
537611096
34142
648080
79191
32
8855038758
555138781
37680
18414880
70489
33
8870680446
571194669
37541
34596108
131455
34
8888946679
588222521
33724
52553756
80009
35
9256640352
604791887
16658672
132185723
41881719
36
9189040776
621918353
12296238
257921026
235389707
37
8962737456
638241888
1086663
109613368
35222987
38
9005853511
655453884
2059624
131945369
73389550
39
9005576553
671845678
1434478
143002441
51959363
40
9284680907
688991063
12776341
349762585
347998221
41
9049931865
705399210
1778532
174597773
72566933
42
9314836359
722226758
12743442
365270833
380415682
43
9072200927
739449289
1344663
205181163
61284843
44
9346737669
755766179
12681859
383580355
409359111
45
9117099955
773167996
1801713
235583664
88985013
46
9108062783
789247474
860680
250992592
43508069
47
9129892784
806871038
984804
268229102
51249366
48
9146468279
822765997
1018387
282312588
58278399
49
9476835578
840085058
13985421
241172394
809315446
50
9495578885
856579327
14155046
241909464
847629148
51
9537115189
873483093
15057500
238735335
932663942
52
9556102594
890026435
15322279
238194482
982429654
53
9589094741
907142375
15899251
234845868
1052080437
54
9609053333
923477989
16049518
233890599
1092323040
55
9628950166
940997348
16172619
235383688
1131146866
56
9650657175
957049360
16445697
231276680
1183699383
57
9666446210
973785857
16330748
233203869
1205098118
58
9687274222
990692518
16523542
230842647
1254624242
59
9706652879
1007946602
16576268
231502185
1288374980
60
9720091630
1024044005
16547047
230966608
1321807705
61
9741079017
1041285110
16635400
230873663
1362929599
62
9761596587
1057847755
16683756
230289842
1399235989
63
9782104875
1075055403
16299138
237386812
1397167324
64
9790122724
1091147494
16650471
229928585
1463076072
Edit:
2 things worth noting:
If I add padding to the inner loop so it won't fit in the uop cache I don't see this behavior until ~150 iterations.
Adding an lfence on with the padding in the outer loop changes N threshold to 31.
edit2: Benchmark which clears branch history The condition was reversed. It should be cmove not cmovne. With the fixed version any iteration count sees elevated Branch Misses at the same rate as above (1.67 * 10^9). This means the LSD is not itself causing Branch Misses, but leaves open the possibility that LSD is in some way defeating the Branch Predictor (what I believe to be the case).
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"testl $3, %k[outer_loop_cnt]\n"
"movl $1000, %k[inner_loop_cnt]\n"
THIS NEEDS TO BE CMOVE
"cmovne %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
This may be a coincidence; similar misprediction effects happen on Skylake (with recent microcode that disables the LSD1): an inner-loop count of about 22 or 23 is enough to stop its IT-TAGE predictor from learning the pattern of 21 taken, 1 not-taken for the inner-loop branch, in exactly this simple nested loop situation, last time I tested.
Choosing that iteration threshold for when to lock the loop down into the LSD might make some sense, or be a side-effect of your 1-uop loop and the LSD's "unrolling" behaviour on Haswell and later to get multiple copies of tiny loops into the IDQ before locking it down, to reduce the impact of the loop not being a multiple of the pipeline width.
Footnote 1: I'm surprised your Whiskey Lake seems to have a working LSD; I thought LSD was still disabled in all Skylake derivatives, at least including Coffee Lake, which was launched concurrently with Whiskey Lake.
My test loop was two dec/jne loops nested simply, IIRC, but your code has padding after the inner loop. (Starting with a jmp because that's what a huge .p2align does.) This puts the two loop branches at significantly different addresses. Either of both of these differences may help them avoid aliasing or some other kind of interference, because I'm seeing mostly-correct predictions for many (but not all) counts much greater than 23.
With your test code on my i7-6700k, lsd.uops is always exactly 0 of course. Compared to your Whiskey Lake data, only a few inner-loop counts produce high mispredict rates, e.g. 40, but not 50.
So there might be some effect from the LSD on your WHL CPU, making it bad for some N values where SKL is fine. (Assuming their IT-TAGE predictors are truly identical.)
e.g. with perf stat ... -r 5 ./a.out on Skylake (i7-6700k) with microcode revision 0xe2.
N
count
rate
variance
17
59,602
0.02% of all branches
+- 10.85%
20
192,307
0.05% of all branches
( +- 44.60% )
21
79,853
0.02% of all branches
( +- 14.16% )
30
136,308
0.02% of all branches
( +- 18.57% )
31..32
similar to N=34
( +- 2 or 3% )
33
22,415,089
3.71% of all branches
( +- 0.11% )
34
91,483
0.01% of all branches
( +- 2.36% )
35 (and 36..37 similar)
98,806
0.02% of all branches
( +- 2.75% )
38
33,517,630
4.87% of all branches
( +- 0.05% )
39
102,077
0.01% of all branches
( +- 1.96% )
40
33,458,267
4.64% of all branches
( +- 0.06% )
41
116,241
0.02% of all branches
( +- 6.86% )
42
22,376,562
2.96% of all branches
( +- 0.01% )
43
116,713
0.02% of all branches
( +- 5.25% )
44
174,834
0.02% of all branches
( +- 35.08% )
45
124,555
0.02% of all branches
( +- 5.36% )
46
135,838
0.02% of all branches
( +- 9.95% )
These numbers are repeatable, it's not just system noise; the spikes of high mispredict counts are very real at those specific N values. Probably some effect of the size / geometry of the IT-TAGE predictor's tables.
Other counters like idq.ms_uops and idq.dsb_uops scale mostly as expected, although idq.ms_uops is somewhat higher in the ones with more misses. (That counts uops added to the IDQ while the MS-ROM is active, perhaps counting front-end work that happens while branch recovery is cleaning up the back-end? It's a very different counter from legacy-decode mite_uops.)
With higher mispredict rates, idq.dsb_uops is quite a lot higher, I guess because the IDQ gets discarded and refilled on mispredicts. Like 1,011,000,288 for N=42, vs. 789,170,126 for N=43.
Note the high variability for N=20, around that threshold near 23, but still a tiny overall miss rate, much lower than every time the inner loop exits.
This is surprising, and different from a loop without as much padding.
Reason
The cause of the spike in Branch Misses is caused by the inner loop running out of the LSD.
The reason the LSD causes an extra branch miss for low iteration counts is that the "stop" condition on the LSD is a branch miss.
From Intel Optimization Manual Page 86.
The loop is sent to allocation 5 µops per cycle. After 45 out of the 46 µops are sent, in the next cycle only
a single µop is sent, which means that in that cycle, 4 of the allocation slots are wasted. This pattern
repeats itself, until the loop is exited by a misprediction. Hardware loop unrolling minimizes the number
of wasted slots during LSD.
Essentially what is happening is that when low enough iteration counts run out of the Uop Cache they are perfectly predictable. But when they run out of the LSD since the built in stop condition for the LSD is a branch mispredict, we see an extra branch miss for every iteration of the outer loop. I guess the takeaway is don't let nested loops execute out of the LSD. Note that LSD only kicks in after ~[20, 25] iterations so an inner loop with < 20 iterations will run optimally.
Benchmark
All benchmarks are run on either Icelake
The new benchmark is essentially the same as the one in the origional post but at the advise of #PeterCordes I added a fixed byte size but varying number of nops in the inner loop. The idea is fixed length so that there is no change in how the branches may alias in the BHT (Branch History Table) but varying the number of nops to sometimes defeat the LSD.
I used 124 bytes of nop padding so that the nop padding + size of decl; jcc would be 128 bytes total.
The benchmark code is as follows:
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef INNER_NOPS
#error "INNER_NOPS must be defined"
#endif
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static const uint64_t bht_shift = 4;
static const uint64_t bht_mask = (1023 << bht_shift);
#define NOP1 ".byte 0x90\n"
#define NOP2 ".byte 0x66,0x90\n"
#define NOP3 ".byte 0x0f,0x1f,0x00\n"
#define NOP4 ".byte 0x0f,0x1f,0x40,0x00\n"
#define NOP5 ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP6 ".byte 0x66,0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP7 ".byte 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00\n"
#define NOP8 ".byte 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP9 ".byte 0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP10 ".byte 0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP11 ".byte 0x66,0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt, tmp;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
".p2align 10\n"
"2:\n"
// This is defined in "inner_nops.h" with the necessary padding.
INNER_NOPS
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt), [ tmp ] "=&r"(tmp)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N),
[ bht_mask ] "i"(bht_mask), [ bht_shift ] "i"(bht_shift)
:);
}
// gcc -O3 -march=native -mtune=native lsd-branchmiss.c -o lsd-branchmiss
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
return 0;
}
Tests
I tested nop count = [0, 39].
Note that nop count = 1 would not be only 1 nop in the inner loop but actually the following:
#define INNER_NOPS NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP3 NOP1
To reach the full 128 byte padding.
Results
At nop count <= 32 the inner loop is able to run out of the LSD and we consistently see elevant Branch Misses when Iterations is large enough that it does so. Note that the elevated Branch Misses number corresponds 1-1 with the number of outer loop iterations. For these numbers outer loop iterations = 2^24
At nop count > 32 the loop has to many uops for the LSD and runs out of the Uop Cache. Here we do not see a consistent elevated Branch Misses until Iterations becomes to large for its BHT entry to store its entire history.
nop count > 32 (No LSD)
Once there are too many nops for the LSD the number of Branch Misses stays relatively low with a few consistent spikes until Iterations = 146 where Branch Misses spike to number of outer loop iterations (2 ^ 24 in this case) and remain constants. My guess is that is the upper bound on history the BHT is able to store.
Below is a graph of Branch Misses (Y) versus Iterations (X) for nop count = [33, 39]:
All of the lines follow the same patterns and have the same spikes. The large spikes to outer loop iterations before 146 are at Iterations = [42, 70, 79, 86, 88]. This is consistently reproducible. I am not sure what is special about these values.
The key point, however, is that for the most cast for Iterations = [20, 145] Branch Misses is relatively low indicating that the entire inner loop is being predicted correctly.
nop count <= 32 (LSD)
This data is a bit more noising bit all of the different nop count follow roughly the same trend of spiking initialing to within a factor of 2 of outer loop iterations Branch Misses at Iterations = [21, 25] (note this is 2-3 orders of magnitude) at the same time the lsd.oups spiked by 4-5 orders of magnitude.
There is also a trend between nop count and what iteration value Branch Misses stablize at outer loop iterations with a Pearson Correlation of 0.81. for nop count = [0, 32] the stablization point is in range iterations = [15, 34].
Below is a graph of Branch Misses (Y) versus Iterations (X) for nops = [0, 32]:
Generally, with some noise, all of the different nop count follow the same trend. As well they follow the same trend when compared with lsd.uops.
Below is a table with nop and the Pearson Correlation between Branch Misses and lsd.uop and idq.dsb_uops respectively.
nop
lsd
uop cache
0
0.961
-0.041
1
0.955
-0.081
2
0.919
-0.122
3
0.918
-0.299
4
0.947
-0.117
5
0.934
-0.298
6
0.894
-0.329
7
0.907
-0.308
8
0.91
-0.322
9
0.915
-0.316
10
0.877
-0.342
11
0.908
-0.28
12
0.874
-0.281
13
0.875
-0.523
14
0.87
-0.513
15
0.889
-0.522
16
0.858
-0.569
17
0.89
-0.507
18
0.858
-0.537
19
0.844
-0.565
20
0.816
-0.459
21
0.862
-0.537
22
0.848
-0.556
23
0.852
-0.552
24
0.85
-0.561
25
0.828
-0.573
26
0.857
-0.559
27
0.802
-0.372
28
0.762
-0.425
29
0.721
-0.112
30
0.736
-0.047
31
0.768
-0.174
32
0.847
-0.129
Which should generally indicate that there is a strong correlation between LSD and Branch Misses and no meaningful relationship between the Uop Cache and branch misses.
Overall
Generally I think it is clear that when the inner loop executing out of the LSD is what is causing the Branch Misses until Iterations becomes too large for the BHT entry's history. For the N = [33, 39] save the explained spikes we don't see elevated Branch Misses but we do for the N = [0, 32] case and the only difference I can tell is the LSD.
How is the neg pseudo instruction implemented with only one sub?
I don't understand, as neg is R[rd] = -R[rs1]. But if I have sub, it is R[rs1] - something.
The "something" in this case is the zero register. but you're not subtracting that from the register, you're subtracting the register from that.
The:
neg rd, rs
pseudo-instruction is meant to put the negation of rs into rd. The
sub rd, zero, rs
instruction subtracts rs from zero, placing the result into rd.
rd := -rs ; example: -(42) -> -42
rd := 0 - rs ; 0 - 42 -> -42
Since -x is the same as 0 - x, they are equivalent.
If you want a more comprehensive list of pseudo instructions and what they map to, here an image which details some, including the specific one you asked about:
everyone,
I'm trying to do some calculations and plot the results, but it seems that these are too heavy for Maxima. When I try to calculate N1 and N2 the program crashes when parameter j is too high or when I try to plot them, the program displays the following error message: "Heap exhausted, game over." What should I do? I've seen some people saying to try to compile Maxima with ccl, but I don't know how to do it or if it will work.
I usually receive error messages like:
Message from maxima's stderr stream: Heap exhausted during garbage collection: 0 bytes available, 16 requested.
Gen Boxed Unboxed LgBox LgUnbox Pin Alloc Waste Trig WP GCs Mem-age
0 0 0 0 0 0 0 0 20971520 0 0 0,0000
1 0 0 0 0 0 0 0 20971520 0 0 0,0000
2 0 0 0 0 0 0 0 20971520 0 0 0,0000
3 16417 2 0 0 43 1075328496 707088 293986768 16419 1 0,8032
4 13432 21 0 1141 70 955593760 838624 2000000 14594 0 0,2673
5 0 0 0 0 0 0 0 2000000 0 0 0,0000
6 741 184 34 28 0 63259792 1424240 2000000 987 0 0,0000
7 0 0 0 0 0 0 0 2000000 0 0 0,0000
Total bytes allocated = 2094182048
Dynamic-space-size bytes = 2097152000
GC control variables:
*GC-INHIBIT* = true
*GC-PENDING* = true
*STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 13884(tid 0000000001236360):
Heap exhausted, game over.
Here goes the code:
enter code here
a: 80$;
b: 6*a$;
h1: 80$;
t: 2$;
j: 5$;
carga: 250$;
sig: -carga/2$;
n: 2*q*%pi/b$;
m: i*%pi/a$;
i: 2*p-1$;
i1: 2*p1-1$;
/*i1: p1$;*/
Φ: a/b$;
τ: cosh(x) - (x/sinh(x))$;
σ: sinh(x) - (x/cosh(x))$;
Ψ: sinh(x)/τ$;
Χ: cosh(x)/σ$;
Λ0: 1/(((i/2)^2+Φ^2*q^2)^2)$;
Λ1: sum((((i/2)^3*subst([x=(i*%pi/(2*Φ))],Ψ))/(((i/2)^2+Φ^2*q1^2)^2))*Λ0, p, 1, j)$;
Λ2: sum(((q1^3*subst([x=(q1*%pi*Φ)],Χ))/(((i/2)^2+Φ^2*q1^2)^2))*Λ1, q1, 1, j)$;
Λ3: sum((((i/2)^3*subst([x=(i*%pi/(2*Φ))],Ψ))/(((i/2)^2+Φ^2*q1^2)^2))*Λ2, p, 1, j)$;
Λ4: sum(((q1^3*subst([x=(q1*%pi*Φ)],Χ))/(((i/2)^2+Φ^2*q1^2)^2))*Λ3, q1, 1, j)$;
Λ5: sum((((i/2)^3*subst([x=(i*%pi/(2*Φ))],Ψ))/(((i/2)^2+Φ^2*q1^2)^2))*Λ4, p, 1, j)$;
Ζ0: sum(((q^3*subst([x=(q*%pi*Φ)],Χ))/(((i1/2)^2+Φ^2*q^2)^2))*Λ0, q, 1, j)$;
Ζ2: sum(((q^3*subst([x=(q*%pi*Φ)],Χ))/(((i1/2)^2+Φ^2*q^2)^2))*Λ2, q, 1, j)$;
Ζ4: sum(((q^3*subst([x=(q*%pi*Φ)],Χ))/(((i1/2)^2+Φ^2*q^2)^2))*Λ4, q, 1, j)$;
E: 200000$;
ν: 0.3$;
λ: (ν*E)/((1+ν)*(1-2*ν))$;
μ: E/(2*(1+ν))$;
a0: float(1/(b/2)*integrate(0, y, -(b/2), -h1/2)+1/b*integrate(sig, y, -h1/2, h1/2)+1/(b/2)*integrate(0, y, h1/2, (b/2)))$;
aq: float(1/(b/2)*integrate(0*cos(q*y*%pi/(b/2)), y, -(b/2), - h1/2)+1/(b/2)*integrate(sig*cos(q*y*%pi/(b/2)), y, -h1/2, h1/2)+1/(b/2)*integrate(0*cos(q*y*%pi/(b/2)), y, h1/2, (b/2)))$;
aq1: float(1/(b/2)*integrate(0*cos(q1*y*%pi/(b/2)), y, -(b/2), - h1/2)+1/(b/2)*integrate(sig*cos(q1*y*%pi/(b/2)), y, -h1/2, h1/2)+1/(b/2)*integrate(0*cos(q1*y*%pi/(b/2)), y, h1/2, (b/2)))$;
Bq: aq/((λ+μ)*subst([x=q*%pi*Φ],σ))+((16*Φ^4*q^2*(-1)^q)/((λ+μ)*%pi^2*subst([x=q*%pi*Φ],σ)))*sum(q1*aq1*(-1) ^q1*subst([x=q1*%pi*Φ],Χ)*(Λ1+(16*Φ^4/(%pi^2))*Λ3+((16*Φ^4/(%pi^2))^2)*Λ5), q1, 1, j)+(8*λ*Φ^3*q^2*(-1)^q*a0)/((λ+μ)*(λ+2*μ)*(%pi^3)*subst([x=q*%pi*Φ],σ))*sum(subst([x=i*%pi/(2*Φ)],Ψ)/(i/ 2)*(Λ0+(16*Φ^4/(%pi^2))*Λ2+((16*Φ^4/(%pi^2))^2)*Λ4), p, 1, j)$;
βp: -(2*λ*a0*(-1)^((i-1)/2))/((λ+μ)*(λ+2*μ)*(i/2)^2*%pi^2*subst([x=i*%pi/(2*Φ)],τ))-((32*λ*Φ^4*(i/2)^2*a0*(-1)^((i-1)/2))/((λ+μ)*(λ+2*μ)*%pi^2*subst([x=i*%pi/(2*Φ)],τ)))*sum(((subst([x=i1*%pi/(2*Φ)],Ψ))/(i1/2))*(Ζ0+Ζ2*((16*Φ^4)/%pi^2)+Ζ4*(((16*Φ^4)/%pi^2)^2)),p1,1,j)-((4*Φ*(i/2)^2*(-1)^((i-1)/2))/((λ+μ)*%pi*subst([x=i*%pi/(2*Φ)],τ)))*sum(q*aq*(-1)^q*subst([x=q*%pi*Φ],Χ)*(Λ0+Λ2*(16*Φ^4/%pi^2)+Λ4*(16*Φ^4/%pi^2)^2),q,1,j)$;
N1: (2*a0/a)*x+(λ+μ)*sum(Bq*((1+((n*a*sinh(n*a/2))/(2*cosh(n*a/2))))*sinh(n*x)-n*x*cosh(n*x))*cos(n*y),q,1,j)+(λ+μ)*sum(βp*((1-((m*b*cosh(m*b/2))/(2*sinh(m*b/2))))*cosh(m*y)+m*y*sinh(m*y))*sin(m*x),p,1,j)$;
N2: ((2*λ*a0)/(a*(λ+2*μ)))*x+(λ+μ)*sum(Bq*((1-((n*a*sinh(n*a/2))/(2*cosh(n*a/2))))*sinh(n*x)+n*x*cosh(n*x))*cos(n*y),q,1,j)+(λ+μ)*sum(βp*((1+((m*b*cosh(m*b/2))/(2*sinh(m*b/2))))*cosh(m*y)-m*y*sinh(m*y))*sin(m*x),p,1,j);
wxplot3d(N1, [x,-a/2,a/2], [y,-b/2,b/2])$;
wxplot3d(N2, [x,-a/2,a/2], [y,-b/2,b/2])$;
This is not a complete answer, since I don't know how this should work with wxMaxima: I would suggest that you ask the developers. However it's too long for a comment and I think might be useful to people, and it does answer the question of how you solve the heap-size limit for Maxima itself when using SBCL, at least when run on Linux or some other platform with a command-line.
As a note, I suspect that the underlying problem is not the heap size, but that the calculation is blowing up in some horrible way: the best fix is probably to understand what's blowing up and fix that. See Robert Dodier's answer, which is probably going to be a lot more helpful. However, if heap size is the problem, this is how you deal with it for Maxima.
The trick is that you can tell SBCL what the heap limit should be by passing it the --dynamic-space-size <MB> argument, and you can pas arguments through the maxima wrapper to do this.
Here is a transcript of Maxima, being run on Linux, with SBCL as a back end (this is a version built from source: the packaged version will I assume be the same):
$ maxima
Maxima 5.43.2 http://maxima.sourceforge.net
using Lisp SBCL 2.0.0
Distributed under the GNU Public License. See the file COPYING.
Dedicated to the memory of William Schelter.
The function bug_report() provides bug reporting information.
(%i1) :lisp (sb-ext:dynamic-space-size)
1073741824
So, on this system the defaule heap limit is 1GB (this is SBCL's default limit on the platform).
Now we can pass the -X <lisp options> aka --lisp-options=<lisp options> option to the maxima wrapper to pass the appropriate option through to sbcl:
$ maxima -X '--dynamic-space-size 2000'
Lisp options: (--dynamic-space-size 2000)
Maxima 5.43.2 http://maxima.sourceforge.net
using Lisp SBCL 2.0.0
Distributed under the GNU Public License. See the file COPYING.
Dedicated to the memory of William Schelter.
The function bug_report() provides bug reporting information.
(%i1) :lisp (sb-ext:dynamic-space-size)
2097152000
As you can see this has doubled the heap size.
If someone knows the answer for wxMaxima then please do add an edit to this answer: I can't experiment it because all my Linux VMs are headless.
Also not a complete answer here, but some more notes and pointers which I hope will help.
To make the problem easier for Maxima to digest, use only exact numbers (integers and ratios), and avoid float and numer. (Plotting functions will apply float and numer automatically.) I changed 0.3 to 3/10 and cut out the calls to float.
Also, try setting j to a smaller number (I tried j equal to 1) to try to work all the way through the problem before increasing it to 5 again.
Also, replace all sum and integrate with 'sum and 'integrate (i.e. noun expressions instead of verb expressions). Take a look at the summands and integrands to see if they look right. You can evaluate the sums and/or integrals or both via ev(expr, sum) or ev(expr, integrate) or ev(expr, nouns) to evaluate 'sum, 'integrate, or all noun expressions, respectively.
With j equal to 1, I get the following expression for N1:
(2500000*((-(13*cosh(%pi/6)
*((8503056*cosh(%pi/6)^2*sinh(3*%pi)^2)
/(9765625*%pi^4
*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))^2
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi))^2)
+(52488*cosh(%pi/6)*sinh(3*%pi))
/(15625*%pi^2*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi)))
+324/25))
/(120000*%pi^2*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi))))
+(13*sinh(3*%pi)
*((2754990144*cosh(%pi/6)^3*sinh(3*%pi)^2)
/(244140625*%pi^4
*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))^3
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi))^2)
+(17006112*cosh(%pi/6)^2*sinh(3*%pi))
/(390625*%pi^2
*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))^2
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi)))
+(104976*cosh(%pi/6))
/(625*(sinh(%pi/6)-%pi/(6*cosh(%pi/6))))))
/(22680000*%pi^2*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi))^2)
+13/(35000*%pi^2*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi))))
*sin((%pi*(2*p-1)*x)/80)
*((%pi*(2*p-1)*y*sinh((%pi*(2*p-1)*y)/80))/80
+(1-(3*%pi*(2*p-1)*cosh(3*%pi*(2*p-1)))
/sinh(3*%pi*(2*p-1)))
*cosh((%pi*(2*p-1)*y)/80)))
/13
+(2500000*((-(13*cosh(%pi/6)
*((344373768*cosh(%pi/6)^2*sinh(3*%pi)^3)
/(244140625*%pi^4
*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))
^2
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi))
^3)
+(2125764*cosh(%pi/6)*sinh(3*%pi)^2)
/(390625*%pi^2
*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi))^2)
+(13122*sinh(3*%pi))
/(625*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi)))))
/(1620000*%pi^3*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))^2))
+(13*sinh(3*%pi)
*((8503056*cosh(%pi/6)^2*sinh(3*%pi)^2)
/(9765625*%pi^4
*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))^2
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi))^2)
+(52488*cosh(%pi/6)*sinh(3*%pi))
/(15625*%pi^2*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi)))
+324/25))
/(3780000*%pi^3*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))
*(cosh(3*%pi)-(3*%pi)/sinh(3*%pi)))
-13/(20000*%pi*(sinh(%pi/6)-%pi/(6*cosh(%pi/6)))))
*(((%pi*sinh(%pi/6))/(6*cosh(%pi/6))+1)
*sinh((%pi*x)/240)
-(%pi*x*cosh((%pi*x)/240))/240)*cos((%pi*y)/240))
/13-(25*x)/48$
Now in order to plot that, it should be a function of x and y only. However listofvars reports that it contains x, y, and p. Hmm. I see that βp has a summation over p1 but it contains Ζ0, which contains Λ0, which contains p. Is the summation over p1 supposed to be over p? Is the summand supposed to contain p1 instead of p?
Likewise it appears that N2, after evaluating the sums and integrals with j equal to 1, contains p.
Maybe you need to rework the formulas somewhat? I don't know what the correct form might be.
Does anyone know how to calculate a Mod b in Casio fx-991ES Calculator. Thanks
This calculator does not have any modulo function. However there is quite simple way how to compute modulo using display mode ab/c (instead of traditional d/c).
How to switch display mode to ab/c:
Go to settings (Shift + Mode).
Press arrow down (to view more settings).
Select ab/c (number 1).
Now do your calculation (in comp mode), like 50 / 3 and you will see 16 2/3, thus, mod is 2. Or try 54 / 7 which is 7 5/7 (mod is 5).
If you don't see any fraction then the mod is 0 like 50 / 5 = 10 (mod is 0).
The remainder fraction is shown in reduced form, so 60 / 8 will result in 7 1/2. Remainder is 1/2 which is 4/8 so mod is 4.
EDIT:
As #lawal correctly pointed out, this method is a little bit tricky for negative numbers because the sign of the result would be negative.
For example -121 / 26 = -4 17/26, thus, mod is -17 which is +9 in mod 26. Alternatively you can add the modulo base to the computation for negative numbers: -121 / 26 + 26 = 21 9/26 (mod is 9).
EDIT2: As #simpatico pointed out, this method will not work for numbers that are out of calculator's precision. If you want to compute say 200^5 mod 391 then some tricks from algebra are needed. For example, using rule
(A * B) mod C = ((A mod C) * B) mod C we can write:
200^5 mod 391 = (200^3 * 200^2) mod 391 = ((200^3 mod 391) * 200^2) mod 391 = 98
As far as I know, that calculator does not offer mod functions.
You can however computer it by hand in a fairly straightforward manner.
Ex.
(1)50 mod 3
(2)50/3 = 16.66666667
(3)16.66666667 - 16 = 0.66666667
(4)0.66666667 * 3 = 2
Therefore 50 mod 3 = 2
Things to Note:
On line 3, we got the "minus 16" by looking at the result from line (2) and ignoring everything after the decimal. The 3 in line (4) is the same 3 from line (1).
Hope that Helped.
Edit
As a result of some trials you may get x.99991 which you will then round up to the number x+1.
You need 10 ÷R 3 = 1
This will display both the reminder and the quoitent
÷R
There is a switch a^b/c
If you want to calculate
491 mod 12
then enter 491 press a^b/c then enter 12. Then you will get 40, 11, 12. Here the middle one will be the answer that is 11.
Similarly if you want to calculate 41 mod 12 then find 41 a^b/c 12. You will get 3, 5, 12 and the answer is 5 (the middle one). The mod is always the middle value.
You can calculate A mod B (for positive numbers) using this:
Pol( -Rec( 1/2πr , 2πr × A/B ) , Y ) ( πr - Y ) B
Then press [CALC], and enter your values for A and B, and any value for Y.
/ indicates using the fraction key, and r means radians ( [SHIFT] [Ans] [2] )
type normal division first and then type shift + S->d
Here's how I usually do it. For example, to calculate 1717 mod 2:
Take 1717 / 2. The answer is 858.5
Now take 858 and multiply it by the mod (2) to get 1716
Finally, subtract the original number (1717) minus the number you got from the previous step (1716) -- 1717-1716=1.
So 1717 mod 2 is 1.
To sum this up all you have to do is multiply the numbers before the decimal point with the mod then subtract it from the original number.
Note: Math error means a mod m = 0
It all falls back to the definition of modulus: It is the remainder, for example, 7 mod 3 = 1.
This because 7 = 3(2) + 1, in which 1 is the remainder.
To do this process on a simple calculator do the following:
Take the dividend (7) and divide by the divisor (3), note the answer and discard all the decimals -> example 7/3 = 2.3333333, only worry about the 2. Now multiply this number by the divisor (3) and subtract the resulting number from the original dividend.
so 2*3 = 6, and 7 - 6 = 1, thus 1 is 7mod3
Calculate x/y (your actual numbers here), and press a b/c key, which is 3rd one below Shift key.
Simply just divide the numbers, it gives yuh the decimal format and even the numerical format. using S<->D
For example: 11/3 gives you 3.666667 and 3 2/3 (Swap using S<->D).
Here the '2' from 2/3 is your mod value.
Similarly 18/6 gives you 14.833333 and 14 5/6 (Swap using S<->D).
Here the '5' from 5/6 is your mod value.