Behavior of contradicting soft constraints - specman

I have a testcase in which the behavior seems wrong. I see that in all generations the num_of_red_shoes is high, while I would expect a more even distribution. What is the cause of this behavior and how can it be fixed?
<'
struct closet {
kind:[SMALL,BIG];
num_of_shoes:uint;
num_of_red_shoes:uint;
num_of_black_shoes:uint;
num_of_yellow_shoes:uint;
keep soft num_of_red_shoes < 10;
keep soft num_of_black_shoes < 10;
keep soft num_of_yellow_shoes < 10;
keep num_of_yellow_shoes + num_of_black_shoes + num_of_red_shoes == num_of_shoes;
when BIG closet {
keep num_of_shoes in [50..100];
};
};
extend sys {
closets[100]:list of BIG closet;
};
'>
Generation results:
item type kind num_of_sh* num_of_re* num_of_bl* num_of_ye*
---------------------------------------------------------------------------
0. BIG closet BIG 78 73 1 4
1. BIG closet BIG 67 50 8 9
2. BIG closet BIG 73 68 0 5
3. BIG closet BIG 73 66 3 4
4. BIG closet BIG 51 50 0 1
5. BIG closet BIG 78 76 1 1
6. BIG closet BIG 55 43 7 5
7. BIG closet BIG 88 87 1 0
8. BIG closet BIG 99 84 6 9
9. BIG closet BIG 92 92 0 0
10. BIG closet BIG 63 55 3 5
11. BIG closet BIG 59 50 9 0
12. BIG closet BIG 51 44 2 5
13. BIG closet BIG 82 76 1 5
14. BIG closet BIG 81 74 2 5
15. BIG closet BIG 97 93 2 2
16. BIG closet BIG 54 41 8 5
17. BIG closet BIG 55 44 5 6
18. BIG closet BIG 70 55 9 6
19. BIG closet BIG 63 57 1 5

When there are contradicting soft constraints, Specman does not randomize the softs which are enforced, but rather gives priority to the constraints which were written last. Since the soft on red shoes was first in the test, it is the one which is always overridden.
If the softs are known to be mutually exclusive (which is not the case here) you could use a simple flag to randomly choose which soft should hold. e.g. the code would look like this:
flag:uint[0..2];
keep soft read_only(flag==0) => num_of_red_shoes < 10;
keep soft read_only(flag==1) => num_of_black_shoes < 10;
keep soft read_only(flag==2) => num_of_yellow_shoes < 10;
However, since here there is no knowledge in advance how many softs are expected to hold (and it's possible and two or all three will be satisfied) a more complex solution should be made. Here is a code which does this randomization:
struct closet {
kind:[SMALL,BIG];
num_of_shoes:uint;
num_of_red_shoes:uint;
num_of_black_shoes:uint;
num_of_yellow_shoes:uint;
//replaces the original soft constraints (if a flag is true the correlating
//right-side implication will be enforced
soft_flags[3]:list of bool;
keep for each in soft_flags {
soft it == TRUE;
};
//this list is used to shuffle the flags so their enforcement will be done
//with even distribution
soft_indices:list of uint;
keep soft_indices.is_a_permutation({0;1;2});
keep soft_flags[soft_indices[0]] => num_of_red_shoes < 10;
keep soft_flags[soft_indices[1]] => num_of_black_shoes < 10;
keep soft_flags[soft_indices[2]] => num_of_yellow_shoes < 10;
keep num_of_yellow_shoes + num_of_black_shoes + num_of_red_shoes == num_of_shoes;
};

I'm not with Cadence, so I can't give you a definite answer. I think the solver will try to break as few constraints as possible and it just chooses the first one if finds (in your case the one for red shoes). Try changing the order and see if this changes (if the black constraint is first, I'd think you'll always get more black shoes).
As a solution, you could just remove the soft constraints when you have a big closet:
when BIG closet {
keep num_of_red_shoes.reset_soft();
keep num_of_black_shoes.reset_soft();
keep num_of_yellow_shoes.reset_soft();
keep num_of_shoes in [50..100];
};
If you want to randomly choose which one of them to disable (sometimes more than 10 red shoes, sometimes more than 10 black shoes, etc.), then you'll need a helper field:
when BIG closet {
more_shoes : [ RED, BLACK, YELLOW ];
keep more_shoes == RED => num_of_red_shoes.reset_soft();
keep more_shoes == BLACK => num_of_black_shoes.reset_soft();
keep more_shoes == YELLOW => num_of_yellow_shoes.reset_soft();
keep num_of_shoes in [50..100];
};
It depends on what you mean by "a more even distribution".

There is no way to satisfy all of your hard and soft constraints for a BIG closet. Therefore Specman attempts to find a solution by ignoring some of your soft constraints. The IntelliGen constraint solver doesn't ignore all of the soft constraints, but attempts to find a solution while still using a subset. This is explained in the "Specman Generation User Guide" (sn_igenuser.pdf):
[S]oft constraints that are loaded later are considered to have a higher priority than soft constraints loaded previously."
In this case that means that Specman discards the soft constraint on red shoes and since it can find a solution still obeying the other soft constraints it does not discard them.
If you combine all of your soft constraints into one, then you will probably get the result you were hoping for:
keep soft ((num_of_red_shoes < 10) and (num_of_black_shoes < 10) and
(num_of_yellow_shoes < 10));
There are advantages to giving later constraints priority: This means that using AOP you can add new soft constraints and they will get the highest priority.

For more distributed values, I would suggest the following.
I'm sure you can follow the intended logic too.
var1, var2 : uint;
keep var1 in [0..30];
keep var2 in [0..30];
when BIG closet {
keep num_of_shoes in [50..100];
keep num_of_yellow_shoes == (num_of_shoes/3) - 15 + var1;
keep num_of_black_shoes == (num_of_shoes - num_of_yellow_shoes)/2 - 15 + var2;
keep num_of_red_shoes == num_of_shoes - (num_of_yellow_shoes - num_of_black_shoes);
keep gen (var1, var2) before (num_of_shoes);
keep gen (num_of_shoes) before (num_of_yellow_shoes, num_of_black_shoes, num_of_red_shoes);
keep gen (num_of_yellow_shoes) before (num_of_black_shoes, num_of_red_shoes);
keep gen (num_of_black_shoes) before (num_of_red_shoes);
};

Related

Regular expression for MS SQL

I want arrange these numbers first two numbers into following categories i confuse because it repeats it self.
Thank you fo your help.
210690, 391910, 392490, 880390, 847321, 940290, 300420, 300410, 901890, 901890, 030269,080530, 630399
1-5
6-14
5
16-24
25-27
28-38
39-40
41-41
44-46
47-49
50-63
64-67
68-70
71
72-83
84-85
86-89
90-92
93
94-96
97
98-99

Why does a loop transitioning from having its uops fed by the Uop Cache to LSD cause a spike in branch-misses?

All benchmarks are run on either
Icelake
or Whiskey Lake (In Skylake Family).
Summary
I am seeing a strange phenomina where it appears that when a loop
transitions from running out of the Uop Cache to running out of
the LSD (Loop Stream Detector) there is a spike in Branch
Misses that can cause severe performance hits. I tested on both
Icelake and Whiskey Lake by benchmarking a nested loop with the outer loop
having a sufficiently large body s.t the entire thing did not fit in
the LSD itself, but with an inner loop small enough to fit in the
LSD.
Basically once the inner loop reaches some iteration count decoding
seems to switch for idq.dsb_uops (Uop Cache) to lsd.uops (LSD)
and at that point there is a large increase in branch-misses
(without a corresponding jump in branches) causing a severe
performance drop. Note: This only seems to occur for nested
loops. Travis Down's Loop
Test for example will
not show any meaningful variation in branch misses. AFAICT this has
something to do with when a loop transitions from running out of the
Uop Cache to running out of the LSD.
Questions
What is happening when the loop transitions from running out of the
Uop Cache to running out of the LSD that causes this spike in
Branch Misses?
Is there a way to avoid this?
Benchmark
This is the minimum reproducible example I could come up with:
Note: If the .p2align statements are removed both loops will fit in
the LSD and there will not be a transitions.
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
}
Compile: gcc -O3 -march=native -mtune=native <filename>.c -o <filename>
Run Icelake: sudo perf stat -C 0 --all-user -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Run Whiskey Lake: sudo perf stat -C 0 -e cycles -e branches -e branch-misses -x, -e idq.ms_uops -e idq.dsb_uops -e lsd.uops taskset -c 0 ./<filename> <N inner loop iterations>
Graphs
Edit: x label is N iterations of inner loop.
Below is a graph of Branch Misses, Branches, and LSD Uops.
Generally you can see that 1) there is no corresponding jump in Branches. 2) that the number of added Branch Misses stabilizes at a constant. And 3) That there is a strong relationship between the Branch Misses and LSD Uops.
Icelake Graph:
Whiskey Lake Graph:
Below is a graph of Branch Misses, Cycles, and LSD Uops
for Icelake only as performance is not affected nearly as much on:
Analysis
Hard numbers below.
For Icelake starting at N = 22 and finishing at N = 27 there
is some fluctuation in the number of uops coming from the LSD vs
Uop Cache and during that time Branch Misses increases by
roughly 3 order of magnitude from 10^4 -> 10^7. During this period
Cycles also increased by a factor of 2. For all N > 27
Branch Misses stays at around 1.67 x 10^7 (roughly outer_loop_N). For N = [17, 40]
branches continues to only increase linearly.
The results for Whiskey Lake look different in that 1) N begins fluctuating at N = 35 and continues to fluctuate until N = 49. And 2) there is less of a
performance impact and more fluctuation in the data. That being said
the increase in Branch-Misses corresponding with transitions from
uops being fed by Uop Cache to being fed by LSD still exists.
Results
Data is mean result for 25 runs.
Icelake Results:
N
cycles
branches
branch-misses
idq.ms_uops
idq.dsb_uops
lsd.uops
1
33893260
67129521
1590
43163
115243
83908732
2
42540891
83908928
1762
49023
142909
100690381
3
50725933
100686143
1782
47656
142506
117440256
4
67533597
117461172
1655
52538
186123
134158311
5
68022910
134238387
1711
53405
204481
150954035
6
85543126
151018722
1924
62445
141397
167633971
7
84847823
167799220
1935
60248
160146
184563523
8
101532158
184570060
1709
60064
361208
201100179
9
101864898
201347253
1773
63827
459873
217780207
10
118024033
218124499
1698
59480
177223
234834304
11
118644416
234908571
2201
62514
422977
251503052
12
134627567
251678909
1679
57262
133462
268435650
13
285607942
268456135
1770
74070
285032524
315423
14
302717754
285233352
1731
74663
302101097
15953
15
321627434
302010569
81796
77831
319192830
1520819
16
337876736
318787786
71638
77056
335904260
1265766
17
353054773
335565563
1798
79839
352434780
15879
18
369800279
352344970
1978
79863
369229396
16790
19
386921048
369119438
1972
84075
385984022
16115
20
404248461
385896655
29454
85348
402790977
510176
21
421100725
402673872
37598
83400
419537730
729397
22
519623794
419451095
4447767
91209
431865775
97827331
23
702206338
436228323
12603617
109064
427880075
327661987
24
710626194
453005538
12316933
106929
432926173
344838509
25
863214037
469782765
14641887
121776
428085132
614871430
26
761037251
486559974
13067814
113011
438093034
418124984
27
832686921
503337195
16381350
113953
421924080
556915419
28
854713119
520114412
16642396
124448
420515666
598907353
29
869873144
536891629
16572581
119280
421188631
629696780
30
889642335
553668847
16717446
120116
420086570
668628871
31
906912275
570446064
16735759
126094
419970933
702822733
32
923023862
587223281
16706519
132498
420332680
735003892
33
940308170
604000498
16744992
124365
419945191
770185745
34
957075000
620777716
16726856
133675
420215897
802779119
35
974557538
637554932
16763071
134871
419764866
838012637
36
991110971
654332149
16772560
130903
419641144
872037131
37
1008489575
671109367
16757219
138788
419900997
904638287
38
1024971256
687886583
16772585
139782
419663863
938988917
39
1041404669
704669411
16776722
137681
419617131
972896126
40
1058594326
721441018
16773492
142959
419662133
1006109192
41
1075179100
738218235
16776636
141185
419601996
1039892900
42
1092093726
754995452
16776233
142793
419611902
1073373451
43
1108706464
771773224
16776359
139500
419610885
1106976114
44
1125413652
788549886
16764637
143717
419889127
1139628280
45
1142023614
805327103
16778640
144397
419558217
1174329696
46
1158833317
822104321
16765518
148045
419889914
1206833484
47
1175665684
838881537
16778437
148347
419562885
1241397845
48
1192454164
855658755
16778865
153651
419552747
1275006511
49
1210199084
872436025
16778287
152468
419599314
1307945613
50
1226321832
889213188
16778464
155552
419572344
1341893668
51
1242886388
905990406
16778745
155401
419559249
1375589883
52
1259559053
922767623
16778809
154847
419554082
1409206082
53
1276875799
939544839
16778460
162521
419576455
1442424993
54
1293113199
956322057
16778931
154913
419550955
1476316161
55
1310449232
973099274
16778534
157364
419578102
1509485876
56
1327022109
989876491
16778794
162881
419562403
1543193559
57
1344097516
1006653708
16778906
157486
419567545
1576414302
58
1362935064
1023430928
16778959
315120
419583132
1609691339
59
1381567560
1040208143
16778564
179997
419661259
1640660745
60
1394829416
1056985359
16778779
167613
419575969
1677034188
61
1411847237
1073762626
16778071
166332
419613028
1710194702
62
1428918439
1090539795
16778409
168073
419610487
1743644637
63
1445223241
1107317011
16778486
172446
419591254
1777573503
64
1461530579
1124094228
16769606
169559
419970612
1810351736
Whiskey Lake Results:
N
cycles
branches
branch-misses
idq.dsb_uops
lsd.uops
1
8332553879
35005847
37925
1799462
6019
2
8329926329
51163346
34338
1114352
5919
3
8357233041
67925775
32270
9241935
5555
4
8379609449
85364250
35667
18215077
5712
5
8394301337
101563554
33177
26392216
2159
6
8409830612
118918934
35007
35318763
5295
7
8435794672
135162597
35592
43033739
4478
8
8445843118
152636271
37802
52154850
5629
9
8459141676
168577876
30766
59245754
1543
10
8475484632
185354280
30825
68059212
4672
11
8493529857
202489273
31703
77386249
5556
12
8509281533
218912407
32133
84390084
4399
13
8528605921
236303681
33056
93995496
2093
14
8553971099
252439989
572416
99700289
2477
15
8558526147
269148605
29912
109772044
6121
16
8576658106
286414453
29839
118504526
5850
17
8591545887
302698593
28993
126409458
4865
18
8611628234
319960954
32568
136298306
5066
19
8627289083
336312187
30094
143759724
6598
20
8644741581
353730396
49458
152217853
9275
21
8685908403
369886284
1175195
161313923
7958903
22
8694494654
387336207
354008
169541244
2553802
23
8702920906
403389097
29315
176524452
12932
24
8711458401
420211718
31924
184984842
11574
25
8729941722
437299615
32472
194553843
12002
26
8743658904
453739403
28809
202074676
13279
27
8763317458
470902005
32298
211321630
15377
28
8788189716
487432842
37105
218972477
27666
29
8796580152
504414945
36756
228334744
79954
30
8821174857
520930989
39550
235849655
140461
31
8818857058
537611096
34142
648080
79191
32
8855038758
555138781
37680
18414880
70489
33
8870680446
571194669
37541
34596108
131455
34
8888946679
588222521
33724
52553756
80009
35
9256640352
604791887
16658672
132185723
41881719
36
9189040776
621918353
12296238
257921026
235389707
37
8962737456
638241888
1086663
109613368
35222987
38
9005853511
655453884
2059624
131945369
73389550
39
9005576553
671845678
1434478
143002441
51959363
40
9284680907
688991063
12776341
349762585
347998221
41
9049931865
705399210
1778532
174597773
72566933
42
9314836359
722226758
12743442
365270833
380415682
43
9072200927
739449289
1344663
205181163
61284843
44
9346737669
755766179
12681859
383580355
409359111
45
9117099955
773167996
1801713
235583664
88985013
46
9108062783
789247474
860680
250992592
43508069
47
9129892784
806871038
984804
268229102
51249366
48
9146468279
822765997
1018387
282312588
58278399
49
9476835578
840085058
13985421
241172394
809315446
50
9495578885
856579327
14155046
241909464
847629148
51
9537115189
873483093
15057500
238735335
932663942
52
9556102594
890026435
15322279
238194482
982429654
53
9589094741
907142375
15899251
234845868
1052080437
54
9609053333
923477989
16049518
233890599
1092323040
55
9628950166
940997348
16172619
235383688
1131146866
56
9650657175
957049360
16445697
231276680
1183699383
57
9666446210
973785857
16330748
233203869
1205098118
58
9687274222
990692518
16523542
230842647
1254624242
59
9706652879
1007946602
16576268
231502185
1288374980
60
9720091630
1024044005
16547047
230966608
1321807705
61
9741079017
1041285110
16635400
230873663
1362929599
62
9761596587
1057847755
16683756
230289842
1399235989
63
9782104875
1075055403
16299138
237386812
1397167324
64
9790122724
1091147494
16650471
229928585
1463076072
Edit:
2 things worth noting:
If I add padding to the inner loop so it won't fit in the uop cache I don't see this behavior until ~150 iterations.
Adding an lfence on with the padding in the outer loop changes N threshold to 31.
edit2: Benchmark which clears branch history The condition was reversed. It should be cmove not cmovne. With the fixed version any iteration count sees elevated Branch Misses at the same rate as above (1.67 * 10^9). This means the LSD is not itself causing Branch Misses, but leaves open the possibility that LSD is in some way defeating the Branch Predictor (what I believe to be the case).
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"testl $3, %k[outer_loop_cnt]\n"
"movl $1000, %k[inner_loop_cnt]\n"
THIS NEEDS TO BE CMOVE
"cmovne %k[inner_N], %k[inner_loop_cnt]\n"
// Extra align surrounding inner loop so that the entire thing
// doesn't execute out of LSD.
".p2align 10\n"
"2:\n"
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N)
:);
}
This may be a coincidence; similar misprediction effects happen on Skylake (with recent microcode that disables the LSD1): an inner-loop count of about 22 or 23 is enough to stop its IT-TAGE predictor from learning the pattern of 21 taken, 1 not-taken for the inner-loop branch, in exactly this simple nested loop situation, last time I tested.
Choosing that iteration threshold for when to lock the loop down into the LSD might make some sense, or be a side-effect of your 1-uop loop and the LSD's "unrolling" behaviour on Haswell and later to get multiple copies of tiny loops into the IDQ before locking it down, to reduce the impact of the loop not being a multiple of the pipeline width.
Footnote 1: I'm surprised your Whiskey Lake seems to have a working LSD; I thought LSD was still disabled in all Skylake derivatives, at least including Coffee Lake, which was launched concurrently with Whiskey Lake.
My test loop was two dec/jne loops nested simply, IIRC, but your code has padding after the inner loop. (Starting with a jmp because that's what a huge .p2align does.) This puts the two loop branches at significantly different addresses. Either of both of these differences may help them avoid aliasing or some other kind of interference, because I'm seeing mostly-correct predictions for many (but not all) counts much greater than 23.
With your test code on my i7-6700k, lsd.uops is always exactly 0 of course. Compared to your Whiskey Lake data, only a few inner-loop counts produce high mispredict rates, e.g. 40, but not 50.
So there might be some effect from the LSD on your WHL CPU, making it bad for some N values where SKL is fine. (Assuming their IT-TAGE predictors are truly identical.)
e.g. with perf stat ... -r 5 ./a.out on Skylake (i7-6700k) with microcode revision 0xe2.
N
count
rate
variance
17
59,602
0.02% of all branches
+- 10.85%
20
192,307
0.05% of all branches
( +- 44.60% )
21
79,853
0.02% of all branches
( +- 14.16% )
30
136,308
0.02% of all branches
( +- 18.57% )
31..32
similar to N=34
( +- 2 or 3% )
33
22,415,089
3.71% of all branches
( +- 0.11% )
34
91,483
0.01% of all branches
( +- 2.36% )
35 (and 36..37 similar)
98,806
0.02% of all branches
( +- 2.75% )
38
33,517,630
4.87% of all branches
( +- 0.05% )
39
102,077
0.01% of all branches
( +- 1.96% )
40
33,458,267
4.64% of all branches
( +- 0.06% )
41
116,241
0.02% of all branches
( +- 6.86% )
42
22,376,562
2.96% of all branches
( +- 0.01% )
43
116,713
0.02% of all branches
( +- 5.25% )
44
174,834
0.02% of all branches
( +- 35.08% )
45
124,555
0.02% of all branches
( +- 5.36% )
46
135,838
0.02% of all branches
( +- 9.95% )
These numbers are repeatable, it's not just system noise; the spikes of high mispredict counts are very real at those specific N values. Probably some effect of the size / geometry of the IT-TAGE predictor's tables.
Other counters like idq.ms_uops and idq.dsb_uops scale mostly as expected, although idq.ms_uops is somewhat higher in the ones with more misses. (That counts uops added to the IDQ while the MS-ROM is active, perhaps counting front-end work that happens while branch recovery is cleaning up the back-end? It's a very different counter from legacy-decode mite_uops.)
With higher mispredict rates, idq.dsb_uops is quite a lot higher, I guess because the IDQ gets discarded and refilled on mispredicts. Like 1,011,000,288 for N=42, vs. 789,170,126 for N=43.
Note the high variability for N=20, around that threshold near 23, but still a tiny overall miss rate, much lower than every time the inner loop exits.
This is surprising, and different from a loop without as much padding.
Reason
The cause of the spike in Branch Misses is caused by the inner loop running out of the LSD.
The reason the LSD causes an extra branch miss for low iteration counts is that the "stop" condition on the LSD is a branch miss.
From Intel Optimization Manual Page 86.
The loop is sent to allocation 5 µops per cycle. After 45 out of the 46 µops are sent, in the next cycle only
a single µop is sent, which means that in that cycle, 4 of the allocation slots are wasted. This pattern
repeats itself, until the loop is exited by a misprediction. Hardware loop unrolling minimizes the number
of wasted slots during LSD.
Essentially what is happening is that when low enough iteration counts run out of the Uop Cache they are perfectly predictable. But when they run out of the LSD since the built in stop condition for the LSD is a branch mispredict, we see an extra branch miss for every iteration of the outer loop. I guess the takeaway is don't let nested loops execute out of the LSD. Note that LSD only kicks in after ~[20, 25] iterations so an inner loop with < 20 iterations will run optimally.
Benchmark
All benchmarks are run on either Icelake
The new benchmark is essentially the same as the one in the origional post but at the advise of #PeterCordes I added a fixed byte size but varying number of nops in the inner loop. The idea is fixed length so that there is no change in how the branches may alias in the BHT (Branch History Table) but varying the number of nops to sometimes defeat the LSD.
I used 124 bytes of nop padding so that the nop padding + size of decl; jcc would be 128 bytes total.
The benchmark code is as follows:
#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#ifndef INNER_NOPS
#error "INNER_NOPS must be defined"
#endif
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
static const uint64_t outer_N = (1UL << 24);
static const uint64_t bht_shift = 4;
static const uint64_t bht_mask = (1023 << bht_shift);
#define NOP1 ".byte 0x90\n"
#define NOP2 ".byte 0x66,0x90\n"
#define NOP3 ".byte 0x0f,0x1f,0x00\n"
#define NOP4 ".byte 0x0f,0x1f,0x40,0x00\n"
#define NOP5 ".byte 0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP6 ".byte 0x66,0x0f,0x1f,0x44,0x00,0x00\n"
#define NOP7 ".byte 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00\n"
#define NOP8 ".byte 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP9 ".byte 0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP10 ".byte 0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
#define NOP11 ".byte 0x66,0x66,0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00\n"
static void BENCH_ATTR
bench(uint64_t inner_N) {
uint64_t inner_loop_cnt, outer_loop_cnt, tmp;
asm volatile(
".p2align 12\n"
"movl %k[outer_N], %k[outer_loop_cnt]\n"
".p2align 6\n"
"1:\n"
"movl %k[inner_N], %k[inner_loop_cnt]\n"
".p2align 10\n"
"2:\n"
// This is defined in "inner_nops.h" with the necessary padding.
INNER_NOPS
"decl %k[inner_loop_cnt]\n"
"jnz 2b\n"
".p2align 10\n"
"decl %k[outer_loop_cnt]\n"
"jnz 1b\n"
: [ inner_loop_cnt ] "=&r"(inner_loop_cnt),
[ outer_loop_cnt ] "=&r"(outer_loop_cnt), [ tmp ] "=&r"(tmp)
: [ inner_N ] "ri"(inner_N), [ outer_N ] "i"(outer_N),
[ bht_mask ] "i"(bht_mask), [ bht_shift ] "i"(bht_shift)
:);
}
// gcc -O3 -march=native -mtune=native lsd-branchmiss.c -o lsd-branchmiss
int
main(int argc, char ** argv) {
assert(argc > 1);
uint64_t inner_N = atoi(argv[1]);
bench(inner_N);
return 0;
}
Tests
I tested nop count = [0, 39].
Note that nop count = 1 would not be only 1 nop in the inner loop but actually the following:
#define INNER_NOPS NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP10 NOP3 NOP1
To reach the full 128 byte padding.
Results
At nop count <= 32 the inner loop is able to run out of the LSD and we consistently see elevant Branch Misses when Iterations is large enough that it does so. Note that the elevated Branch Misses number corresponds 1-1 with the number of outer loop iterations. For these numbers outer loop iterations = 2^24
At nop count > 32 the loop has to many uops for the LSD and runs out of the Uop Cache. Here we do not see a consistent elevated Branch Misses until Iterations becomes to large for its BHT entry to store its entire history.
nop count > 32 (No LSD)
Once there are too many nops for the LSD the number of Branch Misses stays relatively low with a few consistent spikes until Iterations = 146 where Branch Misses spike to number of outer loop iterations (2 ^ 24 in this case) and remain constants. My guess is that is the upper bound on history the BHT is able to store.
Below is a graph of Branch Misses (Y) versus Iterations (X) for nop count = [33, 39]:
All of the lines follow the same patterns and have the same spikes. The large spikes to outer loop iterations before 146 are at Iterations = [42, 70, 79, 86, 88]. This is consistently reproducible. I am not sure what is special about these values.
The key point, however, is that for the most cast for Iterations = [20, 145] Branch Misses is relatively low indicating that the entire inner loop is being predicted correctly.
nop count <= 32 (LSD)
This data is a bit more noising bit all of the different nop count follow roughly the same trend of spiking initialing to within a factor of 2 of outer loop iterations Branch Misses at Iterations = [21, 25] (note this is 2-3 orders of magnitude) at the same time the lsd.oups spiked by 4-5 orders of magnitude.
There is also a trend between nop count and what iteration value Branch Misses stablize at outer loop iterations with a Pearson Correlation of 0.81. for nop count = [0, 32] the stablization point is in range iterations = [15, 34].
Below is a graph of Branch Misses (Y) versus Iterations (X) for nops = [0, 32]:
Generally, with some noise, all of the different nop count follow the same trend. As well they follow the same trend when compared with lsd.uops.
Below is a table with nop and the Pearson Correlation between Branch Misses and lsd.uop and idq.dsb_uops respectively.
nop
lsd
uop cache
0
0.961
-0.041
1
0.955
-0.081
2
0.919
-0.122
3
0.918
-0.299
4
0.947
-0.117
5
0.934
-0.298
6
0.894
-0.329
7
0.907
-0.308
8
0.91
-0.322
9
0.915
-0.316
10
0.877
-0.342
11
0.908
-0.28
12
0.874
-0.281
13
0.875
-0.523
14
0.87
-0.513
15
0.889
-0.522
16
0.858
-0.569
17
0.89
-0.507
18
0.858
-0.537
19
0.844
-0.565
20
0.816
-0.459
21
0.862
-0.537
22
0.848
-0.556
23
0.852
-0.552
24
0.85
-0.561
25
0.828
-0.573
26
0.857
-0.559
27
0.802
-0.372
28
0.762
-0.425
29
0.721
-0.112
30
0.736
-0.047
31
0.768
-0.174
32
0.847
-0.129
Which should generally indicate that there is a strong correlation between LSD and Branch Misses and no meaningful relationship between the Uop Cache and branch misses.
Overall
Generally I think it is clear that when the inner loop executing out of the LSD is what is causing the Branch Misses until Iterations becomes too large for the BHT entry's history. For the N = [33, 39] save the explained spikes we don't see elevated Branch Misses but we do for the N = [0, 32] case and the only difference I can tell is the LSD.

I dont understand this little perl code (if ...)

Can someone explain me this short pearl code?
$batstr2 = "empty" if( $status2 & 4 );
What say the if statement ?
Already answered many times, for the case if you don't know what is the Bitwise And, here is a small example:
perl -e 'print "dec\t bin\t&4\n";printf "%d\t%8b\t%-8b\n", $_, $_, ($_ & 4) for (0..8);'
prints:
dec bin &4
0 0 0
1 1 0
2 10 0
3 11 0
4 100 100
5 101 100
6 110 100
7 111 100
8 1000 0
as you can see, when the 3rb bit from right is 1 - the $num & 4 is true.
That's using the if as a statement modifier. It's roughly the same as
if ($status & 4) {
$batstr2 = "empty";
}
and exactly the same as
($status & 4) and ($batstr2 = "empty");
a variety of constructs can be used as statement modifiers, including: if, unless, while, until, for, when. These modifiers can't be stacked (foo() if $bar for #baz won't work), you are limited for one modifer per simple statement.
That's a bitwise and - http://perldoc.perl.org/perlop.html#Bitwise-And . $status2 is being used as a bit mask and it sets $batstr2 to 'empty' if the bit is set.
It sets $batstr2 to "empty" if the 3rd least significant bit of $status2 is set - it is a logical AND mask.

How to calculate a Mod b in Casio fx-991ES calculator

Does anyone know how to calculate a Mod b in Casio fx-991ES Calculator. Thanks
This calculator does not have any modulo function. However there is quite simple way how to compute modulo using display mode ab/c (instead of traditional d/c).
How to switch display mode to ab/c:
Go to settings (Shift + Mode).
Press arrow down (to view more settings).
Select ab/c (number 1).
Now do your calculation (in comp mode), like 50 / 3 and you will see 16 2/3, thus, mod is 2. Or try 54 / 7 which is 7 5/7 (mod is 5).
If you don't see any fraction then the mod is 0 like 50 / 5 = 10 (mod is 0).
The remainder fraction is shown in reduced form, so 60 / 8 will result in 7 1/2. Remainder is 1/2 which is 4/8 so mod is 4.
EDIT:
As #lawal correctly pointed out, this method is a little bit tricky for negative numbers because the sign of the result would be negative.
For example -121 / 26 = -4 17/26, thus, mod is -17 which is +9 in mod 26. Alternatively you can add the modulo base to the computation for negative numbers: -121 / 26 + 26 = 21 9/26 (mod is 9).
EDIT2: As #simpatico pointed out, this method will not work for numbers that are out of calculator's precision. If you want to compute say 200^5 mod 391 then some tricks from algebra are needed. For example, using rule
(A * B) mod C = ((A mod C) * B) mod C we can write:
200^5 mod 391 = (200^3 * 200^2) mod 391 = ((200^3 mod 391) * 200^2) mod 391 = 98
As far as I know, that calculator does not offer mod functions.
You can however computer it by hand in a fairly straightforward manner.
Ex.
(1)50 mod 3
(2)50/3 = 16.66666667
(3)16.66666667 - 16 = 0.66666667
(4)0.66666667 * 3 = 2
Therefore 50 mod 3 = 2
Things to Note:
On line 3, we got the "minus 16" by looking at the result from line (2) and ignoring everything after the decimal. The 3 in line (4) is the same 3 from line (1).
Hope that Helped.
Edit
As a result of some trials you may get x.99991 which you will then round up to the number x+1.
You need 10 ÷R 3 = 1
This will display both the reminder and the quoitent
÷R
There is a switch a^b/c
If you want to calculate
491 mod 12
then enter 491 press a^b/c then enter 12. Then you will get 40, 11, 12. Here the middle one will be the answer that is 11.
Similarly if you want to calculate 41 mod 12 then find 41 a^b/c 12. You will get 3, 5, 12 and the answer is 5 (the middle one). The mod is always the middle value.
You can calculate A mod B (for positive numbers) using this:
Pol( -Rec( 1/2πr , 2πr × A/B ) , Y ) ( πr - Y ) B
Then press [CALC], and enter your values for A and B, and any value for Y.
/ indicates using the fraction key, and r means radians ( [SHIFT] [Ans] [2] )
type normal division first and then type shift + S->d
Here's how I usually do it. For example, to calculate 1717 mod 2:
Take 1717 / 2. The answer is 858.5
Now take 858 and multiply it by the mod (2) to get 1716
Finally, subtract the original number (1717) minus the number you got from the previous step (1716) -- 1717-1716=1.
So 1717 mod 2 is 1.
To sum this up all you have to do is multiply the numbers before the decimal point with the mod then subtract it from the original number.
Note: Math error means a mod m = 0
It all falls back to the definition of modulus: It is the remainder, for example, 7 mod 3 = 1.
This because 7 = 3(2) + 1, in which 1 is the remainder.
To do this process on a simple calculator do the following:
Take the dividend (7) and divide by the divisor (3), note the answer and discard all the decimals -> example 7/3 = 2.3333333, only worry about the 2. Now multiply this number by the divisor (3) and subtract the resulting number from the original dividend.
so 2*3 = 6, and 7 - 6 = 1, thus 1 is 7mod3
Calculate x/y (your actual numbers here), and press a b/c key, which is 3rd one below Shift key.
Simply just divide the numbers, it gives yuh the decimal format and even the numerical format. using S<->D
For example: 11/3 gives you 3.666667 and 3 2/3 (Swap using S<->D).
Here the '2' from 2/3 is your mod value.
Similarly 18/6 gives you 14.833333 and 14 5/6 (Swap using S<->D).
Here the '5' from 5/6 is your mod value.

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?
I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926
I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.
When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).