Branch prediction and performance - cpu-architecture

I'm reading a book about computer architecture and I'm on this chapter talking about branch prediction.
There is this little exercise that I'm having a hard time wrapping my head around it.
Consider the following inner for loop
for (j = 0; j < 2; j++)
{
for (i = 10; i > 0; i = i-1)
x[i] = x[i] + s
}
-------> Inner loop:
L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, -8
BNE R1, R3, Loop
Assume register F2 holds the scalar s, R1 holds the address of x[10], and R3 is pre-computed to end the loop when i == 0;
a) How would a predictor that alternates between taken/not taken perform?
---- Since the loop is only executed 2 times, I think that the alternate prediction would harm the performance in this case (?) with 1 miss prediction.
b) Would a 1-bit branch prediction buffer improve performance (compare to a)? Assume the first prediction is "not taken", and no other branches map to this entry.
---- Assuming the first prediction is "not taken", and 1-bit predictor invert the bit if the prediction is wrong. So it will be NT/T/T. Does that make it have the same performance as problem a) ? with 1 miss prediction.
c) Would a 2-bit branch prediction buffer improve performance (compare to a)? Assume the first prediction is "not taken", and no other branches map to this entry.
---- 2-bit branch prediction starting with "not taken". As I remember 2 bit prediction change after it misses twice. So this prediction will go like NT/NT/T/T. Therefore its performance will be worse compare to a). 1 miss prediction
That was my attempt to answer the problems. Can anyone explain to me if my answer is right/wrong in more detail please? Thanks.

Since the loop is only executed 2 times
You mean the outer-loop conditional, the one you didn't show asm for? I'm only answering part of the question for now, in case this confusion was your main issue. Leave a comment if this wasn't what had you confused.
The conditional branch at the bottom of the inner loop is executed 20 times, with this patter: 9xT, 1xNT, 9xT, 1xNT. An alternating predictor there would be wrong about 50% of the time, +/- 20% depending on whether it started right or wrong.
It's the outer loop that only runs twice: T,NT. The whole inner loop runs twice.
The outer loop branch would either be predicted perfectly or terribly, depending on whether the alternating prediction started with T or with NT.

Related

Huffman tree - highest possible frequency that gives perfect tree

Suppose you have an alphabet of 4 characters: A, B, C, D. What is the highest possible frequency of the most frequent character given the Huffman tree is perfect.
We have a theory that it is 2/5 of the total length, but we would like to see more concrete proof or explanations.
Without loss of generality, we will assume that p(A) <= p(B) <= p(C) <= p(D). The Huffman algorithm will combine A and B into a branch. (Again, without loss of generality if some of the probabilities are equal.) In order for the resulting tree to be flat, we must then combine C and D into a branch. Then the final step will be to combine those two branches.
To assure that we combine C and D into a branch, p(C) and p(D) must both be less than p(A) + p(B). So p(D) < p(A) + p(B). Note that if p(C) = p(D) = p(A) + p(B), then the Huffman algorithm has the option to pick any pair in the next step, for which two of those cases results in skewed tree. So p(D) must be strictly less than p(A) + p(B).
The rest is left as an exercise for the reader.
(Your guess is close. It must be less than 2/5. So 2/5–ϵ, where ϵ is the smallest number that allows the probability computed from the frequency, presumably an integer, to be less than 2/5. An example set of probabilities to reach the maximum is {1/5, 1/5, 1/5+ϵ, 2/5–ϵ}.)

usage of $past macro in system verilog for a signal to high

I am a starter in system verilog.
I want to check on a falling edge of the signal whether it is high for the past 'n' number of cycles. Usage of ##n cycles doesn't work for me.
logic x,y;
x & y -> ##2 $past(y) -> $fell(y); this doesn't seem like working
with the condition of x & y, what I am checking is at the falling edge of 'y' the signal 'y' is high for past 2 cycles after the condition x& y is met
Hi and welcome to SVA.
In my answer i shall assume that you have defined a clock and are using this in your definition of "falling edge"
There are a few issues with your code and problem description. I only enumerate these to help with issues in the future:
- $past is not a macro but a system function
- You are not using the correct SVA implication operator. "->" is a blocking event trigger. The overlapping implication operator I guess you are after is |->
- ##2 $past(y) will actually insert a delay of 2 cycles, and then check that the past value of y was high. Really, you are checking that y is high one cycle after your initial trigger.
I am also not quite sure what your trigger condition is meant to be - x && y will spawn a property thread if both x and y are high. In fact, it won't trigger on a negedge of y.
In the following code I attempt to code up the SVA to your spec as I understood it. You can use a simple function to ensure that preceding n cycle, y was high. Feel free to replace $fell(y) with any trigger as required.
function bit y_high_preceding_n_cycles(int n)
for(int i = 1; i < n; i++) begin
// check if y wasnt high i cycles ago, just return 0
if (!$past(y, i, , #(posedge clk))) return 0;
end
return 1;
endfunction
prop_label: assert property($fell(y) |-> y_high_preceding_n_cycles(n));
This will check that on detection of $fell(y), y was high the preceding n cycles. Note that the iteration of the for loop i==1 will by definition be redundant (as trigger on $fell(y) i.e. definitely $past(y) == 1 holds assuming no Xs).
Hope this helps.

Is there any case where the Bimodal will be better than Not take?

Considering these two methods:
Dynamic Bimodal:
Where we have 4 stages, 2 stages for each (taken or not taken), and alternating every time the algorithm predicts wrong, changing from taken<->not taken after 2 consecutive wrong predictions.
Static Not Taken:
Here the algorithm will always predict taken OR not taken. Swapping between the two stages after every single wrong prediction.
I tested both algorithm with the follow code in C:
for(i=0; i<4; i++) {
}
and analyzing the if conditional.
for(i=0; i<4; i++) {
if( i%2 ) {
}
else {
}
}
In both cases they are even (will predict right/wrong the same quantity of times).
Is there any possible simple algorithm where the Bimodal will be better than not taken ?
The Static Not Taken (SNT) predictor is almost always (much) worse than any other predictor. The main reasons for this is that it's terrible with predicting the control flow of loops because it will predict not taken at every iteration.
Let's assume that the first C loop will be compiled to something like this:
loop body
compute loop condition
branch to the loop body if condition met
So there is only one branch at the end. The SNT predictor will predict not taken 4 times, but the branch is taken 3 times. So the accuracy is 25%. On the other hand, a bimodal predictor with an initial state of 10 or 111 will achieve an accuracy of 75%. The initial states 01 and 00 will achieve accuracies of 50% and 25%, respectively. 10 or 11 are considered to be good initial states.
Let's assume that the second C loop will be compiled to something like this:
compute the if condition
branch to the else body if condition met
the if body
non-conditional branch to the end of the loop
the else body
compute loop condition
branch to the loop body if condition met
So there are two conditional branches. The SNT predictor will predict not taken 8 times, but 5 of which are mispredictions (there are 5 takens and 3 not-takens2). So the accuracy is 37%. For the bimodal predictor, let's assume that each branch uses the same counter. A bimodal predictor with initial states of 10 or 11 will achieve an accuracy of 63%. A bimodal predictor with initial states of 00 or 01 will achieve accuracies of 25% and 50%, respectively. If each branch uses a different counter with the same initial state, the calculations are similar.
[1] Where 00 and 01 represent not taken and 10 and 11 represent taken.
[2] T, T, NT, T, T, T, NT, NT.

Pseudo randomization in MATLAB with minimum intervals between stimulus categories

For an experiment I need to pseudo randomize a vector of 100 trials of stimulus categories, 80% of which are category A, 10% B, and 10% C. The B trials have at least two non-B trials between each other, and the C trials must come after two A trials and have two A trials following them.
At first I tried building a script that randomized a vector and sort of "popped" out the trials that were not where they should be, and put them in a space in the vector where there was a long series of A trials. I'm worried though that this is overcomplicated and will create an endless series of unforeseen errors that will need to be debugged, as well as it not being random enough.
After that I tried building a script which simply shuffles the vector until it reaches the criteria, which seems to require less code. However now that I have spent several hours on it, I am wondering if these criteria aren't too strict for this to make sense, meaning that it would take forever for the vector to shuffle before it actually met the criteria.
What do you think is the simplest way to handle this problem? Additionally, which would be the best shuffle function to use, since Shuffle in psychtoolbox seems to not be working correctly?
The scope of this question moves much beyond language-specific constructs, and involves a good understanding of probability and permutation/combinations.
An approach to solving this question is:
Create blocks of vectors, such that each block is independent to be placed anywhere.
Randomly allocate these blocks to get a final random vector satisfying all constraints.
Part 0: Category A
Since category A has no constraints imposed on it, we will go to the next category.
Part 1: Make category C independent
The only constraint on category C is that it must have two A's before and after. Hence, we first create random groups of 5 vectors, of the pattern A A C A A.
At this point, we have an array of A vectors (excluding blocks), blocks of A A C A A vectors, and B vectors.
Part 2: Resolving placement of B
The constraint on B is that two consecutive Bs must have at-least 2 non-B vectors between them.
Visualize as follows: Let's pool A and A A C A A in one array, X. Let's place all Bs in a row (suppose there are 3 Bs):
s0 B s1 B s2 B s3
Where s is the number of vectors between each B. Hence, we require that s1, s2 be at least 2, and overall s0 + s1 + s2 + s3 equal to number of vectors in X.
The task is then to choose random vectors from X and assign them to each s. At the end, we finally have a random vector with all categories shuffled, satisfying the constraints.
P.S. This can be mapped to the classic problem of finding a set of random numbers that add up to a certain sum, with constraints.
It is easier to reduce the constrained sum problem to one with no constraints. This can be done as:
s0 B s1 t1 B s2 t2 B s3
Where t1 and t2 are chosen from X just enough to satisfy constraints on B, and s0 + s1 + s2 + s3 equal to number of vectors in X not in t.
Implementation
Implementing the same in MATLAB could benefit from using cell arrays, and this algorithm for the random numbers of constant sum.
You would also need to maintain separate pools for each category, and keep building blocks and piece them together.
Really, this is not trivial but also not impossible. This is the approach you could try, if you want to step aside from brute-force search like you have tried before.

Revolving Doors Riddle - Matlab Time-Efficient Sparse Matrix Use

I'm running a code with many iterations using large sparse matrices. There are three lines in my code that take about 75% of the running time and I think I can use the special structure of my sparse matrix to reduce that time, but so far I haven't managed to do it. I would love your help!!
Ok, here's the gist of my code:
I = 70;
J = 1000;
A = rand(I);
A = A./repmat(sum(A, 2), 1, I);
S = kron(A, speye(J));
indj = randi(J,I,1);
tic
for i = 1:I
S(:, (i-1)*J+indj(i)) = sum(S(:, (i-1)*J + (1:indj(i))), 2);
end
toc
You can skip the following 2 paragraphs
Here's a story to make the example a bit more lively. An old man is visiting sick people at different hospitals. There are 1000 (J) hospitals, and each hospital has 70 (I) rooms in it. The matrix A is the transition matrix that specifies the probability of the old man moving from one room at the hospital to another room within the same hospital. A(i1,i2) is the probability the old man moves from room i1 to room i2 (so columns sum to 1). The big S matrix is the transition probability matrix, where moving from room i1 at hospital j1 to room i2 at hospital j2 is given by the (J*(i1-1)+j1, J*(i2-1)+j2) element. There is no way the old man moves from one hospital to another, so the matrix is sparse.
Something magical happens and now all the doors to room number i in the first indj(i) hospitals all lead to the same hospital, hospital indj(i). So the old man can now magically move between hospitals. We need to change the S matrix accordingly. This amounts to two things, increasing the probability of moving to room i at hospital indj(i), for all i, and setting to zero the probability of getting into all rooms lower than indj(i) at hospital i, for all i. The latter I can do very efficiently, but the first part is taking me too long.
Why I think there's a chance to reduce running time
Loop. The part between the tic and toc can be written without a loop. I have done it, but it made it run much slower perhaps because the length of the sub2ind is very large.
Matrix structure. Notice that we don’t need the entire sum, only one element needs to be added. These loops achieve the same outcome (but here, obviously, much slower):
for i = 1:I
for ii = 1:I
for j = 1:indj(i)-1
S((ii-1)*J+j, (i-1)*J+indj(i)) = S((ii-1)*J+j, (i-1)*J+indj(i)) + S((ii-1)*J+j, (i-1)*J+j);
end
end
end
This makes me somewhat hopeful that there is a way to make the calculation faster…
Your help is HIGHLY appreciated!