Query about a certain programming trick used in an open source software - compiler-optimization

In a certain library (FFTW: discrete Fourier transform computation),
I came across a header file which contains the following comment and some #defines following that. The comment talks about some programming trick.
But I'm not able to understand what exactly this programming trick is.
Could someone please explain ?
/* hackery to prevent the compiler from ``optimizing'' induction
variables in codelet loops. The problem is that for each K and for
each expression of the form P[I + STRIDE * K] in a loop, most
compilers will try to lift an induction variable PK := &P[I + STRIDE * K].
For large values of K this behavior overflows the
register set, which is likely worse than doing the index computation
in the first place.
If we guess that there are more than
ESTIMATED_AVAILABLE_INDEX_REGISTERS such pointers, we deliberately confuse
the compiler by setting STRIDE ^= ZERO, where ZERO is a value guaranteed to
be 0, but the compiler does not know this.
16 registers ought to be enough for anybody, or so the amd64 and ARM ISA's
seem to imply.
*/
#define ESTIMATED_AVAILABLE_INDEX_REGISTERS 16
#define MAKE_VOLATILE_STRIDE(nptr, x) \
(nptr <= ESTIMATED_AVAILABLE_INDEX_REGISTERS ? \
0 : \
((x) = (x) ^ X(an_INT_guaranteed_to_be_zero)))
#endif /* PRECOMPUTE_ARRAY_INDICES */

The optimization: Instead of recalculating the index of the array every time an iteration in the loop occurs, some compilers anticipate the next addresses and place these in registers because the indexing expression is predictable.
The problem: Some indexing expressions (like I + STRIDE * K) may result in using a lot of registers this way, and if this number exceeds the total amount of registers, some register values will be pushed to stack memory, including other variables that the loop might be using.
The trick: In order to force a compiler to not use this optimization, an external integer is used. Adding or XOR'ing this zero and storing it in x is a no-op that "taints" the stride, and consequently the index expression, making it unpredictable by the optimization analysis. It can no longer infer how this variable behaves even though we know it to behave very zero-like. A relevant extract of the file ifftw.h from which this is derived:
extern const INT X(an_INT_guaranteed_to_be_zero);
#ifdef PRECOMPUTE_ARRAY_INDICES
...
#define MAKE_VOLATILE_STRIDE(nptr, x) (x) = (x) + X(an_INT_guaranteed_to_be_zero)
#else
...
#define ESTIMATED_AVAILABLE_INDEX_REGISTERS 16
#define MAKE_VOLATILE_STRIDE(nptr, x) \
(nptr <= ESTIMATED_AVAILABLE_INDEX_REGISTERS ? \
0 : \
((x) = (x) ^ X(an_INT_guaranteed_to_be_zero)))
#endif /* PRECOMPUTE_ARRAY_INDICES */
This optimization is either attempted avoided completely, or allowed on the condition that the indices can fit into a guess at the number of available registers. The way it allows the optimization is by using a constant zero.
Some etymology: The macro MAKE_VOLATILE_STRIDE derives its name from the volatile keyword which indicates that a value may change between different accesses, even if it does not appear to be modified. This keyword prevents an optimizing compiler from optimizing away subsequent reads or writes and thus incorrectly reusing a stale value or omitting writes. (Wikipedia)
Why the volatile keyword, rather than XOR'ing an external value, is not sufficient, I don't know.

Related

What does assignment mean to a C11 atomic?

For example,
atomic_int test(void)
{
atomic_int tmp = ATOMIC_VAR_INIT(14);
tmp = 47; // Looks like atomic_store
atomic_int mc; // Probably just uninitialised data
memcpy(&mc,&tmp,sizeof(mc)); // Probably equivalent to a copy
tmp = mc + 4; // Arithmetic
return tmp; // A copy - perhaps load then store
}
Clang is happy with all this. I've read section 7.17 of the standard, and it says a lot about the memory model and the defined functions (init, store, load etc) but doesn't say anything about the usual operations (+, = etc).
Also of interest is the behaviour of passing struct wot { atomic_int value; } to functions.
I would like to believe that assignment behaves identically to an atomic load then store using memory_order_seq_cst.
Even more optimistically, I would like to believe that struct assignment, passing to function, returning from function and even memcpy also behaves identically to carefully copying the bit pattern across under memory_order_seq_cst.
I can't find any supporting evidence for either belief in the standard though. There's definitely a chance that assignment and memcpy of atomic primitives is undefined behaviour.
How should primitive operations on atomic primitives behave?
Thanks!
Operations on objects that are _Atomic qualified (and atomic_int is just a different writing for that) are guaranteed to have sequential consistency. You find that mentionned at the end of the semantics section for each of the operands. (And maybe the mention for assignment is missing.)
Your code is not correct at two places: initialization must use the ATOMIC_VAR_INIT macro (7.17.2.1), and memcpy is undefined (the sizes might not agree), although it probably will work on most of the architectures.
Also the line
tmp = mc + 4; // Arithmetic
doesn't do what your comment claims. This is not arithmetic on an atomic object, but a load followed by an ordinary addition. More interesting would be
mc += 4; // Arithmetic
which is an atomic operation with sequential consistency.

What to_unsigned does?

Could someone please explain me how VHDL's to_unsigned works or confirm that my understanding is correct?
For example:
C(30 DOWNTO 0) <= std_logic_vector (to_unsigned(-30, 31))
Here is my understanding:
-30 is a signed value, represented in bits as 1111111111100010
all bits should be inverted and to it '1' added to build the value of C
0000000000011101+0000000000000001 == 0000000000011111
In IEEE package numeric_std, the declaration for TO_UNSIGNED:
-- Id: D.3
function TO_UNSIGNED (ARG, SIZE: NATURAL) return UNSIGNED;
-- Result subtype: UNSIGNED(SIZE-1 downto 0)
-- Result: Converts a non-negative INTEGER to an UNSIGNED vector with
-- the specified SIZE.
You won't find a declared function to_unsigned with an argument or size that are declared as type integer. What is the consequence?
Let's put that in a Minimal, Complete, and Verifiable example:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity what_to_unsigned is
end entity;
architecture does of what_to_unsigned is
signal C: std_logic_vector (31 downto 0);
begin
C(30 DOWNTO 0) <= std_logic_vector (to_unsigned(-30, 31));
end architecture;
A VHDL analyzer will give us an error:
ghdl -a what_to_unsigned.vhdl
what_to_unsigned.vhdl:12:53: static constant violates bounds
ghdl: compilation error
And tell us -30 (line 12:character 53) has a bounds violation. Meaning in this case the numerical literal converted to universal_integer doesn't convert to type natural in the function to_unsigned.
A different tool might tell us a bit more graphically:
nvc -a what_to_unsigned.vhdl
** Error: value -30 out of bounds 0 to 2147483647 for parameter ARG
File what_to_unsigned.vhdl, Line 12
C(30 DOWNTO 0) <= std_logic_vector (to_unsigned(-30, 31));
^^^
And actually tells us where in the source code the error is found.
It's safe to say what you think to_unsigned does is not what the analyzer thinks it does.
VHDL is a strongly typed language, you tried to provide a value to place where that value is out of range for the argument ARG in function TO_UNSIGNED declared in IEEE package numeric_std.
The type NATURAL is declared in package standard and is made visible by an inferred declaration library std; use std.standard.all; in the context clause. (See IEEE Std 1076-2008, 13.2 Design libraries):
Every design unit except a context declaration and package STANDARD is
assumed to contain the following implicit context items as part of its
context clause:
library STD, WORK; use STD.STANDARD.all;
The declaration of natural found in 16.3 Package STANDARD:
subtype NATURAL is INTEGER range 0 to INTEGER'HIGH;
A value declared as a NATURAL is a subtype of INTEGER that has a constrained range excluding negative numbers.
And about here you can see you have the ability to answer this question with access to a VHDL standard compliant tool and referencing the IEEE Std 1076-2008, IEEE Standard VHDL Language Reference Manual.
The TL:DR; detail
You could note that 9.4 Static expressions, 9.4.1 General gives permission to evaluate locally static expressions during analysis:
Certain expressions are said to be static. Similarly, certain discrete ranges are said to be static, and the type marks of certain subtypes are said to denote static subtypes.
There are two categories of static expression. Certain forms of expression can be evaluated during the analysis of the design unit in which they appear; such an expression is said to be locally static.
Certain forms of expression can be evaluated as soon as the design hierarchy in which they appear is elaborated; such an expression is said to be globally static.
There may be some standard compliant tools that do not evaluate locally static expressions during analysis. "can be" is permissive not mandatory. The two VHDL tools demonstrated on the above code example take advantage of that permission. In both tools the command line argument -a tells the tool to analyze the provided file which is if successful, inserted into the current working library (WORK by default, see 13.5 Order of analysis, 13.2 Design libraries).
Tools that evaluate bounds checking at elaboration for locally static expressions are typically purely interpretive and even that can be overcome with a separate analysis pass.
The VHDL language can be used for formal specification of a design model used in formal proofs within the bounds specified by Annex D Potentially nonportable constructs and when relying on pure functions only (See 4.Subprograms and packages, 4.1 General).
VHDL compliant tools are guaranteed to give the same results, although there is no standardization of error messages nor limitations placed on tool implementation methodology.
to_unsigned is for converting between different types:
signal i : integer := 2;
signal u : unsigned(3 downto 0);
...
u <= i; -- Error, incompatible types
u <= to_unsigned(i, 4); -- OK, conversion function does the right thing
If you try to convert a negative integer, this is an error.
u <= to_unsigned(-2, 4); -- Error, does not work with negative numbers
If you simply want to invert an integer, i.e. 2 becomes -2, 5 becomes -5, just use the - operator:
u <= to_unsigned(-i, 4); -- OK as long as `i` was negative or zero
If you want the absolute value, a function for this is provided by the numeric_std library.
u <= to_unsigned(abs(i), 4);

What is the benefit of automatic variables?

I'm looking for benefits of "automatic" in Systemverilog.
I have been seeing the "automatic" factorial example. But I can't get though them. Does anyone know why we use "automatic"?
Traditionally, Verilog has been used for modelling hardware at RTL and at Gate level abstractions. Since both RTL and Gate level abstraction are static/fixed (non-dynamic), Verilog supported only static variables. So for example, any reg or wire in Verilog would be instantiated/mapped at the beginning of simulation and would remain mapped in the simulation memory till the end of simulation. As a result, you can take dump of any wire/reg as a waveform, and the reg/wire would have a value from the beginning till the end, since it is always mapped. In a programmers perspective, such variables are termed static. In C/C++ world, to declare such a variable, you will have to use storage class specifier static. In Verilog every variable is implicitly static.
Note that until the advent of SystemVerilog, Verilog supported only static variables. Even though Verilog also supported some constructs for modelling at behavioural abstraction, the support was limited by absence of automatic storage class.
automatic (called auto in software world) storage class variables are mapped on the stack. When a function is called, all the local (non-static) variables declared in the function are mapped to individual locations in the stack. Since such variables exist only on the stack, they cease to exist as soon as the execution of the function is complete and the stack correspondingly shrinks.
Amongst other advantages, one possibility that this storage class enables is recursive functions. In Verilog world, a function can not be re-entrant. Recursive (or re-entrant) functions do not serve any useful purpose in a world where automatic storage class is not available. To understand this, you can imagine a re-entrant function as a function which dynamically makes multiple recursive instantiations of itself. Each instance gets its automatic variables mapped on the stack. As we progress into the recursion, the stack grows and each function gets to make its computations using its own set of variables. When the function calls return the computed values are collated and a final result made available. With only static variables, each function call will store the variable values at the same common locations thus erasing any benefit of having multiple calls (instantiations).
Coming to the factorial algorithm, it is relatively easy to conceptualize factorial as a recursive algorithm. In maths we write factorial(n) = n(factial(n-1))*. So you need to calculate factorial(n-1) in order to know factorial(n). Note that recursion can not be completed without a terminating case, which in case of factorial is n=1.
function automatic int factorial;
input int n;
if (n > 1)
factorial = factorial (n - 1) * n;
else
factorial = 1;
endfunction
Without automatic storage class, since all variables in a function would be mapped to a fixed location, when we call factorial(n-1) from inside factorial(n), the recursive call would overwrite any variable inside the caller context. In the factorial function as defined in the above code snippet, if we do not specify the storage class as automatic, both n and the result factorial would be overwritten by the recursive call to factorial(n-1). As a result the variable n would consecutively be overwritten as n-1, n-2, n-3 and so on till we reach the terminating condition of n = 1. The terminating recursive call to factorial would have a value of 1 assigned to n and when the recursion unwinds, factorial(n-1) * n would evaluate to 1 in each stage.
With automatic storage class, each recursive function call would have its own place in the memory (actually on the stack) to store variable n. As a result consecutive calls to factorial will not overwrite variable n of the caller. As a result when the recursion unwinds, we shall have the right value for factorial(n) as n*(n-1)(n-2) .. *1.
Note that it is possible to define factorial using iteration as well. And that can be done without use of automatic storage class. But in many cases, recursion makes it possible for the user to code algorithms in a more intuitive fashion.
I propose 1 example as below(using fork...join_none):
Ex.1 (non-using automatic) : value output will be "3 3 3 3". because i take the latest value after exit for loop, i is stored in static memory location. This may be a bug in your code.
initial begin
for( int i =0; i<=3 ; i++)
fork
$write ("%d ", i);
join_none
end
Ex.2 (using automatic) : value output will be "0 1 2 3". Because in each loop, value of i is copied to k, and fork..join_none spawn a thread with each value of k ( each loop will locate 1 memory space for k : k0, k1, k2, k3):
initial begin
for( int i =0; i<=3 ; i++)
fork
automatic int k = i;
$write ("%d ", k);
join_none
end
Another example is the use of fork join inside for loop -
Without the use of automatic, the fork join inside the for loop will not work correctly.
for (int i=0; i<`SOME_VALUE ; i++) begin
automatic int id=i;
fork
task/function using the id above ;
...
join_none
end

What should '{default:'1} do in system verilog?

I have an array that I would like to initialize to all 1. To do this, I used the following code snippet:
logic [15:0] memory [8];
always_ff #(posedge clk or posedge reset) begin
if(reset) begin
memory <= '{default:'1};
end
else begin
...
end
end
My simulator does what I think is the correct thing and sets the registers to 16'hFFFF on reset. However, my lint tool gives me a warning that bit 0 has an async set while bits 1 through 15 have async resets. This implies that the linter thinks that this code assigns 16'h0001 to the registers.
Since both tools come from the same vendor I file a bug report since they can't both be right.
The question is: Which behavior is correct according to the spec? There is no example that shows this exact situation. Section 5.7.1 mentions that:
An unsized single-bit value can be specified by preceding the single-bit value with an apostrophe ( ' ), but
without the base specifier. All bits of the unsized value shall be set to the value of the specified bit. In a
self-determined context, an unsized single-bit value shall have a width of 1 bit, and the value shall be treated
as unsigned.
'0, '1, 'X, 'x, 'Z, 'z // sets all bits to specified value
I f this is a "self-determined context" then the answer is 1 bit sign extended to 16'h0001, but if it is not, then I guess the example which says it "sets all bits to the specified value" applies. I am not sure if this is a self -determined context.
The simulator is correct: memory <= '{default:'1}; will assign all each bit in memory to 1. The linting tool does have a bug. See IEEE Std 1800-2012 § 10.9.1 Array assignment patterns:
The **default:***value* applies to elements or subarrays that are not matched by either index or type key. If the type of the element or subarray is a simple bit vector type, matches the self-determined type of the value, or is not an array or structure type, then the value is evaluated in the context of each assignment to an element or subarray by the default and shall be castable to the type of the element or subarray; otherwise, an error is generated. ...
The LRM goes beyond what is synthesizable when it comes to assignment patterns. And most of the tools are sill playing catchup with supporting all the SystemVerilog features. Experiment to make sure your tools (simulator,synthesizer, lint tool, logic-equivalency-checker, etc.) all have the necessary support for the features you want.

What is the default hash code that Mathematica uses?

The online documentation says
Hash[expr]
gives an integer hash code for the expression expr.
Hash[expr,"type"]
gives an integer hash code of the specified type for expr.
It also gives "possible hash code types":
"Adler32" Adler 32-bit cyclic redundancy check
"CRC32" 32-bit cyclic redundancy check
"MD2" 128-bit MD2 code
"MD5" 128-bit MD5 code
"SHA" 160-bit SHA-1 code
"SHA256" 256-bit SHA code
"SHA384" 384-bit SHA code
"SHA512" 512-bit SHA code
Yet none of these correspond to the default returned by Hash[expr].
So my questions are:
What method does the default Hash use?
Are there any other hash codes built in?
The default hash algorithm is, more-or-less, a basic 32-bit hash function applied to the underlying expression representation, but the exact code is a proprietary component of the Mathematica kernel. It's subject to (and has) change between Mathematica versions, and lacks a number of desirable cryptographic properties, so I personally recommend you use MD5 or one of the SHA variants for any serious application where security matters. The built-in hash is intended for typical data structure use (e.g. in a hash table).
The named hash algorithms you list from the documentation are the only ones currently available. Are you looking for a different one in particular?
I've been doing some reverse engeneering on 32 and 64 bit Windows version of Mathematica 10.4 and that's what I found:
32 BIT
It uses a Fowler–Noll–Vo hash function (FNV-1, with multiplication before) with 16777619 as FNV prime and ‭84696351‬ as offset basis. This function is applied on Murmur3-32 hashed value of the address of expression's data (MMA uses a pointer in order to keep one instance of each data). The address is eventually resolved to the value - for simple machine integers the value is immediate, for others is a bit trickier. The Murmur3-32 implementing function contains in fact an additional parameter (defaulted with 4, special case making behaving as in Wikipedia) which selects how many bits to choose from the expression struct in input. Since a normal expression is internally represented as an array of pointers, one can take the first, the second etc.. by repeatedly adding 4 (bytes = 32 bit) to the base pointer of the expression. So, passing 8 to the function will give the second pointer, 12 the third and so on. Since internal structs (big integers, machine integers, machine reals, big reals and so on) have different member variables (e.g. a machine integer has only a pointer to int, a complex 2 pointers to numbers etc..), for each expression struct there is a "wrapper" that combine its internal members in one single 32-bit hash (basically with FNV-1 rounds). The simplest expression to hash is an integer.
The murmur3_32() function has 1131470165 as seed, n=0 and other params as in Wikipedia.
So we have:
hash_of_number = 16777619 * (84696351‬ ^ murmur3_32( &number ))
with " ^ " meaning XOR.
I really didn't try it - pointers are encoded using WINAPI EncodePointer(), so they can't be exploited at runtime. (May be worth running in Linux under Wine with a modified version of EncodePonter?)
64 BIT
It uses a FNV-1 64 bit hash function with 0xAF63BD4C8601B7DF as offset basis and 0x100000001B3 as FNV prime, along with a SIP64-24 hash (here's the reference code) with the first 64 bit of 0x0AE3F68FE7126BBF76F98EF7F39DE1521 as k0 and the last 64 bit as k1. The function is applied to the base pointer of the expression and resolved internally. As in 32-bit's murmur3, there is an additional parameter (defaulted to 8) to select how many pointers to choose from the input expression struct. For each expression type there is a wrapper to condensate struct members into a single hash by means of FNV-1 64 bit rounds.
For a machine integer, we have:
hash_number_64bit = 0x100000001B3 * (0xAF63BD4C8601B7DF ^ SIP64_24( &number ))
Again, I didn't really try it. Could anyone try?
Not for the faint-hearted
If you take a look at their notes on internal implementation, they say that "Each expression contains a special form of hash code that is used both in pattern matching and evaluation."
The hash code they're referring to is the one generated by these functions - at some point in the normal expression wrapper function there's an assignment that puts the computed hash inside the expression struct itself.
It would certainly be cool to understand HOW they can make use of these hashes for pattern matching purpose. So I had a try running through the bigInteger wrapper to see what happens - that's the simplest compound expression.
It begins checking something that returns 1 - dunno what.
So it executes
var1 = 16777619 * (67918732 ^ hashMachineInteger(1));
with hashMachineInteger() is what we said before - including values.
Then it reads the length in bytes of the bigInt from the struct (bignum_length) and runs
result = 16777619 * (v10 ^ murmur3_32(v6, 4 * v4));
Note that murmur3_32() is called if 4 * bignum_length is greater than 8 (may be related to the max value of machine integers $MaxMachineNumber 2^32^32 and by converse to what a bigInt is supposed to be).
So, the final code is
if (bignum_length > 8){
result = 16777619 * (16777619 * (67918732 ^ ( 16777619 * (84696351‬ ^ murmur3_32( 1, 4 )))) ^ murmur3_32( &bignum, 4 * bignum_length ));
}
I've made some hypoteses on the properties of this construction. The presence of many XORs and the fact that 16777619 + 67918732 = 84696351‬ may make one think that some sort of cyclic structure is exploited to check patterns - i.e. subtracting the offset and dividing by the prime, or something like that. The software Cassandra uses the Murmur hash algorithm for token generation - see these images for what I mean with "cyclic structure". Maybe various primes are used for each expression - must still check.
Hope it helps
It seems that Hash calls the internal Data`HashCode function, then divides it by 2, takes the first 20 digits of N[..] and then the IntegerPart, plus one, that is:
IntegerPart[N[Data`HashCode[expr]/2, 20]] + 1