Why do highly-optimizing compilers not utilize the ANDNOT instruction? - compiler-optimization

First off: I mainly develop games in C# using Unity3D. That means I don't have access to .Net Core intrinsics but rather SIMD intrinsics provided by the Burst Compiler made by Unity Technologies - otherwise I wouldn't ask.
I find myself to be using the andnot semantics a ton when doing bit-manipulation where I need >>both<< the original and the inverted value of a bitmask within a small function, which is probably the most optimal use case for it.
I tried to force the fully-optimizing compiler to generate it by defining it as a function both as
static inline int andnot(int a, int b)
{
return ~a & b;
}
and even as
static inline int andnot(int a, int b)
{
__m128i whyDoIHaveToDoThis = _mm_insert_epi32(_mm_setzero_si128(), a, 0);
__m128i compilersAreWaySmarterThanThis = _mm_insert_epi32(_mm_setzero_si128(), b, 0);
return _mm_extract_epi32(_mm_andnot_si128(whyDoIHaveToDoThis, compilersAreWaySmarterThanThis, 0);
}
and both produce assemblies like
not a, a
and b, b, a
(yes, even the second function outputs this)
?
According to Agner Fog's instruction tables, the latency of "andn" is one clock cycle with one micro-op and a reciprocal throughput of 0.5 and this has been true for at least a decade (seems to be trivial to implement in the hardware aswell).
So let me reiterate; why do compilers not use the instruction even if I try my hardest to tell them to?

Related

Are PyTorch activation functions best stored as fields?

An example of a simple neural network in PyTorch can be found at https://visualstudiomagazine.com/articles/2020/10/14/pytorch-define-network.aspx
class Net(T.nn.Module):
def __init__(self):
super(Net, self).__init__()
self.hid1 = T.nn.Linear(4, 8) # 4-(8-8)-1
self.hid2 = T.nn.Linear(8, 8)
self.oupt = T.nn.Linear(8, 1)
T.nn.init.xavier_uniform_(self.hid1.weight)
T.nn.init.zeros_(self.hid1.bias)
T.nn.init.xavier_uniform_(self.hid2.weight)
T.nn.init.zeros_(self.hid2.bias)
T.nn.init.xavier_uniform_(self.oupt.weight)
T.nn.init.zeros_(self.oupt.bias)
def forward(self, x):
z = T.tanh(self.hid1(x))
z = T.tanh(self.hid2(z))
z = T.sigmoid(self.oupt(z))
return z
A distinctive feature of the above is that the layers are stored as fields within the Net object (as they need to be, in the sense that they contain the weights, which need to be remembered across training epochs), but the activation functors such as tanh are re-created on every call to forward. The author says:
The most common structure for a binary classification network is to define the network layers and their associated weights and biases in the __init__() method, and the input-output computations in the forward() method.
Fair enough. On the other hand, perhaps it would be marginally faster to store the functors rather than re-create them on every call to forward. On the third hand, it's unlikely to make any measurable difference, which means it might end up being a matter of code style.
Is the above, indeed the most common way to do it? Does either way have any technical advantage, or is it just a matter of style?
On "storing" functors
The snippet is not "re-creating" anything -- calling torch.tanh(x) is literally just calling the function tanh exported by the torch package with arguments x.
Other ways of doing it
I think the snippet is a fair example for small neural blocks that are use-and-forget or are just not meant to be parameterizable.
Depending on your intentions, there are of course alternatives, but you'd have to weigh yourself whether the added complexity offers any value.
activation functions as strings
allow a selection of an activation function from a fixed set
class Model(torch.nn.Module):
def __init__(..., activation_function: Literal['tanh'] | Literal['relu']):
...
if activation_function == 'tanh':
self.activation_function = torch.tanh
elif activation_function == 'relu':
self.activation_function = torch.relu
else:
raise ValueError(f'activation function {activation_function} not allowed, use tanh or relu.'}
def forward(...) -> Tensor:
output = ...
return self.activation_function(output)
activation functions as callables
use arbitrary modules or functions as activations
class Model(torch.nn.Module):
def __init__(..., activation_function: torch.nn.Module | Callable[[Tensor], Tensor]):
self.activation_function = activation_function
def forward(...) -> Tensor:
output = ...
return self.activation_function(output)
which would for instance work like
def cube(x: Tensor) -> Tensor: return x**3
cubic_model = Model(..., activation_function=cube)
The key difference between the above examples and your snippet is the fact that the latter are transparent and adjustable wrt. to the activation used; you can inspect the activation function (i.e. model.activation_function), and change it (before or after initialization), whereas in the case of the original snippet it is invisible and baked into the model's functionality (to replicate the model with a different function, you'd need to define it from scratch).
Overall, I think the best way to go is to create small, locally tunable blocks that are as parametric as you need them to be, and wrap them into bigger blocks that make generalizations over the contained parameters. i.e. if your big model consists of 5 linear layers, you could make a single, activation-parametric wrapper for 1 layer (including dropouts, layer norms, whatever), and then another wrapper for a flow of N layers, which asks once for which activation function to initialize its children with. In other words, generalize and parameterize when you anticipate this to save you from extra effort and copy-pasting code in the future, but don't overdo it or you'll end up far away from your original specifications and needs.
ps: I don't know whether calling activation functions functors is justifiable.

What happens when we pass-by-value-result in this function?

Consider this code.
foo(int x, int y){
x = y + 1;
y = 10;
x++;
}
int n = 5;
foo(n,n);
print(n);
if we assume that the language supports pass-by-value result, what would be the answer? As far as I know, pass-by-value-result copies in and out. But I am not sure what would be n's value when it is copied to two different formal parameters. Should x and y act like references? Or should n get the value of either x or y depending on which is copied out last?
Thanks
Regardless of whether it's common pass-by-value or pass-by-value-result, then x and y would become separate copies of n, they are in no way tied to each other, except for the fact they start with the same value.
However, pass-by-value-result assigns the value back to the original variables upon function exit meaning that n would take on the value of x and y. Which one it gets first (or, more importantly, last, since that will be its final value) is open to interpretation since you haven't specified what language you're actually using.
The Wikipedia page on this entry has this to say on the subject ("call-by-copy-restore" is its terminology for what you're asking about, and I've emphasised the important bit and paraphrased to make it clearer):
The semantics of call-by-copy-restore also differ from those of call-by-reference where two or more function arguments alias one another; that is, point to the same variable in the caller's environment.
Under call-by-reference, writing to one will affect the other immediately; call-by-copy-restore avoids this by giving the function distinct copies, but leaves the result in the caller's environment undefined depending on which of the aliased arguments is copied back first. Will the copies be made in left-to-right order both on entry and on return?
I would hope that the language specification would clarify actual consistent behaviour so as to avoid all those undefined-behaviour corners you often see in C and C++ :-)
Examine the code below, slightly modified from your original since I'm inherently lazy and don't want to have to calculate the final values :-)
foo(int x, int y){
x = 7;
y = 42;
}
int n = 5;
foo(n,n);
print(n);
The immediate possibilities I see as the most likely are:
strict left to right copy-on-exit, n will become x then y, so 42.
strict right to left copy-on-exit, n will become y then x, so 7.
undefined behaviour, n may take on either, or possibly any, value.
compiler raises a diagnostic and refuses to compile, if it has no strict rule and doesn't want your code to end up behaving in a (seemingly) random manner.

iOS - bitwise XOR on a vector using Accelerate.framework

I am trying to perform a bitwise XOR between a predetermined value and each element of an array.
This can clearly be done in a loop like so (in psuedocode):
int scalar = 123;
for(int i = 0; i < VECTOR_LENGTH; i++) {
int x_or = scalar ^ a[i];
}
but I'm starting to learn about the performance enhancements by using the Accelerate.framework.
I'm looking through the docs for Accelerate.framework, but I haven't seen anyway to do an element based bitwise XOR. Does anyone know if this is possible?
Accelerate doesn't implement the operation in question. You can pretty easily write your own vector code to do it, however. Once nice approach is to use clang vector extensions:
#include <stddef.h>
typedef int vint8 __attribute__((ext_vector_type(8),aligned(4)));
typedef int vint4 __attribute__((ext_vector_type(4),aligned(4)));
typedef int vint2 __attribute__((ext_vector_type(2),aligned(4)));
int vector_xor(int *x, size_t n) {
vint8 xor8 = 0;
while (n >= 8) {
xor8 ^= *(vint8 *)x;
x += 8;
n -= 8;
}
vint4 xor4 = xor8.lo ^ xor8.hi;
vint2 xor2 = xor4.lo ^ xor4.hi;
int xor = xor2.lo ^ xor2.hi;
while (n > 0) {
xor ^= *x++;
n -= 1;
}
return xor ^ 123;
}
This is pretty nice because (a) it doesn't require use of intrinsics and (b) it doesn't tie you to any specific architecture. It generates pretty decent code for any architecture you compile for. On the other hand, it ties you to clang, whereas if you use intrinsics your code may work with other compilers as well.
Stephen's answer is useful, but as you're looking at Accelerate, keep in mind that it is not a magic "go fast" library. Unless VECTOR_LENGTH is very large (say 10,000 -- EDIT: Stephen disagrees on this scale, and tends to know more about this subject than I do; see comments), the cost of the function call will often overwhelm any benefits you get. Remember, at the end of the day, Accelerate is just code. Very often, simple hand-written loops like yours (especially with good compiler optimizations) are going to be just as good or better on simple operations like xor.
But in many cases you need to let the compiler help you. Clang knows how to do all kinds of useful vector optimizations (just like in Stephen's answer) automatically. But in most cases, the default optimization setting is -Os (Fastest, Smallest). That says "clang, you may do any optimizations you want, but not if it makes the resulting binary any larger." You might notice that Stephen's example is a little larger than yours. That means that the compiler is often forbidden from applying the automatic vector optimizations it knows how to do.
But, if you switch to -Ofast, then you give clang permission to improve performance, even if it increases binary size (and on modern hardware, even mobile hardware, that is often a very good tradeoff). In the Build Settings panel, this is called "Optimization Level: Fastest, Aggressive Optimizations." In nearly every case, that is the correct setting for iOS and OS X apps. (It is not currently the default because of history; I expect that Apple will make it the default in the future.)
For more discussion on the limitations of Accelerate (wonderful library that it is), you may be interested in "Introduction to Fast Bézier (and Trying the Accelerate.framework)". I also highly recommend "What's New in the LLVM Compiler" (Session 402 from WWDCS 2013), which I found even more useful than the introduction to Accelerate. Clang can do some really amazing optimizations if you get out of its way.

intel parallel studio 2011 - summing in parallel

I have a serial code that looks something like that:
sum = a;
sum += b;
sum += c;
sum += d;
I would like to parallelize it to something like that:
temp1 = a + b and in the same time temp2 = c + d
sum = temp1 + temp2
How do I do it using Intel parallel studio tools?
Thanks!!!
Assuming that all variables are of integral or floating point types, there is absolutely no sense to parallelize this code (in the sense of executing by different threads/cores), as the overhead will be much much higher than any benefit out of it. The applicable parallelism in this example is at the level of multiple computation units and/or vectorization on a single CPU. Optimizing compilers are sophisticated enough nowadays to exploit this automatically, without code changes; however if you wish you may explicitly use temporary variables, as in the second part of the question.
And if you ask just out of curiosity: Intel Parallel Studio provides several ways to parallelize code. For example, let's use Cilk keywords together with C++11 lambda functions:
#include <cilk/cilk.h>
...
temp = cilk_spawn [=]{ return a+b; }();
sum = c+d;
cilk_sync;
sum += temp;
Don't expect to get performance out of that (see above), unless you use classes with a computational-heavy overloaded operator+.

Why is this C-style code 10X slower than this obj-C style code?

//obj C version, with some - less than one second on 18,000 iterations
for (NSString* coordStr in splitPoints) {
char *buf = [coordStr UTF8String];
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
i++;
}
//C version - over 13 seconds on 18,000 iterations
for (i = 0; buf != NULL; buf = strchr(buf,'['), ++i) {
buf += sizeof(char);
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
}
As a corollary question, is there any way to make this loop faster?
Also see this question: Another Speed Boost Possible?
Measure, measure, measure.
Measure the code with the Sampler instrument in Instruments.
With that said, there is an obvious inefficiency in the C code compared to the Objective-C code.
Namely, fast enumeration -- the for(x in y) syntax -- is really fast and, more importantly, implies that splitPoints is an array or set that contains a bunch of data that has already been parsed into individual objects.
The strchr() call in the second loop implies that you are parsing stuff on the fly. In and of itself, strchr() is a looping operation and will consume time, more-so as the # of characters between occurrences of the target character increase.
That is all conjecture, though. As with all optimizations, speculation is useless and gathering concrete data using the [rather awesome] set of tools provided is the only way to know for sure.
Once you have measured, then you can make it faster.
Having nothing to do with performance, your C code has an error in it. buf += sizeof(char) should simply be buf++. Pointer arithmetic always moves in units the size of the type. It worked fine in this case because sizeof(char) was 1.
Obj C code looks like it has precomputed some split points, while the C code seeks them in each iteration. Simple answer? If N is the length of buf and M the number of your split points, it looks like your two snippets have complexities of O(M) versus O(N*M); which one's slower?
edit: Really amazed me though, that some would think C code is axiomatically faster than any other solution.
Vectorization can be used to speed up C code.
Example:
Even faster UTF-8 character counting
(But maybe just try to avoid the function call strchr() in the loop condition.)