iOS - bitwise XOR on a vector using Accelerate.framework - iphone

I am trying to perform a bitwise XOR between a predetermined value and each element of an array.
This can clearly be done in a loop like so (in psuedocode):
int scalar = 123;
for(int i = 0; i < VECTOR_LENGTH; i++) {
int x_or = scalar ^ a[i];
}
but I'm starting to learn about the performance enhancements by using the Accelerate.framework.
I'm looking through the docs for Accelerate.framework, but I haven't seen anyway to do an element based bitwise XOR. Does anyone know if this is possible?

Accelerate doesn't implement the operation in question. You can pretty easily write your own vector code to do it, however. Once nice approach is to use clang vector extensions:
#include <stddef.h>
typedef int vint8 __attribute__((ext_vector_type(8),aligned(4)));
typedef int vint4 __attribute__((ext_vector_type(4),aligned(4)));
typedef int vint2 __attribute__((ext_vector_type(2),aligned(4)));
int vector_xor(int *x, size_t n) {
vint8 xor8 = 0;
while (n >= 8) {
xor8 ^= *(vint8 *)x;
x += 8;
n -= 8;
}
vint4 xor4 = xor8.lo ^ xor8.hi;
vint2 xor2 = xor4.lo ^ xor4.hi;
int xor = xor2.lo ^ xor2.hi;
while (n > 0) {
xor ^= *x++;
n -= 1;
}
return xor ^ 123;
}
This is pretty nice because (a) it doesn't require use of intrinsics and (b) it doesn't tie you to any specific architecture. It generates pretty decent code for any architecture you compile for. On the other hand, it ties you to clang, whereas if you use intrinsics your code may work with other compilers as well.

Stephen's answer is useful, but as you're looking at Accelerate, keep in mind that it is not a magic "go fast" library. Unless VECTOR_LENGTH is very large (say 10,000 -- EDIT: Stephen disagrees on this scale, and tends to know more about this subject than I do; see comments), the cost of the function call will often overwhelm any benefits you get. Remember, at the end of the day, Accelerate is just code. Very often, simple hand-written loops like yours (especially with good compiler optimizations) are going to be just as good or better on simple operations like xor.
But in many cases you need to let the compiler help you. Clang knows how to do all kinds of useful vector optimizations (just like in Stephen's answer) automatically. But in most cases, the default optimization setting is -Os (Fastest, Smallest). That says "clang, you may do any optimizations you want, but not if it makes the resulting binary any larger." You might notice that Stephen's example is a little larger than yours. That means that the compiler is often forbidden from applying the automatic vector optimizations it knows how to do.
But, if you switch to -Ofast, then you give clang permission to improve performance, even if it increases binary size (and on modern hardware, even mobile hardware, that is often a very good tradeoff). In the Build Settings panel, this is called "Optimization Level: Fastest, Aggressive Optimizations." In nearly every case, that is the correct setting for iOS and OS X apps. (It is not currently the default because of history; I expect that Apple will make it the default in the future.)
For more discussion on the limitations of Accelerate (wonderful library that it is), you may be interested in "Introduction to Fast Bézier (and Trying the Accelerate.framework)". I also highly recommend "What's New in the LLVM Compiler" (Session 402 from WWDCS 2013), which I found even more useful than the introduction to Accelerate. Clang can do some really amazing optimizations if you get out of its way.

Related

Why do highly-optimizing compilers not utilize the ANDNOT instruction?

First off: I mainly develop games in C# using Unity3D. That means I don't have access to .Net Core intrinsics but rather SIMD intrinsics provided by the Burst Compiler made by Unity Technologies - otherwise I wouldn't ask.
I find myself to be using the andnot semantics a ton when doing bit-manipulation where I need >>both<< the original and the inverted value of a bitmask within a small function, which is probably the most optimal use case for it.
I tried to force the fully-optimizing compiler to generate it by defining it as a function both as
static inline int andnot(int a, int b)
{
return ~a & b;
}
and even as
static inline int andnot(int a, int b)
{
__m128i whyDoIHaveToDoThis = _mm_insert_epi32(_mm_setzero_si128(), a, 0);
__m128i compilersAreWaySmarterThanThis = _mm_insert_epi32(_mm_setzero_si128(), b, 0);
return _mm_extract_epi32(_mm_andnot_si128(whyDoIHaveToDoThis, compilersAreWaySmarterThanThis, 0);
}
and both produce assemblies like
not a, a
and b, b, a
(yes, even the second function outputs this)
?
According to Agner Fog's instruction tables, the latency of "andn" is one clock cycle with one micro-op and a reciprocal throughput of 0.5 and this has been true for at least a decade (seems to be trivial to implement in the hardware aswell).
So let me reiterate; why do compilers not use the instruction even if I try my hardest to tell them to?

Arima antipersistence

I’m running RStudio Version 1.1.419 with R-3.4.3 on Windows 10. I am trying to fit an (f)arima model and setting the fractional differencing parameter during the optimization process to be between (-0.5,0.5), i.e. allowing for antipersistence (d < 0), short memory (d = 0) and long memory (d > 0). I have tried multiple functions to accomplish that. I am aware that the default of fracdiff$drange is (0,0.5). Therefore this ...
> result <- fracdiff(MeanPrice, nar = 2, nma = 1, drange = c(-0.5,0.5))
sadly returns this..
Warning: C fracdf() optimization failure
Warning message: unable to compute correlation matrix; maybe change 'h'
Is there a way to fit fracdiff or other models (maybe arfima::arfima()?) with that drange? Your help is very much appreciated.
If you look at the package documentation, it states that the h argument for fracdiff "is used to compute a finite difference approximation to the Hessian, and
hence only influences the cov, cor, and std.error computations." However, as they are referring to the Hessian, I would assume that this affects the results of the MLE. There are other functions in that package that may be helpful: fdGHP for estimating the order of fractional differencing based on the Geweke and Porter-Hudak method, and similarly fdSperio.
Take a look at the forecast package. If you estimate the order of fractional differencing using the above mentioned functions, you might be able to use the same method described in the details of the arfima function.

Does Fortran have inherent limitations on numerical accuracy compared to other languages?

While working on a simple programming exercise, I produced a while loop (DO loop in Fortran) that was meant to exit when a real variable had reached a precise value.
I noticed that due to the precision being used, the equality was never met and the loop became infinite. This is, of course, not unheard of and one is advised that, rather than comparing two numbers for equality, it is best see if the absolute difference between two numbers is less than a set threshold.
What I found disappointing was how low I had to set this threshold, even with variables at double precision, for my loop to exit properly. Furthermore, when I rewrote a "distilled" version of this loop in Perl, I had no problems with numerical accuracy and the loop exited fine.
Since the code to produce the problem is so small, in both Perl and Fortran, I'd like to reproduce it here in case I am glossing over an important detail:
Fortran Code
PROGRAM precision_test
IMPLICIT NONE
! Data Dictionary
INTEGER :: count = 0 ! Number of times the loop has iterated
REAL(KIND=8) :: velocity
REAL(KIND=8), PARAMETER :: MACH_2_METERS_PER_SEC = 340.0
velocity = 0.5 * MACH_2_METERS_PER_SEC ! Initial Velocity
DO
WRITE (*, 300) velocity
300 FORMAT (F20.8)
IF (count == 50) EXIT
IF (velocity == 5.0 * MACH_2_METERS_PER_SEC) EXIT
! IF (abs(velocity - (5.0 * MACH_2_METERS_PER_SEC)) < 1E-4) EXIT
velocity = velocity + 0.1 * MACH_2_METERS_PER_SEC
count = count + 1
END DO
END PROGRAM precision_test
Perl Code
#! /usr/bin/perl -w
use strict;
my $mach_2_meters_per_sec = 340.0;
my $velocity = 0.5 * $mach_2_meters_per_sec;
while (1) {
printf "%20.8f\n", $velocity;
exit if ($velocity == 5.0 * $mach_2_meters_per_sec);
$velocity = $velocity + 0.1 * $mach_2_meters_per_sec;
}
The commented-out line in Fortran is what I would need to use for the loop to exit normally. Notice that the threshold is set to 1E-4, which I feel is quite pathetic.
The names of the variables come from the self-study-based programming exercise I was performing and don't have any relevance.
The intent is that the loop stops when the velocity variable reaches 1700.
Here are the truncated outputs:
Perl Output
170.00000000
204.00000000
238.00000000
272.00000000
306.00000000
340.00000000
...
1564.00000000
1598.00000000
1632.00000000
1666.00000000
1700.00000000
Fortran Output
170.00000000
204.00000051
238.00000101
272.00000152
306.00000203
340.00000253
...
1564.00002077
1598.00002128
1632.00002179
1666.00002229
1700.00002280
What good is Fortran's speed and ease of parallelization if its accuracy stinks? Reminds me of the three ways to do things:
The Right Way
The Wrong Way
The Max Power Way
"Isn't that just the wrong way?"
"Yeah! But faster!"
All kidding aside, I must be doing something wrong.
Does Fortran have inherent limitations on numerical accuracy compared to other languages, or am I (quite likely) the one at fault?
My compiler is gfortran (gcc version 4.1.2), Perl v5.12.1, on a Dual Core AMD Opteron # 1 GHZ.
Your assignment is accidentally converting the value to single precision and then back to double.
Try making your 0.1 * be 0.1D0 * and you should see your problem fixed.
As already answered, "plain" floating point constants in Fortran will default to the default real type, which will likely be single-precision. This is an almost classic mistake.
Also, using "kind=8" is not portable -- it will give you double precision with gfortran, but not with some other compilers. The safe, portable way to specify precisions for both variables and constants in Fortran >= 90 is to use the intrinsic functions, and request the precision that you need. Then specify "kinds" on the constants where precision is important. A convenient method is to define your own symbols. For example:
integer, parameter :: DR_K = selected_real_kind (14)
REAL(DR_K), PARAMETER :: MACH_2_METERS_PER_SEC = 340.0_DR_K
real (DR_K) :: mass, velocity, energy
energy = 0.5_DR_K * mass * velocity**2
This can also be important for integers, e.g., if large values are needed. For related questions for integers, see Fortran: integer*4 vs integer(4) vs integer(kind=4) and Long ints in Fortran

Efficient way to make a Programmatic Audio Mixdown

I'm currently working on the iPhone with Audio Units and I'm playing four tracks simultaneously. To improve the performance of my setup, I thought it would be a good idea to minimize the number of Audio Units / threads, by mixing down the four tracks into one.
With the following code I'm processing the next buffer by adding up the samples of the four tracks, keep them in the SInt16 range and add them to a temporary buffer, which will later on be copied into the ioData.mBuffers of the Audio Unit.
Although it works, I don't have the impression that this is the most efficient way to do this.
SInt16* buffer = bufferToWriteTo;
int reads = bufferSize/sizeof(SInt16);
SInt16** files = circularBuffer->files;
float tempValue;
SInt16 values[reads];
int k,j;
int numFiles=4;
for (k=0; k<reads; k++)
{
tempValue=0.f;
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber];
}
if (tempValue > 32767.f) tempValue = 32767.f;
else if (tempValue < -32768.f) tempValue =- 32768.f;
values[k] = (SInt16) tempValue;
values[k] += values[k] << 16;
packetNumber++;
if (packetNumber >= totalPackets) packetNumber=0;
}
memcpy(buffer,values,bufferSize);
Any ideas or pointers to speed this up? Am I right?
The biggest improvement you can get from this code would be by not using floating point arithmetic. While the arithmetic by itself is fast, the conversions which happen in the nested loops, take a long time, especially on the ARM processor in the iPhone. You can achieve exactly the same results by using 'SInt32' instead of 'float' for the 'tempValue' variable.
Also, see if you can get rid of the memcpy() in the last string: perhaps you can construct the 'buffer' directly, without using a temporary buffer called 'values'. That saves one copy, which would be significant improvement for such a function.
Other notes: the last two lines of the loop probably belong outside of the loop and the body of the nested loop should use 'k' as a second index, instead of 'packetNumber', but I'm not sure about this logic.
And the last note: you're squashing the peaks of your resulting sound. While this seems like a good idea, it will sound pretty rough. You probably want to scale the result down instead of cropping it. Like that: instead of this code
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber];
}
if (tempValue > 32767.f) tempValue = 32767.f;
else if (tempValue < -32768.f) tempValue =- 32768.f;
you probably want something like this:
for (j=0; j<numFiles; j++)
{
tempValue += files[j][packetNumber] / numFiles;
}
Edit: and please do not forget to measure the performance before and after, to see which one of the improvements gave the biggest impact. This is the best way to learn performance: trial and measurement
A couple of pointers even though I'm not really familliar with iPhone development.
You could unwind the inner loop. You don't need a for loop to add 4 numbers together although it might be your compiler will do this for you.
Write directly to the buffer in your for loop. memcpy at the end will do another loop to copy the buffers.
Don't use a float for tempvalue. Depending on the hardware integer math is quicker and you don't need floats for summing channels.
Remove the if/endif. Digital clipping will sound horrible anyway so try to avoid it before summing the channels together. Branching inside a loop like this should be avoided if possible.
One thing I found when writing the audio mixing routines for my app is that incremented pointers worked much faster than indexing. Some compilers may sort this out for you but - not sure on the iphone - but certainly this gave my app a big boost for these tight loops (about 30% if I recall).
eg: instead of this:
for (k=0; k<reads; k++)
{
// Use buffer[k]
}
do this:
SInt16* p=buffer;
SInt16* pEnd=buffer+reads;
while (p!=pEnd)
{
// Use *p
p++;
}
Also, I believe iPhone has some sort of SIMD (single instruction multiple data) support called VFP. This would let you perform math on a number of samples in one instruction but I know little about this on iPhone.

Why is this C-style code 10X slower than this obj-C style code?

//obj C version, with some - less than one second on 18,000 iterations
for (NSString* coordStr in splitPoints) {
char *buf = [coordStr UTF8String];
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
i++;
}
//C version - over 13 seconds on 18,000 iterations
for (i = 0; buf != NULL; buf = strchr(buf,'['), ++i) {
buf += sizeof(char);
sscanf(buf, "%f,%f,", &routePoints[i].latitude, &routePoints[i].longitude);
}
As a corollary question, is there any way to make this loop faster?
Also see this question: Another Speed Boost Possible?
Measure, measure, measure.
Measure the code with the Sampler instrument in Instruments.
With that said, there is an obvious inefficiency in the C code compared to the Objective-C code.
Namely, fast enumeration -- the for(x in y) syntax -- is really fast and, more importantly, implies that splitPoints is an array or set that contains a bunch of data that has already been parsed into individual objects.
The strchr() call in the second loop implies that you are parsing stuff on the fly. In and of itself, strchr() is a looping operation and will consume time, more-so as the # of characters between occurrences of the target character increase.
That is all conjecture, though. As with all optimizations, speculation is useless and gathering concrete data using the [rather awesome] set of tools provided is the only way to know for sure.
Once you have measured, then you can make it faster.
Having nothing to do with performance, your C code has an error in it. buf += sizeof(char) should simply be buf++. Pointer arithmetic always moves in units the size of the type. It worked fine in this case because sizeof(char) was 1.
Obj C code looks like it has precomputed some split points, while the C code seeks them in each iteration. Simple answer? If N is the length of buf and M the number of your split points, it looks like your two snippets have complexities of O(M) versus O(N*M); which one's slower?
edit: Really amazed me though, that some would think C code is axiomatically faster than any other solution.
Vectorization can be used to speed up C code.
Example:
Even faster UTF-8 character counting
(But maybe just try to avoid the function call strchr() in the loop condition.)