C vs assembler vs NEON performance

C vs assembler vs NEON performance - iphone

I am working on an iPhone application that does real time image processing. One of the earliest steps in its pipeline is to convert a BGRA image to greyscale. I tried several different methods and the difference in timing results is far greater than I had imagined possible. First I tried using C. I approximate the conversion to luminosity by adding B+2*G+R /4
void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();
uchar *pOut = imByte.data;
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;
}
}
This code takes 55 ms to convert a 352x288 image. I then found some assembler code that does essentially the same thing
void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();
unsigned int *pOut = (unsigned int*) imByte.data;
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
register unsigned int nBGRA1 asm("r4");
register unsigned int nBGRA2 asm("r5");
unsigned int nZero=0;
unsigned int nSum1;
unsigned int nSum2;
unsigned int nPacked1;
asm volatile(
"ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #0] \n" // Load in two BGRA words
"usad8 %[nSum1], %[nBGRA1], %[nZero] \n" // Add R+G+B+A
"usad8 %[nSum2], %[nBGRA2], %[nZero] \n" // Add R+G+B+A
"uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8 \n" // Add G again
"uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8 \n" // Add G again
"mov %[nPacked1], %[nSum1], LSR #2 \n" // Init packed word
"mov %[nSum2], %[nSum2], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum2], LSL #8 \n" // Add to packed word
"ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #8] \n" // Load in two more BGRA words
"usad8 %[nSum1], %[nBGRA1], %[nZero] \n" // Add R+G+B+A
"usad8 %[nSum2], %[nBGRA2], %[nZero] \n" // Add R+G+B+A
"uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8 \n" // Add G again
"uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8 \n" // Add G again
"mov %[nSum1], %[nSum1], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum1], LSL #16 \n" // Add to packed word
"mov %[nSum2], %[nSum2], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum2], LSL #24 \n" // Add to packed word
///////////
////////////
: [pIn]"+r" (pIn),
[nBGRA1]"+r"(nBGRA1),
[nBGRA2]"+r"(nBGRA2),
[nZero]"+r"(nZero),
[nSum1]"+r"(nSum1),
[nSum2]"+r"(nSum2),
[nPacked1]"+r"(nPacked1)
:
: "cc" );
*pOut = nPacked1;
pOut++;
}
}
This function converts the same image in 12ms, almost 5X faster! I have not programmed in assembler before but I assumed that it would not be this much faster than C for such a simple operation. Inspired by this success I continued searching and discovered a NEON conversion example here.
void greyScaleNEON(uchar* output_data, uchar* input_data, int tot_pixels)
{
__asm__ volatile("lsr %2, %2, #3 \n"
"# build the three constants: \n"
"mov r4, #28 \n" // Blue channel multiplier
"mov r5, #151 \n" // Green channel multiplier
"mov r6, #77 \n" // Red channel multiplier
"vdup.8 d4, r4 \n"
"vdup.8 d5, r5 \n"
"vdup.8 d6, r6 \n"
"0: \n"
"# load 8 pixels: \n"
"vld4.8 {d0-d3}, [%1]! \n"
"# do the weight average: \n"
"vmull.u8 q7, d0, d4 \n"
"vmlal.u8 q7, d1, d5 \n"
"vmlal.u8 q7, d2, d6 \n"
"# shift and store: \n"
"vshrn.u16 d7, q7, #8 \n" // Divide q3 by 256 and store in the d7
"vst1.8 {d7}, [%0]! \n"
"subs %2, %2, #1 \n" // Decrement iteration count
"bne 0b \n" // Repeat unil iteration count is not zero
:
: "r"(output_data),
"r"(input_data),
"r"(tot_pixels)
: "r4", "r5", "r6"
);
}
The timing results were hard to believe. It converts the same image in 1 ms. 12X faster than assembler and an astounding 55X faster than C. I had no idea that such performance gains were possible. In light of this I have a few questions. First off, am I doing something terribly wrong in the C code? I still find it hard to believe that it is so slow. Second, if these results are at all accurate, in what kinds of situations can I expect to see these gains? You can probably imagine how excited I am at the prospect of making other parts of my pipeline run 55X faster. Should I be learning assembler/NEON and using them inside any loop that takes an appreciable amount of time?
Update 1: I have posted the assembler output from my C function in a text file at
http://temp-share.com/show/f3Yg87jQn It was far too large to include directly here.
Timing is done using OpenCV functions.
double duration = static_cast<double>(cv::getTickCount());
//function call
duration = static_cast<double>(cv::getTickCount())-duration;
duration /= cv::getTickFrequency();
//duration should now be elapsed time in ms
Results
I tested several suggested improvements. First, as recommended by Viktor I reordered the inner loop to put all fetches first. The inner loop then looked like.
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
//Jul 16, 2012 MR: Read and writes collected
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;
}
This change brought processing time down to 53ms an improvement of 2ms. Next as recommended by Victor I changed my function to fetch as uint. The inner loop then looked like
unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
uchar* out = temp.data;
for(; in_int < end; in_int+=4) // Does four pixels at a time
{
unsigned int pixelA = in_int[0];
unsigned int pixelB = in_int[1];
unsigned int pixelC = in_int[2];
unsigned int pixelD = in_int[3];
uchar* byteA = (uchar*)&pixelA;
uchar* byteB = (uchar*)&pixelB;
uchar* byteC = (uchar*)&pixelC;
uchar* byteD = (uchar*)&pixelD;
unsigned int sumA = byteA[0] + 2 * byteA[1] + byteA[2];
unsigned int sumB = byteB[0] + 2 * byteB[1] + byteB[2];
unsigned int sumC = byteC[0] + 2 * byteC[1] + byteC[2];
unsigned int sumD = byteD[0] + 2 * byteD[1] + byteD[2];
out[0] = sumA / 4;
out[1] = sumB / 4;
out[2] = sumC / 4;
out[3] = sumD / 4;
out +=4;
}
This modification had a dramatic effect, dropping processing time to 14ms, a drop of 39ms (75%). This last result is very close the the assembler performance of 11ms. The final optimization as recommended by rob was to include the __restrict keyword. I added it in front of every pointer declaration changing the following lines
__restrict unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
__restrict uchar* out = temp.data;
...
__restrict uchar* byteA = (uchar*)&pixelA;
__restrict uchar* byteB = (uchar*)&pixelB;
__restrict uchar* byteC = (uchar*)&pixelC;
__restrict uchar* byteD = (uchar*)&pixelD;
...
These changes had no measurable effect on processing time. Thank you for all your help, I will be paying much closer attention to memory management in the future.

There is an explanation here concerning some of the reasons for NEON's "success": http://hilbert-space.de/?p=22
Try compiling you C code with the "-S -O3" switches to see the optimized output of the GCC compiler.
IMHO, the key to success is the optimized read/write pattern employed by both assembly versions. And NEON/MMX/other vector engines also support saturation (clamping results to 0..255 without having to use the 'unsigned ints').
See these lines in the loop:
unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;
The reads and writes are really mixed. Slightly better version of the loop's cycle would be
// and the pIn reads can be combined into a single 4-byte fetch
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;
Keep in mind, that the "unsigned in sumA" line here can really mean the alloca() call (allocation on the stack), so you're wasting a lot of cycles on the temporary var allocations (the function call 4 times).
Also, the pIn[i] indexing does only a single-byte fetch from memory. The better way to do this is to read the int and then extract single bytes. To make things faster, use the "unsgined int*" to read 4 bytes (pIn[i * 4 + 0], pIn[i * 4 + 1], pIn[i * 4 + 2], pIn[i * 4 + 3]).
The NEON version is clearly superior: the lines
"# load 8 pixels: \n"
"vld4.8 {d0-d3}, [%1]! \n"
and
"#save everything in one shot \n"
"vst1.8 {d7}, [%0]! \n"
save most of the time for the memory access.

If performance is critically important (as it generally is with real-time image processing), you do need to pay attention to the machine code. As you have discovered, it can be especially important to use the vector instructions (which are designed for things like real-time image processing) -- and it is hard for compilers to automatically use the vector instructions effectively.
What you should try, before committing to assembly, is using compiler intrinsics. Compiler intrinsics aren't any more portable than assembly, but they should be easier to read and write, and easier for the compiler to work with. Aside from maintainability problems, the performance problem with assembly is that it effectively turns off the optimizer (you did use the appropriate compiler flag to turn it on, right?). That is: with inline assembly, the compiler is not able to tweak register assignment and so forth, so if you don't write your entire inner loop in assembly, it may still not be as efficient as it could be.
However, you will still be able to use your newfound assembly expertise to good effect -- as you can now inspect the assembly produced by your compiler, and figure out if it's being stupid. If so, you can tweak the C code (perhaps doing some pipelining by hand if the compiler isn't managing to), recompile it, look at the assembly output to see if the compiler is now doing what you want it to, then benchmark to see if it's actually running any faster...
If you've tried the above, and still can't provoke the compiler to do the right thing, go ahead and write your inner loop in assembly (and, again, check to see if the result is actually faster). For reasons described above, be sure to get the entire inner loop, including the loop branch.
Finally, as others have mentioned, take some time to try and figure out what "the right thing" is. Another benefit of learning your machine architecture is that it gives you a mental model of how things work -- so you will have a better chance of understanding how to put together efficient code.

Viktor Latypov's answer has lots of good information, but I want to point out one more thing: in your original C function, the compiler can't tell that pIn and pOut point to non-overlapping regions of memory. Now look at these lines:
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
The compiler has to assume that pOut[0] might be the same as pIn[4] or pIn[5] or pIn[6] (or any other pIn[x]). So it basically can't reorder any of the code in your loop.
You can tell the compiler that pIn and pOut don't overlap by declaring them __restrict:
__restrict uchar *pIn = (uchar*) imBGRA.data;
__restrict uchar *pOut = imByte.data;
This might speed up your original C version a bit.

This is kind of a toss up between performance and maintainability. Typically have an app load and function quickly is very nice for the user, but there is the trade off. Now your app is fairly difficult to maintain and the speed gains may be unwarranted. If the users of your app were complaining that it felt slow then these optimizations are worth the effort and lack of maintainability, but if it came from your need to speed up your app then you should not go this far into the optimization. If you are doing these images conversion at app startup then speed is not of the essence, but if you are constantly doing them ( and doing a lot of them ) while the app is running then they make more sense. Only optimize the parts of the app where the user spends time and actually experiences the slow down.
Also looking at the assembly they do not use division but rather only multiplications so look into that for your C code. Another instance is that it optimizes out your multiplication by 2 out to two additions. This again may be another trick as the multiplication may be slower on a iPhone application than an addition.

Related

Nibble shuffling with x64 SIMD

I'm aware of byte shuffling instructions, but I'd like to do the same with nibbles (4-bit values), concretely I'd like to shuffle 16 nibbles in a 64-bit word. My shuffling indices are also stored as 16 nibbles. What's the most efficient implementation of this?

Arbitrary shuffles with a control vector that has to be stored this way? Ugh, hard to work with. I guess you'd have to unpack both to feed SSSE3 pshufb and then re-pack that result.
Probably just punpcklbw against a right-shifted copy, then AND mask to keep only the low 4 bits in each byte. Then pshufb.
Sometimes an odd/even split is easier than widening each element (so bits just stay within their original byte or word). In this case, if we could change your nibble index numbering, punpcklqdq could put the odd or even nibbles in the high half, ready to bring them back down and OR.
But without doing that, re-packing is a separate problem. I guess combine adjacent pairs of bytes into a word in the low byte, perhaps with pmaddubsw if throughput is more important than latency. Then you can packuswd (against zero or itself) or pshufb (with a constant control vector).
If you were doing multiple such shuffles, you could pack two vectors down to one, to store with movhps / movq. Using AVX2, it might be possible to have all the other instructions working on two independent shuffles in the two 128-bit lanes.
// UNTESTED, requires only SSSE3
#include <stdint.h>
#include <immintrin.h>
uint64_t shuffle_nibbles(uint64_t data, uint64_t control)
{
__m128i vd = _mm_cvtsi64_si128(data); // movq
__m128i vd_hi = _mm_srli_epi32(vd, 4); // x86 doesn't have a SIMD byte shift
vd = _mm_unpacklo_epi8(vd, vd_hi); // every nibble at the bottom of a byte, with high garbage
vd = _mm_and_si128(vd, _mm_set1_epi8(0x0f)); // clear high garbage for later merging
__m128i vc = _mm_cvtsi64_si128(control);
__m128i vc_hi = _mm_srli_epi32(vc, 4);
vc = _mm_unpacklo_epi8(vc, vc_hi);
vc = _mm_and_si128(vc, _mm_set1_epi8(0x0f)); // make sure high bit is clear, else pshufb zeros that element.
// AVX-512VBMI vpermb doesn't have that problem, if you have it available
vd = _mm_shuffle_epi8(vd, vc);
// left-hand input is the unsigned one, right hand is treated as signed bytes.
vd = _mm_maddubs_epi16(vd, _mm_set1_epi16(0x1001)); // hi nibbles << 4 (*= 0x10), lo nibbles *= 1.
// vd has nibbles merged into bytes, but interleaved with zero bytes
vd = _mm_packus_epi16(vd, vd); // duplicate vd into low & high halves.
// Pack against _mm_setzero_si128() if you're not just going to movq into memory or a GPR and you want the high half of the vector to be zero.
return _mm_cvtsi128_si64(vd);
}
Masking the data with 0x0f ahead of the shuffle (instead of after) allows more ILP on CPUs with two shuffle units. At least if they already had the uint64_t values in vector registers, or if the data and control values are coming from memory so both can be loaded in the same cycle. If coming from GPRs, 1/clock throughput for vmovq xmm, reg means there's a resource conflict between the dep chains so they can't both start in the same cycle. But since we the data might be ready before the control, masking early keeps it off the critical path for control->output latency.
If latency is a bottleneck instead of the usual throughput, consider replacing pmaddubsw with right-shift by 4, por, and AND/pack. Or pshufb to pack while ignoring garbage in odd bytes. Since you'd need another constant anyway, might as well make it a pshufb constant instead of and.
If you had AVX-512, a shift and bit-blend with vpternlogd could avoid needing to mask the data before shuffling, and vpermb instead of vpshufb would avoid needing to mask the control, so you'd avoid the set1_epi8(0x0f) constant entirely.
clang's shuffle optimizer didn't spot anything, just compiling it as-written like GCC does (https://godbolt.org/z/xz7TTbM1d), even with -march=sapphirerapids. Not spotting that it could use vpermb instead of vpand / vpshufb.
shuffle_nibbles(unsigned long, unsigned long):
vmovq xmm0, rdi
vpsrld xmm1, xmm0, 4
vpunpcklbw xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
vmovq xmm1, rsi
vpsrld xmm2, xmm1, 4
vpunpcklbw xmm1, xmm1, xmm2 # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
vmovdqa xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
vpand xmm0, xmm0, xmm2
vpand xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm1
vpmaddubsw xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
vpackuswb xmm0, xmm0, xmm0
vmovq rax, xmm0
ret
(Without AVX, it requires 2 extra movdqa register-copy instructions.)

I came across this problem today. In AVX-512 you can use vpmultishiftqb (1), an amusing instruction available in Ice Lake and after (and apparently in Zen 4, according to Wikipedia), to shuffle nibbles much more quickly. Its power lies in its ability to permute bytes in an unaligned fashion: It takes the eight 8-bit chunks in each 64-bit element and selects unaligned 8-bit chunks from the corresponding element. Below is an implementation.
#include <immintrin.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
// Convention: (a & (0xf << (4 * i))) >> (4 * i) is the ith nibble of a
// (i.e., lowest-significant is 0)
uint64_t shuffle_nibbles(uint64_t data, uint64_t indices) {
#if defined(__AVX512VBMI__) && defined(__AVX512VL__)
// If your data is already in vectors, then this method also works in parallel
const __m128i lo_nibble_msk = _mm_set1_epi8(0x0f);
__m128i v_data = _mm_cvtsi64_si128(data);
__m128i v_indices = _mm_cvtsi64_si128(indices);
__m128i indices_lo = _mm_and_si128(lo_nibble_msk, v_indices);
__m128i indices_hi = _mm_andnot_si128(lo_nibble_msk, v_indices);
indices_lo = _mm_slli_epi32(indices_lo, 2);
indices_hi = _mm_srli_epi32(indices_hi, 2);
// Look up unaligned bytes
__m128i shuffled_hi = _mm_multishift_epi64_epi8(indices_hi, v_data);
__m128i shuffled_lo = _mm_multishift_epi64_epi8(indices_lo, v_data);
shuffled_hi = _mm_slli_epi32(shuffled_hi, 4);
// msk ? lo : hi
__m128i shuffled = _mm_ternarylogic_epi32(lo_nibble_msk, shuffled_lo, shuffled_hi, 202);
return _mm_cvtsi128_si64(shuffled);
#else
// Fallback scalar implementation (preferably Peter Cordes's SSE solution--this is as an example)
uint64_t result = 0;
for (int i = 0; i < 16; ++i) {
indices = (indices >> 60) + (indices << 4);
int idx = indices & 0xf;
result <<= 4;
result |= (data >> (4 * idx)) & 0xf;
}
return result;
#endif
}
int main() {
// 0xaa025411fe034102
uint64_t r1 = shuffle_nibbles(0xfedcba9876543210, 0xaa025411fe034102);
// 0x55fdabee01fcbefd
uint64_t r2 = shuffle_nibbles(0x0123456789abcdef, 0xaa025411fe034102);
// 0xaaaa00002222aaaa
uint64_t r3 = shuffle_nibbles(0xaa025411fe034102, 0xeeee11110000ffff);
printf("0x%" PRIx64 "\n", r1);
printf("0x%" PRIx64 "\n", r2);
printf("0x%" PRIx64 "\n", r3);
}
Clang yields (2):
.LCPI0_0:
.zero 16,60
shuffle_nibbles(unsigned long, unsigned long):
vmovq xmm0, rdi
vmovq xmm1, rsi
vpslld xmm2, xmm1, 2
vpsrld xmm1, xmm1, 2
vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60]
vpand xmm1, xmm1, xmm3
vpmultishiftqb xmm1, xmm1, xmm0
vpand xmm2, xmm2, xmm3
vpmultishiftqb xmm0, xmm2, xmm0
vpslld xmm1, xmm1, 4
vpternlogd xmm1, xmm0, dword ptr [rip + .LCPI0_1]{1to4}, 216
vmovq rax, xmm1
In my case, I am shuffling nibbles in 64-bit-element vectors; this method also avoids the need for widening. If your shuffle(s) is/are constant and you stay in vectors, this method reduces to a measly four instructions: 2x vpmultishiftqb, 1x vpslld, and 1x vpternlogd. Counting µops suggests a latency of 5 and throughput of one every 2 cycles, bottlenecked on shuffle µops, for 128- and 256-bit vectors; and a throughput of 3 for 512-bit vectors, due to reduced execution units for the latter two instructions.

How to emulate really simple variable bit shifts with SSE?

I have two variable bit-shifting code fragments that I want to SSE-vectorize by some means:
1) a = 1 << b (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/2/4/8/16/32/64/128/256
2) a = 1 << (8 * b) (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/0x100/0x10000/etc
OK, I know that AMD's XOP VPSHLQ would do this, as would AVX2's VPSHLQ. But my challenge here is whether this can be achieved on 'normal' (i.e. up to SSE4.2) SSE.
So, is there some funky SSE-family opcode sequence that will achieve the effect of either of these code fragments? These only need yield the listed output values for the specific input values (0-7).
Update: here's my attempt at 1), based on Peter Cordes' suggestion of using the floating point exponent to do simple variable bitshifting:
#include <stdint.h>
typedef union
{
int32_t i;
float f;
} uSpec;
void do_pow2(uint64_t *in_array, uint64_t *out_array, int num_loops)
{
uSpec u;
for (int i=0; i<num_loops; i++)
{
int32_t x = *(int32_t *)&in_array[i];
u.i = (127 + x) << 23;
int32_t r = (int32_t) u.f;
out_array[i] = r;
}
}

CRC-32 algorithm from HDL to software

I implemented a Galois Linear-Feedback Shift-Regiser in Verilog (and also in MATLAB, mainly to emulate the HDL design). It's been working great, and as of know I use MATLAB to calculate CRC-32 fields, and then include them in my HDL simulations to verify a data packet has arrived correctly (padding data with CRC-32), which produces good results.
The thing is I want to be able to calculate the CRC-32 I've implemented in software, because I'll be using a Raspberry Pi to input data through GPIO in my FPGA, and I haven't been able to do so. I've tried this online calculator, using the same parameters, but never get to yield the same result.
This is the MATLAB code I use to calculate my CRC-32:
N = 74*16;
data = [round(rand(1,N)) zeros(1,32)];
lfsr = ones(1,32);
next_lfsr = zeros(1,32);
for i = 1:length(data)
next_lfsr(1) = lfsr(2);
next_lfsr(2) = lfsr(3);
next_lfsr(3) = lfsr(4);
next_lfsr(4) = lfsr(5);
next_lfsr(5) = lfsr(6);
next_lfsr(6) = xor(lfsr(7),lfsr(1));
next_lfsr(7) = lfsr(8);
next_lfsr(8) = lfsr(9);
next_lfsr(9) = xor(lfsr(10),lfsr(1));
next_lfsr(10) = xor(lfsr(11),lfsr(1));
next_lfsr(11) = lfsr(12);
next_lfsr(12) = lfsr(13);
next_lfsr(13) = lfsr(14);
next_lfsr(14) = lfsr(15);
next_lfsr(15) = lfsr(16);
next_lfsr(16) = xor(lfsr(17), lfsr(1));
next_lfsr(17) = lfsr(18);
next_lfsr(18) = lfsr(19);
next_lfsr(19) = lfsr(20);
next_lfsr(20) = xor(lfsr(21),lfsr(1));
next_lfsr(21) = xor(lfsr(22),lfsr(1));
next_lfsr(22) = xor(lfsr(23),lfsr(1));
next_lfsr(23) = lfsr(24);
next_lfsr(24) = xor(lfsr(25), lfsr(1));
next_lfsr(25) = xor(lfsr(26), lfsr(1));
next_lfsr(26) = lfsr(27);
next_lfsr(27) = xor(lfsr(28), lfsr(1));
next_lfsr(28) = xor(lfsr(29), lfsr(1));
next_lfsr(29) = lfsr(30);
next_lfsr(30) = xor(lfsr(31), lfsr(1));
next_lfsr(31) = xor(lfsr(32), lfsr(1));
next_lfsr(32) = xor(data2(i), lfsr(1));
lfsr = next_lfsr;
end
crc32 = lfsr;
See I use a 32-zeroes padding to calculate the CRC-32 in the first place (whatever's left in the LFSR at the end is my CRC-32, and if I do the same replacing the zeroes with this CRC-32, my LFSR becomes empty at the end too, which means the verification passed).
The polynomial I'm using is the standard for CRC-32: 04C11DB7. See also that the order seems to be reversed, but that's just because it's mirrored to have the input in the MSB. The results of using this representation and a mirrored one are the same when the input is the same, only the result will be also mirrored.
Any ideas would be of great help.
Thanks in advance

Your CRC is not a CRC. The last 32 bits fed in don't actually participate in the calculation, other than being exclusive-or'ed into the result. That is, if you replace the last 32 bits of data with zeros, do your calculation, and then exclusive-or the last 32 bits of data with the resulting "crc32", then you will get the same result.
So you will never get it to match another CRC calculation, since it isn't a CRC.
This code in C replicates your function, where the data bits come from the series of n bytes at p, least significant bit first, and the result is a 32-bit value:
unsigned long notacrc(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
reg ^= (unsigned long)*dat++ << 24;
n--;
}
return reg;
}
You can immediately see that the last byte of data is simply exclusive-or'ed with the final register value. Less obvious is that the last four bytes are just exclusive-or'ed. This exactly equivalent version makes that evident:
unsigned long notacrc_xor(void const *p, unsigned n) {
unsigned char const *dat = p;
// initial register values
unsigned long const init[] = {
0xffffffff, 0x2dfd1072, 0xbe26ed00, 0x00be26ed, 0xdebb20e3};
unsigned xor = n > 3 ? 4 : n; // number of bytes merely xor'ed
unsigned long reg = init[xor];
while (n > xor) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
switch (n) {
case 4:
reg ^= *dat++;
case 3:
reg ^= (unsigned long)*dat++ << 8;
case 2:
reg ^= (unsigned long)*dat++ << 16;
case 1:
reg ^= (unsigned long)*dat++ << 24;
}
return reg;
}
There you can see that the last four bytes of the message, or all of the message if it is three or fewer bytes, is exclusive-or'ed with the final register value at the end.
An actual CRC must use all of the input data bits in determining when to exclusive-or the polynomial with the register. The inner part of that last function is what a CRC implementation looks like (though more efficient versions make use of pre-computed tables to process a byte or more at a time). Here is a function that computes an actual CRC:
unsigned long crc32_jam(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
return reg;
}
That one is called crc32_jam because it implements a particular CRC called "JAMCRC". That CRC is the closest to what you attempted to implement.
If you want to use a real CRC, you will need to update your Verilog implementation.

AudioQueue Recording Audio Sample

I am currently in the process of building an application that reads in audio from my iPhone's microphone, and then does some processing and visuals. Of course I am starting with the audio stuff first, but am having one minor problem.
I am defining my sampling rate to be 44100 Hz and defining my buffer to hold 4096 samples. Which is does. However, when I print this data out, copy it into MATLAB to double check accuracy, the sample rate I have to use is half of my iPhone defined rate, or 22050 Hz, for it to be correct.
I think it has something to do with the following code and how it is putting 2 bytes per packet, and when I am looping through the buffer, the buffer is spitting out the whole packet, which my code assumes is a single number. So what I am wondering is how to split up those packets and read them as individual numbers.
- (void)setupAudioFormat {
memset(&dataFormat, 0, sizeof(dataFormat));
dataFormat.mSampleRate = kSampleRate;
dataFormat.mFormatID = kAudioFormatLinearPCM;
dataFormat.mFramesPerPacket = 1;
dataFormat.mChannelsPerFrame = 1;
// dataFormat.mBytesPerFrame = 2;
// dataFormat.mBytesPerPacket = 2;
dataFormat.mBitsPerChannel = 16;
dataFormat.mReserved = 0;
dataFormat.mBytesPerPacket = dataFormat.mBytesPerFrame = (dataFormat.mBitsPerChannel / 8) * dataFormat.mChannelsPerFrame;
dataFormat.mFormatFlags =
kLinearPCMFormatFlagIsSignedInteger |
kLinearPCMFormatFlagIsPacked;
}
If what I described is unclear, please let me know. Thanks!
EDIT
Adding the code that I used to print the data
float *audioFloat = (float *)malloc(numBytes * sizeof(float));
int *temp = (int*)inBuffer->mAudioData;
int i;
float power = pow(2, 31);
for (i = 0;i<numBytes;i++) {
audioFloat[i] = temp[i]/power;
printf("%f ",audioFloat[i]);
}

I found the problem with what I was doing. It was a c pointer issue, and since I have never really programmed in C before, I of course got them wrong.
You can not directly cast inBuffer->mAudioData to an int array. So what I simply did was the following
SInt16 *buffer = malloc(sizeof(SInt16)*kBufferByteSize);
buffer = inBuffer->mAudioData;
This worked out just fine and now my data is of correct length and the data is represented properly.

I saw your answer, there also is an underlying issue which gives wrong sample data bytes which is because of an endian issue of bytes being swapped.
-(void)feedSamplesToEngine:(UInt32)audioDataBytesCapacity audioData:(void *)audioData {
int sampleCount = audioDataBytesCapacity / sizeof(SAMPLE_TYPE);
SAMPLE_TYPE *samples = (SAMPLE_TYPE*)audioData;
//SAMPLE_TYPE *sample_le = (SAMPLE_TYPE *)malloc(sizeof(SAMPLE_TYPE)*sampleCount );//for swapping endians
std::string shorts;
double power = pow(2,10);
for(int i = 0; i < sampleCount; i++)
{
SAMPLE_TYPE sample_le = (0xff00 & (samples[i] << 8)) | (0x00ff & (samples[i] >> 8)) ; //Endianess issue
char dataInterim[30];
sprintf(dataInterim,"%f ", sample_le/power); // normalize it.
shorts.append(dataInterim);
}

Fastest way of bitwise AND between two arrays on iPhone?

I have two image blocks stored as 1D arrays and have do the following bitwise AND operations among the elements of them.
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y++)
for(int x=0; x<a_lenx; x++)
{
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap ;
}
Actually, I have to do this job about 220,000 times, so it becomes very slow on iphone devices.
How could I accelerate this job on iPhone ?
I heard that NEON could be useful, but I'm not really familiar with it. In addition it seems that NEON doesn't have bitwise AND...

Option 1 - Work in the native width of your platform (it's faster to fetch 32-bits into a register and then do operations on that register than it is to fetch and compare data one byte at a time):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap = 0;
uint32_t* a_int = (uint32_t*)a;
uint32_t* b_int = (uint32_t*)b;
a_leny = a_leny / 4;
a_lenx = a_lenx / 4;
a_pitch = a_pitch / 4;
b_pitch = b_pitch / 4;
for(int y=0; y<a_leny_int; y++)
for(int x=0; x<a_lenx_int; x++)
{
uint32_t aVal = a_int[x + y * a_pitch_int];
uint32_t bVal = b_int[x+y*b_pitch_int];
if (aVal & 0xFF) & (bVal & 0xFF)
overlap++;
if ((aVal >> 8) & 0xFF) & ((bVal >> 8) & 0xFF)
overlap++;
if ((aVal >> 16) & 0xFF) & ((bVal >> 16) & 0xFF)
overlap++;
if ((aVal >> 24) & 0xFF) & ((bVal >> 24) & 0xFF)
overlap++;
}
return overlap ;
}
Option 2 - Use a heuristic to get an approximate result using fewer calculations (a good approach if the absolute difference between 101 overlaps and 100 overlaps is not important to your application):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y+= 10)
for(int x=0; x<a_lenx; x+= 10)
{
//we compare 1% of all the pixels, and use that as the result
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap * 100;
}
Option 3 - Rewrite your function in inline assembly code. You're on your own for this one.

Your code is Rambo for the CPU - its worst nightmare :
byte access. Like aroth mentioned, ARM is VERY slow reading bytes from memory
random access. Two absolutely unnecessary multiply/add operations in addition to the already steep performance penalty by its nature.
Simply put, everything is wrong that can be wrong.
Don't call me rude. Let me be your angel instead.
First, I'll provide you a working NEON version. Then an optimized C version showing you exactly what you did wrong.
Just give me some time. I have to go to bed right now, and I have an important meeting tomorrow.
Why don't you learn ARM assembly? It's much easier and useful than x86 assembly.
It will also improve your C programming capabilities by a huge step.
Strongly recommended
cya
==============================================================================
Ok, here is an optimized version written in C with ARM assembly in mind.
Please note that both the pitches AND a_lenx have to be multiples of 4. Otherwise, it won't work properly.
There isn't much room left for optimizations with ARM assembly upon this version. (NEON is a different story - coming soon)
Take a careful look at how to handle variable declarations, loop, memory access, and AND operations.
And make sure that this function runs in ARM mode and not Thumb for best results.
unsigned int compare(unsigned int *a, unsigned int a_pitch,
unsigned int *b, unsigned int b_pitch, unsigned int a_lenx, unsigned int a_leny)
{
unsigned int overlap =0;
unsigned int a_gap = (a_pitch - a_lenx)>>2;
unsigned int b_gap = (b_pitch - a_lenx)>>2;
unsigned int aval, bval, xcount;
do
{
xcount = (a_lenx>>2);
do
{
aval = *a++;
// ldr aval, [a], #4
bval = *b++;
// ldr bavl, [b], #4
aval &= bval;
// and aval, aval, bval
if (aval & 0x000000ff) overlap += 1;
// tst aval, #0x000000ff
// addne overlap, overlap, #1
if (aval & 0x0000ff00) overlap += 1;
// tst aval, #0x0000ff00
// addne overlap, overlap, #1
if (aval & 0x00ff0000) overlap += 1;
// tst aval, #0x00ff0000
// addne overlap, overlap, #1
if (aval & 0xff000000) overlap += 1;
// tst aval, #0xff000000
// addne overlap, overlap, #1
} while (--xcount);
a += a_gap;
b += b_gap;
} while (--a_leny);
return overlap;
}

First of all, why the double loop? You can do it with a single loop and a couple of pointers.
Also, you don't need to calculate x+y*pitch for every single pixel; just increment two pointers by one. Incrementing by one is a lot faster than x+y*pitch.
Why exactly do you need to perform this operation? I would make sure there are no high-level optimizations/changes available before looking into a low-level solution like NEON.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse