Fastest way of bitwise AND between two arrays on iPhone? - iphone

I have two image blocks stored as 1D arrays and have do the following bitwise AND operations among the elements of them.
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y++)
for(int x=0; x<a_lenx; x++)
{
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap ;
}
Actually, I have to do this job about 220,000 times, so it becomes very slow on iphone devices.
How could I accelerate this job on iPhone ?
I heard that NEON could be useful, but I'm not really familiar with it. In addition it seems that NEON doesn't have bitwise AND...

Option 1 - Work in the native width of your platform (it's faster to fetch 32-bits into a register and then do operations on that register than it is to fetch and compare data one byte at a time):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap = 0;
uint32_t* a_int = (uint32_t*)a;
uint32_t* b_int = (uint32_t*)b;
a_leny = a_leny / 4;
a_lenx = a_lenx / 4;
a_pitch = a_pitch / 4;
b_pitch = b_pitch / 4;
for(int y=0; y<a_leny_int; y++)
for(int x=0; x<a_lenx_int; x++)
{
uint32_t aVal = a_int[x + y * a_pitch_int];
uint32_t bVal = b_int[x+y*b_pitch_int];
if (aVal & 0xFF) & (bVal & 0xFF)
overlap++;
if ((aVal >> 8) & 0xFF) & ((bVal >> 8) & 0xFF)
overlap++;
if ((aVal >> 16) & 0xFF) & ((bVal >> 16) & 0xFF)
overlap++;
if ((aVal >> 24) & 0xFF) & ((bVal >> 24) & 0xFF)
overlap++;
}
return overlap ;
}
Option 2 - Use a heuristic to get an approximate result using fewer calculations (a good approach if the absolute difference between 101 overlaps and 100 overlaps is not important to your application):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y+= 10)
for(int x=0; x<a_lenx; x+= 10)
{
//we compare 1% of all the pixels, and use that as the result
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap * 100;
}
Option 3 - Rewrite your function in inline assembly code. You're on your own for this one.

Your code is Rambo for the CPU - its worst nightmare :
byte access. Like aroth mentioned, ARM is VERY slow reading bytes from memory
random access. Two absolutely unnecessary multiply/add operations in addition to the already steep performance penalty by its nature.
Simply put, everything is wrong that can be wrong.
Don't call me rude. Let me be your angel instead.
First, I'll provide you a working NEON version. Then an optimized C version showing you exactly what you did wrong.
Just give me some time. I have to go to bed right now, and I have an important meeting tomorrow.
Why don't you learn ARM assembly? It's much easier and useful than x86 assembly.
It will also improve your C programming capabilities by a huge step.
Strongly recommended
cya
==============================================================================
Ok, here is an optimized version written in C with ARM assembly in mind.
Please note that both the pitches AND a_lenx have to be multiples of 4. Otherwise, it won't work properly.
There isn't much room left for optimizations with ARM assembly upon this version. (NEON is a different story - coming soon)
Take a careful look at how to handle variable declarations, loop, memory access, and AND operations.
And make sure that this function runs in ARM mode and not Thumb for best results.
unsigned int compare(unsigned int *a, unsigned int a_pitch,
unsigned int *b, unsigned int b_pitch, unsigned int a_lenx, unsigned int a_leny)
{
unsigned int overlap =0;
unsigned int a_gap = (a_pitch - a_lenx)>>2;
unsigned int b_gap = (b_pitch - a_lenx)>>2;
unsigned int aval, bval, xcount;
do
{
xcount = (a_lenx>>2);
do
{
aval = *a++;
// ldr aval, [a], #4
bval = *b++;
// ldr bavl, [b], #4
aval &= bval;
// and aval, aval, bval
if (aval & 0x000000ff) overlap += 1;
// tst aval, #0x000000ff
// addne overlap, overlap, #1
if (aval & 0x0000ff00) overlap += 1;
// tst aval, #0x0000ff00
// addne overlap, overlap, #1
if (aval & 0x00ff0000) overlap += 1;
// tst aval, #0x00ff0000
// addne overlap, overlap, #1
if (aval & 0xff000000) overlap += 1;
// tst aval, #0xff000000
// addne overlap, overlap, #1
} while (--xcount);
a += a_gap;
b += b_gap;
} while (--a_leny);
return overlap;
}

First of all, why the double loop? You can do it with a single loop and a couple of pointers.
Also, you don't need to calculate x+y*pitch for every single pixel; just increment two pointers by one. Incrementing by one is a lot faster than x+y*pitch.
Why exactly do you need to perform this operation? I would make sure there are no high-level optimizations/changes available before looking into a low-level solution like NEON.

Related

Flutter/Dart List with set size and bit shifting question

I'm writing to a piece of hardware using bluetooth and need to format my data in a specific way.
When I get the value from the device I have do a little bit shifting to get the correct answer.
Here is a breakdown of the values I am getting back from the device.
byte[1] = (unsigned char)temp;
byte[2] = (unsigned char)(temp>>8);
byte[3] = (unsigned char)(temp>>16);
byte[4] = (unsigned char)(temp>>24);
It is a List with a size of 4. A real world example would be this:
byte[1] = '46';
byte[2] = '2';
byte[3] = '0';
byte[4] = '0';
This should work out to be
558
My working code to get this is:
int _shiftLeft(int n, int amount) {
return n << amount;
}
int _getValue(List<int> list) {
int temp;
temp = list[1];
temp += _shiftLeft(list[2], 8);
temp += _shiftLeft(list[3], 16);
temp += _shiftLeft(list[4], 24);
return temp;
}
The actual list I get back from the device is quite large but I only need values 1-4.
This works great and gets me the correct value back. Now I have to write to the device. So if I have a value of 558, I need to build a list of size 4 with the same bit shifting but in reverse. Following the exact method above but in reverse. What is the best way to do this?
Basically if I pass a method a value of '558' I need to get back a List<int> of [46,2,0,0]
You can get only the lower 8 bits by the bitwise AND operation & 255 (or & 0xFF).
Just combining this with bit shifting will do.
int _shiftRight(int n, int amount) {
return n >> amount;
}
List<int> _getList(int value) {
final list = <int>[];
list.add(value & 255);
list.add(_shiftRight(value, 8) & 255);
list.add(_shiftRight(value, 16) & 255);
list.add(_shiftRight(value, 24) & 255);
return list;
}
It can be simplified using for as follows:
List<int> _getList(int value) {
final list = <int>[];
for (int i = 0; i < 4; i++) {
list.add(value >> i * 8 & 255);
}
return list;
}

How to emulate *really simple* variable bit shifts with SSE?

I have two variable bit-shifting code fragments that I want to SSE-vectorize by some means:
1) a = 1 << b (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/2/4/8/16/32/64/128/256
2) a = 1 << (8 * b) (where b = 0..7 exactly), i.e. 0/1/2/3/4/5/6/7 -> 1/0x100/0x10000/etc
OK, I know that AMD's XOP VPSHLQ would do this, as would AVX2's VPSHLQ. But my challenge here is whether this can be achieved on 'normal' (i.e. up to SSE4.2) SSE.
So, is there some funky SSE-family opcode sequence that will achieve the effect of either of these code fragments? These only need yield the listed output values for the specific input values (0-7).
Update: here's my attempt at 1), based on Peter Cordes' suggestion of using the floating point exponent to do simple variable bitshifting:
#include <stdint.h>
typedef union
{
int32_t i;
float f;
} uSpec;
void do_pow2(uint64_t *in_array, uint64_t *out_array, int num_loops)
{
uSpec u;
for (int i=0; i<num_loops; i++)
{
int32_t x = *(int32_t *)&in_array[i];
u.i = (127 + x) << 23;
int32_t r = (int32_t) u.f;
out_array[i] = r;
}
}

CRC-32 algorithm from HDL to software

I implemented a Galois Linear-Feedback Shift-Regiser in Verilog (and also in MATLAB, mainly to emulate the HDL design). It's been working great, and as of know I use MATLAB to calculate CRC-32 fields, and then include them in my HDL simulations to verify a data packet has arrived correctly (padding data with CRC-32), which produces good results.
The thing is I want to be able to calculate the CRC-32 I've implemented in software, because I'll be using a Raspberry Pi to input data through GPIO in my FPGA, and I haven't been able to do so. I've tried this online calculator, using the same parameters, but never get to yield the same result.
This is the MATLAB code I use to calculate my CRC-32:
N = 74*16;
data = [round(rand(1,N)) zeros(1,32)];
lfsr = ones(1,32);
next_lfsr = zeros(1,32);
for i = 1:length(data)
next_lfsr(1) = lfsr(2);
next_lfsr(2) = lfsr(3);
next_lfsr(3) = lfsr(4);
next_lfsr(4) = lfsr(5);
next_lfsr(5) = lfsr(6);
next_lfsr(6) = xor(lfsr(7),lfsr(1));
next_lfsr(7) = lfsr(8);
next_lfsr(8) = lfsr(9);
next_lfsr(9) = xor(lfsr(10),lfsr(1));
next_lfsr(10) = xor(lfsr(11),lfsr(1));
next_lfsr(11) = lfsr(12);
next_lfsr(12) = lfsr(13);
next_lfsr(13) = lfsr(14);
next_lfsr(14) = lfsr(15);
next_lfsr(15) = lfsr(16);
next_lfsr(16) = xor(lfsr(17), lfsr(1));
next_lfsr(17) = lfsr(18);
next_lfsr(18) = lfsr(19);
next_lfsr(19) = lfsr(20);
next_lfsr(20) = xor(lfsr(21),lfsr(1));
next_lfsr(21) = xor(lfsr(22),lfsr(1));
next_lfsr(22) = xor(lfsr(23),lfsr(1));
next_lfsr(23) = lfsr(24);
next_lfsr(24) = xor(lfsr(25), lfsr(1));
next_lfsr(25) = xor(lfsr(26), lfsr(1));
next_lfsr(26) = lfsr(27);
next_lfsr(27) = xor(lfsr(28), lfsr(1));
next_lfsr(28) = xor(lfsr(29), lfsr(1));
next_lfsr(29) = lfsr(30);
next_lfsr(30) = xor(lfsr(31), lfsr(1));
next_lfsr(31) = xor(lfsr(32), lfsr(1));
next_lfsr(32) = xor(data2(i), lfsr(1));
lfsr = next_lfsr;
end
crc32 = lfsr;
See I use a 32-zeroes padding to calculate the CRC-32 in the first place (whatever's left in the LFSR at the end is my CRC-32, and if I do the same replacing the zeroes with this CRC-32, my LFSR becomes empty at the end too, which means the verification passed).
The polynomial I'm using is the standard for CRC-32: 04C11DB7. See also that the order seems to be reversed, but that's just because it's mirrored to have the input in the MSB. The results of using this representation and a mirrored one are the same when the input is the same, only the result will be also mirrored.
Any ideas would be of great help.
Thanks in advance
Your CRC is not a CRC. The last 32 bits fed in don't actually participate in the calculation, other than being exclusive-or'ed into the result. That is, if you replace the last 32 bits of data with zeros, do your calculation, and then exclusive-or the last 32 bits of data with the resulting "crc32", then you will get the same result.
So you will never get it to match another CRC calculation, since it isn't a CRC.
This code in C replicates your function, where the data bits come from the series of n bytes at p, least significant bit first, and the result is a 32-bit value:
unsigned long notacrc(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
reg ^= (unsigned long)*dat++ << 24;
n--;
}
return reg;
}
You can immediately see that the last byte of data is simply exclusive-or'ed with the final register value. Less obvious is that the last four bytes are just exclusive-or'ed. This exactly equivalent version makes that evident:
unsigned long notacrc_xor(void const *p, unsigned n) {
unsigned char const *dat = p;
// initial register values
unsigned long const init[] = {
0xffffffff, 0x2dfd1072, 0xbe26ed00, 0x00be26ed, 0xdebb20e3};
unsigned xor = n > 3 ? 4 : n; // number of bytes merely xor'ed
unsigned long reg = init[xor];
while (n > xor) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
switch (n) {
case 4:
reg ^= *dat++;
case 3:
reg ^= (unsigned long)*dat++ << 8;
case 2:
reg ^= (unsigned long)*dat++ << 16;
case 1:
reg ^= (unsigned long)*dat++ << 24;
}
return reg;
}
There you can see that the last four bytes of the message, or all of the message if it is three or fewer bytes, is exclusive-or'ed with the final register value at the end.
An actual CRC must use all of the input data bits in determining when to exclusive-or the polynomial with the register. The inner part of that last function is what a CRC implementation looks like (though more efficient versions make use of pre-computed tables to process a byte or more at a time). Here is a function that computes an actual CRC:
unsigned long crc32_jam(void const *p, unsigned n) {
unsigned char const *dat = p;
unsigned long reg = 0xffffffff;
while (n) {
reg ^= *dat++;
for (unsigned k = 0; k < 8; k++)
reg = reg & 1 ? (reg >> 1) ^ 0xedb88320 : reg >> 1;
n--;
}
return reg;
}
That one is called crc32_jam because it implements a particular CRC called "JAMCRC". That CRC is the closest to what you attempted to implement.
If you want to use a real CRC, you will need to update your Verilog implementation.

How to perform right shift on RISC

I'd like to know how I can perform a right shift on a Reduced Instruction Set Computer that does not offer this operation on it's own.
A left shift can be simply done by adding a register to itself but how about a right shift?
The RISC offers only:
ADD
NOT
NXOR (XOR)
AND (NAND)
so OR and NOR can all be emulated by several (N)AND and NOT operations.
The C program below uses only authorized instructions plus conditional jumps, and it shifts input into output by 1.
If the instruction you are trying to emulate is “shift by n”, then you should start with c equal to 2n.
unsigned int shift_right(unsigned int input) {
unsigned int d = 1;
unsigned int output = 0;
for (unsigned int c = 2; c <= 0x80000000; c += c)
{
if (c & input)
output |= d;
d += d;
}
return output;
}

Carefully deleting N items from a "circular" vector (or perhaps just an NSMutableArray)

Imagine a std:vector, say, with 100 things on it (0 to 99) currently. You are treating it as a loop. So the 105th item is index 4; forward 7 from index 98 is 5.
You want to delete N items after index position P.
So, delete 5 items after index 50; easy.
Or 5 items after index 99: as you delete 0 five times, or 4 through 0, noting that position at 99 will be erased from existence.
Worst, 5 items after index 97 - you have to deal with both modes of deletion.
What's the elegant and solid approach?
Here's a boring routine I wrote
-(void)knotRemovalHelper:(NSMutableArray*)original
after:(NSInteger)nn howManyToDelete:(NSInteger)desired
{
#define ORCO ((NSInteger)[original count])
static NSInteger kount, howManyUntilLoop, howManyExtraAferLoop;
if ( ... our array is NOT a loop ... )
// trivial, if messy...
{
for ( kount = 1; kount<=desired; ++kount )
{
if ( (nn+1) >= ORCO )
return;
[original removeObjectAtIndex:( nn+1 )];
}
return;
}
else // our array is a loop
// messy, confusing and inelegant. how to improve?
// here we go...
{
howManyUntilLoop = (ORCO-1) - nn;
if ( howManyUntilLoop > desired )
{
for ( kount = 1; kount<=desired; ++kount )
[original removeObjectAtIndex:( nn+1 )];
return;
}
howManyExtraAferLoop = desired - howManyUntilLoop;
for ( kount = 1; kount<=howManyUntilLoop; ++kount )
[original removeObjectAtIndex:( nn+1 )];
for ( kount = 1; kount<=howManyExtraAferLoop; ++kount )
[original removeObjectAtIndex:0];
return;
}
#undef ORCO
}
Update!
InVariant's second answer leads to the following excellent solution. "starting with" is much better than "starting after". So the routine now uses "start with". Invariant's second answer leads to this very simple solution...
N times do if P < currentsize remove P else remove 0
-(void)removeLoopilyFrom:(NSMutableArray*)ra
startingWithThisOne:(NSInteger)removeThisOneFirst
howManyToDelete:(NSInteger)countToDelete
{
// exception if removeThisOneFirst > ra highestIndex
// exception if countToDelete is > ra size
// so easy thanks to Invariant:
for ( do this countToDelete times )
{
if ( removeThisOneFirst < [ra count] )
[ra removeObjectAtIndex:removeThisOneFirst];
else
[ra removeObjectAtIndex:0];
}
}
Update!
Toolbox has pointed out the excellent idea of working to a new array - super KISS.
Here's an idea off the top of my head.
First, generate an array of integers representing the indices to remove. So "remove 5 from index 97" would generate [97,98,99,0,1]. This can be done with the application of a simple modulus operator.
Then, sort this array descending giving [99,98,97,1,0] and then remove the entries in that order.
Should work in all cases.
This solution seems to work, and it copies all remaining elements in the vector only once (to their final destination).
Assume kNumElements, kStartIndex, and kNumToRemove are defined as const size_t values.
vector<int> my_vec(kNumElements);
for (size_t i = 0; i < my_vec.size(); ++i) {
my_vec[i] = i;
}
for (size_t i = 0, cur = 0; i < my_vec.size(); ++i) {
// What is the "distance" from the current index to the start, taking
// into account the wrapping behavior?
size_t distance = (i + kNumElements - kStartIndex) % kNumElements;
// If it's not one of the ones to remove, then we keep it by copying it
// into its proper place.
if (distance >= kNumToRemove) {
my_vec[cur++] = my_vec[i];
}
}
my_vec.resize(kNumElements - kNumToRemove);
There's nothing wrong with two loop solutions as long as they're readable and don't do anything redundant. I don't know Objective-C syntax, but here's the pseudocode approach I'd take:
endIdx = after + howManyToDelete
if (Len <= after + howManyToDelete) //will have a second loop
firstloop = Len - after; //handle end in the first loop, beginning in second
else
firstpass = howManyToDelete; //the first loop will get them all
for (kount = 0; kount < firstpass; kount++)
remove after+1
for ( ; kount < howManyToDelete; kount++) //if firstpass < howManyToDelete, clean up leftovers
remove 0
This solution doesn't use mod, does the limit calculation outside the loop, and touches the relevant samples once each. The second for loop won't execute if all the samples were handled in the first loop.
The common way to do this in DSP is with a circular buffer. This is just a fixed length buffer with two associated counters:
//make sure BUFSIZE is a power of 2 for quick mod trick
#define BUFSIZE 1024
int CircBuf[BUFSIZE];
int InCtr, OutCtr;
void PutData(int *Buf, int count) {
int srcCtr;
int destCtr = InCtr & (BUFSIZE - 1); // if BUFSIZE is a power of 2, equivalent to and faster than destCtr = InCtr % BUFSIZE
for (srcCtr = 0; (srcCtr < count) && (destCtr < BUFSIZE); srcCtr++, destCtr++)
CircBuf[destCtr] = Buf[srcCtr];
for (destCtr = 0; srcCtr < count; srcCtr++, destCtr++)
CircBuf[destCtr] = Buf[srcCtr];
InCtr += count;
}
void GetData(int *Buf, int count) {
int srcCtr = OutCtr & (BUFSIZE - 1);
int destCtr = 0;
for (destCtr = 0; (srcCtr < BUFSIZE) && (destCtr < count); srcCtr++, destCtr++)
Buf[destCtr] = CircBuf[srcCtr];
for (srcCtr = 0; srcCtr < count; srcCtr++, destCtr++)
Buf[destCtr] = CircBuf[srcCtr];
OutCtr += count;
}
int BufferOverflow() {
return ((InCtr - OutCtr) > BUFSIZE);
}
This is pretty lightweight, but effective. And aside from the ctr = BigCtr & (SIZE-1) stuff, I'd argue it's highly readable. The only reason for the & trick is in old DSP environments, mod was an expensive operation so for something that ran often, like every time a buffer was ready for processing, you'd find ways to remove stuff like that. And if you were doing FFT's, your buffers were probably a power of 2 anyway.
These days, of course, you have 1 GHz processors and magically resizing arrays. You kids get off my lawn.
Another method:
N times do {remove entry at index P mod max(ArraySize, P)}
Example:
N=5, P=97, ArraySize=100
1: max(100, 97)=100 so remove at 97%100 = 97
2: max(99, 97)=99 so remove at 97%99 = 97 // array size is now 99
3: max(98, 97)=98 so remove at 97%98 = 97
4: max(97, 97)=97 so remove at 97%97 = 0
5: max(96, 97)=97 so remove at 97%97 = 0
I don't program iphone for know, so I image std::vector, it's quite easy, simple and elegant enough:
#include <iostream>
using std::cout;
#include <vector>
using std::vector;
#include <cassert> //no need for using, assert is macro
template<typename T>
void eraseCircularVector(vector<T> & vec, size_t position, size_t count)
{
assert(count <= vec.size());
if (count > 0)
{
position %= vec.size(); //normalize position
size_t positionEnd = (position + count) % vec.size();
if (positionEnd < position)
{
vec.erase(vec.begin() + position, vec.end());
vec.erase(vec.begin(), vec.begin() + positionEnd);
}
else
vec.erase(vec.begin() + position, vec.begin() + positionEnd);
}
}
int main()
{
vector<int> values;
for (int i = 0; i < 10; ++i)
values.push_back(i);
cout << "Values: ";
for (vector<int>::const_iterator cit = values.begin(); cit != values.end(); cit++)
cout << *cit << ' ';
cout << '\n';
eraseCircularVector(values, 5, 1); //remains 9: 0,1,2,3,4,6,7,8,9
eraseCircularVector(values, 16, 5); //remains 4: 3,4,6,7
cout << "Values: ";
for (vector<int>::const_iterator cit = values.begin(); cit != values.end(); cit++)
cout << *cit << ' ';
cout << '\n';
return 0;
}
However, you might consider:
creating new loop_vector class, if you use this kind of functionality enough
using list if you perform many deletions (or few deletions (not from end, that's simple pop_back) but large array)
If your container (NSMutableArray or whatever) is not list, but vector (i.e. resizable array), you most definitely don't want to delete items one by one, but whole range (e.g. std::vector's erase(begin, end)!
Edit: reacting to comment, to fully realize what must be done by vector, if you erase element other than the last one: it must copy all values after that element (e.g. 1000 items in array, you erase first, 999x copying (moving) of item, that is very costly).
Example:
#include <iostream>
#include <vector>
#include <ctime>
using namespace std;
int main()
{
clock_t start, end;
vector<int> vec;
const int items = 64 * 1024;
cout << "using " << items << " items in vector\n";
for (size_t i = 0; i < items; ++i) vec.push_back(i);
start = clock();
while (!vec.empty()) vec.erase(vec.begin());
end = clock();
cout << "Inefficient method took: "
<< (end - start) * 1.0 / CLOCKS_PER_SEC << " ms\n";
for (size_t i = 0; i < items; ++i) vec.push_back(i);
start = clock();
vec.erase(vec.begin(), vec.end());
end = clock();
cout << "Efficient method took: "
<< (end - start) * 1.0 / CLOCKS_PER_SEC << " ms\n";
return 0;
}
Produces output:
using 65536 items in vector
Inefficient method took: 1.705 ms
Efficient method took: 0 ms
Note it's very easy to get inefficient, look e.g. have at http://www.cplusplus.com/reference/stl/vector/erase/