VFP Unit Matrix Multiply problem on the iPhone - iphone

I'm trying to write a Matrix3x3 multiply using the Vector Floating Point on the iPhone, however i'm encountering some problems. This is my first attempt at writing any ARM assembly, so it could be a faily simple solution that i'm not seeing.
I've currently got a small application running using a maths library that i've written. I'm investigating into the benifits using the Vector Floating Point Unit would provide so i've taken my matrix multiply and converted it to asm. Previously the application would run without a problem, however now my objects will all randomly disappear. This seems to be caused by the results from my matrix multiply becoming NAN at some point.
Heres the code
IMatrix3x3 operator*(IMatrix3x3 & _A, IMatrix3x3 & _B)
{
IMatrix3x3 C;
//C++ code for the simulator
#if TARGET_IPHONE_SIMULATOR == true
C.A0 = _A.A0 * _B.A0 + _A.A1 * _B.B0 + _A.A2 * _B.C0;
C.A1 = _A.A0 * _B.A1 + _A.A1 * _B.B1 + _A.A2 * _B.C1;
C.A2 = _A.A0 * _B.A2 + _A.A1 * _B.B2 + _A.A2 * _B.C2;
C.B0 = _A.B0 * _B.A0 + _A.B1 * _B.B0 + _A.B2 * _B.C0;
C.B1 = _A.B0 * _B.A1 + _A.B1 * _B.B1 + _A.B2 * _B.C1;
C.B2 = _A.B0 * _B.A2 + _A.B1 * _B.B2 + _A.B2 * _B.C2;
C.C0 = _A.C0 * _B.A0 + _A.C1 * _B.B0 + _A.C2 * _B.C0;
C.C1 = _A.C0 * _B.A1 + _A.C1 * _B.B1 + _A.C2 * _B.C1;
C.C2 = _A.C0 * _B.A2 + _A.C1 * _B.B2 + _A.C2 * _B.C2;
//VPU ARM asm for the device
#else
//create a pointer to the Matrices
IMatrix3x3 * pA = &_A;
IMatrix3x3 * pB = &_B;
IMatrix3x3 * pC = &C;
//asm code
asm volatile(
//turn on a vector depth of 3
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00020000 \n\t"
"fmxr fpscr, r0 \n\t"
//load matrix B into the vector bank
"fldmias %1, {s8-s16} \n\t"
//load the first row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.A0, C.A1 and C.A2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the second row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.B0, C.B1 and C.B2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//load the third row of A into the scalar bank
"fldmias %0!, {s0-s2} \n\t"
//calulate C.C0, C.C1 and C.C2
"fmuls s17, s8, s0 \n\t"
"fmacs s17, s11, s1 \n\t"
"fmacs s17, s14, s2 \n\t"
//save this into the output
"fstmias %2!, {s17-s19} \n\t"
//set the vector depth back to 1
"fmrx r0, fpscr \n\t"
"bic r0, r0, #0x00370000 \n\t"
"orr r0, r0, #0x00000000 \n\t"
"fmxr fpscr, r0 \n\t"
//pass the inputs and set the clobber list
: "+r"(pA), "+r"(pB), "+r" (pC) :
:"cc", "memory","s0", "s1", "s2", "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15", "s16", "s17", "s18", "s19"
);
#endif
return C;
}
As far as i can see that makes sence. While debugging i've managed to notice that if i were to say _A = C prior to the return and after the ASM, _A will not necessarily be equal to C which has only increased my confusion. I had thought it was possibly due to the pointers I'm giving to the VFPU being incrimented by lines such as "fldmias %0!, {s0-s2} \n\t" however my understanding of asm is not good enough to properly understand the problem, nor to see an alternative approach to that line of code.
Anyway, I was hoping someone with a greater understanding than me would be able to see a solution, and any help would be greatly appreciated, thank you :-)
Edit: I've found that pC seems to be NULL when the asm code is hit despite being set pC = &C. I'm assuming this is due to the compiler rearranging the code in a manor thats breaking it? I've tried the various methods I've seen for stopping this happening (like adding everything relevent in the input list - thought this shouldnt even be nessisary since i'm listing "memory" in the clobber list) and I'm still getting the same problems.
Edit #2: Right, the memory issue seems to have been caused by me not including "r0" in the clobber list, however fixing that (if it is indeed fixed) doesnt seem to have fixed the problem. I noticed that multiplying a rotation matrix by the identity matrix doesn't work correctly and instead gives 0.88 as the last entry in the matrix instead of 1:
| 0.88 0.48 0 | | 1 0 0 | | 0.88 0.48 0 |
|-0.48 0.88 0 | * | 0 1 0 | = |-0.48 0.88 0 |
| 0 0 1 | | 0 0 1 | | 0 0 0.88|
I figured then that my logic must be wrong somewhere so i stepped through the assembly. everything seems fine up until the last "fmacs s17, s14, s2 \n\t" where:
s0 = 0 s14 = 0 s17 = 0
s1 = 0 s15 = 0 s18 = 0
s2 = 1 s16 = 1 s19 = 0
so surely the fmacs is performing the operation:
s17 = s17 + s14 * s2 = 0 + 0 * 1 = 0
s18 = s18 + s15 * s2 = 0 + 0 * 1 = 0
s19 = s19 + s16 * s2 = 0 + 1 * 1 = 1
However the result gives s19 = 0.88 which has left me even more confused :S am i misunderstanding how fmacs works? (P.S sorry for what has now become a really long question :-P)

Solved the problem! i was unaware that the vector banks were "circular".
The banks 0-7, 8-15, 16-23 and 24-31 can contain vectors of up to a length of 8, and can be used as vectors by simply stating you are using s16 with a length of 4 for example. However, in my case i had been using s14 with a length of 3, assuming this would get me s14,s15 and s16, but instead because its circular it would roll back to s8 - in other words i was using s14, s15 and s8.
Took my a long time to see that, so hopefully if anyone else has a similar problem they will find this :-)

Related

How to include durations in an instrument definition in Csound

Using the function oscil, I define an oscillator bank with given frequencies and amplitudes:
instr 1
a1 oscil .3, 110
outs a1,a1
a2 oscil .2, 220
outs a2,a2
a3 oscil .1, 330
outs a3,a3
endin
I know that I can set the duration in the orchestra section. But how can I give different durations to the different oscillations? Can I do this in the instrument definition? Because I want to be able to call the instrument (3 oscillators) with one line in the orchestra:
;instr start duration
i 1 0 ;duration of oscils defined under instr 1
e
Opcode instances within an instrument instance all share the same processing context (i.e., p3/duration). There are a few different strategies one could use to get different durations here:
Use multiple instrument instances have one oscillator per instrument. This is probably the most flexible, but most verbose.
Use some form of envelope and multiply that with the output of each oscillator. For example:
instr 1
p3 = 4
a1 oscil .3, 110
aenv1 linseg 1, 3, 1, 0.01, 0, 0.99, 0
a1 *= aenv1
outs a1,a1
...
endin
In #2, the duration is set by the instrument. The linseg is used as an envelope and the durations written in. One could then use multiple linseg/oscil pairs and hand write the durations for each part in.
Something that comes to mind is to apply different envelopes to each sinusoid you create inside an instrument:
0dbfs = 1
instr 1
kFirstEnvelope line 0, p3, 1
kSecondEnvelope line 0.5, p3, 0.5
kThirdEnvelope line 1, p3, 0
aFirstSine oscili 1, 440
aSecondSine oscili 1, 660
aThirdSine oscili 1, 880
aMix balance aFirstSine * kFirstEnvelope + aSecondSine * kSecondEnvelope + aThirdSine * kThirdEnvelope, a(0.15)
outs aMix, aMix
endin
You could then call instr 1 from the score with a single line of code, and you would probably want to come up with more interesting envelopes than the ones above.
i 1 0 10
However, if you are doing additive synthesis, a more elegant approach would be to trigger multiple score events from a separate instrument using event_i within a until loop.
instr 2
seed 0
iNoteIndex = 0
iNoteCount = 30
until iNoteIndex == iNoteCount do
iRandomStart = random(0, p3)
iRandomDuration = random(1.2, 0.5 * p3)
event_i "i", 3, iRandomStart, iRandomDuration
iNoteIndex += 1
enduntil
endin
instr 3
iAttack = .2
iDecay = .2
iSustain = .4
iRelease = 0.6
aSineWave oscili 0.1, random(200, 4000)
kEnvelope adsr iAttack, iDecay, iSustain, iRelease
outs aSineWave * kEnvelope
endin
You can then call instr 2 from the score, and that will take care of calling instr 3.
i 2 0 10
Cheers

Sending float from PIC to Raspberry Pi

im trying to send float value from pic 16f877a (easypic4 development board) to raspberry pi via uart.
mikroc code
AValue = 4.88 * ADC_Read(2)
ptr = (insigned char *)&AValue;
for (i=0; i < sizeof(AValue);
UART1_Write(*(ptr+1)), i++);
UART1_Write(0x0a);
delay_ms(100)
import serial, time, struct
from pprint import pprint
ser = serial.Serial("/dev/ttyAMA0", 9600)
ser.write(raw_input("enter char: "))
while True:
count = 0
AValue = []
for ch in ser.read():
if ch == "\n":
AValue = []
time.sleep(0.1)
while count < 4:
for ch in ser.read():
AValue.append(ch)
count += 1
flt =struct.unpack("<f",str("".join(AValue)))
pprint (flt)enter code here
output in python shell on raspberry pi. the value is changing as i move the pot around as you can see but only the zero value is correct. actualy not even zero since first value should be 1*4.88
(0.0,)
(-1.1472824864025526e-35),
(-3.2123910193243324e-35),
(-4.405564851100735e-35),
(-3.0045950051756514e-32),
Be careful, Microchip use custom float format.
Here you can see both translations of 4997,12 number
0F6h, 028h, 09Ch, 045h ; IEEE Real = 4997,12
0F6h, 028h, 01Ch, 08Bh ; Microchip Real = 4997,12
For more information about read: http://ww1.microchip.com/downloads/en/AppNotes/00575.pdf
EDIT: added x86 asm conversion rutines...
procedure AnsiSingleToMFormat(var Data: single);
asm
mov cx,[eax + 2]
rcl cl,1
rcl ch,1
rcr cl,1
mov [eax + 2],cx
end;
procedure MFormatToAnsiSingle(var Data: single);
asm
mov cx,[eax + 2]
rcl cl,1
rcr ch,1
rcr cl,1
mov [eax + 2],cx
end;

Easy68K IF-ELSE branching

writing my first assembly language program for class using Easy68K.
I'm using an if-else branching to replicate the code:
IF (P > 12)
P = P * 8 + 3
ELSE
P = P - Q
PRINT P
But I think I have my branches wrong because without the first halt in my code the program runs through the IF branch anyway even after the CMP finds a case that P < 12. Am I missing something here or would this be a generally accepted way of doing this?
Here is my assembly code:
START: ORG $1000 ; Program starts at loc $1000
MOVE P, D1 ; [D1] <- P
MOVE Q, D2 ; [D2] <- Q
* Program code here
CMP #12, D1 ; is P > 12?
BGT IF ;
SUB D2, D1 ; P = P - Q
MOVE #3, D0 ; assign read command
TRAP #15 ;
SIMHALT ; halt simulator
IF ASL #3, D1 ; P = P * 8
ADD #3, D1 ; P = P + 3
ENDIF
MOVE #3, D0 ; assign read command
TRAP #15 ;
SIMHALT ; halt simulator
* Data and Variables
ORG $2000 ; Data starts at loc $2000
P DC.W 5 ;
Q DC.W 7 ;
END START ; last line of source
To do if..else, you need two jumps; one at the start, and one at the end of the first block.
While it doesn't affect correctness, it is also conventional to retain source order, which means negating the condition.
MOVE P, D1 ; [D1] <- P
MOVE Q, D2 ; [D2] <- Q
* Program code here
CMP #12, D1 ; is P > 12?
BLE ELSE ; P is <= 12
IF
ASL #3, D1 ; P = P * 8
ADD #3, D1 ; P = P + 3
BRA ENDIF
ELSE
SUB D2, D1 ; P = P - Q
ENDIF
MOVE #3, D0 ; assign read command
TRAP #15 ;
SIMHALT ; halt simulator
EASy68K supports structured assembly.
OPT SEX
IF.L P <GT> #12 THEN
ELSE
ENDI
Add the option SEX to expand the structured code during assembly if you wish to view the compare and branch instructions used to implement the structured code.

scipy.optimize.fsolve 'proper array of floats' error

I need to compute the root of a function and I'm using scipy.optimize.fsolve. However when I call fsolve, sometimes it outputs an error that says 'Result from function call is not a proper array of floats.'
Here's an example of the inputs I'm using:
In [45]: guess = linspace(0.1,1.0,11)
In [46]: alpha_old = 0.5
In [47]: n_old = 0
In [48]: n_new = 1
In [49]: S0 = 0.9
In [50]: fsolve(alpha_eq,guess,args=(n_old,alpha_old,n_new,S0))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: array cannot be safely cast to required type
---------------------------------------------------------------------------
error Traceback (most recent call last)
/home/andres/Documents/UdeA/Proyecto/basis_analysis/<ipython-input-50-f1e9a42ba072> in <module>()
----> 1 fsolve(bb.alpha_eq,guess,args=(n_old,alpha_old,n_new,S0))
/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.pyc in fsolve(func, x0, args, fprime, full_output, col_deriv, xtol, maxfev, band, epsfcn, factor, diag)
123 maxfev = 200*(n + 1)
124 retval = _minpack._hybrd(func, x0, args, full_output, xtol,
--> 125 maxfev, ml, mu, epsfcn, factor, diag)
126 else:
127 _check_func('fsolve', 'fprime', Dfun, x0, args, n, (n,n))
error: Result from function call is not a proper array of floats.
In [51]: guess = linspace(0.1,1.0,2)
In [52]: fsolve(alpha_eq,guess,args=(n_old,alpha_old,n_new,S0))
Out[52]: array([ 0.54382423, 1.29716005])
In [53]: guess = linspace(0.1,1.0,3)
In [54]: fsolve(alpha_eq,guess,args=(n_old,alpha_old,n_new,S0))
Out[54]: array([ 0.54382423, 0.54382423, 1.29716005])
There you can see that for 'guess' as defined in In[46] it outputs an error, however for 'guess' as defined in In[51] and in In[53] it works ok. As far as I know both In[46], In[51] and In[53] are the same type of arrays so what's the reason for the error I'm getting in In[50]?
Here are the functions I'm calling in case they're the reason of the problem:
def alpha_eq(alpha2,n1,alpha1,n2,S0):
return overlap(n1,alpha1,n2,alpha2) - S0
def overlap(n1,alpha1,n2,alpha2):
aux1 = sqrt((2.0*alpha1)**(2*n1+3)/factorial(2*n1+2))
aux2 = sqrt((2.0*alpha2)**(2*n2+3)/factorial(2*n2+2))
return aux1 * aux2 * factorial(n1+n2+2) / (alpha1+alpha2)**(n1+n2+3)
(the functions linspace, sqrt and factorial are imported from scipy)
This is a plot of the function for which I'm trying to find the roots.
plot
It seems to me like this is a bug of fsolve, however I want to make sure I'm not making a stupid mistake before reporting it.
If there's something wrong with my code please let me know. Thanks!
I have modified your overlap function for debugging as follows:
def overlap(n1,alpha1,n2,alpha2):
print n1, alpha1, n2, alpha2
aux1 = sqrt((2.0*alpha1)**(2*n1 + 3)/factorial(2*n1 + 2))
aux2 = sqrt((2.0*alpha2)**(2*n2 + 3)/factorial(2*n2 + 2))
ret = aux1 * aux2 * factorial(n1+n2+2) / (alpha1+alpha2)**(n1+n2+3)
print ret, ret.dtype
return ret
And when I try to reproduce your error, here's what happens:
>>> scipy.optimize.fsolve(alpha_eq,guess,args=(n_old,alpha_old,n_new,S0))
0 0.5 1 [ 0.1 0.19 0.28 0.37 0.46 0.55 0.64 0.73 0.82 0.91 1. ]
[ 0.11953652 0.34008953 0.54906314 0.71208678 0.82778065 0.90418052
0.95046505 0.97452352 0.98252708 0.97911263 0.96769965] float64
...
0 0.5 1 [ 0.45613162 0.41366639 0.44818267 0.49222515 0.52879856 0.54371741
0.50642005 0.28700652 -3.72580492 1.81152096 1.41975621]
[ 0.82368346+0.j 0.77371428+0.j 0.81503304+0.j
0.85916030+0.j 0.88922137+0.j 0.89992643+0.j
0.87149667+0.j 0.56353606+0.j 0.00000000+1.21228156j
0.75791881+0.j 0.86627491+0.j ] complex128
So in the process of solving your equation, the square root of a negative number is being calculated, which leads to the complex128 dtype and your error.
With your function, if you are only interested in the zeros, I think you can get rid of the sqrts if you raise S0 to the 4th power:
def alpha_eq(alpha2,n1,alpha1,n2,S0):
return overlap(n1,alpha1,n2,alpha2) - S0**4
def overlap(n1,alpha1,n2,alpha2):
aux1 = (2.0*alpha1)**(2*n1 + 3)/factorial(2*n1 + 2)
aux2 = (2.0*alpha2)**(2*n2 + 3)/factorial(2*n2 + 2)
ret = aux1 * aux2 * factorial(n1+n2+2) / (alpha1+alpha2)**(n1+n2+3)
return ret
And now:
>>> scipy.optimize.fsolve(alpha_eq,guess,args=(n_old,alpha_old,n_new,S0))
array([ 0.92452239, 0.92452239, 0.92452239, 0.92452239, 0.92452239,
0.92452239, 0.92452239, 0.92452239, 0.92452239, 0.92452239,
0.92452239])

C vs assembler vs NEON performance

I am working on an iPhone application that does real time image processing. One of the earliest steps in its pipeline is to convert a BGRA image to greyscale. I tried several different methods and the difference in timing results is far greater than I had imagined possible. First I tried using C. I approximate the conversion to luminosity by adding B+2*G+R /4
void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();
uchar *pOut = imByte.data;
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;
}
}
This code takes 55 ms to convert a 352x288 image. I then found some assembler code that does essentially the same thing
void BGRA_To_Byte(Image<BGRA> &imBGRA, Image<byte> &imByte)
{
uchar *pIn = (uchar*) imBGRA.data;
uchar *pLimit = pIn + imBGRA.MemSize();
unsigned int *pOut = (unsigned int*) imByte.data;
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
register unsigned int nBGRA1 asm("r4");
register unsigned int nBGRA2 asm("r5");
unsigned int nZero=0;
unsigned int nSum1;
unsigned int nSum2;
unsigned int nPacked1;
asm volatile(
"ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #0] \n" // Load in two BGRA words
"usad8 %[nSum1], %[nBGRA1], %[nZero] \n" // Add R+G+B+A
"usad8 %[nSum2], %[nBGRA2], %[nZero] \n" // Add R+G+B+A
"uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8 \n" // Add G again
"uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8 \n" // Add G again
"mov %[nPacked1], %[nSum1], LSR #2 \n" // Init packed word
"mov %[nSum2], %[nSum2], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum2], LSL #8 \n" // Add to packed word
"ldrd %[nBGRA1], %[nBGRA2], [ %[pIn], #8] \n" // Load in two more BGRA words
"usad8 %[nSum1], %[nBGRA1], %[nZero] \n" // Add R+G+B+A
"usad8 %[nSum2], %[nBGRA2], %[nZero] \n" // Add R+G+B+A
"uxtab %[nSum1], %[nSum1], %[nBGRA1], ROR #8 \n" // Add G again
"uxtab %[nSum2], %[nSum2], %[nBGRA2], ROR #8 \n" // Add G again
"mov %[nSum1], %[nSum1], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum1], LSL #16 \n" // Add to packed word
"mov %[nSum2], %[nSum2], LSR #2 \n" // Div by four
"add %[nPacked1], %[nPacked1], %[nSum2], LSL #24 \n" // Add to packed word
///////////
////////////
: [pIn]"+r" (pIn),
[nBGRA1]"+r"(nBGRA1),
[nBGRA2]"+r"(nBGRA2),
[nZero]"+r"(nZero),
[nSum1]"+r"(nSum1),
[nSum2]"+r"(nSum2),
[nPacked1]"+r"(nPacked1)
:
: "cc" );
*pOut = nPacked1;
pOut++;
}
}
This function converts the same image in 12ms, almost 5X faster! I have not programmed in assembler before but I assumed that it would not be this much faster than C for such a simple operation. Inspired by this success I continued searching and discovered a NEON conversion example here.
void greyScaleNEON(uchar* output_data, uchar* input_data, int tot_pixels)
{
__asm__ volatile("lsr %2, %2, #3 \n"
"# build the three constants: \n"
"mov r4, #28 \n" // Blue channel multiplier
"mov r5, #151 \n" // Green channel multiplier
"mov r6, #77 \n" // Red channel multiplier
"vdup.8 d4, r4 \n"
"vdup.8 d5, r5 \n"
"vdup.8 d6, r6 \n"
"0: \n"
"# load 8 pixels: \n"
"vld4.8 {d0-d3}, [%1]! \n"
"# do the weight average: \n"
"vmull.u8 q7, d0, d4 \n"
"vmlal.u8 q7, d1, d5 \n"
"vmlal.u8 q7, d2, d6 \n"
"# shift and store: \n"
"vshrn.u16 d7, q7, #8 \n" // Divide q3 by 256 and store in the d7
"vst1.8 {d7}, [%0]! \n"
"subs %2, %2, #1 \n" // Decrement iteration count
"bne 0b \n" // Repeat unil iteration count is not zero
:
: "r"(output_data),
"r"(input_data),
"r"(tot_pixels)
: "r4", "r5", "r6"
);
}
The timing results were hard to believe. It converts the same image in 1 ms. 12X faster than assembler and an astounding 55X faster than C. I had no idea that such performance gains were possible. In light of this I have a few questions. First off, am I doing something terribly wrong in the C code? I still find it hard to believe that it is so slow. Second, if these results are at all accurate, in what kinds of situations can I expect to see these gains? You can probably imagine how excited I am at the prospect of making other parts of my pipeline run 55X faster. Should I be learning assembler/NEON and using them inside any loop that takes an appreciable amount of time?
Update 1: I have posted the assembler output from my C function in a text file at
http://temp-share.com/show/f3Yg87jQn It was far too large to include directly here.
Timing is done using OpenCV functions.
double duration = static_cast<double>(cv::getTickCount());
//function call
duration = static_cast<double>(cv::getTickCount())-duration;
duration /= cv::getTickFrequency();
//duration should now be elapsed time in ms
Results
I tested several suggested improvements. First, as recommended by Viktor I reordered the inner loop to put all fetches first. The inner loop then looked like.
for(; pIn < pLimit; pIn+=16) // Does four pixels at a time
{
//Jul 16, 2012 MR: Read and writes collected
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;
}
This change brought processing time down to 53ms an improvement of 2ms. Next as recommended by Victor I changed my function to fetch as uint. The inner loop then looked like
unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
uchar* out = temp.data;
for(; in_int < end; in_int+=4) // Does four pixels at a time
{
unsigned int pixelA = in_int[0];
unsigned int pixelB = in_int[1];
unsigned int pixelC = in_int[2];
unsigned int pixelD = in_int[3];
uchar* byteA = (uchar*)&pixelA;
uchar* byteB = (uchar*)&pixelB;
uchar* byteC = (uchar*)&pixelC;
uchar* byteD = (uchar*)&pixelD;
unsigned int sumA = byteA[0] + 2 * byteA[1] + byteA[2];
unsigned int sumB = byteB[0] + 2 * byteB[1] + byteB[2];
unsigned int sumC = byteC[0] + 2 * byteC[1] + byteC[2];
unsigned int sumD = byteD[0] + 2 * byteD[1] + byteD[2];
out[0] = sumA / 4;
out[1] = sumB / 4;
out[2] = sumC / 4;
out[3] = sumD / 4;
out +=4;
}
This modification had a dramatic effect, dropping processing time to 14ms, a drop of 39ms (75%). This last result is very close the the assembler performance of 11ms. The final optimization as recommended by rob was to include the __restrict keyword. I added it in front of every pointer declaration changing the following lines
__restrict unsigned int* in_int = (unsigned int*) original.data;
unsigned int* end = (unsigned int*) in_int + out_length;
__restrict uchar* out = temp.data;
...
__restrict uchar* byteA = (uchar*)&pixelA;
__restrict uchar* byteB = (uchar*)&pixelB;
__restrict uchar* byteC = (uchar*)&pixelC;
__restrict uchar* byteD = (uchar*)&pixelD;
...
These changes had no measurable effect on processing time. Thank you for all your help, I will be paying much closer attention to memory management in the future.
There is an explanation here concerning some of the reasons for NEON's "success": http://hilbert-space.de/?p=22
Try compiling you C code with the "-S -O3" switches to see the optimized output of the GCC compiler.
IMHO, the key to success is the optimized read/write pattern employed by both assembly versions. And NEON/MMX/other vector engines also support saturation (clamping results to 0..255 without having to use the 'unsigned ints').
See these lines in the loop:
unsigned int sumA = pIn[0] + 2 * pIn[1] + pIn[2];
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
pOut[1] = sumB / 4;
unsigned int sumC = pIn[8] + 2 * pIn[9] + pIn[10];
pOut[2] = sumC / 4;
unsigned int sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut[3] = sumD / 4;
pOut +=4;
The reads and writes are really mixed. Slightly better version of the loop's cycle would be
// and the pIn reads can be combined into a single 4-byte fetch
sumA = pIn[0] + 2 * pIn[1] + pIn[2];
sumB = pIn[4] + 2 * pIn[5] + pIn[6];
sumC = pIn[8] + 2 * pIn[9] + pIn[10];
sumD = pIn[12] + 2 * pIn[13] + pIn[14];
pOut +=4;
pOut[0] = sumA / 4;
pOut[1] = sumB / 4;
pOut[2] = sumC / 4;
pOut[3] = sumD / 4;
Keep in mind, that the "unsigned in sumA" line here can really mean the alloca() call (allocation on the stack), so you're wasting a lot of cycles on the temporary var allocations (the function call 4 times).
Also, the pIn[i] indexing does only a single-byte fetch from memory. The better way to do this is to read the int and then extract single bytes. To make things faster, use the "unsgined int*" to read 4 bytes (pIn[i * 4 + 0], pIn[i * 4 + 1], pIn[i * 4 + 2], pIn[i * 4 + 3]).
The NEON version is clearly superior: the lines
"# load 8 pixels: \n"
"vld4.8 {d0-d3}, [%1]! \n"
and
"#save everything in one shot \n"
"vst1.8 {d7}, [%0]! \n"
save most of the time for the memory access.
If performance is critically important (as it generally is with real-time image processing), you do need to pay attention to the machine code. As you have discovered, it can be especially important to use the vector instructions (which are designed for things like real-time image processing) -- and it is hard for compilers to automatically use the vector instructions effectively.
What you should try, before committing to assembly, is using compiler intrinsics. Compiler intrinsics aren't any more portable than assembly, but they should be easier to read and write, and easier for the compiler to work with. Aside from maintainability problems, the performance problem with assembly is that it effectively turns off the optimizer (you did use the appropriate compiler flag to turn it on, right?). That is: with inline assembly, the compiler is not able to tweak register assignment and so forth, so if you don't write your entire inner loop in assembly, it may still not be as efficient as it could be.
However, you will still be able to use your newfound assembly expertise to good effect -- as you can now inspect the assembly produced by your compiler, and figure out if it's being stupid. If so, you can tweak the C code (perhaps doing some pipelining by hand if the compiler isn't managing to), recompile it, look at the assembly output to see if the compiler is now doing what you want it to, then benchmark to see if it's actually running any faster...
If you've tried the above, and still can't provoke the compiler to do the right thing, go ahead and write your inner loop in assembly (and, again, check to see if the result is actually faster). For reasons described above, be sure to get the entire inner loop, including the loop branch.
Finally, as others have mentioned, take some time to try and figure out what "the right thing" is. Another benefit of learning your machine architecture is that it gives you a mental model of how things work -- so you will have a better chance of understanding how to put together efficient code.
Viktor Latypov's answer has lots of good information, but I want to point out one more thing: in your original C function, the compiler can't tell that pIn and pOut point to non-overlapping regions of memory. Now look at these lines:
pOut[0] = sumA / 4;
unsigned int sumB = pIn[4] + 2 * pIn[5] + pIn[6];
The compiler has to assume that pOut[0] might be the same as pIn[4] or pIn[5] or pIn[6] (or any other pIn[x]). So it basically can't reorder any of the code in your loop.
You can tell the compiler that pIn and pOut don't overlap by declaring them __restrict:
__restrict uchar *pIn = (uchar*) imBGRA.data;
__restrict uchar *pOut = imByte.data;
This might speed up your original C version a bit.
This is kind of a toss up between performance and maintainability. Typically have an app load and function quickly is very nice for the user, but there is the trade off. Now your app is fairly difficult to maintain and the speed gains may be unwarranted. If the users of your app were complaining that it felt slow then these optimizations are worth the effort and lack of maintainability, but if it came from your need to speed up your app then you should not go this far into the optimization. If you are doing these images conversion at app startup then speed is not of the essence, but if you are constantly doing them ( and doing a lot of them ) while the app is running then they make more sense. Only optimize the parts of the app where the user spends time and actually experiences the slow down.
Also looking at the assembly they do not use division but rather only multiplications so look into that for your C code. Another instance is that it optimizes out your multiplication by 2 out to two additions. This again may be another trick as the multiplication may be slower on a iPhone application than an addition.