Problem with increment in inline ARM assembly - iphone

i have the following bit of inline ARM assembly, it works in a debug build but crashes in a release build of iphone sdk 3.1. The problem is the add instructions where i am incrementing the address of the C variables output and x by 4 bytes, this is supposed to increment by the size of a float. I think when i increment at some such stage i am overwriting something, can anyone say which is the best way to handle this
Thanks
C code that the asm is replacing, sum,output and x are all floats
for(int i = 0; i< count; i++)
sum+= output[i]* (*x++)
asm volatile(
".align 4 \n\t"
"mov r4,%3 \n\t"
"flds s0,[%0] \n\t"
"0: \n\t"
"flds s1,[%2] \n\t"
//"add %3,%3,#4 \n\t"
"flds s2,[%1] \n\t"
//"add %2,%2,#4 \n\t"
"subs r4,r4, #1 \n\t"
"fmacs s0, s1, s2 \n\t"
"bne 0b \n\t"
"fsts s0,[%0] \n\t"
:
: "r" (&sum), "r" (output), "r" (x),"r" (count)
: "r0","r4","cc", "memory",
"s0","s1","s2"
);

did you mean %1 to add 4 to and %3? adding to %3 could cause damage if that register is used again after your function.
asm volatile(
".align 4 \n\t"
"mov r4,%3 \n\t"
"flds s0,[%0] \n\t"
"0: \n\t"
"flds s1,[%2] \n\t"
"add %2,%2,#4 \n\t"
"flds s2,[%1] \n\t"
"add %1,%1,#4 \n\t"
"subs r4,r4, #1 \n\t"
"fmacs s0, s1, s2 \n\t"
"bne 0b \n\t"
"fsts s0,[%0] \n\t"
:
: "r" (&sum), "r" (output), "r" (x),"r" (count)
: "r0","r4","cc", "memory",
"s0","s1","s2"
);

Related

*.s files cannot be correctly assembled in visualGDB

I am new to visualGDB. Recently I new a stm32 stand-alone project, any .cpp and .c files could be compiled correctly, but when i migrated µC/OS-II into the project, I found os_cpu_a.s file couldn't be assembled correctly.
Error logs are shown as follows:
And my Makefile flags setting is:
Please give me some suggestions! Any help will be appreciated!
os_cpu_a.s files content is shown below:
;/*********************** (C) COPYRIGHT 2010 Libraworks *************************
;* File Name : os_cpu_a.asm
;* Author : Librae
;* Version : V1.0
;* Date : 06/10/2010
;* Description : μCOS-II asm port for STM32
;*******************************************************************************/
IMPORT OSRunning ; External references
IMPORT OSPrioCur
IMPORT OSPrioHighRdy
IMPORT OSTCBCur
IMPORT OSTCBHighRdy
IMPORT OSIntNesting
IMPORT OSIntExit
IMPORT OSTaskSwHook
EXPORT OSStartHighRdy
EXPORT OSCtxSw
EXPORT OSIntCtxSw
EXPORT OS_CPU_SR_Save ; Functions declared in this file
EXPORT OS_CPU_SR_Restore
EXPORT PendSV_Handler
NVIC_INT_CTRL EQU 0xE000ED04
NVIC_SYSPRI2 EQU 0xE000ED20
NVIC_PENDSV_PRI EQU 0xFFFF0000
NVIC_PENDSVSET EQU 0x10000000
PRESERVE8
AREA |.text|, CODE, READONLY
THUMB
;********************************************************************************************************
; CRITICAL SECTION METHOD 3 FUNCTIONS
;
; Description: Disable/Enable interrupts by preserving the state of interrupts. Generally speaking you
; would store the state of the interrupt disable flag in the local variable 'cpu_sr' and then
; disable interrupts. 'cpu_sr' is allocated in all of uC/OS-II's functions that need to
; disable interrupts. You would restore the interrupt disable state by copying back 'cpu_sr'
; into the CPU's status register.
;
; Prototypes : OS_CPU_SR OS_CPU_SR_Save(void);
; void OS_CPU_SR_Restore(OS_CPU_SR cpu_sr);
;
;
; Note(s) : 1) These functions are used in general like this:
;
; void Task (void *p_arg)
; {
; #if OS_CRITICAL_METHOD == 3 /* Allocate storage for CPU status register */
; OS_CPU_SR cpu_sr;
; #endif
;
; :
; :
; OS_ENTER_CRITICAL(); /* cpu_sr = OS_CPU_SaveSR(); */
; :
; :
; OS_EXIT_CRITICAL(); /* OS_CPU_RestoreSR(cpu_sr); */
; :
; :
; }
;********************************************************************************************************
OS_CPU_SR_Save
MRS R0, PRIMASK
CPSID I
BX LR
OS_CPU_SR_Restore
MSR PRIMASK, R0
BX LR
OSStartHighRdy
LDR R4, =NVIC_SYSPRI2 ; set the PendSV exception priority
LDR R5, =NVIC_PENDSV_PRI
STR R5, [R4]
MOV R4, #0 ; set the PSP to 0 for initial context switch call
MSR PSP, R4
LDR R4, =OSRunning ; OSRunning = TRUE
MOV R5, #1
STRB R5, [R4]
LDR R4, =NVIC_INT_CTRL ;rigger the PendSV exception (causes context switch)
LDR R5, =NVIC_PENDSVSET
STR R5, [R4]
CPSIE I ;enable interrupts at processor level
OSStartHang
B OSStartHang ;should never get here
OSCtxSw
PUSH {R4, R5}
LDR R4, =NVIC_INT_CTRL
LDR R5, =NVIC_PENDSVSET
STR R5, [R4]
POP {R4, R5}
BX LR
OSIntCtxSw
PUSH {R4, R5}
LDR R4, =NVIC_INT_CTRL
LDR R5, =NVIC_PENDSVSET
STR R5, [R4]
POP {R4, R5}
BX LR
NOP
PendSV_Handler
CPSID I ; Prevent interruption during context switch
MRS R0, PSP
CBZ R0, PendSV_Handler_Nosave ; Skip register save the first time
SUBS R0, R0, #0x20 ; Save remaining regs r4-11 on process stack
STM R0, {R4-R11}
LDR R1, =OSTCBCur ; OSTCBCur->OSTCBStkPtr = SP;
LDR R1, [R1]
STR R0, [R1] ; R0 is SP of process being switched out
; At this point, entire context of process has been saved
PendSV_Handler_Nosave
PUSH {R14} ; Save LR exc_return value
LDR R0, =OSTaskSwHook ; OSTaskSwHook();
BLX R0
POP {R14}
LDR R0, =OSPrioCur ; OSPrioCur = OSPrioHighRdy;
LDR R1, =OSPrioHighRdy
LDRB R2, [R1]
STRB R2, [R0]
LDR R0, =OSTCBCur ; OSTCBCur = OSTCBHighRdy;
LDR R1, =OSTCBHighRdy
LDR R2, [R1]
STR R2, [R0]
LDR R0, [R2] ; R0 is new process SP; SP = OSTCBHighRdy->OSTCBStkPtr;
LDM R0, {R4-R11} ; Restore r4-11 from new process stack
ADDS R0, R0, #0x20
MSR PSP, R0 ; Load PSP with new process SP
ORR LR, LR, #0x04 ; Ensure exception return uses process stack
CPSIE I
BX LR ; Exception return will restore remaining context
end
By the way, it seems that any chinese character in edition will be treated as spam as a spam prevention method?So i have to delete all chinese comment in my code.

TSC MSR and serialization

About the MSR IA32_TIME_STAMP_COUNTER (10h) :
Which rules of serialization does it follow ? rdtsc or rdtscp ? or other ?
If not serialized, should I provide a cpuid "barrier" before any math computations ?
-- Edit --
So far I have implemented two kinds of barriers : cpuid and fences.
With cpuid :
#define RDCOUNTER(_val, _cnt) \
asm volatile \
( \
"xorq %%rax, %%rax \n\t" \
"cpuid \n\t" \
"movq %1, %%rcx \n\t" \
"rdmsr \n\t" \
"push %%rax \n\t" \
"push %%rdx \n\t" \
"xorq %%rax, %%rax \n\t" \
"cpuid \n\t" \
"pop %%rdx \n\t" \
"pop %%rax \n\t" \
"shlq $32, %%rdx \n\t" \
"orq %%rdx, %%rax \n\t" \
"movq %%rax, %0" \
: "=m" (_val) \
: "i" (_cnt) \
: "%rax", "%rbx", "%rcx", "%rdx", "memory" \
)
With fence :
#define RDCOUNTER(_val, _cnt) \
asm volatile \
( \
"movq %1, %%rcx \n\t" \
"mfence \n\t" \
"rdmsr \n\t" \
"mfence \n\t" \
"shlq $32, %%rdx \n\t" \
"orq %%rdx, %%rax \n\t" \
"movq %%rax, %0" \
: "=m" (_val) \
: "i" (_cnt) \
: "%rax", "%rbx", "%rcx", "%rdx", "memory" \
)
Bellow part of my project is trying to estimate the processor external clock frequency (FSB or BCLK).
Algorithm allocates an array of a structured memory to read and measure deltas of the Time Stamp Counter.
This slab of memory is allocated to be resident in the processor cache.
A cpu affinity is made with the BSP, scheduler and interrupts are suspended the time of computation.
Several loops of the TSC reads are done to force cache residency; and the most occurrences of same result is declared as the best frequency.
What I expect is to get a constant frequency after several run.
Unfortunately, I still have variance whatever the barrier instruction is employed or not.
Results are pretty closed, at least 3 decimals past period, but never constant.
(this is tested on a Core 2 and Core i7)
DECLARE_COMPLETION(bclk_job_complete);
typedef struct {
unsigned long long V[2], D;
} TSC_STRUCT;
#define OCCURENCES 32
signed int Compute_Clock(void *arg)
{
CLOCK *clock=(CLOCK *) arg;
unsigned int ratio=clock->Q;
unsigned long long overhead=0;
struct kmem_cache *hardwareCache=kmem_cache_create(
"IntelClockCache",
OCCURENCES * sizeof(TSC_STRUCT), 0,
SLAB_HWCACHE_ALIGN, NULL);
TSC_STRUCT *TSC=kmem_cache_alloc(hardwareCache, GFP_KERNEL);
unsigned int loop=0, best=0, top=0;
// No preemption, no interrupt.
unsigned long flags;
preempt_disable();
raw_local_irq_save(flags);
// Warm-up
RDCOUNTER(TSC[loop].V[0], MSR_IA32_TSC);
RDCOUNTER(TSC[loop].V[1], MSR_IA32_TSC);
// Overhead
RDCOUNTER(TSC[loop].V[0], MSR_IA32_TSC);
RDCOUNTER(TSC[loop].V[1], MSR_IA32_TSC);
overhead=TSC[loop].V[1] - TSC[loop].V[0];
// Pick-up
for(loop=0; loop < OCCURENCES; loop++)
{
RDCOUNTER(TSC[loop].V[0], MSR_IA32_TSC);
udelay(100);
RDCOUNTER(TSC[loop].V[1], MSR_IA32_TSC);
}
// Restore interrupt and preemption.
raw_local_irq_restore(flags);
preempt_enable();
for(loop=0; loop < OCCURENCES; loop++)
TSC[loop].D=TSC[loop].V[1] - TSC[loop].V[0] - overhead;
for(loop=0; loop < OCCURENCES; loop++) {
unsigned int inner=0, count=0;
for(inner=loop; inner < OCCURENCES; inner++)
if(TSC[loop].D == TSC[inner].D)
count++;
if((count > top)
||((count == top) && (TSC[loop].D < TSC[best].D))) {
top=count;
best=loop;
}
/* printk("%3u x D[%02u]=%llu\t%llu - %llu\n",
count, loop, TSC[loop].D, TSC[loop].V[1], TSC[loop].V[0]); */
}
printk("Overhead=%llu\tBest=%llu\n", overhead, TSC[best].D);
clock->Q=TSC[best].D / (ratio * PRECISION);
clock->R=TSC[best].D % (ratio * PRECISION);
kmem_cache_free(hardwareCache, TSC);
kmem_cache_destroy(hardwareCache);
complete_and_exit(&bclk_job_complete, 0);
}

how to write inline assembly codes about LOOP in Xcode LLVM?

I'm studying about inline assembly. I want to write a simple routine in iPhone under Xcode 4 LLVM 3.0 Compiler. I succeed write basic inline assembly codes.
example :
int sub(int a, int b)
{
int c;
asm ("sub %0, %1, %2" : "=r" (c) : "r" (a), "r" (b));
return c;
}
I found it in stackoverflow.com and it works very well. But, I don't know how to write code about LOOP.
I need to assembly codes like
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity)
{
for(int i=0; i<numPixels; i++)
{
dst[i] = src[i] + intensity;
}
}
Take a look here at the loop section - http://en.wikipedia.org/wiki/ARM_architecture
Basically you'll want something like:
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"\t mov r3, #0\n"
"Lloop:\n"
"\t cmp r3, %2\n"
"\t bge Lend\n"
"\t ldrb r4, [%0, r3]\n"
"\t add r4, r4, %3\n"
"\t strb r4, [%1, r3]\n"
"\t add r3, r3, #1\n"
"\t b Lloop\n"
"Lend:\n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r3", "r4");
}
Update:
And here's that NEON version:
void brighten_neon(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"\t mov r4, #0\n"
"\t vdup.8 d1, %3\n"
"Lloop2:\n"
"\t cmp r4, %2\n"
"\t bge Lend2\n"
"\t vld1.8 d0, [%0]!\n"
"\t vqadd.s8 d0, d0, d1\n"
"\t vst1.8 d0, [%1]!\n"
"\t add r4, r4, #8\n"
"\t b Lloop2\n"
"Lend2:\n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r4", "d1", "d0");
}
So this NEON version will do 8 at a time. It does however not check that numPixels is divisible by 8 so you'd definitely want to do that otherwise things will go wrong! Anyway, it's just a start at showing you what can be done. Notice the same number of instructions, but action on eight pixels of data at once. Oh and it's got the saturation in there as well that I assume you would want.
Though this answer is not directly an answer to your question, it is more a general advice regarding use of assembler versus modern compilers.
You will generally have a hard time beating the compiler regarding optimazation of your C code. Of course by clever use of certain knowledge about how your data behave it's possible that you might tweak it just a few percents.
One of the reasons for this is that modern compilers use a number of techniques when dealing with code like the one you describe, e.g. loop unrolling, instruction reordering to avoid pipeline stalls and bubbles, etc.
If you really want to make that algorithm scream, you should consider redesigning the algorithm instead in C so you avoid the worst delays. For instance reading and writing to memory is expensive compared to register access.
One way of accomplishing this could be to have your code load 4 bytes at a time by using an unsigned long and then doing the math on this in registers before writing these 4 bytes back in one store operation.
So to recap, make your algorithm work smarter not harder.

Optimizing RGBA8888 to RGB565 conversion with NEON

I'm trying to optimize an image format conversion on iOS using the NEON vector instruction set. I assumed this would map well to that because it processes a bunch of similar data.
My attempts haven't gone that well, though, achieving only a marginal speedup vs the naive c implementation:
for(int i = 0; i < pixelCount; ++i, ++inPixel32) {
const unsigned int r = ((*inPixel32 >> 0 ) & 0xFF);
const unsigned int g = ((*inPixel32 >> 8 ) & 0xFF);
const unsigned int b = ((*inPixel32 >> 16) & 0xFF);
*outPixel16++ = ((r >> 3) << 11) | ((g >> 2) << 5) | ((b >> 3) << 0);
}
1 megapixel image array on iPad 2:
format is [min avg max n=number of timer samples] in milliseconds
C:
[14.446 14.632 18.405 n=1000]ms
NEON:
[11.920 12.032 15.336 n=1000]ms
My attempt at a NEON implementation is below:
int i;
const int pixelsPerLoop = 8;
for(i = 0; i < pixelCount; i += pixelsPerLoop, inPixel32 += pixelsPerLoop, outPixel16 += pixelsPerLoop) {
//Read all r,g,b pixels into 3 registers
uint8x8x4_t rgba = vld4_u8(inPixel32);
//Right-shift r,g,b as appropriate
uint8x8_t r = vshr_n_u8(rgba.val[0], 3);
uint8x8_t g = vshr_n_u8(rgba.val[1], 2);
uint8x8_t b = vshr_n_u8(rgba.val[2], 3);
//Widen b
uint16x8_t r5_g6_b5 = vmovl_u8(b);
//Widen r
uint16x8_t r16 = vmovl_u8(r);
//Left shift into position within 16-bit int
r16 = vshlq_n_u16(r16, 11);
r5_g6_b5 |= r16;
//Widen g
uint16x8_t g16 = vmovl_u8(g);
//Left shift into position within 16-bit int
g16 = vshlq_n_u16(g16, 5);
r5_g6_b5 |= g16;
//Now write back to memory
vst1q_u16(outPixel16, r5_g6_b5);
}
//Do the remainder on normal flt hardware
Code was compiled via LLVM 3.0 into the following (.loc and extra labels removed):
_DNConvert_ARGB8888toRGB565:
push {r4, r5, r7, lr}
mov r9, r1
mov.w r12, #0
add r7, sp, #8
cmp r2, #0
mov.w r1, #0
it ne
movne r1, #1
cmp r0, #0
mov.w r3, #0
it ne
movne r3, #1
cmp.w r9, #0
mov.w r4, #0
it ne
movne r4, #1
tst.w r9, #3
bne LBB0_8
ands r1, r3
ands r1, r4
cmp r1, #1
bne LBB0_8
movs r1, #0
lsr.w lr, r9, #2
cmp.w r1, r9, lsr #2
bne LBB0_9
mov r3, r2
mov r5, r0
b LBB0_5
LBB0_4:
movw r1, #65528
add.w r0, lr, #7
movt r1, #32767
ands r1, r0
LBB0_5:
mov.w r12, #1
cmp r1, lr
bhs LBB0_8
rsb r0, r1, r9, lsr #2
mov.w r9, #63488
mov.w lr, #2016
mov.w r12, #1
LBB0_7:
ldr r2, [r5], #4
subs r0, #1
and.w r1, r9, r2, lsl #8
and.w r4, lr, r2, lsr #5
ubfx r2, r2, #19, #5
orr.w r2, r2, r4
orr.w r1, r1, r2
strh r1, [r3], #2
bne LBB0_7
LBB0_8:
mov r0, r12
pop {r4, r5, r7, pc}
LBB0_9:
sub.w r1, lr, #1
movs r3, #32
add.w r3, r3, r1, lsl #2
bic r3, r3, #31
adds r5, r0, r3
movs r3, #16
add.w r1, r3, r1, lsl #1
bic r1, r1, #15
adds r3, r2, r1
movs r1, #0
LBB0_10:
vld4.8 {d16, d17, d18, d19}, [r0]!
adds r1, #8
cmp r1, lr
vshr.u8 d20, d16, #3
vshr.u8 d21, d17, #2
vshr.u8 d16, d18, #3
vmovl.u8 q11, d20
vmovl.u8 q9, d21
vmovl.u8 q8, d16
vshl.i16 q10, q11, #11
vshl.i16 q9, q9, #5
vorr q8, q8, q10
vorr q8, q8, q9
vst1.16 {d16, d17}, [r2]!
Ltmp28:
blo LBB0_10
b LBB0_4
Full code is available at https://github.com/darknoon/DNImageConvert I would appreciate any help, thanks!
Here you are, hand-optimized NEON implementation ready for XCode :
/* IT DOESN'T WORK!!! USE THE NEXT VERSION BELOW.
* BGRA2RGB565.s
*
* Created by Jake "Alquimista" Lee on 11. 11. 1..
* Copyright 2011 Jake Lee. All rights reserved.
*/
.align 2
.globl _bgra2rgb565_neon
.private_extern _bgra2rgb565_neon
// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
//ARM
pDst .req r0
pSrc .req r1
count .req r2
//NEON
blu .req d16
grn .req d17
red .req d18
alp .req d19
rg .req red
gb .req blu
_bgra2rgb565_neon:
pld [pSrc]
tst count, #0x7
movne r0, #0
bxne lr
loop:
pld [pSrc, #32]
vld4.8 {blu, grn, red, alp}, [pSrc]!
subs count, count, #8
vshr.u8 red, red, #3
vext.8 rg, grn, red, #5
vshr.u8 grn, grn, #2
vext.8 gb, blu, grn, #3
vst2.8 {gb, rg}, [pDst]!
bgt loop
bx lr
This version will be many times faster than what you suggested :
increased cache hit rate via PLD
conversion to "long" not necessary
fewer instructions within the loop
There is still some room for optimizations though, you could modify the loop so that it converts 16 pixels per iteration instead of 8.
Then you can schedule the instructions to avoid the two stalls completely (which is simply not possible in this 8/iteration version above) and benefit from NEON's dual-issue capability in addition.
I didn't do this because it would make the code hard to understand.
It's important to know what VEXT is supposed to do.
Now it's up to you. :)
I verified this code to be properly compiled under Xcode.
Although I'm pretty sure it works correctly as well, I cannot guarantee this since I don't have the test environment.
In case of malfunctioning, please let me know. I'll correct it accordingly then.
cya
==============================================================================
Well, here is the improved version.
Due to the nature of the VSRI instruction not allowing two operands other than the target, it was not possible to create a more robust one regarding the register assignment.
Please check the image format of your source image. (exact byte order of the elements)
If it's not B, G, R, A, which is the default and native one on iOS, your application will suffer heavily from internal conversions by iOS.
If it's absolutely not possible to change this for whatever the reason, let me know.
I'll write a new version matching it.
PS : I forgot to remove the underscore at the start of the function prototype. Now it's gone.
/*
* BGRA2RGB565.s
*
* Created by Jake "Alquimista" Lee on 11. 11. 1..
* Copyright 2011 Jake Lee. All rights reserved.
*
* Version 1.1
* - bug fix
*
* Version 1.0
* - initial release
*/
.align 2
.globl _bgra2rgb565_neon
.private_extern _bgra2rgb565_neon
// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
//ARM
pDst .req r0
pSrc .req r1
count .req r2
//NEON
blu .req d16
grn .req d17
red .req d18
alp .req d19
gb .req grn
rg .req red
_bgra2rgb565_neon:
pld [pSrc]
tst count, #0x7
movne r0, #0
bxne lr
.loop:
pld [pSrc, #32]
vld4.8 {blu, grn, red, alp}, [pSrc]!
subs count, count, #8
vsri.8 red, grn, #5
vshl.u8 gb, grn, #3
vsri.8 gb, blu, #3
vst2.8 {gb, rg}, [pDst]!
bgt .loop
bx lr
If you are on iOS or OS X, then you may be delighted to discover vImageConvert_RGBA8888toRGB565() and friends, in Accelerate.framework. This function rounds the 8-bit values to nearest 565 value.
For even better dithering, the quality of which is nearly indistinguishable from 8-bit color, try vImageConvert_AnyToAny():
vImage_CGImageFormat RGBA8888Format =
{
.bitsPerComponent = 8,
.bitsPerPixel = 32,
.bitmapInfo = kCGBitmapByteOrderDefault | kCGImageAlphaNoneSkipLast,
.colorSpace = NULL, // sRGB or substitute your own in
};
vImage_CGImageFormat RGB565Format =
{
.bitsPerComponent = 5,
.bitsPerPixel = 16,
.bitmapInfo = kCGBitmapByteOrder16Little | kCGImageAlphaNone,
.colorSpace = RGBA8888Format.colorSpace,
};
err = vImageConverterRef converter = vImageConverter_CreateWithCGImageFormat(
&RGBA8888Format, &RGB565Format, NULL, kvImageNoFlags, &err );
err = vImageConvert_AnyToAny( converter, &src, &dest, NULL, kvImageNoFlags );
Either of these approaches will be vectorized and multithreaded for best performance.
You might want to use vld4q_u8() instead of vld4_u8() and adjust the rest of your code accordingly. It's hard to tell where the problem might be, but the assembler doesn't look too bad otherwise.
(I'm not familiar with NEON, nor deeply with the memory system of the Ipad2, but this is what we used to do with 88110 pixel-ops, which were an early precursor to today's SIMD extensions)
How big is the memory latency?
Could you hide it by unrolling the inner loop and running the NEON instructions on the "previous" values while the ARM pulls the "next" values from memory? A brief scan of the NEON manual implies you can run ARM and NEON instructions in parallel.
I don't think converting vld4_u8 to vld4q_u8 would lead to any bettering of the performance.
The code seems simple enough. I am not good at ASM and so it would take some time to look into it deeply.
The neon seems simple enough. But I am not quiet sure about r5_g6_b5 |= g16 being used instead of vorrq_u16
Please have a look at the optimization level too. As far as what I heard neon code optimization level goes to a maximum of 1. So the performance may differ when default optimization is being taken into account for both the reference code and neon code, as the level of optimization of reference by DEFAULT may be different.
I doesnt find any area in neon that can better the current code.

linker script load vs. virtual address

I've got the following linker script that is supposed to link code to run on a flash based micrcontroller. The uC has flash at address 0x0, and RAM at 0x40000000. I want to put the data section into flash, but link the program so that access to the data section is done in RAM. The point being, I'll manually copy it out of flash into the proper RAM location when the controller starts.
MEMORY
{
flash : ORIGIN = 0x00000000, LENGTH = 512K
ram : ORIGIN = 0x40000000, LENGTH = 32K
usbram : ORIGIN = 0x7FD00000, LENGTH = 8K
ethram : ORIGIN = 0x7FE00000, LENGTH = 16K
}
SECTIONS
{
.text : { *(.text) } >flash
__end_of_text__ = .;
.data :
{
__data_beg__ = .;
__data_beg_src__ = __end_of_text__;
*(.data)
__data_end__ = .;
} >ram AT>flash
.bss :
{
__bss_beg__ = .;
*(.bss)
} >ram
}
The code as shown above generates the following output:
40000000 <__data_beg__>:
40000000: 00000001 andeq r0, r0, r1
40000004: 00000002 andeq r0, r0, r2
40000008: 00000003 andeq r0, r0, r3
4000000c: 00000004 andeq r0, r0, r4
40000010: 00000005 andeq r0, r0, r5
40000014: 00000006 andeq r0, r0, r6
which represents an array of the form
int foo[] = {1,2,3,4,5,6};
Problem is that it's linked to 0x40000000, and not the flash region as I wanted. I expected the AT>flash part of the linker script to specify linking into flash, as explained in the LD manual.
http://sourceware.org/binutils/docs/ld/Output-Section-Attributes.html#Output-Section-Attributes
and here is my ld invocation:
arm-elf-ld -T ./lpc2368.ld entry.o main.o -o binary.elf
Thanks.
Your linker script LINKS .data to RAM and LOADS .data into FLASH. This is due to AT command.
Why do you want to link volatile data to flash? Data and Bss must ALWAYS be linked to RAM. Your linker script is quite correct. You would only link text and constand data into flash.
Please look at your map file. It would necessarily have assigned a RAM address to data variables.
The program loader code then copies copies (data_end - data_beg) bytes from data_beg_src to data_beg.
So the first data which is the array is copied into the begining of SRAM.
If you need to link data to flash:
Data :
{
*(.data);
} > flash
Linker will now LINK and LOAD .data into flash. But if I were you, I wouldnt do that.
Your .data virtual address = 0x40000000
Your .data logical address = 0x00000000
This can be seen with command
objdump -h file.elf
Sections:
Idx Name Size VMA LMA File off Algn
8 .data 00000014 40000000 00000000 00001000 2**2
CONTENTS, ALLOC, LOAD, DATA