Golang - Socket Read Operation is taking long time - sockets

I'm trying to read the response from a socket using the tcp connection type and net package in go.
func main() {
start := time.Now()
con, _ := net.Dial("tcp", "prodserver:6778")
defer con.Close()
b := make([]byte, 0, 512)
k := 0
for {
if len(b) == cap(b) {
b = append(b, 0)[:len(b)]
}
readStart := time.Now()
n, err := con.Read(b[len(b):cap(b)])
readEnd := time.Since(readStart)
fmt.Print("Start Iteraton #" + strconv.Itoa(k) + " ")
fmt.Print(readEnd)
fmt.Println(" End Iteraton #" + strconv.Itoa(k) + " byte size: " + strconv.Itoa(len(b)) + " ")
b = b[:len(b)+n]
if err != nil {
if err == io.EOF {
err = nil
}
// return b, err
break
}
k++
}
}
So I have tried calculate the duration taken for each read and printed to the console. And the results were interesting, all the read operations were very fast it took around a micro or milli seconds to complete except for the last read, it took around 1 minute.
Output:
Start Iteraton #0 551.32278ms End Iteraton #0 byte size: 0
Start Iteraton #1 11.577µs End Iteraton #1 byte size: 512
Start Iteraton #2 10.333µs End Iteraton #2 byte size: 896
Start Iteraton #3 5.204µs End Iteraton #3 byte size: 1408
Start Iteraton #4 6.605µs End Iteraton #4 byte size: 2048
Start Iteraton #5 6.452µs End Iteraton #5 byte size: 3072
//
// so on for the rest of iteration the time taken was very similar to above
//
Start Iteraton #238 37.747µs End Iteraton #238 byte size: 340285
Start Iteraton #239 23.235µs End Iteraton #239 byte size: 341630
Start Iteraton #240 58.041µs End Iteraton #240 byte size: 342975
Start Iteraton #241 34.192µs End Iteraton #241 byte size: 344320
Start Iteraton #242 101.094µs End Iteraton #242 byte size: 345665
Start Iteraton #243 1m0.307097139s End Iteraton #243 byte size: 346430
Any idea on why the last read operation is taking long time to complete? Is there any timeout we need to set.

Related

Handling return values of ReadKey

Program Example3;
uses Crt;
{ Program to demonstrate the ReadKey function. }
var
ch : char;
begin
writeln('Press Left/Right, Esc=Quit');
repeat
ch:=ReadKey;
case ch of
#0 : begin
ch:=ReadKey; {Read ScanCode}
case ch of
#32: Writeln ('Space');
#75 : WriteLn('Left');
#77 : WriteLn('Right');
end;
end;
#27 : WriteLn('ESC');
end;
until ch=#27 {Esc}
end.
This is Lazarus IDE Pascal. I want to extend functionality of an example copied from documentation so that the program recognizes space, not only left/right/esc keys.
I found a program that writes out the codes as you press the keys. It says 32 for space. I added the #32 case in switch statement above. Why do I still see no output when pressing space?
case ch of
#0 : begin
ch:=ReadKey; {Read ScanCode}
case ch of
#75 : WriteLn('Left');
#77 : WriteLn('Right');
end;
end;
#27 : WriteLn('ESC');
#32 : WriteLn('Space'); {<- space case should go HERE}
end;
Space is not an extended key, so is not preceded by #0. We don't put #32 case into #0 case but next to it.

Multithreading with runspaces: Method calls cause unexpected behavior

Following the articles Concurrency in PowerShell: Multi-threading with Runspaces and Multithreading PowerShell Scripts, I've been experimenting with multithreading using runspaces, and noticed some behavior that I don't understand.
In the script below, the script block $compute_block simulates an expensive computation by sleeping for a second before returning. Ten threads with this computation are spawned, after which the script waits for all of them to complete and prints the results.
The script can be run with either -Case 1 or -Case 2. In case 1, AddScript is called directly with $compute_block. In case 2, an object is created with a method equivalent to $compute_block, and AddScript is given a script block that calls this method.
threading.ps1
Param([Int] $Case)
Set-StrictMode -Version 2
$start_date = Get-Date
Function Get-Timestamp() {
Return ("{0:N3}" -f ((Get-Date) - $start_date).TotalSeconds)
}
Function Write-Timed([String] $str) {
Write-Host "[$(Get-Timestamp)] $str"
}
$count = 10
$runspace_pool = [RunspaceFactory]::CreateRunspacePool(1, $count)
$runspace_pool.Open()
$compute_block = {
Param($x)
Start-Sleep 1
Return $x
}
# Spawn jobs
$jobs = #()
ForEach($i In 0..($count-1)) {
Write-Timed "Creating job #$i"
$ps = [PowerShell]::Create()
Switch($Case) {
1 {
$job = $ps.AddScript($compute_block).AddArgument($i)
}
2 {
$object = New-Object PSObject `
| Add-Member ScriptMethod Compute $compute_block -PassThru
$job = $ps.AddScript({
Param($o, $x)
Return $o.Compute($x)
}).AddArgument($object).AddArgument($i)
}
}
$job.RunspacePool = $runspace_pool
Write-Timed "Starting job #$i"
$result = $job.BeginInvoke()
Write-Timed "Creating record for job #$i"
$record = New-Object PSObject -Property #{
"Job" = $job;
"Result" = $result;
}
Write-Timed "Adding record for job #$i to list"
$jobs += $record
}
# Wait for all jobs to complete
While($true) {
$running_count = #($jobs | Where-Object { -not $_.Result.IsCompleted }).Count
Write-Timed "Waiting for $running_count/$count jobs"
If(0 -eq $running_count) {
Break
}
Start-Sleep 1
}
# Print results
$i = 0
ForEach($job In $jobs) {
$result = $job.Job.EndInvoke($job.Result)
$job.Job.Dispose()
Write-Timed "Job #$i result: $result"
$i++
}
$runspace_pool.Close()
The results are very different (note the timestamps):
Case 1
PS C:\Users\Miranda\Documents> .\threading.ps1 -Case 1
[0.050] Creating job #0
[0.051] Starting job #0
[0.052] Creating record for job #0
[0.052] Adding record for job #0 to list
[0.053] Creating job #1
[0.053] Starting job #1
[0.056] Creating record for job #1
[0.057] Adding record for job #1 to list
[0.057] Creating job #2
[0.058] Starting job #2
[0.061] Creating record for job #2
[0.062] Adding record for job #2 to list
[0.062] Creating job #3
[0.063] Starting job #3
[0.066] Creating record for job #3
[0.066] Adding record for job #3 to list
[0.067] Creating job #4
[0.067] Starting job #4
[0.070] Creating record for job #4
[0.071] Adding record for job #4 to list
[0.071] Creating job #5
[0.072] Starting job #5
[0.075] Creating record for job #5
[0.076] Adding record for job #5 to list
[0.076] Creating job #6
[0.077] Starting job #6
[0.080] Creating record for job #6
[0.080] Adding record for job #6 to list
[0.081] Creating job #7
[0.081] Starting job #7
[0.084] Creating record for job #7
[0.085] Adding record for job #7 to list
[0.085] Creating job #8
[0.086] Starting job #8
[0.102] Creating record for job #8
[0.103] Adding record for job #8 to list
[0.104] Creating job #9
[0.104] Starting job #9
[0.114] Creating record for job #9
[0.115] Adding record for job #9 to list
[0.119] Waiting for 10/10 jobs
[1.120] Waiting for 0/10 jobs
[1.121] Job #0 result: 0
[1.122] Job #1 result: 1
[1.122] Job #2 result: 2
[1.123] Job #3 result: 3
[1.124] Job #4 result: 4
[1.124] Job #5 result: 5
[1.124] Job #6 result: 6
[1.125] Job #7 result: 7
[1.125] Job #8 result: 8
[1.126] Job #9 result: 9
Case 2
PS C:\Users\Miranda\Documents> .\threading.ps1 -Case 2
[0.080] Creating job #0
[0.117] Starting job #0
[0.120] Creating record for job #0
[0.121] Adding record for job #0 to list
[1.126] Creating job #1
[1.128] Starting job #1
[1.129] Creating record for job #1
[2.130] Adding record for job #1 to list
[2.130] Creating job #2
[2.132] Starting job #2
[2.132] Creating record for job #2
[3.134] Adding record for job #2 to list
[3.135] Creating job #3
[3.136] Starting job #3
[3.137] Creating record for job #3
[4.137] Adding record for job #3 to list
[4.138] Creating job #4
[4.139] Starting job #4
[4.140] Creating record for job #4
[5.141] Adding record for job #4 to list
[5.142] Creating job #5
[5.143] Starting job #5
[5.144] Creating record for job #5
[6.144] Adding record for job #5 to list
[6.145] Creating job #6
[6.146] Starting job #6
[6.147] Creating record for job #6
[7.148] Adding record for job #6 to list
[7.149] Creating job #7
[7.150] Starting job #7
[7.151] Creating record for job #7
[8.152] Adding record for job #7 to list
[8.153] Creating job #8
[8.166] Starting job #8
[8.167] Creating record for job #8
[9.168] Adding record for job #8 to list
[9.169] Creating job #9
[9.170] Starting job #9
[9.171] Creating record for job #9
[10.172] Adding record for job #9 to list
[10.192] Waiting for 0/10 jobs
[10.206] Job #0 result: 0
[10.208] Job #1 result: 1
[10.209] Job #2 result: 2
[10.209] Job #3 result: 3
[10.209] Job #4 result: 4
[10.210] Job #5 result: 5
[10.211] Job #6 result: 6
[10.211] Job #7 result: 7
[10.212] Job #8 result: 8
[10.212] Job #9 result: 9
Case 1 behaves as I would expect — the threads are spawned instantaneously, and the script spends a second in the wait loop before finishing.
However, in case 2, all concurrency seems to be lost. Each iteration of the spawning loop blocks until the spawned thread finishes, and once the wait loop is reached, there's nothing left to wait for. Why is this happening?
Edit:
For the record, I'm working with PS 3.0. As noted by Roman Kuzmin, running case 2 in PS 2.0 produces some very strange errors:
PS>.\threading.ps1 -Case 2
[0.060] Creating job #0
[0.060] Starting job #0
The '=' operator failed: Index was outside the bounds of the array..
At C:\Users\Miranda\Documents\threading.ps1:50 char:14
+ $result = <<<< $job.BeginInvoke()
+ CategoryInfo : InvalidOperation: (System.Manageme...hellAsyncResult:PowerShellAsyncResult) [], RuntimeE
xception
+ FullyQualifiedErrorId : OperatorFailed
C:\Users\Miranda\Documents\threading.ps1 : Index was outside the bounds of the array.
At line:1 char:16
+ .\threading.ps1 <<<< -Case 2
+ CategoryInfo : NotSpecified: (:) [threading.ps1], IndexOutOfRangeException
+ FullyQualifiedErrorId : System.IndexOutOfRangeException,threading.ps1
I cannot tell exactly why PowerShell works so but I can tell what causes the
problem and how to work around it. In the case 2 the script block
$compute_block is used as the script method in 10 objects/runspaces - this is
the culprit. If we make and use cloned script blocks, i.e.
$compute_block2 = [scriptblock]::Create($compute_block)
$object = New-Object PSObject `
| Add-Member ScriptMethod Compute $compute_block2 -PassThru
then the problem is resolved.
Interestingly, in PowerShell v2 the original script (case 2) does not work at all, it fails with some weird error messages. With a fix it works in v2, as well as in v3. It looks like you hit a problem case to be avoided. I do not remember such a thing documented. It may be a bug or a feature.

gpus_ReturnGuiltyForHardwareRestart crash

Application crashes in presentFrameBuffer (while running in foreground, no interruption occurring).
It's not crashing in the first frame, it draws for a while then it suddenly crashes.
I don't have exact steps to reproduce, but seems related to drawing something specific, still I have no openGL error reported trough the application, including one error check right before the presentFrameBuffer. If I add glFinish before the presentFrameBuffer will crash in the glFinish.
Application is crashing with EXC_BAD_ACCESS (code=1, address=0x1) and the above callstack without any other error/log/debug info.
Here is the callstack reported at the crash:
Thread 1, Queue : com.apple.main-thread
> #0 0x36871e46 in gpus_ReturnGuiltyForHardwareRestart ()
> #1 0x36872764 in gpusSubmitDataBuffers ()
> #2 0x31eae624 in SubmitPacketsIfAny ()
> #3 0x378a337a in gliPresentViewES ()
> #4 0x325b6df2 in -[EAGLContext presentRenderbuffer:] ()
> #5 0x0052986e in EAGLContext_presentRenderbuffer(EAGLContext*, objc_selector*, unsigned int) ()
> #6 0x000e2a48 in -[EAGLView presentFramebuffer] at /svn/src_svn/GG/iphone/Classes/EAGLView.mm:228
> #7 0x000e4066 in -[GGViewController drawFrame] at /svn/src_svn/GG/iphone/Classes/GGViewController.mm:504
> #8 0x3809ab0a in __NSFireTimer ()
> #9 0x39d36856 in __CFRUNLOOP_IS_CALLING_OUT_TO_A_TIMER_CALLBACK_FUNCTION__ ()
> #10 0x39d36502 in __CFRunLoopDoTimer ()
> #11 0x39d35176 in __CFRunLoopRun ()
> #12 0x39ca823c in CFRunLoopRunSpecific ()
> #13 0x39ca80c8 in CFRunLoopRunInMode ()
> #14 0x39b9333a in GSEventRunModal ()
> #15 0x3551b288 in UIApplicationMain ()
> #16 0x000e1bae in main at /svn/src_svn/GG/iphone/main.m:14
Anyone has any clue about this one ?
If you are using VAO, this can be caused by the index buffer (element array buffer) referencing vertices beyond the vertex buffer limits (VBO).
Keep in mind that the element array buffer is stored in the VAO, so as long as the VAO is bound, each call to glBindBuffer( GL_ELEMENT_ARRAY_BUFFER ) replaces the index buffer. If you forget to unbind the VAO when you move to your scene's next object you will be altering the VAO of the previous call.
More info over here: http://www.opengl.org/wiki/Vertex_Specification#Index_buffers
And a debugging tip: oversize your vertex buffers, it might turn this crash into a glitch that you can then inspect with the OpenGL ES frame capture tool of XCode (this requires XCode 4.5 and iOS 6).
Looks like the problem was caused by having glEnableClientState(GL_TEXTURE_COORD_ARRAY) for GL_TEXTURE1 but not providing the actual data in the vertex buffer.

import specific columns and range of rows from .dat file

How would I import the data from the fourth row from the following .dat file:
#0 Date-time: 07/06/2011 09:13:53
#1 Recorder: 10T2607
#2 File type: 1
#3 Columns: 3
#4 Channels: 1
#5 Field separation: 0
#6 Decimal point: 0
#7 Date def.: 0 0
#8 Time def.: 0
#9 Channel 1: Temperature(°C) Temp(°C) 3 1
#11 Reconvertion: 0
#19 Line color: 1 2 3 4
#30 Trend Type Number: 1
#33 Limit Temp. Corr. OTCR: 0
1 07.04.11 08:00:00 17,433
2 07.04.11 08:05:00 17,446
3 07.04.11 08:10:00 17,458
4 07.04.11 08:15:00 17,458
So, following the line that begins with #33 I would like to import 17,433 (which should be 17.433) then 17,446 and so on. I have tried to use textscan and headerlines by specifying that the data begins on line 13:
filename = 'Folder\data.dat');
fid = fopen(filename);
data = textscan(fid,'%f\t%f\t%f\t%f\n','Headerlines',13);
fclose(fid);
However, this does not work (in the sense that MATLAB returns an empty array). I guess this is due to the second and third column not being a floating point number, however, it does not work when I specify it to be a string either. What should I try next?
First, note that you have 14 headerlines.
For the data import, you can try the following:
filename = 'Folder\data.dat';
fid = fopen(filename);
data = textscan(fid,'%f\t%s\t%s\t%s','Headerlines',14);
a = cellfun(#(x) str2num(strrep(x, ',', '.')), data{4});
fclose(fid);
This results in
a =
17.4330
17.4460
17.4580
17.4580

Optimizing RGBA8888 to RGB565 conversion with NEON

I'm trying to optimize an image format conversion on iOS using the NEON vector instruction set. I assumed this would map well to that because it processes a bunch of similar data.
My attempts haven't gone that well, though, achieving only a marginal speedup vs the naive c implementation:
for(int i = 0; i < pixelCount; ++i, ++inPixel32) {
const unsigned int r = ((*inPixel32 >> 0 ) & 0xFF);
const unsigned int g = ((*inPixel32 >> 8 ) & 0xFF);
const unsigned int b = ((*inPixel32 >> 16) & 0xFF);
*outPixel16++ = ((r >> 3) << 11) | ((g >> 2) << 5) | ((b >> 3) << 0);
}
1 megapixel image array on iPad 2:
format is [min avg max n=number of timer samples] in milliseconds
C:
[14.446 14.632 18.405 n=1000]ms
NEON:
[11.920 12.032 15.336 n=1000]ms
My attempt at a NEON implementation is below:
int i;
const int pixelsPerLoop = 8;
for(i = 0; i < pixelCount; i += pixelsPerLoop, inPixel32 += pixelsPerLoop, outPixel16 += pixelsPerLoop) {
//Read all r,g,b pixels into 3 registers
uint8x8x4_t rgba = vld4_u8(inPixel32);
//Right-shift r,g,b as appropriate
uint8x8_t r = vshr_n_u8(rgba.val[0], 3);
uint8x8_t g = vshr_n_u8(rgba.val[1], 2);
uint8x8_t b = vshr_n_u8(rgba.val[2], 3);
//Widen b
uint16x8_t r5_g6_b5 = vmovl_u8(b);
//Widen r
uint16x8_t r16 = vmovl_u8(r);
//Left shift into position within 16-bit int
r16 = vshlq_n_u16(r16, 11);
r5_g6_b5 |= r16;
//Widen g
uint16x8_t g16 = vmovl_u8(g);
//Left shift into position within 16-bit int
g16 = vshlq_n_u16(g16, 5);
r5_g6_b5 |= g16;
//Now write back to memory
vst1q_u16(outPixel16, r5_g6_b5);
}
//Do the remainder on normal flt hardware
Code was compiled via LLVM 3.0 into the following (.loc and extra labels removed):
_DNConvert_ARGB8888toRGB565:
push {r4, r5, r7, lr}
mov r9, r1
mov.w r12, #0
add r7, sp, #8
cmp r2, #0
mov.w r1, #0
it ne
movne r1, #1
cmp r0, #0
mov.w r3, #0
it ne
movne r3, #1
cmp.w r9, #0
mov.w r4, #0
it ne
movne r4, #1
tst.w r9, #3
bne LBB0_8
ands r1, r3
ands r1, r4
cmp r1, #1
bne LBB0_8
movs r1, #0
lsr.w lr, r9, #2
cmp.w r1, r9, lsr #2
bne LBB0_9
mov r3, r2
mov r5, r0
b LBB0_5
LBB0_4:
movw r1, #65528
add.w r0, lr, #7
movt r1, #32767
ands r1, r0
LBB0_5:
mov.w r12, #1
cmp r1, lr
bhs LBB0_8
rsb r0, r1, r9, lsr #2
mov.w r9, #63488
mov.w lr, #2016
mov.w r12, #1
LBB0_7:
ldr r2, [r5], #4
subs r0, #1
and.w r1, r9, r2, lsl #8
and.w r4, lr, r2, lsr #5
ubfx r2, r2, #19, #5
orr.w r2, r2, r4
orr.w r1, r1, r2
strh r1, [r3], #2
bne LBB0_7
LBB0_8:
mov r0, r12
pop {r4, r5, r7, pc}
LBB0_9:
sub.w r1, lr, #1
movs r3, #32
add.w r3, r3, r1, lsl #2
bic r3, r3, #31
adds r5, r0, r3
movs r3, #16
add.w r1, r3, r1, lsl #1
bic r1, r1, #15
adds r3, r2, r1
movs r1, #0
LBB0_10:
vld4.8 {d16, d17, d18, d19}, [r0]!
adds r1, #8
cmp r1, lr
vshr.u8 d20, d16, #3
vshr.u8 d21, d17, #2
vshr.u8 d16, d18, #3
vmovl.u8 q11, d20
vmovl.u8 q9, d21
vmovl.u8 q8, d16
vshl.i16 q10, q11, #11
vshl.i16 q9, q9, #5
vorr q8, q8, q10
vorr q8, q8, q9
vst1.16 {d16, d17}, [r2]!
Ltmp28:
blo LBB0_10
b LBB0_4
Full code is available at https://github.com/darknoon/DNImageConvert I would appreciate any help, thanks!
Here you are, hand-optimized NEON implementation ready for XCode :
/* IT DOESN'T WORK!!! USE THE NEXT VERSION BELOW.
* BGRA2RGB565.s
*
* Created by Jake "Alquimista" Lee on 11. 11. 1..
* Copyright 2011 Jake Lee. All rights reserved.
*/
.align 2
.globl _bgra2rgb565_neon
.private_extern _bgra2rgb565_neon
// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
//ARM
pDst .req r0
pSrc .req r1
count .req r2
//NEON
blu .req d16
grn .req d17
red .req d18
alp .req d19
rg .req red
gb .req blu
_bgra2rgb565_neon:
pld [pSrc]
tst count, #0x7
movne r0, #0
bxne lr
loop:
pld [pSrc, #32]
vld4.8 {blu, grn, red, alp}, [pSrc]!
subs count, count, #8
vshr.u8 red, red, #3
vext.8 rg, grn, red, #5
vshr.u8 grn, grn, #2
vext.8 gb, blu, grn, #3
vst2.8 {gb, rg}, [pDst]!
bgt loop
bx lr
This version will be many times faster than what you suggested :
increased cache hit rate via PLD
conversion to "long" not necessary
fewer instructions within the loop
There is still some room for optimizations though, you could modify the loop so that it converts 16 pixels per iteration instead of 8.
Then you can schedule the instructions to avoid the two stalls completely (which is simply not possible in this 8/iteration version above) and benefit from NEON's dual-issue capability in addition.
I didn't do this because it would make the code hard to understand.
It's important to know what VEXT is supposed to do.
Now it's up to you. :)
I verified this code to be properly compiled under Xcode.
Although I'm pretty sure it works correctly as well, I cannot guarantee this since I don't have the test environment.
In case of malfunctioning, please let me know. I'll correct it accordingly then.
cya
==============================================================================
Well, here is the improved version.
Due to the nature of the VSRI instruction not allowing two operands other than the target, it was not possible to create a more robust one regarding the register assignment.
Please check the image format of your source image. (exact byte order of the elements)
If it's not B, G, R, A, which is the default and native one on iOS, your application will suffer heavily from internal conversions by iOS.
If it's absolutely not possible to change this for whatever the reason, let me know.
I'll write a new version matching it.
PS : I forgot to remove the underscore at the start of the function prototype. Now it's gone.
/*
* BGRA2RGB565.s
*
* Created by Jake "Alquimista" Lee on 11. 11. 1..
* Copyright 2011 Jake Lee. All rights reserved.
*
* Version 1.1
* - bug fix
*
* Version 1.0
* - initial release
*/
.align 2
.globl _bgra2rgb565_neon
.private_extern _bgra2rgb565_neon
// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
//ARM
pDst .req r0
pSrc .req r1
count .req r2
//NEON
blu .req d16
grn .req d17
red .req d18
alp .req d19
gb .req grn
rg .req red
_bgra2rgb565_neon:
pld [pSrc]
tst count, #0x7
movne r0, #0
bxne lr
.loop:
pld [pSrc, #32]
vld4.8 {blu, grn, red, alp}, [pSrc]!
subs count, count, #8
vsri.8 red, grn, #5
vshl.u8 gb, grn, #3
vsri.8 gb, blu, #3
vst2.8 {gb, rg}, [pDst]!
bgt .loop
bx lr
If you are on iOS or OS X, then you may be delighted to discover vImageConvert_RGBA8888toRGB565() and friends, in Accelerate.framework. This function rounds the 8-bit values to nearest 565 value.
For even better dithering, the quality of which is nearly indistinguishable from 8-bit color, try vImageConvert_AnyToAny():
vImage_CGImageFormat RGBA8888Format =
{
.bitsPerComponent = 8,
.bitsPerPixel = 32,
.bitmapInfo = kCGBitmapByteOrderDefault | kCGImageAlphaNoneSkipLast,
.colorSpace = NULL, // sRGB or substitute your own in
};
vImage_CGImageFormat RGB565Format =
{
.bitsPerComponent = 5,
.bitsPerPixel = 16,
.bitmapInfo = kCGBitmapByteOrder16Little | kCGImageAlphaNone,
.colorSpace = RGBA8888Format.colorSpace,
};
err = vImageConverterRef converter = vImageConverter_CreateWithCGImageFormat(
&RGBA8888Format, &RGB565Format, NULL, kvImageNoFlags, &err );
err = vImageConvert_AnyToAny( converter, &src, &dest, NULL, kvImageNoFlags );
Either of these approaches will be vectorized and multithreaded for best performance.
You might want to use vld4q_u8() instead of vld4_u8() and adjust the rest of your code accordingly. It's hard to tell where the problem might be, but the assembler doesn't look too bad otherwise.
(I'm not familiar with NEON, nor deeply with the memory system of the Ipad2, but this is what we used to do with 88110 pixel-ops, which were an early precursor to today's SIMD extensions)
How big is the memory latency?
Could you hide it by unrolling the inner loop and running the NEON instructions on the "previous" values while the ARM pulls the "next" values from memory? A brief scan of the NEON manual implies you can run ARM and NEON instructions in parallel.
I don't think converting vld4_u8 to vld4q_u8 would lead to any bettering of the performance.
The code seems simple enough. I am not good at ASM and so it would take some time to look into it deeply.
The neon seems simple enough. But I am not quiet sure about r5_g6_b5 |= g16 being used instead of vorrq_u16
Please have a look at the optimization level too. As far as what I heard neon code optimization level goes to a maximum of 1. So the performance may differ when default optimization is being taken into account for both the reference code and neon code, as the level of optimization of reference by DEFAULT may be different.
I doesnt find any area in neon that can better the current code.