How to implement 16 bit->32 bit lookup table in ARM assembly using NEON?

How to implement 16 bit->32 bit lookup table in ARM assembly using NEON? - iphone

In an iOS 6 project, I have a buffer containing two byte words (16 bits) that need to be translated to four byte words (32 bits) via a lookup table. I hard-code the values into the table, and then use the the value of the two byte buffer to determine which 32 bit table value to retrieve. Here's an example:
void map_values(uint32_t *dst,uint16_t *src,uint32_t *lut,int buf_length){
int i=0;
for(i=0;i<buf_length;i++){
*dst = *(lut+(*src));
dst++;
src++;
}
}
The problem is, it's too slow. Could this be sped up by processing 4 output bytes at a time using NEON? The thing is, I'm iffy on how to take the value from the src buffer and use that as an input to the lookup table to figure out what value to retrieve. Also, the word lengths are the same in the table and the output buffer, but not for the source. So, I can only read two 16 bit words as input, versus the four 32 bit word output I need. Any ideas? Is there a better way to approach this problem, perhaps?
Current asm output from clang (clang -O3 -arch armv7 lut.c -S):
.section __TEXT,__text,regular,pure_instructions
.section __TEXT,__textcoal_nt,coalesced,pure_instructions
.section __TEXT,__const_coal,coalesced
.section __TEXT,__picsymbolstub4,symbol_stubs,none,16
.section __TEXT,__StaticInit,regular,pure_instructions
.syntax unified
.section __TEXT,__text,regular,pure_instructions
.globl _map_values
.align 2
.code 16 # #map_values
.thumb_func _map_values
_map_values:
# BB#0:
cmp r3, #0
it eq
bxeq lr
LBB0_1: # %.lr.ph
# =>This Inner Loop Header: Depth=1
ldrh r9, [r1], #2
subs r3, #1
ldr.w r9, [r2, r9, lsl #2]
str r9, [r0], #4
bne LBB0_1
# BB#2: # %._crit_edge
bx lr
.subsections_via_symbols

Lookup tables are (nearly) unvectorizable. Very small lookup tables can be handled with the vtbl instruction, but your lookup table is far too big for that.
What are you using the lookup table for? If the values can be computed on the fly without too much work instead of looking them up, that may actually be a significant win for you.

My first thought is that you might get some luck out of vtablelookup in the vecLib portion of the Accelerate framework. The signature is:
vUInt32 vtablelookup (
vSInt32 Index_Vect,
uint32_t *Table
);
where vSInt32 and vUInt32 are 128 bit packed 32 bit signed/unsigned integers respectively. I believe the function is backed by NEON on ARM. The big problem will be converting your src array into 32 bit indices, which could well slow things down so much as to render the speed gains from vectorising the lookup pointless.

Related

Using scanf in x86-64 gas assembly gives sigsegv [duplicate]

When compiling below code:
global main
extern printf, scanf
section .data
msg: db "Enter a number: ",10,0
format:db "%d",0
section .bss
number resb 4
section .text
main:
mov rdi, msg
mov al, 0
call printf
mov rsi, number
mov rdi, format
mov al, 0
call scanf
mov rdi,format
mov rsi,[number]
inc rsi
mov rax,0
call printf
ret
using:
nasm -f elf64 example.asm -o example.o
gcc -no-pie -m64 example.o -o example
and then run
./example
it runs, print: enter a number:
but then crashes and prints:
Segmentation fault (core dumped)
So printf works fine but scanf not.
What am I doing wrong with scanf so?

Use sub rsp, 8 / add rsp, 8 at the start/end of your function to re-align the stack to 16 bytes before your function does a call.
Or better push/pop a dummy register, e.g. push rdx / pop rcx, or a call-preserved register like RBP you actually wanted to save anyway. You need the total change to RSP to be an odd multiple of 8 counting all pushes and sub rsp, from function entry to any call.
i.e. 8 + 16*n bytes for whole number n.
On function entry, RSP is 8 bytes away from 16-byte alignment because the call pushed an 8-byte return address. See Printing floating point numbers from x86-64 seems to require %rbp to be saved,
main and stack alignment, and Calling printf in x86_64 using GNU assembler. This is an ABI requirement which you used to be able to get away with violating when there weren't any FP args for printf. But not any more.
See also Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
To put it another way, RSP % 16 == 8 on function entry, and you need to ensure RSP % 16 == 0 before you call a function. How you do this doesn't matter. (Not all functions will actually crash if you don't, but the ABI does require/guarantee it.)
gcc's code-gen for glibc scanf now depends on 16-byte stack alignment
even when AL == 0.
It seems to have auto-vectorized copying 16 bytes somewhere in __GI__IO_vfscanf, which regular scanf calls after spilling its register args to the stack1. (The many similar ways to call scanf share one big implementation as a back end to the various libc entry points like scanf, fscanf, etc.)
I downloaded Ubuntu 18.04's libc6 binary package: https://packages.ubuntu.com/bionic/amd64/libc6/download and extracted the files (with 7z x blah.deb and tar xf data.tar, because 7z knows how to extract a lot of file formats).
I can repro your bug with LD_LIBRARY_PATH=/tmp/bionic-libc/lib/x86_64-linux-gnu ./bad-printf, and also it turns out with the system glibc 2.27-3 on my Arch Linux desktop.
With GDB, I ran it on your program and did set env LD_LIBRARY_PATH /tmp/bionic-libc/lib/x86_64-linux-gnu then run. With layout reg, the disassembly window looks like this at the point where it received SIGSEGV:
│0x7ffff786b49a <_IO_vfscanf+602> cmp r12b,0x25 │
│0x7ffff786b49e <_IO_vfscanf+606> jne 0x7ffff786b3ff <_IO_vfscanf+447> │
│0x7ffff786b4a4 <_IO_vfscanf+612> mov rax,QWORD PTR [rbp-0x460] │
│0x7ffff786b4ab <_IO_vfscanf+619> add rax,QWORD PTR [rbp-0x458] │
│0x7ffff786b4b2 <_IO_vfscanf+626> movq xmm0,QWORD PTR [rbp-0x460] │
│0x7ffff786b4ba <_IO_vfscanf+634> mov DWORD PTR [rbp-0x678],0x0 │
│0x7ffff786b4c4 <_IO_vfscanf+644> mov QWORD PTR [rbp-0x608],rax │
│0x7ffff786b4cb <_IO_vfscanf+651> movzx eax,BYTE PTR [rbx+0x1] │
│0x7ffff786b4cf <_IO_vfscanf+655> movhps xmm0,QWORD PTR [rbp-0x608] │
>│0x7ffff786b4d6 <_IO_vfscanf+662> movaps XMMWORD PTR [rbp-0x470],xmm0 │
So it copied two 8-byte objects to the stack with movq + movhps to load and movaps to store. But with the stack misaligned, movaps [rbp-0x470],xmm0 faults.
I didn't grab a debug build to find out exactly which part of the C source turned into this, but the function is written in C and compiled by GCC with optimization enabled. GCC has always been allowed to do this, but only recently did it get smart enough to take better advantage of SSE2 this way.
Footnote 1: printf / scanf with AL != 0 has always required 16-byte alignment because gcc's code-gen for variadic functions uses test al,al / je to spill the full 16-byte XMM regs xmm0..7 with aligned stores in that case. __m128i can be an argument to a variadic function, not just double, and gcc doesn't check whether the function ever actually reads any 16-byte FP args.

Segfault in scan with x86_64 NASM [duplicate]

When compiling below code:
global main
extern printf, scanf
section .data
msg: db "Enter a number: ",10,0
format:db "%d",0
section .bss
number resb 4
section .text
main:
mov rdi, msg
mov al, 0
call printf
mov rsi, number
mov rdi, format
mov al, 0
call scanf
mov rdi,format
mov rsi,[number]
inc rsi
mov rax,0
call printf
ret
using:
nasm -f elf64 example.asm -o example.o
gcc -no-pie -m64 example.o -o example
and then run
./example
it runs, print: enter a number:
but then crashes and prints:
Segmentation fault (core dumped)
So printf works fine but scanf not.
What am I doing wrong with scanf so?

Use sub rsp, 8 / add rsp, 8 at the start/end of your function to re-align the stack to 16 bytes before your function does a call.
Or better push/pop a dummy register, e.g. push rdx / pop rcx, or a call-preserved register like RBP you actually wanted to save anyway. You need the total change to RSP to be an odd multiple of 8 counting all pushes and sub rsp, from function entry to any call.
i.e. 8 + 16*n bytes for whole number n.
On function entry, RSP is 8 bytes away from 16-byte alignment because the call pushed an 8-byte return address. See Printing floating point numbers from x86-64 seems to require %rbp to be saved,
main and stack alignment, and Calling printf in x86_64 using GNU assembler. This is an ABI requirement which you used to be able to get away with violating when there weren't any FP args for printf. But not any more.
See also Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
To put it another way, RSP % 16 == 8 on function entry, and you need to ensure RSP % 16 == 0 before you call a function. How you do this doesn't matter. (Not all functions will actually crash if you don't, but the ABI does require/guarantee it.)
gcc's code-gen for glibc scanf now depends on 16-byte stack alignment
even when AL == 0.
It seems to have auto-vectorized copying 16 bytes somewhere in __GI__IO_vfscanf, which regular scanf calls after spilling its register args to the stack1. (The many similar ways to call scanf share one big implementation as a back end to the various libc entry points like scanf, fscanf, etc.)
I downloaded Ubuntu 18.04's libc6 binary package: https://packages.ubuntu.com/bionic/amd64/libc6/download and extracted the files (with 7z x blah.deb and tar xf data.tar, because 7z knows how to extract a lot of file formats).
I can repro your bug with LD_LIBRARY_PATH=/tmp/bionic-libc/lib/x86_64-linux-gnu ./bad-printf, and also it turns out with the system glibc 2.27-3 on my Arch Linux desktop.
With GDB, I ran it on your program and did set env LD_LIBRARY_PATH /tmp/bionic-libc/lib/x86_64-linux-gnu then run. With layout reg, the disassembly window looks like this at the point where it received SIGSEGV:
│0x7ffff786b49a <_IO_vfscanf+602> cmp r12b,0x25 │
│0x7ffff786b49e <_IO_vfscanf+606> jne 0x7ffff786b3ff <_IO_vfscanf+447> │
│0x7ffff786b4a4 <_IO_vfscanf+612> mov rax,QWORD PTR [rbp-0x460] │
│0x7ffff786b4ab <_IO_vfscanf+619> add rax,QWORD PTR [rbp-0x458] │
│0x7ffff786b4b2 <_IO_vfscanf+626> movq xmm0,QWORD PTR [rbp-0x460] │
│0x7ffff786b4ba <_IO_vfscanf+634> mov DWORD PTR [rbp-0x678],0x0 │
│0x7ffff786b4c4 <_IO_vfscanf+644> mov QWORD PTR [rbp-0x608],rax │
│0x7ffff786b4cb <_IO_vfscanf+651> movzx eax,BYTE PTR [rbx+0x1] │
│0x7ffff786b4cf <_IO_vfscanf+655> movhps xmm0,QWORD PTR [rbp-0x608] │
>│0x7ffff786b4d6 <_IO_vfscanf+662> movaps XMMWORD PTR [rbp-0x470],xmm0 │
So it copied two 8-byte objects to the stack with movq + movhps to load and movaps to store. But with the stack misaligned, movaps [rbp-0x470],xmm0 faults.
I didn't grab a debug build to find out exactly which part of the C source turned into this, but the function is written in C and compiled by GCC with optimization enabled. GCC has always been allowed to do this, but only recently did it get smart enough to take better advantage of SSE2 this way.
Footnote 1: printf / scanf with AL != 0 has always required 16-byte alignment because gcc's code-gen for variadic functions uses test al,al / je to spill the full 16-byte XMM regs xmm0..7 with aligned stores in that case. __m128i can be an argument to a variadic function, not just double, and gcc doesn't check whether the function ever actually reads any 16-byte FP args.

How to truncate a 2's complement output

I have data written into short data type. The data written is of 2's complement form.
Now when I try to print the data using %04x, the data with MSB=0 is printed fine for eg if data=740, print I get is 0740
But when the MSB=1, I am unable to get a proper print. For eg if data=842, print I get is fffff842
I want the data truncated to 4 bytes so expected output is f842

Either declare your data as a type which is 16 bits long, or make sure the printing function uses the right format for 16 bits value. Or use your current type, but do a bitwise AND with 0xffff. What you can do depends on the language you're doing it in really.
But whichever way you go, check your assumptions again. There seems to be a few issues in your question:
2s-complement applies to signed numbers only. There are no negative numbers in your question.
Assuming you mean C's short - it doesn't have to be 16 bits long.
"I get is fffff842 I want the data truncated to 4 bytes" - fffff842 is 4 bytes long. f842 is 2 bytes long.
2-bytes long value 842 does not have the MSB set.

I'm assuming C (or possibly C++) as the language here.
Because of the default argument promotions involved when calling a variable argument function (such as printf), your use of a short will result in an integer promotion, which states that "If an int can represent all values of the original type (as restricted by the width, for a
bit-field), the value is converted to an int".
A short is converted to an int by means of sign-extension, and 0xf842 sign-extended to 32 bits is 0xfffff842.
You can use a bitwise AND to mask off the most significant word:
printf("%04x", data & 0xffff);
You could also add the h length specifier to state that you only want to print an (unsigned) short worth of bits from an int:
printf("%04hx", data);

Why this function uses a lot of memory?

I'm trying to unpack binary vector of 140 Million bits into list.
I'm checking the memory usage of this function, but it looks weird. the memory usage rises to 35GB (GB and not MB). how can I reduce the memory usage?
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return #unpacked;
}

Scalars contain a lot of information.
$ perl -MDevel::Peek -e'Dump("0")'
SV = PV(0x42a8330) at 0x42c57b8
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x42ce670 "0"\0
CUR = 1
LEN = 16
In order to keep them as small as possible, a scalar consists of two memory blocks[1], a fixed-sized head, and a body that can be "upgraded" to contain more information.
The smallest type of scalar that can contain a string (such as the ones returned by split) is a SVt_PV. (It's usually called PV, but PV can also refer to the name of the field that points to the string buffer, so I'll go with the name of the constant.)
The first block is the head.
ANY is a pointer to the body.
REFCNT is a reference count that allows Perl to know when the scalar can be deallocated.
FLAGS contains information about what the scalar actually contains. (e.g. SVf_POK means the scalar contains a string.)
TYPE contains information the type of scalar (what kind of information it can contain.)
For an SVt_PV, the last field points to the string buffer.
The second block is the body. The body of an SVt_PV has the following fields:
STASH is not used in the scalars in question since they're not objects.
MAGIC is not used for the scalars in question. Magic allows code to be called when the variable is accessed.
CUR is the length of the string in the buffer.
LEN is the length of the string buffer. Perl over-allocates to speed up concatenation.
The block on the right is the string buffer. As you might have noticed, Perl over-allocates. This speeds up concatenation.
Ignore the block on the bottom. It's an alternative to the string buffer format for special strings (e.g. hash keys).
To how much does that add up?
$ perl -MDevel::Size=total_size -E'say total_size("0")'
28 # 32-bit Perl
56 # 64-bit Perl
That's just for the scalar itself. It doesn't take into the overhead in the memory allocation system of three memory blocks.
These scalars are in an array. An array is really just a scalar.
So an array has overheard.
$ perl -MDevel::Size=total_size -E'say total_size([])'
56 # 32-bit Perl
64 # 64-bit Perl
That's an empty array. You have 140 million of the scalars in yours, so it needs a buffer that can contain 140 million pointers. (In this particular case, the array won't be over-allocated, at least.) Each pointer is 4 bytes on a 32-bit system, 8 on a 64.
That brings the total up to:
32-bit: 56 + (4 + 28) * 140,000,000 = 4,480,000,056
64-bit: 64 + (8 + 56) * 140,000,000 = 8,960,000,064
That doesn't factor in the memory allocation overhead, but it's still very different from the numbers you gave. Why? Well, the scalars returned by split are actually different than the scalars inside the array. So for a moment, you actually have 280,000,000 scalars in memory!
The rest of the memory is probably held by lexical variables in subs that aren't currently executing. Lexical variables aren't normally freed on scope exit since it's expected that the sub will need the memory the next time it's called. That means bin2list continues to use up 140MB of memory after it exits.
Footnotes
Scalars that are undefined can get away without a body until a value is assigned to them. Scalars that contain only an integer can get away without allocating a memory block for the body by storing the integer in the same field as a SVt_PV stores the pointer to the string buffer.
The images are from illguts. They are protected by Copyright.

A single integer value in Perl is going to be stored in an SVt_IV or SVt_UV scalar, whose size will be four machine-sized words - so on a 32bit machine, 16 bytes. An array of 140 million of those, therefore, is going to consume 2.2 billion bytes, presuming it is densely packed together. Add to that the SV * pointers in the AvARRAY used to reference them and we're now at 2.8 billion bytes. Now double that, because you copied the array when you returned it, and we're now at 5.6 billion bytes.
That of course was on a 32bit machine - on a 64bit machine we're at double again, so 11.2 billion bytes. This presumes totally dense packing inside the memory - in practice this will be allocated in stages and chunks, so RAM fragmentation will further add to this. I could imagine a total size around the 35 billion byte mark for this. It doesn't sound outlandishly unreasonable.
For a very easy way to massively reduce the memory usage (not to mention CPU time required), rather than returning the array itself as a list, return a reference to it. Then a single reference is returned rather than a huge list of 140 million SVs; this avoids a second copy also.
sub bin2list {
# This sub translates a binary vector to a list of "1","0"
my $vector = shift;
my #unpacked = split //, (unpack "B*", $vector );
return \#unpacked;
}

Basic NASM bootstrap

I've recently been researching Operating Systems, the boot process, and NASM. On my journeys I ran into a piece of useful bootstrapping code which I partially understand and have tested via a virtual floppy disk. My basic question is to what some of these lines I don't understand do. I've commented what I think the lines do, and any corrections or confirmations would be much appreciated.
; This is NASM
BITS 16 ; 16 bits!
start: ; Entry point
mov ax, 07C0h ; Move the starting address (after this bootloader) into 'ax'
add ax, 288 ; Leave 288 bytes before the stack beginning for some reason
mov ss, ax ; Show 'stack segment' where our stack starts
mov sp, 4096 ; Tell 'stack pointer' that our stack is 4K in size
mov ax, 07C0h ; Use 'ax' as temporary variable for setting 'ds'
mov ds, ax ; Set data segment to where we're loaded
mov si, text_string ; Put string position into SI (the reg used for this!)
call print_string ; Call our string-printing routine
jmp $ ; Jump here - infinite loop!
text_string db 'This is my cool new OS!', 0 ; Our null terminated string
; For some reason declared after use
print_string: ; Routine: output string in SI to screen
mov ah, 0Eh ; I don't know what this does..
; Continue on to 'repeat'
.repeat:
lodsb ; Get character from DS:SI into AL
cmp al, 0 ; If end of text_string
je .done ; We're done here
int 10h ; Otherwise, print the character (What 10h means)
jmp .repeat ; And repeat
.done:
ret
times 510-($-$$) db 0 ; Pad remainder of boot sector with 0s
dw 0xAA55 ; The standard PC 'magic word' boot signature
Thanks,
Joe

Your comments are largely correct.
mov ah,0Eh
This sets a parameter to the BIOS interrupt call:
int 10h
See here for more details, but essentially the call to 10h expects an operation in ah and data for the operation in al.
The segment registers cannot be loaded directly and can only load from a register, thus the use of ax as a 'temporary variable.'
The 288 bytes added to the base stack pointer are actually not bytes at all. Addresses loaded into the segment registers are actually pointers to 16-byte blocks, so to convert the number to its actual address, shift it left by 4-bits. That means that the address 07C0h is actually referring to 7C00h, which is where your bootloader code is placed. 288 is 120h in hex, and so the actual location of the stack is really 7C00h + 1200h = 8E00h.
Also, you use words like "show" and "tell" which are fine, but it's better to think of defining the stack by setting ss and sp as opposed to reporting where it is at... I hope that makes sense.

mov ah, 0Eh ; I don't know what this does..
Loading 0eh into ah sets up the int 10h function Teletype output, which will print the character in al to the screen.
Your .repeat loop will then load each character from text_string into al and call int 10h.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse