How can I multiply two vectors in NASM x86 assembly? - x86-64

I am trying to multiply two vectors of floating point values in assembly code. I am using NASM preprocessor on Intel x86_64 architecture.
Recently I have learned about SSE extension for Intel assembly, so I was trying to implement packed multiplication using SSE instructions and tail recursion. Here is my function:
mul:
movdqa xmm0, [rdi]
movdqa xmm1, [rsi]
mulps xmm0, xmm1
movdqa [rdi], xmm0
add rdi, 16
add rsi, 16
sub rdx, 4
cmp rdx, 0
jg mul
ret
It multiplies two vectors value by value and saves results into the first vector. Pointers to vectors are stored in rdi and rsi registers accordingly, size of both should be written in rdx.
The program written above catches segmentation fault on the second instruction. I think I am using SSE incorrectly.
Is there any other way to use packed multiplication? Or am I just doing something wrong here?

Related

the usage of offset for storing (stw) in powerpc assembly

long long int i=57745158985; #the C code
0000000000100004: li r7,13
0000000000100008: lis r8,29153
000000000010000c: ori r8,r8,0x3349
0000000000100010: stw r7,24(rsp)
0000000000100014: stw r8,28(rsp)
0000000000100018: lfd fp0,24(rsp)
000000000010001c: stfd fp0,8(rsp)
Can anyone explain the part of after the ori instruction? Thanks in advance.
It looks like this is on a 32-bit big endian machine. I will assume i is a local variable.
Starting with these instructions...
li r7,13
lis r8,29153
ori r8,r8,0x3349
After these instructions:
r7 contains 13
r8 contains ((29153 << 16) | 0x3349)
The required value for i is 57745158985, which is equal to
(13<<32) | ((29153 << 16) | 0x3349)
Clearly this value is too big to fit in a single 32-bit register.
The next instructions are where the 64-bit local variable i is "created" on the stack.
stw r7,24(rsp)
stw r8,28(rsp)
rsp is the stack pointer for the function.
Here i is being initialized to it's initial value of 57745158985.
stw r7,24(rsp) stores the four bytes of r7 starting at an offset of 24 bytes into the stack.
stw r8,28(rsp) stores the four bytes r8 starting at an offset of 28 bytes into the stack.
So i is the 8 bytes starting from an offset of 24 on the stack.
As this is a big-endian architecture the most significant bytes are placed first in memory.
Placing the value of r7 at lower address performs acts like the (13<<32) when considering the 8 bytes as one long long int.
These next instructions load the value of i into a floating point register and save it at a different location on the stack.
lfd fp0,24(rsp)
stfd fp0,8(rsp)
These 3 are loading up two 32 bit literal values into GPRs r7 and r8
0000000000100004: li r7,13
0000000000100008: lis r8,29153
000000000010000c: ori r8,r8,0x3349
These two are storing the two 32 bit values out consecutive 32 bit memory locations pointed to by rsp (which is the stack pointer == r1) + 24
0000000000100010: stw r7,24(rsp)
0000000000100014: stw r8,28(rsp)
This is a 64 bit load from the same location (ie rsp + 24) into floating point register 0 (ie fp0). (you can't move GPRs to FPR directly on this processor, so you go via memory)
0000000000100018: lfd fp0,24(rsp)
This is storing the same 64 bit FPR0 out to a different offset from the stack point.
000000000010001c: stfd fp0,8(rsp)

Get a negative result out of sub instruction x86_64

I've been stuck on a rather simple instruction, but everything in assembly only seems simple until I try to really understand it haha.
I've paid attention to this post which clarifies some stuff but I'm still confused: Understanding intel SUB instruction
Here is a super simple situation:
Two strings are sent to function for comparison. I stripped down to keep the essentials.
function:
mov al, [rdi] ; == 0
sub al, [rsi] ; == 32
ret ; returns 224 and not -32 like I would like
I am guessing I need to check the overflow flag right after the sub instruction if it's on, then my result is negative. Or maybe the sign flag, more logically (both seem right in this case ?)
From that, I would need to subtract 256 to rax to make it -32 before returning... But it seems a little weird to me, there is got to be a cleaner way?
The 8-bit values 224 and -32 are the same on x86 and x86-64, since it uses two's complement to represent negative numbers. Subtracting 256 wouldn't help you. It's up to the caller of the function to choose to interpret the result as signed (-128 to 127, in which case it's -32), or unsigned (0 to 255, in which case it's 224). I assume you're asking this because your code that calls this function is interpreting it wrongly. This could happen if you accidentally used movzx on the result instead of movsx to extend it, or worse, if you used ax, eax, or rax without extending it first (in which case you could end up with complete garbage like 0x11223344556677e0).

What's so special about 0x55AA?

I have encountered 0x55AA in 2 scenarios:
the final 2 bytes of boot sector in the legacy booting process contains 0x55AA.
the first 2 bytes of the Option ROM must be 0x55AA
So what's special about 0x55AA?
The binary version of 0x55AA is 0101010110101010. Is it because it is evenly interleaved 0 and 1? But I don't see that's a strong criteria.
0x55AA is a "signature word". It is used as the "end of sector" marker in the last 2 bytes of a 512 byte boot record. This includes MBR and it's extended boot records and in the newer GPTs protective MBR.
References:
Image from Master Boot Record - microsoft.com.
How Basic Disks and Volumes Work - microsoft.com.
There is nothing magical or mystical about that combination. Implementers needed a means by which to determine if the first sector of a device was bootable (boot signature) and that combination occurring in the last two bytes of a sector is so improbable, is why it was chosen.Similarly, SMBIOS entry point can be found scanning BIOS for _SM_ signature that must be on an segment boundary like this;
Find_SMBIOS:
push ds
push bx ; Preserve essential
push si
; Establish DS:BX to point to base of BIOS code
mov ax, 0xf000
mov ds, ax ; Segment where table lives
xor bx, bx ; Initial pointer
mov eax, '_SM_' ; Scan buffer for this signature
; Loop has maximum of 4096 interations. As table is probably at top of buffer, cycling
; though it backwards saves time. In my test bed, BOCH's 2.6.5 BIOS-bochs-latest it was
; 1,451 interations.
.L0: sub bx, 16 ; Bump pointer to previous segment
jnz .J0
; Return NULL in AX and set CF. Either AX or flag can be tested on return.
mov ax, bx
stc
jmp .Done
; Did we find signature at this page
.J0: cmp [bx], eax
jnz .L0 ; NZ, keep looking
; Calculate checksum to verify position
mov cx, 15
mov ax, cx
mov si, bx ; DS:SI = Table entry point
; Compute checksum on next 15 bytes
.L1: lodsb
add ah, al
loop .L1
or ah, ah
jnz .L0 ; Invalid, try to find another occurence
; As entry point is page aligned, we can do this to determine segment.
shr bx, 4
mov ax, ds
add ax, bx
clc ; NC, found signature
.Done:
pop si
pop bx ; Restore essential
pop ds
ret
That signature is easily identifiable in a hex dump and it fits into a 16 bit register. Where those two criteria precipitating factors, I don't know, but again, the probability of 0x5f4d535f appearing on an even 16 byte boundary is very unlikely.

MIPS: Branch using indirect jump?

Suppose a label called L1. On MIPS, one can easily do:
beq $t1, $t2, L1
But is there a way to do the same using indirect addressing? By that, I mean using a register that holds the address where L1 is found. I know of the jr command, but I don't see how it could be used for this purpose.
beq requires an immediate value in its 3rd argument, never a register or memory address.
According to page 55 of this manual (page 63 in the PDF), the range of beq is -128 KB to +128KB, which is exactly 4 times as much as a signed 16-bit integer can represent: -32 KB to +32 KB (since instructions are 4 bytes long, a multiplier of 4 is automatically applied).
I think jr should be able to accomplish what you want. Instead of using a register to point to memory address XX, just load the value of address XX into a register and use that to jump.
lw $t0, XX
jr $t0

harmonic series with x86-64 assembly

Trying to compute a harmonic series.
Right now I'm entering the number I want the addition to go up to.
When I enter a small number like 1.2, the program just stops, doesn't crash, it seems to be doing calculations.
BUt it never finishes the program
here is my code
denominator:
xor r14,r14 ;zero out r14 register
add r14, 2 ;start counter at 2
fld1 ;load 1 into st0
fxch st2
denomLoop:
fld1
mov [divisor], r14 ;put 1 into st0
fidiv dword [divisor] ;divide st0 by r14
inc r14 ;increment r14
fst qword [currentSum] ;pop current sum value into currentSum
jmp addParts
addParts:
fld qword [currentSum]
fadd st2 ;add result of first division to 1
fxch st2 ;place result of addition into st2
fld qword [realNumber] ;place real number into st0
;compare to see if greater than inputed value
fcom st2 ;compare st0 with st2
fstsw ax ;needed to do floating point comparisons on FPU
sahf ;needed to do floating point comaprisons on FPU
jg done ;jump if greater than
jmp denomLoop ;jump if less than
The code is basically computing the 1/2 or 1/3 or 1/4 and adding it to a running sum, then compares to see if i've reached a value above what I entered, once it has it should exit the loop
do you guys see my error?
This line seems suspicious:
fst qword [currentSum] ;pop current sum value into currentSum
contrary to the comment, fst stores the top of the stack into memory WITHOUT popping it. You want fstp if you want to pop it.
Overall, the stack behavior of your program seems suspicious -- it pushes various things onto the fp stack but never pops anything. After a couple of iterations, the stack will overflow and wrap around. Depending on your settings, you'll then either get an exception or get bogus values if you don't have exceptions enabled.