Strange value of the double type variable: -nan(0x8000000000000) - double

I am confused by the value of "currdisk->currangle" after adding. Initially the value of "currdisk->currangle" is 0.77500000000000013, but after adding operation, it's changed to "-nan(0x8000000000000)", Can anyone explain ? Thanks! The following is the occasion of gdb debugging.
3338 currdisk->currangle += (simtime - seg->time) / currdisk->rotatetime;
(gdb) p currdisk->currangle
$28 = 0.77500000000000013
(gdb) p (simtime - seg->time) / currdisk->rotatetime
$29 = 0.00833333333333325
(gdb) p (simtime - seg->time)
$30 = 0.092592592592591672
(gdb) p currdisk->rotatetime
$31 = 11.111111111111111
(gdb) p currdisk->currangle + (simtime - seg->time) / currdisk->ratetime
(gdb) n
(gdb) p currdisk->currangle
$32 = -nan(0x8000000000000)
(gdb) p/x (char[8])currdisk->currangle
$52 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf8, 0xff}
(gdb)
Then I divide the sentence.
double tmp1 = (simtime - seg->time) / currdisk->rotatetime;
currdisk->currangle += tmp1
The value of currdisk->currangle is normal.
tmp1 = (simtime - seg->time) / currdisk->rotatetime;
currdisk->currangle += tmp1;
The above code can also cause NaN in the later calling.
So I step into to the assembly.
tmp1 = (simtime - seg->time) / currdisk->rotatetime;
0x0808fdf4 <disk_buffer_sector_done+559>: fldl 0x80b2208
0x0808fdfa <disk_buffer_sector_done+565>: mov -0x38(%ebp),%eax
0x0808fdfd <disk_buffer_sector_done+568>: fldl (%eax)
...Here, the content of %eax is 0x80bdfeb, which is the address of seg->time.
After the above "fldl (%eax) " executed, the content of register st0 is 0x8000000000000000, which represents NaN.
So the NaN is propagated to the following instructions. This is the key issue.
Code:
0x0808fdff <disk_buffer_sector_done+570>: fsubrp %st,%st(1)
0x0808fe01 <disk_buffer_sector_done+572>: mov 0x8(%ebp),%eax
0x0808fe04 <disk_buffer_sector_done+575>: fldl 0xc4(%eax)
0x0808fe0a <disk_buffer_sector_done+581>: fdivrp %st,%st(1)
0x0808fe0c <disk_buffer_sector_done+583>: fstpl -0x28(%ebp)

Related

Calculating jmp's from one segment to another in windows PE files

Assume I have a binary on my disk that I load into memory using VirtualAlloc and ReadFile.
If I want to follow a jmp instruction from one section to another, what do I need to add/subtract to get the destination address.
In other words, I want to know how IDA calculates the loc_140845BB8 from jmp loc_140845BB8
Example:
.text:000000014005D74E jmp loc_140845BB8
Jumps to the section seg007
seg007:0000000140845BB8 ; seg007:0000000140845BC4↓j
seg007:0000000140845BB8 and rbx, r14
PE info (seg007 is the section named "")
Segments are arbitary, it jumps where it jumps, without regard for segments. Jump location is calculated as the signed 32-bit value following the 0xE9 JMP opcode, added to the the address of where the next instruction would be (i.e. the location of JMP + 5 bytes).
def GetInsnLen(ea):
insn = ida_ua.insn_t()
return ida_ua.decode_insn(insn, ea)
def MakeSigned(number, size):
number = number & (1<<size) - 1
return number if number < 1<<size - 1 else - (1<<size) - (~number + 1)
def GetRawJumpTarget(ea):
if ea is None:
return None
insnlen = GetInsnLen(ea)
if not insnlen:
return None
result = MakeSigned(idc.get_wide_dword(ea + insnlen - 4), 32) + ea + insnlen
if ida_ida.cvar.inf.min_ea <= result < ida_ida.cvar.inf.max_ea:
return result
return None

why did not fill with zeros

Allocated array for 10000 bits = 1250 bytes(10000/8):
mov edi, 1250
call malloc
tested the pointer:
cmp rax, 0
jz .error ; error handling at label down the code
memory was allocated:
(gdb) p/x $rax
$3 = 0x6030c0
attempted to fill that allocated memory with zeros:
mov rdi, rax
xor esi, esi
mov edx, 1250 ; 10000 bits
call memset
checked first byte:
(gdb) p/x $rax
$2 = 0x6030c0
(gdb) x/xg $rax + 0
0x6030c0: 0x0000000000000000
checked last byte(0 - first byte, 1249 - last byte)
(gdb) p/x $rax + 1249
$3 = 0x6035a1
(gdb) x/xg $rax + 1249
0x6035a1: 0x6100000000000000
SOLVED QUESTION
Should have typed x/1c $rax + 1249
You interpreted memory as a 64 bit integer, but you forgot that endianness of intel is little endian. So bytes were reversed.
0x6100000000000000 is the value that the CPU reads when de-serializing the memory at this address. Since it's little endian, the 0x61 byte is last in memory (not very convenient to dump memory in this format, unless you have a big endian architecture)
Use x /10bx $rax + 1249 you'll see that it's zero at the correct location. The rest is garbage (happens to be zero for a while, then garbage)
0x00 0x00 0x00 0x00 0x00 0x00 0x61

Read a sector to address zero

%include "init.inc"
[org 0x0]
[bits 16]
jmp 0x07C0:start_boot
start_boot:
mov ax, cs
mov ds, ax
mov es, ax
load_setup:
mov ax, SETUP_SEG
mov es, ax
xor bx, bx
mov ah, 2 ; copy data to es:bx from disk.
mov al, 1 ; read a sector.
mov ch, 0 ; cylinder 0
mov cl, 2 ; read data since sector 2.
mov dh, 0 ; Head = 0
mov dl, 0 ; Drive = 0
int 0x13 ; BIOS call.
jc load_setup
lea si, [msg_load_setup]
call print
jmp $
print:
print_beg:
mov ax, 0xB800
mov es, ax
xor di, di
print_msg:
mov al, byte [si]
mov byte [es:di], al
or al, al
jz print_end
inc di
mov byte [es:di], BG_TEXT_COLOR
inc di
inc si
jmp print_msg
print_end:
ret
msg_load_setup db "Loading setup.bin was completed." , 0
times 510-($-$$) db 0
dw 0xAA55
I want to load setup.bin to memory address zero. So, I input 0 value to es register (SETUP_SEG = 0). bx, too. But it didn't work. then, I have a question about this issue. My test is below.
SETUP_SEG's value
0x0000 : fail
0x0010 : success
0x0020 : fail
0x0030 : fail
0x0040 : fail
0x0050 : success
I can't understand why this situation was happened. All test was carried out on VMware. Does anyone have an idea ?
I'm not sure if this is your problem, but your trying to load setup.bin in the Real Mode IVT (Interrupt Vector Table). The IVT contains the location of each interrupt, so I'm assuming that your boatloader is overwriting them when it loads setup.bin into memory! Interrupts can be sneaky and tricky, since they can be called even if you didn't call them. Any interrupt vector you overwrote will likely cause undefined behavior when called, which will cause some problems.
I suggest setting SETUP_SEG to a higher number like 0x2000 or 0x3000, but the lowest you could safely go is 0x07E0. The Osdev Wiki and Wikipedia have some helpful information on conventional memory and memory mapping.
I hope this helps!

Analog read from all 7 inputs using PRU and a host C program for beaglebone black

I'm kinda new to the beaglebone black world running on a AM335X Cortex A8 processor and I would like to use the PRU for fast analog read with the maximum sampling rate possible.
I would like to read all 7 inputs in a loop form like:
while( n*7 < sampling_rate){ //initial value for n = 0
read(AIN0); //and store it in shared memory(7*n + 0)
read(AIN1); //and store it in shared memory(7*n + 1)
read(AIN2); //and store it in shared memory(7*n + 2)
read(AIN3); //and store it in shared memory(7*n + 3)
read(AIN4); //and store it in shared memory(7*n + 4)
read(AIN5); //and store it in shared memory(7*n + 5)
read(AIN6); //and store it in shared memory(7*n + 6)
n++;
}
so that I can read them from a host program running on the main processor. Any idea how to do so? I tried using a ready code called ADCCollector.c from a package named AM335x_pru_package but I can't figure out how to get all the addresses and values of the registers used.
This is the code I was trying to modify (ADCCollector.p):
.origin 0 // offset of the start of the code in PRU memory
.entrypoint START // program entry point, used by debugger only
#include "ADCCollector.hp"
#define BUFF_SIZE 0x00000fa0 //Total buff size: 4kbyte(Each buffer has 2kbyte: 500 piece of data
#define HALF_SIZE BUFF_SIZE / 2
#define SAMPLING_RATE 1 //Sampling rate(16khz) //***//16000
#define DELAY_MICRO_SECONDS (1000000 / SAMPLING_RATE) //Delay by sampling rate
#define CLOCK 200000000 // PRU is always clocked at 200MHz
#define CLOCKS_PER_LOOP 2 // loop contains two instructions, one clock each
#define DELAYCOUNT DELAY_MICRO_SECONDS * CLOCK / CLOCKS_PER_LOOP / 1000 / 1000 * 3 //if sampling rate = 98000 --> = 3061.224
.macro DELAY
MOV r10, DELAYCOUNT
DELAY:
SUB r10, r10, 1
QBNE DELAY, r10, 0
.endm
.macro READADC
//Initialize buffer status (0: empty, 1: first buffer is ready, 2: second buffer is ready)
MOV r2, 0
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
INITV:
MOV r5, 0 //Shared RAM address of ADC Saving position
MOV r6, BUFF_SIZE //Counting variable
READ:
//Read ADC from FIFO0DATA
MOV r2, 0x44E0D100
LBBO r3, r2, 0, 4
//Add address counting
ADD r5, r5, 4
//Write ADC to PRU Shared RAM
SBCO r3, CONST_PRUSHAREDRAM, r5, 4
DELAY
SUB r6, r6, 4
MOV r2, HALF_SIZE
QBEQ CHBUFFSTATUS1, r6, r2 //If first buffer is ready
QBEQ CHBUFFSTATUS2, r6, 0 //If second buffer is ready
QBA READ
//Change buffer status to 1
CHBUFFSTATUS1:
MOV r2, 1
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
QBA READ
//Change buffer status to 2
CHBUFFSTATUS2:
MOV r2, 2
SBCO r2, CONST_PRUSHAREDRAM, 0, 4
QBA INITV
//Send event to host program
MOV r31.b0, PRU0_ARM_INTERRUPT+16
HALT
.endm
// Starting point
START:
// Enable OCP master port
LBCO r0, CONST_PRUCFG, 4, 4 //#define CONST_PRUCFG C4 taken from ADCCollector.hp
CLR r0, r0, 4
SBCO r0, CONST_PRUCFG, 4, 4
//C28 will point to 0x00012000 (PRU shared RAM)
MOV r0, 0x00000120
MOV r1, CTPPR_0
ST32 r0, r1
//Init ADC CTRL register
MOV r2, 0x44E0D040
MOV r3, 0x00000005
SBBO r3, r2, 0, 4
//Enable ADC STEPCONFIG 1
MOV r2, 0x44E0D054
MOV r3, 0x00000002
SBBO r3, r2, 0, 4
//Init ADC STEPCONFIG 1
MOV r2, 0x44E0D064
MOV r3, 0x00000001 //continuous mode
SBBO r3, r2, 0, 4
//Read ADC and FIFOCOUNT
READADC
Another question is: if I simply changed the #define Sampling_rate from 16000 to any other number below or equal to 200000 in the (.p) file, I will get that sampling rate? or should I change other things?
Thanks in advance.
I used the c wrappers from libpruio: http://www.freebasic.net/forum/viewtopic.php?f=14&t=22501
and then use this code to get all my ADC values:
#include "stdio.h"
#include "c_wrapper/pruio.h" // include header
#include "sys/time.h"
//! The main function.
int main(int argc, char **argv) {
struct timeval start, now;
long mtime, seconds, useconds;
gettimeofday(&start, NULL);
int i,x;
pruIo *io = pruio_new(PRUIO_DEF_ACTIVE, 0x98, 0, 1); //! create new driver structure
if (pruio_config(io, 1, 0x1FE, 0, 4)){ // upload (default) settings, start IO mode
printf("config failed (%s)\n", io->Errr);}
else {
do {
gettimeofday(&now, NULL);
seconds = now.tv_sec - start.tv_sec;
useconds = now.tv_usec - start.tv_usec;
mtime = ((seconds) * 1000 + useconds/1000.0) + 0.5;
printf("%lu",mtime);
for(i = 1; i < 9; i++) {
printf(",%d", io->Adc->Value[i]); //0-66504 for 0-1.8v
}
printf("\n");
x++;
}while (mtime < 100);
printf("count: %d \n", x);
pruio_destroy(io); /* destroy driver structure */
}
return 0;
}
In your example you use libpruio in IO mode (synchronous) and therefore you have no control over the sampling rate, since the host CPU doesn't work in real-time.
To get the maximum sampling rate (as mentioned in the OP) you have to use either RB or MM mode. In those modes libpruio buffers the samples in memory and the host can access them asynchronously. See example rb_file.c (or triggers.bas) in the libpruio package.

Optimizing RGBA8888 to RGB565 conversion with NEON

I'm trying to optimize an image format conversion on iOS using the NEON vector instruction set. I assumed this would map well to that because it processes a bunch of similar data.
My attempts haven't gone that well, though, achieving only a marginal speedup vs the naive c implementation:
for(int i = 0; i < pixelCount; ++i, ++inPixel32) {
const unsigned int r = ((*inPixel32 >> 0 ) & 0xFF);
const unsigned int g = ((*inPixel32 >> 8 ) & 0xFF);
const unsigned int b = ((*inPixel32 >> 16) & 0xFF);
*outPixel16++ = ((r >> 3) << 11) | ((g >> 2) << 5) | ((b >> 3) << 0);
}
1 megapixel image array on iPad 2:
format is [min avg max n=number of timer samples] in milliseconds
C:
[14.446 14.632 18.405 n=1000]ms
NEON:
[11.920 12.032 15.336 n=1000]ms
My attempt at a NEON implementation is below:
int i;
const int pixelsPerLoop = 8;
for(i = 0; i < pixelCount; i += pixelsPerLoop, inPixel32 += pixelsPerLoop, outPixel16 += pixelsPerLoop) {
//Read all r,g,b pixels into 3 registers
uint8x8x4_t rgba = vld4_u8(inPixel32);
//Right-shift r,g,b as appropriate
uint8x8_t r = vshr_n_u8(rgba.val[0], 3);
uint8x8_t g = vshr_n_u8(rgba.val[1], 2);
uint8x8_t b = vshr_n_u8(rgba.val[2], 3);
//Widen b
uint16x8_t r5_g6_b5 = vmovl_u8(b);
//Widen r
uint16x8_t r16 = vmovl_u8(r);
//Left shift into position within 16-bit int
r16 = vshlq_n_u16(r16, 11);
r5_g6_b5 |= r16;
//Widen g
uint16x8_t g16 = vmovl_u8(g);
//Left shift into position within 16-bit int
g16 = vshlq_n_u16(g16, 5);
r5_g6_b5 |= g16;
//Now write back to memory
vst1q_u16(outPixel16, r5_g6_b5);
}
//Do the remainder on normal flt hardware
Code was compiled via LLVM 3.0 into the following (.loc and extra labels removed):
_DNConvert_ARGB8888toRGB565:
push {r4, r5, r7, lr}
mov r9, r1
mov.w r12, #0
add r7, sp, #8
cmp r2, #0
mov.w r1, #0
it ne
movne r1, #1
cmp r0, #0
mov.w r3, #0
it ne
movne r3, #1
cmp.w r9, #0
mov.w r4, #0
it ne
movne r4, #1
tst.w r9, #3
bne LBB0_8
ands r1, r3
ands r1, r4
cmp r1, #1
bne LBB0_8
movs r1, #0
lsr.w lr, r9, #2
cmp.w r1, r9, lsr #2
bne LBB0_9
mov r3, r2
mov r5, r0
b LBB0_5
LBB0_4:
movw r1, #65528
add.w r0, lr, #7
movt r1, #32767
ands r1, r0
LBB0_5:
mov.w r12, #1
cmp r1, lr
bhs LBB0_8
rsb r0, r1, r9, lsr #2
mov.w r9, #63488
mov.w lr, #2016
mov.w r12, #1
LBB0_7:
ldr r2, [r5], #4
subs r0, #1
and.w r1, r9, r2, lsl #8
and.w r4, lr, r2, lsr #5
ubfx r2, r2, #19, #5
orr.w r2, r2, r4
orr.w r1, r1, r2
strh r1, [r3], #2
bne LBB0_7
LBB0_8:
mov r0, r12
pop {r4, r5, r7, pc}
LBB0_9:
sub.w r1, lr, #1
movs r3, #32
add.w r3, r3, r1, lsl #2
bic r3, r3, #31
adds r5, r0, r3
movs r3, #16
add.w r1, r3, r1, lsl #1
bic r1, r1, #15
adds r3, r2, r1
movs r1, #0
LBB0_10:
vld4.8 {d16, d17, d18, d19}, [r0]!
adds r1, #8
cmp r1, lr
vshr.u8 d20, d16, #3
vshr.u8 d21, d17, #2
vshr.u8 d16, d18, #3
vmovl.u8 q11, d20
vmovl.u8 q9, d21
vmovl.u8 q8, d16
vshl.i16 q10, q11, #11
vshl.i16 q9, q9, #5
vorr q8, q8, q10
vorr q8, q8, q9
vst1.16 {d16, d17}, [r2]!
Ltmp28:
blo LBB0_10
b LBB0_4
Full code is available at https://github.com/darknoon/DNImageConvert I would appreciate any help, thanks!
Here you are, hand-optimized NEON implementation ready for XCode :
/* IT DOESN'T WORK!!! USE THE NEXT VERSION BELOW.
* BGRA2RGB565.s
*
* Created by Jake "Alquimista" Lee on 11. 11. 1..
* Copyright 2011 Jake Lee. All rights reserved.
*/
.align 2
.globl _bgra2rgb565_neon
.private_extern _bgra2rgb565_neon
// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
//ARM
pDst .req r0
pSrc .req r1
count .req r2
//NEON
blu .req d16
grn .req d17
red .req d18
alp .req d19
rg .req red
gb .req blu
_bgra2rgb565_neon:
pld [pSrc]
tst count, #0x7
movne r0, #0
bxne lr
loop:
pld [pSrc, #32]
vld4.8 {blu, grn, red, alp}, [pSrc]!
subs count, count, #8
vshr.u8 red, red, #3
vext.8 rg, grn, red, #5
vshr.u8 grn, grn, #2
vext.8 gb, blu, grn, #3
vst2.8 {gb, rg}, [pDst]!
bgt loop
bx lr
This version will be many times faster than what you suggested :
increased cache hit rate via PLD
conversion to "long" not necessary
fewer instructions within the loop
There is still some room for optimizations though, you could modify the loop so that it converts 16 pixels per iteration instead of 8.
Then you can schedule the instructions to avoid the two stalls completely (which is simply not possible in this 8/iteration version above) and benefit from NEON's dual-issue capability in addition.
I didn't do this because it would make the code hard to understand.
It's important to know what VEXT is supposed to do.
Now it's up to you. :)
I verified this code to be properly compiled under Xcode.
Although I'm pretty sure it works correctly as well, I cannot guarantee this since I don't have the test environment.
In case of malfunctioning, please let me know. I'll correct it accordingly then.
cya
==============================================================================
Well, here is the improved version.
Due to the nature of the VSRI instruction not allowing two operands other than the target, it was not possible to create a more robust one regarding the register assignment.
Please check the image format of your source image. (exact byte order of the elements)
If it's not B, G, R, A, which is the default and native one on iOS, your application will suffer heavily from internal conversions by iOS.
If it's absolutely not possible to change this for whatever the reason, let me know.
I'll write a new version matching it.
PS : I forgot to remove the underscore at the start of the function prototype. Now it's gone.
/*
* BGRA2RGB565.s
*
* Created by Jake "Alquimista" Lee on 11. 11. 1..
* Copyright 2011 Jake Lee. All rights reserved.
*
* Version 1.1
* - bug fix
*
* Version 1.0
* - initial release
*/
.align 2
.globl _bgra2rgb565_neon
.private_extern _bgra2rgb565_neon
// unsigned int * bgra2rgb565_neon(unsigned int * pDst, unsigned int * pSrc, unsigned int count);
//ARM
pDst .req r0
pSrc .req r1
count .req r2
//NEON
blu .req d16
grn .req d17
red .req d18
alp .req d19
gb .req grn
rg .req red
_bgra2rgb565_neon:
pld [pSrc]
tst count, #0x7
movne r0, #0
bxne lr
.loop:
pld [pSrc, #32]
vld4.8 {blu, grn, red, alp}, [pSrc]!
subs count, count, #8
vsri.8 red, grn, #5
vshl.u8 gb, grn, #3
vsri.8 gb, blu, #3
vst2.8 {gb, rg}, [pDst]!
bgt .loop
bx lr
If you are on iOS or OS X, then you may be delighted to discover vImageConvert_RGBA8888toRGB565() and friends, in Accelerate.framework. This function rounds the 8-bit values to nearest 565 value.
For even better dithering, the quality of which is nearly indistinguishable from 8-bit color, try vImageConvert_AnyToAny():
vImage_CGImageFormat RGBA8888Format =
{
.bitsPerComponent = 8,
.bitsPerPixel = 32,
.bitmapInfo = kCGBitmapByteOrderDefault | kCGImageAlphaNoneSkipLast,
.colorSpace = NULL, // sRGB or substitute your own in
};
vImage_CGImageFormat RGB565Format =
{
.bitsPerComponent = 5,
.bitsPerPixel = 16,
.bitmapInfo = kCGBitmapByteOrder16Little | kCGImageAlphaNone,
.colorSpace = RGBA8888Format.colorSpace,
};
err = vImageConverterRef converter = vImageConverter_CreateWithCGImageFormat(
&RGBA8888Format, &RGB565Format, NULL, kvImageNoFlags, &err );
err = vImageConvert_AnyToAny( converter, &src, &dest, NULL, kvImageNoFlags );
Either of these approaches will be vectorized and multithreaded for best performance.
You might want to use vld4q_u8() instead of vld4_u8() and adjust the rest of your code accordingly. It's hard to tell where the problem might be, but the assembler doesn't look too bad otherwise.
(I'm not familiar with NEON, nor deeply with the memory system of the Ipad2, but this is what we used to do with 88110 pixel-ops, which were an early precursor to today's SIMD extensions)
How big is the memory latency?
Could you hide it by unrolling the inner loop and running the NEON instructions on the "previous" values while the ARM pulls the "next" values from memory? A brief scan of the NEON manual implies you can run ARM and NEON instructions in parallel.
I don't think converting vld4_u8 to vld4q_u8 would lead to any bettering of the performance.
The code seems simple enough. I am not good at ASM and so it would take some time to look into it deeply.
The neon seems simple enough. But I am not quiet sure about r5_g6_b5 |= g16 being used instead of vorrq_u16
Please have a look at the optimization level too. As far as what I heard neon code optimization level goes to a maximum of 1. So the performance may differ when default optimization is being taken into account for both the reference code and neon code, as the level of optimization of reference by DEFAULT may be different.
I doesnt find any area in neon that can better the current code.