Callling the brk syscall using inline assembler - x86-64

How can I implement brk syscall in x86-64 linux? My code is as follows:
Mysyscall(uint64_t n, uint64_t a1){
uint64_t ret;
__asm__ __volatile__("movq %0, %%rax\n\t"
"movq %1, %%rdi\n\t"
"syscall\n"
: "=r"(ret)
: "g"(n), "g"(a1));
return ret; }
Mysyscall(SYS_brk, uint64_t increment);
But it is not working. I think I may use wrong constraints for asm operands. but still find it difficult to figure out.

I'm not on linux, so I can't test this. But based on http://blog.rchapman.org/post/36801038863/linux-system-call-table-for-x86-64, I would expect it to be something like this:
__asm__ __volatile__("syscall"
: "=a" (ret)
: "0" (12), "D" (a1)
: "rcx", "r11", "cc");
You may also need the "memory" clobber.
To learn about constraints, check out the i386 section here: https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html

Related

getting illegal instructions when vectorized code writes to PCI

I am writing a program that writes to a device's range of HW registers. I am using mmap to map the HW addresses to virtual address (user space). I tested the result from the mmap and it is OK. I implemented a copy of a buffer into the device:
void bufferCopy(void *dest, void *src, const size_t size) {
uint8_t *pdest = static_cast<uint8_t *>(dest);
uint8_t *psrc = static_cast<uint8_t *>(src);
size_t iters = 0, tailBytes = 0;
/* iterate 64bit */
iters = (size / sizeof(uint64_t));
for (size_t index = 0; index < iters; ++index) {
*(reinterpret_cast<uint64_t *>(pdest)) =
*(reinterpret_cast<uint64_t *>(psrc));
pdest += sizeof(uint64_t);
psrc += sizeof(uint64_t);
}
.
.
.
but when running it on QEMU I get illegal instruction exception. When I debugged got it crashes on the next instruction (below is the asm of the main loop):
movdqu (%rsi,%rax,1),%xmm0
movups %xmm0,(%rdi,%rax,1) <----- this instruction crashes ...
add $0x10,%rax
cmp %rax,%r9
jne 0x7ffff7eca1e0 <_ZN12_GLOBAL__N_110bufferCopyEPvS0_m+64>
any ideas why ? my guess that you can write to PCI only 32/64 bit.
The compile doesn’t know my limitations, so it optimize my code and create vectorized loop (each iteration loads 128 bit and saves 128 bit). Is is making sense ?? can I write to PCI with vectorized instructions ?
Also, whether it is a missing feature in QEMU or a bug or just a recommendation, how can I prevent from the compiler to generate those vector instructions ?

Passing C structs through SystemVerilog DPI-C layer

SystemVerilog LRM has some examples that show how to pass structs in SystemVerilog to\from C through DPI-C layer. However when I try my own example it seems to not work at all in Incisive or Vivado simulator (it does work in ModelSim). I wanted to know if I am doing something wrong, or if it is an issue with the Simulators. My example is as follow:
#include <stdio.h>
typedef struct {
char f1;
int f2;
} s1;
void SimpleFcn(const s1 * in,s1 * out){
printf("In the C function the struct in has f1: %d\n",in->f1);
printf("In the C function the struct in has f2: %d\n",in->f2);
out->f1=!(in->f1);
out->f2=in->f2+1;
}
I compile the above code into a shared library:
gcc -c -fPIC -Wall -ansi -pedantic -Wno-long-long -fwrapv -O0 dpi_top.c -o dpi_top.o
gcc -shared -lm dpi_top.o -o dpi_top.so
And the SystemVerilog code:
`timescale 1ns / 1ns
typedef struct {
bit f1;
int f2;
} s1;
import "DPI-C" function void SimpleFcn(input s1 in,output s1 out);
module top();
s1 in,out;
initial
begin
in.f1=1'b0;
in.f2 = 400;
$display("The input struct in SV has f1: %h and f2:%d",in.f1,in.f2);
SimpleFcn(in,out);
$display("The output struct in SV has f1: %h and f2:%d",out.f1,out.f2);
end
endmodule
In Incisive I run it using irun:
irun -sv_lib ./dpi_top.so -sv ./top.sv
But it SegV's.
In Vivado I run it using
xvlog -sv ./top.sv
xelab top -sv_root ./ -sv_lib dpi_top.so -R
It runs fine until it exits simulation, then there is a memory corruption:
Vivado Simulator 2017.4
Time resolution is 1 ns
run -all
The input struct in SV has f1: 0 and f2: 400
In the C function the struct in has f1: 0
In the C function the struct in has f2: 400
The output struct in SV has f1: 1 and f2: 401
exit
*** Error in `xsim.dir/work.top/xsimk': double free or corruption (!prev): 0x00000000009da2c0 ***
You were lucky that this worked in Modelsim. Your SystemVerilog prototype does not match your C prototype. You have f1 as a byte in C and bit in SystemVerilog.
Modelsim/Questa has a -dpiheader switch that produces a C header file that you can #include into your dpi_top.c file. That way you get a compiler error when the prototypes don't match instead of an unpredictable run-time error. This is the C prototype for your SV code.
typedef struct {
svBit f1;
int f2;
} s1;
void SimpleFcn(
const s1* in,
s1* out);
But I would recommend sticking with C compatible types in SystemVerilog.

Why SSE4.2 CRC32 hash value is different with software CRC32 hash value?

In my project, CRC32 is calculated very many times.
I have used software CRC32 calculation until now.
But I noticed there is CPU support in SSE4.2 and linux also provides the hardware CRC32 calculation function using the CPU instruction.
I use Intel Xeon E5-2650 CPU so I tried to calculate CRC32 by using the linux function.
But the result is different with software CRC32 function that I used.
I used init value 127 in the both. Software CRC32 function I used is below
static uint32_t crc32_tab[] = {
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9,
0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c,
0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106,
0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};
uint32_t crc32(uint32_t crc, const void *buf, size_t size)
{
const uint8_t *p;
p = buf;
crc = 127;
while (size--)
crc = crc32_tab[(crc ^ *p++) & 0xFF] ^ (crc >> 8);
return crc ^ ~0U;
}
That table is for the "standard" CRC-32 used in ethernet, zip, gzip, v.42, many other places.
I need to point out that that code in your question is incorrect, with the crc = 127; statement. crc should not be set at all, since it is an input to the function that is then lost, and it should not be initialized to 127. What should be there is crc = ~crc;. That should also be the approach used at the end, instead of return crc ^ ~0U;. That may not be portable if unsigned is not the same size at uint32_t. The use of ~ is fine with uint32_t, but if a different type is used that is not 32 bits, then more portable still is crc ^ 0xffffffff.
Anyway, to answer your question, Intel chose a different 32-bit CRC to implement in their hardware instruction. That CRC is usually referred to as CRC-32C, using a polynomial discovered by Castagnoli (what the "C" refers to) with better properties than the polynomial used in the standard CRC-32. iSCSI and SCTP use the CRC-32C instead of CRC-32. I presume that that had something to do with Intel's choice.
See this answer for code that computes the CRC-32C in software, as well as in hardware if available.

how to write inline assembly codes about LOOP in Xcode LLVM?

I'm studying about inline assembly. I want to write a simple routine in iPhone under Xcode 4 LLVM 3.0 Compiler. I succeed write basic inline assembly codes.
example :
int sub(int a, int b)
{
int c;
asm ("sub %0, %1, %2" : "=r" (c) : "r" (a), "r" (b));
return c;
}
I found it in stackoverflow.com and it works very well. But, I don't know how to write code about LOOP.
I need to assembly codes like
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity)
{
for(int i=0; i<numPixels; i++)
{
dst[i] = src[i] + intensity;
}
}
Take a look here at the loop section - http://en.wikipedia.org/wiki/ARM_architecture
Basically you'll want something like:
void brighten(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"\t mov r3, #0\n"
"Lloop:\n"
"\t cmp r3, %2\n"
"\t bge Lend\n"
"\t ldrb r4, [%0, r3]\n"
"\t add r4, r4, %3\n"
"\t strb r4, [%1, r3]\n"
"\t add r3, r3, #1\n"
"\t b Lloop\n"
"Lend:\n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r3", "r4");
}
Update:
And here's that NEON version:
void brighten_neon(unsigned char* src, unsigned char* dst, int numPixels, int intensity) {
asm volatile (
"\t mov r4, #0\n"
"\t vdup.8 d1, %3\n"
"Lloop2:\n"
"\t cmp r4, %2\n"
"\t bge Lend2\n"
"\t vld1.8 d0, [%0]!\n"
"\t vqadd.s8 d0, d0, d1\n"
"\t vst1.8 d0, [%1]!\n"
"\t add r4, r4, #8\n"
"\t b Lloop2\n"
"Lend2:\n"
: "=r"(src), "=r"(dst), "=r"(numPixels), "=r"(intensity)
: "0"(src), "1"(dst), "2"(numPixels), "3"(intensity)
: "cc", "r4", "d1", "d0");
}
So this NEON version will do 8 at a time. It does however not check that numPixels is divisible by 8 so you'd definitely want to do that otherwise things will go wrong! Anyway, it's just a start at showing you what can be done. Notice the same number of instructions, but action on eight pixels of data at once. Oh and it's got the saturation in there as well that I assume you would want.
Though this answer is not directly an answer to your question, it is more a general advice regarding use of assembler versus modern compilers.
You will generally have a hard time beating the compiler regarding optimazation of your C code. Of course by clever use of certain knowledge about how your data behave it's possible that you might tweak it just a few percents.
One of the reasons for this is that modern compilers use a number of techniques when dealing with code like the one you describe, e.g. loop unrolling, instruction reordering to avoid pipeline stalls and bubbles, etc.
If you really want to make that algorithm scream, you should consider redesigning the algorithm instead in C so you avoid the worst delays. For instance reading and writing to memory is expensive compared to register access.
One way of accomplishing this could be to have your code load 4 bytes at a time by using an unsigned long and then doing the math on this in registers before writing these 4 bytes back in one store operation.
So to recap, make your algorithm work smarter not harder.

is Atom-32bit in mode protected after a reset?

I work on Atom-32bit-intel, I have to port MicroC OS II, so there is no code to make any configuration on the Atom (No GDT, no LDT...):
my question is more about the state of the Atom-32bit after a reset, is the Atom in protecte mode or not ? and the most important how do i check which mode is it (which registers have to be checked nad how)?
Remark:
The CR0.PE = 1 (I checked it), is that enough to prove that the Atom is in protected mode ?
************ UPDATE : *****************
/*Read the IDTR*/
sidt (idt_ptr)
/*Read the GDTR*/
sgdt (gdt_ptr)
So I tried just to use IDT's address to link my ISR to the IDT :
fill_interrupt(ISR_Nbr,(unsigned int) isr33, 0x08, 0x8E);
static void fill_interrupt(unsigned char num, unsigned int base, unsigned short sel, unsigned char flags)
{
unsigned short *Interrupt_Address;
/*address = idt_ptr.base + num * 8 byte*/
Interrupt_Address = (unsigned short *)(idt_ptr.base + num*8);
*(Interrupt_Address) = base&0xFFFF;
*(Interrupt_Address+1) = sel;
*(Interrupt_Address+1) = (flags>>8)&0xFF00;
*(Interrupt_Address+1) = (base>>16)&0xFFFF;
}
my ISR a imple one :
isr33:
nop
nop
cli
push %ebp //save the context to swith back
mov %esp,%ebp
pop %ebp //Return to the calling function
sti
ret
Chapter 9 of volume 3 of the Intel Software Developer's Manual says that the reset value of CR0 is 60000010H. As you can see, bit 0, aka PE, is clear.
Regardless, you can setup the descriptor tables in Protected Mode as well as in Real Mode. You just have to be more careful about it.
I suggest you check if the BIOS or OS are setting this bit at a stage before you read it.
Atom is x86 instruction set, and as such, should be starting in real mode for compatibility. I don't have one on hand to test with though.
Resolved, I use N450 Atom board, it has already a BIOS, the BIOS configures the board in Protected Mode.