Can't set the higher half of ymm registers - x86-64

I'm trying to extend the word values from the lower half of the ymm0 register into the lower and higher half of it. When I execute the vpunpcklwd instruction, the higher half of the ymm0 register is left unchanged. This is the code line:
vpunpcklwd ymm0, ymm0, ymm0
Am I overseeing something? Could this be an OS bug or a CPU bug?
I'm using Windows 11 Pro on an Intel Core i5-10210U with Visual Studio 2019. I've tried to compile the code with the Visual C++ and Intel C++ compiler and to use a different destination register but nothing changes.
Many thanks in advance.

Related

Compiled COM files with empty project is over 10 KiB large in Turbo Pascal

I have a problem with the binary's size of old Pascal versions.
We need very small simple programs. We would like to use Turbo Pascal 2 in MS-DOS (higher is the same problem) to compile COM files. But the size is always 10 KiB and larger, even for an empty project like:
begin
end.
Compiled file sizes 10052 bytes. I do not understand why. I tested compiler commands, changed stack/heaps with no results.
Compilation output:
Compiling --> c:emtpy.com
3 lines
code: 0002 paragraphs (32 bytes), 0D7B paragraphs free
data: 0000 paragraphs (0 bytes), 0FE7 paragraphs free
stack/heap: 0400 paragraphs (16384 bytes) (minimum)
4000 paragraphs (262144 bytes) (maximum)
Is it possible to get a smaller COM file, and is it possible to convert the Pascal code automatically into ASM code?
Any version of Turbo Pascal up to 3.02 will result into an executable file which includes the whole Run-Time Library. As you discovered, the size of it for TP2 on your target operating system is about 10,050 bytes.
We need very small simple programs.
... then Turbo Pascal 2 is not a good option to start up. Better try with any version from 4 up, if you want to stick with Pascal and are targeting MS-DOS. Or switch to C or assembly language, which will be able to produce smaller executables, at the cost of being more difficult to develop.
[...] is it possible to convert the Pascal code automatically into ASM code.
It can be done using Turbo Pascal but it is not practical (basically you need a disassembler; IDA is such a tool, used nowadays; the version you need is not free.) Also you won't gain much by smashing some bytes from an already compiled application: you will end much better starting it straight in assembly language.
Anyway, the best course to achieve it is to drop Turbo Pascal and go to Free Pascal, which compiler produces .s files, which are written in assembly language (although maybe not in the the same syntax as you are used.) There is (was?) a sub-project to target the 16-bit i8086 processor, which seems reasonably up-to-date (I never tried it.)
Update
You mentioned in a comment you really need the .COM format (which Turbo Pascal 4-7 does not support directly). The problem then is about the memory model. .COM programs are natively using the so-called tiny model (16-bit code and data segments overlapping at the same location), but it can be somewhat evaded for application (not TSR) which can grab all the available memory; TP 1-3 for MS-DOS uses a variant of the compact model (data pointers are 32-bit "far" but code pointers are 16-bit "near", which caps at 64 Ki bytes of code); TP 4-7 are instead using the large model where each unit have a separate code segment. It could be possible to rewrite the Run-Time Library to use only one code segment, then relink the TP-produced executables to convert the FAR CALLs into NEAR CALLs (that one is easy since all the information is in the relocation table of the .EXE). However, you will be home sooner using directly Free Pascal, which supports natively the tiny memory model and can produce .COM executables; while still being highly compatible with Turbo Pascal.

List of all registers to be queried with `r`

I was looking for a way to get the contents of the MXCSR register in WinDbg. Looking up the help for the r command I found a lot of options. I thought I had covered all registers with the command
0:000> rM 0xfe7f
However, the MXCSR register was still not included. So I did a full search in WinDbg help, which did not give me any results (sorry for the German screenshot):
So I continued my search in the Internet and finally found
0:000> r mxcsr
mxcsr=00001f80
I am now wondering whether there are other registers that will not be displayed by rM 0xfe7f but are available anyways. I am especially interested in user mode and x86 and AMD64 architecture.
I had a look at dbgeng.dll (version 10.0.20153.1000) and found a few more registers by trying some strings around offset 7DC340. Based on some of that information, I found the MSDN websites x64 registers and x86 registers.
In addition I found
brto, brfrom, exto, exfrom
The registers zmm0 through zmm15 can be used as zmm0h, possibly for the high half.
The registers xmm0/ymm0 through xmm15/ymm15 can be used as ymm0h and ymm0l, likely for the high and low half.
some more which didn't work either because of my CPU model or because I tried it in user mode instead of kernel mode.

Do instruction sets like x86 get updated? If so, how is backwards compatibility guaranteed?

How would an older processor know how to decode new instructions it doesn't know about?
New instructions use previously-unused opcodes, or other ways to find more "coding space" (e.g. prefixes that didn't previously mean anything for a given opcode).
How would an older processor know how to decode new instructions it doesn't know about?
It won't. A binary that wants to work on old CPUs as well as new ones has to either limit itself to a baseline feature set, or detect CPU features at run-time and set function pointers to select versions of a few important functions. (aka "runtime dispatching".)
x86 has a good mechanism (the cpuid instruction) for letting code query CPU features without any kernel support needed. Some other ISAs need CPU info hard-coded into the OS or detected via I/O accesses so the only viable method is for the kernel to export info to user-space in an OS-specific way.
Or if you're building from source on a machine with a newer CPU and don't care about old CPUs, you can use gcc -O3 -march=native to let GCC use all the ISA extensions the current CPU supports, making a binary that will fault on old CPUs. (e.g. x86 #UD (UnDefined instruction) hardware exception, resulting in the OS delivering a SIGILL or equivalent to the process.)
Or in some cases, a new instruction may decode as something else on old CPUs, e.g. x86 lzcnt decodes as bsr with an ignored REP prefix on older CPUs, because x86 has basically no unused opcodes left (in 32-bit mode). Sometimes this "decode as something else" is actually useful as a graceful fallback to allow transparent use of new instructions, notably pause = rep nop = nop on old CPUs that don't know about it. So code can use it in spin loops without checking CPUID.
-march=native is common for servers where you're setting things up to just run on that server, not making a binary to distribute.
Most of the times, old processor will have "Undefined Instruction" exception. The instruction is not defined in old CPU.
In more rare cases, the instruction will execute as a different instruction. This happens when then new instruction is encoded via obligatory prefix. As an example, PAUSE is encoded as REP NOP, so it executed as nothing on older CPUs.

Set ba (break on access) breakpoint in managed code programmatically

For investigating managed heap corruption I would like to use ba (break on access) breakpoints. Can I use them in managed code? If yes, how can I set them programmatically?
UPDATE: It would also be okay so set them in WinDbg (-> set ba for every object of type XY)
Breakpoints set by 'ba' command are called processor or hardware breakpoints.
First the good news
It is easy to set hardware breakpoint. You will need to set one of the processor's debug registers (DR0, DR1, DR2 or DR3) with the address of the data and set debug control register DR7 with fields to set size of memory and type of access. The instruction (in x64 assembler) looks like:
MOV rax, DR0
Obviously you will have to somehow execute this assembler instruction from your language of choice, or use interop to C++ and inline assembly, but this is easier than for example setting software breakpoint.
Now the bad news
First of all, on SMP machines you will have to do this for all processors that can touch your code. This is probably solvable if you configure processor affinity for you process, or do debugging on single-proc machine. Second, there are only 4 debug processors on Intel architecture. If you try setting processor breakpoints with WinDbg, after 4th it will complain Too many data breakpoints for thread N after you hit g.
I assume the whole purpose you are asking about automation is because there are too many objects to set breakpoints by hand. Since you are limited to 4 ba breakpoints anyways, there is not much point in automating this.

Mixing 32 bit and 16 bit code with nasm

This is a low-level systems question.
I need to mix 32 bit and 16 bit code because I'm trying to return to real-mode from protected mode. As a bit of background information, my code is doing this just after GRUB boots so I don't have any pesky operating system to tell me what I can and can't do.
Anyway, I use [BITS 32] and [BITS 16] with my assembly to tell nasm which types of operations it should use, but when I test my code use bochs it looks like the for some operations bochs isn't executing the code that I wrote. It looks like the assembler is sticking in extras 0x66 and 0x67's which confuses bochs.
So, how do I get nasm to successfully assemble code where I mix 32 bit and 16 bit code in the same file? Is there some kind of trick?
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
The problem turned out to be that I wasn't setting up my descriptor tables correctly. I had one bit flipped wrong so instead of going to 16-bit mode I was going to 32-bit mode (with segments that happened to have a limit of one meg).
Thanks for the suggestions!
Terry
The 0x66 and 0x67 are opcodes that are used to indicate that the following opcode should be interpreted as a non-default bitness. More specifically, (and according to this link),
"When NASM is in BITS 16 mode, instructions which use 32-bit data are prefixed with an 0x66 byte, and those referring to 32-bit addresses have an 0x67 prefix. In BITS 32 mode, the reverse is true: 32-bit instructions require no prefixes, whereas instructions using 16-bit data need an 0x66 and those working on 16-bit addresses need an 0x67."
This suggests that it's bochs that at fault.
You weren't kidding about this being low-level!
Have you checked the generated opcodes / operands to make sure that nasm is honoring your BITS directives correctly? Also check to make sure the jump targets are correct - maybe nasm is using the wrong offsets.
If it's not a bug in nasm, maybe there is a bug in bochs. I can't imagine that people switch back to 16-bit mode from 32-bit mode very often anymore.
If you're in real mode your default size is implicitly 16 bits, so you should use BITS 16 mode. This way if you need a 32-bit operand size you add the 0x66 prefix, and for a 32-bit address size you add the 0x67 prefix.
Look at the Intel IA-32 Software Developer's Guide, Volume 3, Chapter 16 (MIXING 16-BIT AND 32-BIT CODE; the chapter number might change according to the edition of the book):
Real-address mode, virtual-8086 mode, and SMM are native 16-bit modes.
The BITS 32 directive will only confuse the assembler if you use it outside of Protected Mode or Long Mode.