Trouble Figuring out loading to register with offset from different register

Trouble Figuring out loading to register with offset from different register - offset

I am creating an 8-bit CPU. I have basic instructions like mov, ld, st, add, sub, mult, jmp. I am trying to put my instructions together. First I move the base address of a value into register 1 (R1). I then want to load register 2 (R2) with the value. So my instructions look:
1 mov R1, 0xFFFF
2 ld R2, [R1+0]
My opcode definitions are:
ld: 0001
mov: 1111
Register codes are:
R1: 0001
R2: 0010
So my instructions in binary look like:
1 mov R1, 0xFFFF = 1111 0001 0xFFFF
2 ld R2, [R1+0] = 0001 00010
But on my second direction for load, how can I ensure the value stored at the memory location I moved to R1 is going to be used. This is my first time doing anything with computer architecture, so I am a little lost.

how can I ensure the value stored at the memory location I moved to R1 is going to be used.
By building your hardware to correctly handle the read-after-write hazard (https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards).
Either
make it a simple non-pipelined CPU where one instruction writes back to registers before the next instruction reads any registers
detect the dependency and stall the pipeline
bypass forwarding. (https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Eliminating_hazards)

Related

Can SIPI be sent from a BSP running in long mode?

Currently I have an multi-processing operating system running in x86 protected mode, and I want to make it run in x86_64 long mode. Its current logic to wake up APs is by sending SIPI-INIT-INIT:
// BSP already entered protected mode, set up page tables
uint32_t *icr = 0xfee00300;
*icr = 0x000c4500ul; // send INIT
delay_us(10000); // delay for 10 ms
while (*icr & 0x1000); // wait until Send Pending bit is clear
for (int i = 0; i < 2; i++) {
*icr = 0x000c4610ul; // send SIPI
delay_us(200); // delay for 200 us
while (*icr & 0x1000); // wait until Send Pending bit is clear
}
This program works well in 32-bit protected mode.
However, after I modified the operating system to run in 64-bit long mode, the logic breaks when sending SIPI. In QEMU, immediately after executing the send SIPI line, the BSP is reset (program counter goes to 0xfff0).
In Intel's software developer's manual volume 3, section 8.4.4.1 (Typical BSP Initialization Sequence), it says that BSP should "Switches to protected mode". Does this process apply to long mode? How should I debug this problem?
Here are some debug information, if helpful:
CPU registers before sending SIPI instruction (movl $0xc4610,(%rax)) in 64-bit long mode:
rax 0xfee00300 4276093696
rbx 0x40 64
rcx 0x0 0
rdx 0x61 97
rsi 0x61 97
rdi 0x0 0
rbp 0x1996ff78 0x1996ff78
rsp 0x1996ff38 0x1996ff38
r8 0x1996ff28 429326120
r9 0x2 2
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x1020d615 0x1020d615
eflags 0x97 [ IOPL=0 SF AF PF CF ]
cs 0x10 16
ss 0x18 24
ds 0x18 24
es 0x18 24
fs 0x18 24
gs 0x18 24
fs_base 0x0 0
gs_base 0x0 0
k_gs_base 0x0 0
cr0 0x80000011 [ PG ET PE ]
cr2 0x0 0
cr3 0x19948000 [ PDBR=12 PCID=0 ]
cr4 0x20 [ PAE ]
cr8 0x0 0
efer 0x500 [ LMA LME ]
mxcsr 0x1f80 [ IM DM ZM OM UM PM ]
CPU registers before sending SIPI instruction (movl $0xc4610,(%eax)) in 32-bit protected mode:
rax 0xfee00300 4276093696
rbx 0x40000 262144
rcx 0x0 0
rdx 0x61 97
rsi 0x2 2
rdi 0x102110eb 270602475
rbp 0x19968f4c 0x19968f4c
rsp 0x19968f04 0x19968f04
r8 0x0 0
r9 0x0 0
r10 0x0 0
r11 0x0 0
r12 0x0 0
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x1020d075 0x1020d075
eflags 0x97 [ IOPL=0 SF AF PF CF ]
cs 0x8 8
ss 0x10 16
ds 0x10 16
es 0x10 16
fs 0x10 16
gs 0x10 16
fs_base 0x0 0
gs_base 0x0 0
k_gs_base 0x0 0
cr0 0x80000015 [ PG ET EM PE ]
cr2 0x0 0
cr3 0x19942000 [ PDBR=12 PCID=0 ]
cr4 0x30 [ PAE PSE ]
cr8 0x0 0
efer 0x0 [ ]
mxcsr 0x1f80 [ IM DM ZM OM UM PM ]

Can SIPI be sent from a BSP running in long mode?
Yes. The only thing that matters is that you write the right values to the right local APIC registers (with the right delays, sort of - see my method at the end).
However, after I modified the operating system to run in 64-bit long mode, the logic breaks when sending SIPI. In QEMU, immediately after executing the send SIPI line, the BSP is reset (program counter goes to 0xfff0).
I'd assume that either:
a) there's a bug and the address of the local APIC's registers isn't right; causing a triple fault when you attempt to write to the local APIC's register. Don't forget that long mode must use paging, and even though 0xFEE00300 is likely to be the correct physical address it can be the wrong virtual address (unless you took care of that by identity mapping that specific page when porting the OS to long mode).
b) The data isn't right for some hard to imagine reason, causing the SIPI to restart the BSP.
In Intel's software developer's manual volume 3, section 8.4.4.1 (Typical BSP Initialization Sequence), it says that BSP should "Switches to protected mode". Does this process apply to long mode?
Intel's "Typical BSP Initialization Sequence" is just one possible example that's only intended for firmware developers. Note that "intended for firmware developers" means that it should not be used by any OS.
The main problem with Intel's example is that it broadcasts the INIT-SIPI-SIPI sequence to all other CPUs (possibly including CPUs that the firmware disabled because they're faulty, and possibly including CPUs that the firmware disabled for other reasons - e.g. because the user disabled hyper-threading); and fails to detect "CPU exists but failed to start for some reason" (which an OS should report to user).
The other problem is that typically an OS will want to pre-allocate a stack for each AP before starting it (and store "address you should use for stack" somewhere before starting an AP), and you can't give each AP its own stack like that if you're starting an unknown number of CPUs at the same time.
Essentially; firmware uses (something like) the example Intel described, then builds information in an "ACPI/MADT" ACPI table (and/or a "MultiProcessor specification table" for very old computers - it's obsolete now) for the OS to use; and the OS uses information from the firmware's table/s to find the physical address of the local APIC in a correct (vendor and platform neutral) way, and find only the CPUs that the firmware says are valid and determine if those CPU/s are using "local APIC" or "X2APIC" (which supports more than 256 APIC IDs and is necessary if there's a huge number of CPUs); and then only starts valid CPUs one at a time while using a time-out so that "CPU #123, that I have proof exists, has failed to start" can be reported to user and/or logged.
I should also point out that Intel's example has existed in Intel's manuals mostly unchanged for about 25 years (since before long mode was introduced).
My Method
The delays in Intel's algorithm are annoying, and often a CPU will start on the first SIPI, and sometimes the second SIPI will cause the same CPU to be started twice (causing problems if you have any kind of "started_CPUs++;" in the AP startup code).
To fix these problems (and improve performance) the AP startup code can set an "I started" flag, and instead of having a "delay_us(200);" after the sending the first SIPI the BSP can monitor the "I started" flag with a time-out, and skip the second SIPI (and the remainder of the time-out) if the AP already started. In this case the time-out between SIPIs can be longer (e.g. 500 us is fine) and more importantly needn't be so precise; and the same "wait for flag with time-out" code can be re-used after sending the second SIPI (if the second SIPI needed to be sent), with a much longer time-out.
This alone doesn't completely solve the "CPU started twice" problem; and it doesn't solve the "AP started after second SIPI, but started after time-out expired, so now there's 2 APs running and OS only knows about one". These problems are fixed with extra synchronization - specifically, AP sets the "I started flag" and then it can wait for BSP to set a "you can continue if your APIC ID is ...." value to be set (and if the AP detects that the APIC ID value is wrong it can do a "CLI then HLT" loop to shut itself down).
Finally; if you do the whole "INIT-SIPI-SIPI" sequence one CPU at a time, then it can be slow if there's lots of CPUs (e.g. at least a whole second for 100 CPUs due to the 10 ms delay after sending INIT). This can be significantly reduced by using 2 different methods:
a) starting CPUs in parallel. For best case; BSP can start 1 AP, then BSP+AP can start 2 more APs, then BSP+3 APs can start 4 more APs, etc. This means 128 CPUs can be started in slightly more than 70 ms (instead of over a whole second). To make this work (to give each AP different values to use for stack, etc) it's best to use multiple AP CPU startup trampolines (e.g. so that an AP can do "mov esp,[cs:stackPointer]" where different APs are started with different values in cs because that came from the SIPI).
b) Sending multiple INITs to multiple CPUs one at a time; then having one 10 ms delay; then doing the later "SIPI-SIPI" sequence one CPU at a time. This relies on the later "SIPI-SIPI" sequence being relatively fast (compared to the huge 10 ms delay after INIT) and the CPU not being too fussy about the exact length of that 10 ms delay. For example; if you send 4 INITs to 4 CPUs and you know that (for a worst case) the SIPI-SIPI takes 1 ms for the OS to decide that the CPU failed to start; then there'd be a delay of 13 ms between sending the INIT to the fourth/last CPU and sending the first SIPI to the fourth/last CPU.
Note that if you're brave, both of these approaches can be combined (e.g. you could start 128 CPUs in a little more than 50 ms).

Immediate Addressing mode difference?

Recently when I was studying the concept of addressing modes, the first type being immediate addressing mode, consider the example ADD #NUM1,R0 (instruction execution from left to right)
Here, is the address of NUM1 stored in Register R1?
What about when we do ADD #4,R0 to make it point to the next data, when we use #4, I understood that it adds 4 to contents of Register R0. Is there a difference when we use #NUM1 and #4. Please explain!

Is there a difference when we use #NUM1 and #4
In the final machine code in the executable that a CPU will actually run, no, there isn't.
If you have an assembler that directly creates an executable (no separate linking step), then the assembler will know at assemble time the numeric address of NUM1, and simply expand it as an immediate, producing exactly the same machine code as if you'd written add #0x700, R0. (Assuming the NUM1 label ends up at address 0x700 for this example.)
e.g. if the machine encoding for add #imm, R0 is 00 00 imm16, then you'll get 00 00 07 00 (assuming a bit-endian immediate).
Here, is the address of NUM1 stored in Register R1?
No, it's added to R0. If R0 previously contained 4, then R0 will now hold the address NUM1+4.
R1 isn't affected.
Often you have an assembler and a separate linker (e.g. as foo.s -o foo.o to assemble and then link with ld -o foo foo.o).
The numeric address isn't available at assemble time, only at link time. An object file format holds metadata for the symbol relocations, which let the linker fill in the absolute numeric addresses once it decides where the code will be loaded.
The resulting machine code will still be the same.

How to interprete double entries in Windbg "x /2" result?

I'm debugging a dumpfile (memory dump, not a crashdump), which seems to contain two times the amount of expected objects. While investigating the corresponding symbols, I've noticed the following:
0:000> x /2 <product_name>!<company>::<main_product>::<chapter>::<subchapter>::<Current_Object>*
012511cc <product_name>!<company>::<main_product>::<chapter>::<subchapter>::<Current_ObjectID>::`vftable'
012511b0 <product_name>!<company>::<main_product>::<chapter>::<subchapter>::<Current_ObjectID>::`vftable'
01251194 <product_name>!<company>::<main_product>::<chapter>::<subchapter>::<Current_Object>::`vftable'
0125115c <product_name>!<company>::<main_product>::<chapter>::<subchapter>::<Current_Object>::`vftable'
For your information, the entries Current_Object and Current_ObjectID are present in the code, no problem there.
What I don't understand, is that there seem to be two entries for every symbol, and their memory addresses are very close to each other.
Does anybody know how I can interprete this?

it can be due to veriety of reasons Optimizations and redundant code elimination being one at the linking time (pdb is normally made when you compile) see this link by raymond chen for an overview
quoting relevent paragraph from the link
And when you step into the call to p->GetValue() you find yourself in Class1::GetQ.
What happened?
What happened is that the Microsoft linker combined functions that are identical
at the code generation level.
?GetQ#Class1##QAEPAHXZ PROC NEAR ; Class1::GetQ, COMDAT
00000 8b 41 04 mov eax, DWORD PTR [ecx+4]
00003 c3 ret 0
?GetQ#Class1##QAEPAHXZ ENDP ; Class1::GetQ
?GetValue#Class2##UAEHXZ PROC NEAR ; Class2::GetValue, COMDAT
00000 8b 41 04 mov eax, DWORD PTR [ecx+4]
00003 c3 ret 0
?GetValue#Class2##UAEHXZ ENDP ; Class2::GetValue
Observe that at the object code level, the two functions are identical.
(Note that whether two functions are identical at the object code level is
highly dependent on which version of what compiler you're using, and with
which optimization flags. Identical code generation for different functions
occurs with very high frequency when you use templates.) Therefore, the
linker says, "Well, what's the point of having two identical functions? I'll
just keep one copy and use it to stand for both Class1::GetQ and
Class2::GetValue."

How does a microprocessor process an instruction set

For example if I have an 8085 microprocessor.
And below are the instructions.
MVI A, 52H : Store 32H in the accumulator
STA 4000H : Copy accumulator contents at address 4000H
HLT : Terminate program execution
How does the microprocessor understand the commands MVI, STA, HLT.
If I am correct, HLT has 76 as an opcode. In that case, how does a microprocessor recognize 76 as instruction rather than data?

It depends on the processor. Some have fixed-length instructions, in which case the instruction bytes are at every <n> locations, whereas some have variable-length instructions, so that which words/bytes are opcodes and which are arguments depends on what came before. To further complicate this, some processors have certain instructions which must be aligned or padded to certain addresses. Yikes.
The 8085 has variable length instructions. So you have to start at the PC and interpret each instruction based on its length to know where the next begins, and which bytes are data/arguments as opposed to opcodes.

A value of 76 can represent anything, it depends on how it is being interpreted.
In the case of a micro processor, there is a special register that contains the memory address of the next instruction to execute. This data is then loaded and interpreted as an instruction to execute. If the address of the next instruction contains the value 76, this will be interpreted as HLT (in your case). Obviously a different processor might interpret 76 as a different instruction.
On the other hand, if the data from this address is interpreted as a numerical value, it will just mean 76.

It's just that when the processor finds 76 as a part of a program that it is executing, that is, its "program counter" points to the place in memory where the 76 is, it will interpret it as an instruction.
If the processor is then told by its program to load that same 76, from some other place in memory or even from the same place in memory, into a register and use it for calculations, it is interpeted as data.
This is the so called Von Neumann architecture, where program and data are stored in the same computer memory. It all looks the same, but the processor is told by its program which content to treat as data.

Questions on iPhone code disassembly

This is the disassembly of syscall() on iPhone.
(gdb) disass syscall
Dump of assembler code for function syscall:
0x3195fafc <syscall+0>: mov r12, sp
0x3195fb00 <syscall+4>: push {r4, r5, r6, r8}
0x3195fb04 <syscall+8>: ldm r12, {r4, r5, r6}
0x3195fb08 <syscall+12>: mov r12, #0 ; 0x0
0x3195fb0c <syscall+16>: svc 0x00000080
0x3195fb10 <syscall+20>: pop {r4, r5, r6, r8}
0x3195fb14 <syscall+24>: bcc 0x3195fb2c <syscall+48>
0x3195fb18 <syscall+28>: ldr r12, [pc, #4] ; 0x3195fb24 <syscall+40>
0x3195fb1c <syscall+32>: ldr r12, [pc, r12]
0x3195fb20 <syscall+36>: b 0x3195fb28 <syscall+44>
0x3195fb24 <syscall+40>: cfldrdeq mvd15, [r12], #992
0x3195fb28 <syscall+44>: bx r12
0x3195fb2c <syscall+48>: bx lr
End of assembler dump.
Can someone please explain what instructions at offsets +28,+32 are doing? At +28, the value of r12 is 0 (set at +12), so looks like r12 is being set to (in C notation) *(pc + 4). At +32, r12 is set to *(pc + r12) - note that this instruction is not compiling - see #3 below. The 'b' at +36 jumps to +44, which returns to the address in r12. So what value was loaded into r12 by +28 & +32?
What does the cfldrdeq instruction at +40 do? I have check the ARM instruction set & searched for it, but not found anything.
I added this code to my C program using asm(). When compiling, the compiler shows these errors. Any idea how to get around this?
/var/folders/62/3px_xsd56ml5gz18lp8dptjc0000gv/T//ccDThXFx.s:7607:cannot use register index with PC-relative addressing -- ldr r12,[pc,r12]'
/var/folders/62/3px_xsd56ml5gz18lp8dptjc0000gv/T//ccDThXFx.s:7609:selected processor does
not supportcfldrdeq mvd15,[r12],#992'

It makes more sense if you know of the small gotcha surrounding reading the PC: most instructions that read PC see a value of address_of_current_instruction+8 (except +4 in thumb mode, and ldm in ARM mode might be either +8 or +12 IIRC).
cfldrdeq mvd15, [r12], #992 is not meant to be an instruction; it's a relative relocation that points to a relocation the DATA section. In the DATA section, there'll be a dynamic relocation that points to the actual address. Typical seudocode looks something like this
ldr r12,[pc,#small_offset_to_foo]
ldr r12,[pc,r12]
bx r12
... a short distance away ...
foo:
int relative_offset_of_bar_from_the_second_ldr
... a galaxy far far away ...
bar:
int pointer_to_the_actual_syscall
I do not know why the disassembly for syscall() places "foo" between ldr r12,[pc,r12] and bx r12, causing the branch over the non-instruction "foo".
It is also worth mentioning that simply pasting the code shown will almost certainly not work: you don't have the relocation that points to the actual implementation of syscall (in a debugger, step past bx r12 and you should get there); you'll just branch to some randomish address.
The error "cannot use register index with PC-relative addressing" is apparently because you're compiling in Thumb mode (the listing is ARM code). As for cfldrdeq, I believe it's just a conditional cfldrd instruction (the "eq" is a condition code), which Google suggests is related to a the Cirrus Logic "Maverick" processor series.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse