How to write a X86_64 _assembler_? - x86-64

Goal: I want to write an X86_64 assembler. Note: marked as community wiki
Background: I'm familiar with C. I've written MIPS assembly before. I've written some x86 assembly. However, I want to write an x86_64 assembler -- it should output machine code that I can jump to and start executing (like in a JIT).
Question is: what is the best way to approach this? I realize this problem looks kind large to tackle. I want to start out with a basic minimum set:
Load into register
Arithmetric ops on registers (just integers is fine, no need to mess with FPU yet)
Conditionals
Jumps
Just a basic set to make it Turing complete. Anyone done this? Suggestions / resources?

An assembler, like any other "compiler", is best written as a lexical analyser feeding into a language grammar processor.
Assembly language is usually easier than the regular compiled languages since you don't need to worry about constructs crossing line boundaries and the format is usually fixed.
I wrote an assembler for a (fictional) CPU some two years ago for educational purposes and it basically treated each line as:
optional label (e.g., :loop).
operation (e.g., mov).
operands (e.g., ax,$1).
The easiest way to do it is to ensure that tokens are easily distinguishable.
That's why I made the rule that labels had to begin with : - it made the analysis of the line so much easier. The process for handling a line was:
strip off comments (first ; outside a string to end of line).
extract label if present.
first word is then the operation.
rest are the operands.
You can easily insist that different operands have special markers as well, to make your life easier. All this is assuming you have control over the input format. If you're required to use Intel or AT&T format, it's a little more difficult.
The way I approached it is that there was a simple per-operation function that got called (e.g., doJmp, doCall, doRet) and that function decided on what was allowed in the operands.
For example, doCall only allows a numeric or label, doRet allows nothing.
For example, here's a code segment from the encInstr function:
private static MultiRet encInstr(
boolean ignoreVars,
String opcode,
String operands)
{
if (opcode.length() == 0) return hlprNone(ignoreVars);
if (opcode.equals("defb")) return hlprByte(ignoreVars,operands);
if (opcode.equals("defbr")) return hlprByteR(ignoreVars,operands);
if (opcode.equals("defs")) return hlprString(ignoreVars,operands);
if (opcode.equals("defw")) return hlprWord(ignoreVars,operands);
if (opcode.equals("defwr")) return hlprWordR(ignoreVars,operands);
if (opcode.equals("equ")) return hlprNone(ignoreVars);
if (opcode.equals("org")) return hlprNone(ignoreVars);
if (opcode.equals("adc")) return hlprTwoReg(ignoreVars,0x0a,operands);
if (opcode.equals("add")) return hlprTwoReg(ignoreVars,0x09,operands);
if (opcode.equals("and")) return hlprTwoReg(ignoreVars,0x0d,operands);
The hlpr... functions simply took the operands and returned a byte array containing the instructions. They're useful when many operations have similar operand requirements, such as adc,addandand` all requiring two register operands in the above case (the second parameter controlled what opcode was returned for the instruction).
By making the types of operands easily distinguishable, you can check what operands are provided, whether they are legal and which byte sequences to generate. The separation of operations into their own functions provides for a nice logical structure.
In addition, most CPUs follow a reasonably logical translation from opcode to operation (to make the chip designers lives easier) so there will be very similar calculations on all opcodes that allow, for example, indexed addressing.
In order to properly create code in a CPU that allows variable-length instructions, you're best of doing it in two passes.
In the first pass, don't generate code, just generate the lengths of instructions. This allows you to assign values to all labels as you encounter them. The second pass will generate the code and can fill in references to those labels since their values are known. The ignoreVars in that code segment above was used for this purpose (byte sequences of code were returned so we could know the length but any references to symbols just used 0).

Not to discourage you, but there are already many assemblers with various bells and whistles. Please consider contributing to an existing open source project like elftoolchain.

Related

How does one create constants in Tcl?

I have seen the use of set keyword in tcl often. This cannot be used to create constant. How does one create constant in tcl which can then be used by other procedures?
Generally speaking, most of the uses of constants fall into several categories:
enumeration values
magical numbers
looping control factors
scaling factors
In Tcl, for the first case you'd usually just use the name instead of mapping it to an integer, with integer mappings only being applied in the cases that need them. Even bit sets can be handled that way, substituting a list of names for the set of bits (and the name being present in the list is equivalent to the bit being set). Tcl's C API has relevant functions for helping with this, specifically Tcl_GetIndexFromObj().
Magical values are usually best locked away close to the code that handles them. If I was interfacing to hardware, I'd not let the magic values appear at all at the script level (since I'd have the binding code written in C).
Looping control factors are often best represented as default values for procedure arguments, as they are things that you want to sometimes override. But they're often not as needed once custom control structures are available, and they fit a lot more into the Tcl style of working.
Scaling factors are the case where constants might be useful. I tend to simulate those by just using a global or namespace variable and plain old not assigning to it from elsewhere. I'd be quite interested in having code to allow constants (specifically variables that can't be assigned to) as a standard feature, but we don't have that right now.
Once those cases are covered, what remain tend to be unimportant constants. After all, there's almost no need to calculate the sizes of things for allocation and stuff like that, and things like positional binding in SQL statements are discouraged within TDBC in favour of binding by name (an awful lot easier to get right).
A simple way of making a constant is to put a write trace on a variable so that whenever it is written to, it is reset back to its constant value.
set CONSTANT 123
trace add variable CONSTANT write {apply {args {
global CONSTANT
# Reset to the constant value; write traces are after the fact
set CONSTANT 123
# Make the write give an error
error "may not change constant"
}}}

How could I encode "implies" logic in LogicBlox?

I would like to encode "implies" logic in LogicBlox.
I have a predicate:
Number(n),hasNumberName(n:i)->int(i).
isTrue[n] = i -> Number(n), boolean(i).
And I add some data in that predicate:
+Number(1).
Now, I want to create number 2 and number 3, and the truth value for these two number following this logic rule:
If isTrue[1] is true, then isTrue[2] is true or isTrue[3] is true. (isTrue[1] implies (isTrue[2] or isTrue[3]))
So I create a predicate:
implies[n1,n2,n3] = e -> Number(n1), Number(n2), Number(n3),boolean(e).
Then I try to create a rule like that:
isTrue[n2] = true;isTrue[n3] = true <- isTrue[n1] = true,implies[n1,n2,n3] = true.
But LogicBlox reports:"error: disjunction is not supported in the head of a rule "
So how can I encoding this implies logic in LogicBlox?
From your question it looks like you're asking this question with a Prolog background. If so, then it might be helpful to read a Datalog introduction, for example "What you always wanted to know about Datalog (and never dared to ask)".
The logic you want to express is on purpose not allowed in Datalog, because it requires a solving or search strategy. As opposed to Prolog, Datalog is on purpose restricted in the computational complexity of the programs you can express. As a result of these restrictions it meets important requirement for use in a database management system, most importantly supporting very large data sets. The computational complexity restrictions will be more clear after reading a good introduction to Datalog.
People have studied extensions of Datalog to allow more programs to be expressed (without going to full Prolog, which would result in a more procedural semantics). This particular example is called "Disjunctive Datalog". The hits on Google look good for this if you want to read more. LogicBlox does (at least currently) not implement Disjunctive Datalog because our primary objective is to be a scalable database management system.
LogicBlox does support using a solver for specific programs. A typical example is the knapsack problem. If your problem is expressible as an optimization problem (it almost certainly is, but the formulation usually requires some creativity for things that are not conventional optimization problems), then you could use this feature. The solver functionality is not very well documented in publicly available material yet. Please reach out to us directly if you would like to give this a try.
I assume you are trying to enforce a constraint that 1 -> 2 or 3 ? If so, trying to derive a value using <- is not going to work: if neither 2 nor 3 is present, which one(s) are you telling the system create? Instead, just write the constraint using -> syntax. Constraints are implications, after all (the right arrow syntax is no accident!), and that puts the disjunction on the right hand side where the language allows it. Then, if you ever try to create 1 and neither 2 nor 3 exists, the system will report a constraint failure because the implication was not found to hold.
Also, you don't usually need boolean-valued functions in logic languages; isTrue(x) can just be the set of x which you consider to be "true" (and any not present are "false").

Flow Control Instructions in a virtual machine

I've been implementing my own scripting language + virtual machine from scratch for a small experiment. A script reader parses the script and translates it to a stream of instructions that a runtime engine will execute.
At the beginning I didn't think about it but now I'd like to include flow control (loops, branching etc). I'm not well versed with language theory and just looked at some examples for inspiration.
But both the x86 and the java virtual machine have a plethora of instructions used for flow control. In x86 there are plenty instructions that jump based on the state of flags and other instructions that manipulate the relevant flags one way or another. In Java there seem to be 16 instructions that make some sort of comparison and a conditional jump.
This might be efficient or motivated by hardware specific reasons but it's not what I'm looking for.
I look for a lean, elegant solution to flow control that only requires a few dedicated instructions and isn't too complicated to implement and maintain.
I'm pretty confident I could come up with something that works but I'd rather improve my knowledge instead of reinventing the wheel. Any explanations or links to relevant material are very welcome!
Generally the minimum primitives required for flow control are
unconditional jump
conditional jump
Of these, the conditional jump is the complex one, and at a minimum it needs to support the following atomically:
test a binary variable/flag
if the flag is set, cause instruction execution to jump to some specified location
if the flag is unset, allow instruction execution to continue uninterrupted
However with such a primitive conditional jump, you would need ways to set that binary variable/flag to the appropriate value for every type of boolean expression that could be used in the flow control structures of your language.
This would therefore either lead to the need for various primitives of varying complexity for setting the binary variable/flag, or the need to emit complex sequences of instructions to get the desired effect.
The other alternative is to introduce more complex conditional jump primitives.
Generally there will be a trade-off between the number and complexity of each of: conditional jump primitives; condition (variable/flag) setting primitives; emitted instructions.

What did John McCarthy mean by *pornographic programming*?

In the History of Lisp, McCarthy writes :
The unexpected appearance of an interpreter tended to freeze the form of the language, and some of the decisions made rather lightheartedly for the ``Recursive functions ...'' paper later proved unfortunate. These included the COND notation for conditional expressions which leads to an unnecessary depth of parentheses, and the use of the number zero to denote the empty list NIL and the truth value false. Besides encouraging pornographic programming, giving a special interpretation to the address 0 has caused difficulties in all subsequent implementations.
What's he talking about?
... zero to denote the empty list ...
because 0==() has been the emoticon for pornography since 1958.
Now you know.
The fact that too many implementation details were leaking at a higher level, i.e. showing up too much
The original Fortran III spec document, a technical paper disseminated in the Winter of 1958 describes some very explicit additions to the Fortran II language, including ... inline assembly.
The PDF is here
A tantalizing description of the "additions" follows :
Some taboo code is
Mysteriously, Fortran-III was never released to the public (see section 5.), but disseminated in limited fashion before quietly fading away.
I think it is about mixing numerical and logic values, which can still be seen in popular constructs, probably originated in Fortran, like while (1). There are a lot of "clever" C algorithms, that rely on the fact, that 0 is false and every other value isn't.
The same applies at large to API calls, like in POSIX or Linux kernel, some of which return 0 on failure, while some -1 (there's a rule of thumb, when to apply which, but it is just folklore, so often it is broken). Considering the fact, that at McCarthy's time, those things weren't developed yet, you can see his "prophetic" power even here.
Perhaps it was his way of talking about null references: the billion dollar mistake (T. Hoare).

Design - When to create new functions?

This is a general design question not relating to any language. I'm a bit torn between going for minimum code or optimum organization.
I'll use my current project as an example. I have a bunch of tabs on a form that perform different functions. Lets say Tab 1 reads in a file with a specific layout, tab 2 exports a file to a specific location, etc. The problem I'm running into now is that I need these tabs to do something slightly different based on the contents of a variable. If it contains a 1 I may need to use Layout A and perform some extra concatenation, if it contains a 2 I may need to use Layout B and do no concatenation but add two integer fields, etc. There could be 10+ codes that I will be looking at.
Is it more preferable to create an individual path for each code early on, or attempt to create a single path that branches out only when absolutely required.
Creating an individual path for each code would allow my code to be extremely easy to follow at a glance, which in turn will help me out later on down the road when debugging or making changes. The downside to this is that I will increase the amount of code written by calling some of the same functions in multiple places (for example, steps 3, 5, and 9 for every single code may be exactly the same.
Creating a single path that would branch out only when required will be a bit messier and more difficult to follow at a glance, but I would create less code by placing conditionals only at steps that are unique.
I realize that this may be a case-by-case decision, but in general, if you were handed a previously built program to work on, which would you prefer?
Edit: I've drawn some simple images to help express it. Codes 1/2/3 are the variables and the lines under them represent the paths they would take. All of these steps need to be performed in a linear chronological fashion, so there would be a function to essentially just call other functions in the proper order.
Different Paths
Single Path
Creating a single path that would
branch out only when required will be
a bit messier and more difficult to
follow at a glance, but I would create
less code by placing conditionals only
at steps that are unique.
Im not buying this statement. There is a level of finesse when deciding when to write new functions. Functions should be as simple and reusable as possible (but no simpler). The correct answer is almost never 'one big file that does a lot of branching'.
Less LOC (lines of code) should not be the goal. Readability and maintainability should be the goal. When you create functions, the names should be self documenting. If you have a large block of code, it is good to do something like
function doSomethingComplicated() {
stepOne();
stepTwo();
// and so on
}
where the function names are self documenting. Not only will the code be more readable, you will make it easier to unit test each segment of the code in isolation.
For the case where you will have a lot of methods that call the same exact methods, you can use good OO design and design patterns to minimize the number of functions that do the same thing. This is in reference to your statement "The downside to this is that I will increase the amount of code written by calling some of the same functions in multiple places (for example, steps 3, 5, and 9 for every single code may be exactly the same."
The biggest danger in starting with one big block of code is that it will never actually get refactored into smaller units. Just start down the right path to begin with....
EDIT --
for your picture, I would create a base-class with all of the common methods that are used. The base class would be abstract, with an abstract method. Subclasses would implement the abstract method and use the common functions they need. Of course, replace 'abstract' with whatever your language of choice provides.
You should always err on the side of generalization, with the only exception being early prototyping (where throughput of generating working stuff is majorly impacted by designing correct abstractions/generalizations). having said that, you should NEVER leave that mess of non-generalized cloned branches past the early prototype stage, as it leads to messy hard to maintain code (if you are doing almost the same thing 3 different times, and need to change that thing, you're almost sure to forget to change 1 out of 3).
Again it's hard to specifically answer such an open ended question, but I believe you don't have to sacrifice one for the other.
OOP techniques solves this issue by allowing you to encapsulate the reusable portions of your code and generate child classes to handle object specific behaviors.
Personally I think you might (if possible by your API) create inherited forms, create them on fly on master form (with tabs), pass agruments and embed in tab container.
When to inherit form and when to decide to use arguments (code) to show/hide/add/remove functionality is up to you, yet master form should contain only decisions and argument passing and embeddable forms just plain functionality - this way you can separate organisation from implementation.