linker: what is "NMAGIC" section of linked file and what is the aim of section alignment? - ld

ld man here say
-n
--nmagic
Turn off page alignment of sections, and mark the output as "NMAGIC" if possible.
-N
--omagic
Set the text and data sections to be readable and writable. Also, do not page-align the data segment, and disable linking against shared
libraries. If the output format supports Unix style magic numbers,
mark the output as "OMAGIC". Note: Although a writable text section is
allowed for PE-COFF targets, it does not conform to the format
specification published by Microsoft.
--no-omagic
This option negates most of the effects of the -N option. It sets the text section to be read-only, and forces the data segment to be
page-aligned. Note - this option does not enable linking against
shared libraries. Use -Bdynamic for this.
I do understand that theses options are used to make the code (.text) section writable or not, but I don't get the point to align or not the sections, and what is a "NMAGIC" section

On historic (PDP-11) Unix, an executable file's header began with a branch instruction that would jump past the header, to the actual start of the code. When Unix was ported to other processors, that initial PDP-11 branch instruction became fossilized as the "magic number" for the a.out(5) file format. When "pure text" was introduced, initially to allow processes to share their code segments, a new magic number was introduced so that the kernel could tell the difference (there were some important Unix programs that relied on self-modifying code and thus needed to be loaded with writable code segments). The old magic number (0407) was given the name "OMAGIC" -- "old magic" -- and the new magic number (0410) was given the name "NMAGIC", "new magic". The data segment immediately follows the code segment in memory, so when the code segment is made read-only, it must be padded to a page boundary.
Various operating systems and file formats since then introduced other magic numbers; in the last FreeBSD releases to use a.out format, the normal formats were ZMAGIC and QMAGIC, which were introduced to allow page zero in the address space to be unmapped for safety (so that a null-pointer dereference would fault) while still allowing executables to be demand paged (i.e., mmap()ed into the process's address space).
So to answer your question more directly: NMAGIC and OMAGIC are different formats of executable files, not of individual sections. They indicate the desired correspondence between the in-memory and on-disk layouts of the executable. (The reason these numbers are traditionally written in octal rather than hex or decimal is that octal is a natural representation for the instruction format on the PDP-11.) GNU ld uses these names (only) as references to executable formats that have analogous features, even when you are not generating traditional a.out format -- which of course is quite rare today. One particular benefit to using OMAGIC format is that it is more compact than any other format, which may matter in cases like boot loaders where space is limited, there is no demand paging, and there is also no room for any sort of padding.

Related

Was there ever a first parameter for the CLEAR statement?

In both GW-BASIC and QuickBASIC, statements are passed arguments, some of which are optional and can be omitted depending on the statement:
REM Move the text cursor to the specified column and row.
LOCATE row%, column%
REM Move the text cursor to the specified column without changing the row.
LOCATE , column%
In GW-BASIC, the CLEAR statement is rather unusual in that its first "argument" is always omitted:
CLEAR , basicMem
CLEAR , basicMem, basicStack
CLEAR , , basicStack
In QuickBASIC, the basicMem parameter became optional due to the interpreter/runtime managing its own memory:
CLEAR , , basicStack
What I'm wondering is whether that first "argument" ever used for anything prior to GW-BASIC, i.e. something like this was actually useful:
CLEAR missingArg, basicMem, basicStack
REM ^^^^^^^^^^
REM here
That is, was there ever an purposeful non-empty argument before the first comma?
If anybody has any idea, I'd love to know!
What I'm wondering is whether that first "argument" ever used for
anything prior to GW-BASIC, i.e. something like this was actually
useful:
CLEAR missingArg, basicMem, basicStack
REM ^^^^^^^^^^
REM here
That is, was there ever an purposeful non-empty argument before the
first comma?
Yes, there was a first argument, but there was never a 3-argument form that actually made use of it.
Microsoft (originally Micro-Soft) created Altair BASIC. It featured a CLEAR command with no arguments that set all program variables to zero. The 4K version had no strings, so it had no need for managing string space. However, the 8K, Extended, and Disk versions had a CLEAR command that also accepted a single argument of the form CLEAR x. The value x specified the maximum amount of string space available in bytes, with the default at load time of BASIC being 50 bytes in the 8K version and 200 bytes in the Extended and Disk versions until it was changed [source]. That's where the missing first argument came from and what it was used for originally. At the time, however, only that one argument was valid.
Microsoft went on to develop a derivative called "BASIC-80" for several systems, notably the Intel ISIS-II, CP/M, and TEKDOS operating systems. A "Standalone Disk BASIC" version of BASIC-80 was also created that could run on "almost any 8080 or Z80 based disk hardware without an operating system." There was no 4K version of BASIC-80, so it's reasonable to assume all versions of BASIC-80 had strings available as the 8K version of Altair BASIC did. As a result, that string space needed managed. However, it was in BASIC-80 that a second argument was added:
CLEAR [expression![,address]]
expression! was an expression that specified the amount of string space, like in 8K (Altair) BASIC, and address was the maximum address available to BASIC, i.e. the amount of memory available to BASIC, like the argument immediately after the first comma in GW-BASIC.
Eventually, BASIC-80, Release 5.0, was shipped into the world, and it featured the odd syntax instead:
CLEAR [,[expression1][,expression2]]
expression1 was the maximum memory available to BASIC, and expression2 was the amount of stack space. Appendix A: New Features in BASIC-80, Release 5.0 explains why the first argument was dropped:
String space is allocated dynamically, and the first argument in a two-argument CLEAR statement will be ignored.
In other words, CLEAR strSpace!,maxMem would ignore the strSpace! argument in BASIC-80, Release 5.0, so the syntax became CLEAR [,[maxMem][,maxStack]].
QuickBASIC eventually changed the syntax further to just CLEAR [,,stack].
Confusingly, the on-line help system of QuickBASIC 4.5 states the following:
Note: Two commas are used before stack to keep QuickBASIC compatible
with BASICA. BASICA included an additional argument that set the
size of the data segment. Because QuickBASIC automatically manages
the data segment, the first parameter is no longer required.
"The first parameter" mentioned is maxMem as BASICA (and GW-BASIC) used the syntax available with BASIC-80, Release 5.0, rather than the equally missing strSpace! parameter used by pre-5.0 releases of BASIC-80.

How does computer display a character on the screen with the correct encoding?

I'm interested in the encoding of the character in the computer.
When I open my xxx.c with visual studio code, how does the VS code detect the encoding of my file and interprets these "01" sequence. Further on, how the visual studio code (or even the computer system) display the character on the screen acorrding to my "01" sequence file and the character encoding?
Thank you!
I also uses Chinese during my projects. Sometimes, the file encoding really drive my crazy. Sometimes,my correct utf-8 file created by edit A for example, was destroyed by some text editor B that interpret it as GBK file, and edit A can never get it back correct.
I searched a lot, but the most answers seems to be too abstract or irrelevant. I want to figure out how the software and the computer system( or operating system) cooperate together to make this simple but important job done!
First things first, "can never get it back": Always Use Source Code Control
"How the software and the computer system (or operating system) cooperate together to make this simple but important job done!": They don't that's the problem!
Short history: Many decades ago people used small character sets. The idea was a system would always use the same one. Simple. Every time a text file was transferred between systems, it would be immediately transcribed to the local character encoding. Then came the globalization of file exchanges and systems needed to hold text files in different encodings. There was no general way of recording what the encoding was. In 1991 came the huge character set Unicode. Languages (VB4, Java), operating system APIs (Win32), file systems (NTFS), … began adopting it. However, its encodings (UTF-8, UTF-16) are just yet more possibilities for which encoding a text file uses. Many programs that read text files either rely on the old system of a system default encoding or guess ("detect").
In the programming world, some languages require source files to use a specific encoding (say UTF-8); In others, tools default to specific encoding (say UTF-8). In most cases, the toolset provided with a C or C++ implementation will have a consistent set of rules. If you also use an IDE or other form of project system, you can set the encoding for the entire project and in some cases specific files.
So, the only solution is to only use tools that work for you and to properly configure them. If it hurts, stop doing it.
Aside: On the topic of programming and default character encodings, be careful not to get tricked with various language libraries' use of the system default character encoding—unless that is exactly what's needed. Otherwise, you are giving your users the same problem that you are encountering. (In Java, just avoid it with explicit arguments. In C and C++ libraries, encoding is combined into Locales. But note that many systems initialize a program to use default character encoding.

why is there severals encodings for one instruction in ARMv7

I am currently trying to implement a disassembler for the ARM cortex A9, which implement the ARMv7 instruction set.
For that I am using the manual "DDI0406C_b_arm_architecture_reference_manual.pdf" that can be download here (after having registered on arm website) :
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.architecture/index.html
In this manual, in the part A8.8 with the instructions details, I can't understand why there is several encoding for one instruction (like A1, A2, ...), that all seem to be implemented with ARMv7.
Also, as the ARM cortex A9 used thumb-2, does it also implement the A1/A2/... encodings, or only the T1/T2... ?
I really read all parts of this manual are related to encodings, but I still don't understand how we can know which encoding is used for a program.
Different encoding of an instruction do functionally different things.
One example for usage of different encodings is A8.9.12 ADR
This instruction adds an immediate value to the PC value to form a PC-relative address, and writes the result to the
destination register.
If instruction is encoded as A1 then offset must be interpreted as zero or positive, if it is encoded as A2 then offset is negative.
Another example is A8.8.132 POP
If the list contains more than one register, the instruction is assembled to encoding A1. If the list contains exactly one register, the instruction is assembled to encoding A2.
I can imagine different POP encodings are created probably to create different microcodes for performance reasons.
For the second part of your question, Cortex-A9 is an ARMv7-A architecture CPU and it supports all the instructions as specified in the manual you pointed. May be you should also read Cortex™-A9 Technical Reference Manual.
There is no way to really distinguish between ARM and Thumb inside the instruction-stream. You can only decide based on the way a function gets called (if the lowest bit is set to 1 then it's thumb, otherwise arm).
The ARM-Encoding are quite "stable", you'll only find a few A1 encodings, BLX is an example where a A2 encoding is given, but this is mainly because the new ARM-ARM is structured differently from the older ones. BL and BLX were two different instructions, BLX was added in additional instruction space (the upper 4 bits which are normaly used for conditions are set to 1111, which in ARM prior to v5 meant "never execute".
For the Thumb-Encodings it's different, there are a lot of them, because they had to be placed in a more compressed instruction-space, page A6-220 has information about how to decide whatever an thumb instruction consists of two or just one halfword.
The Ax encodings are arm when the processor is in arm mode it will decode bits it finds using those encodings. if there is more than one A1, A2, it should be obvious that there is a different feature or reason for that. those two instructions can be considered separate (look at the overuse of the mov instruction in x86 for example, it has many encodings). Treat each encoding as a separate "instruction".
Then there are the Tx variants, those are thumb and thumb2 extensions. The thumb are all 16 bit (the bl can be decoded as two separate 16 bit instructions) and the descriptions below them indicate "all thumb variants" or "armv4t to the present" or some such language. The thumb2 extensions are all 32 bit, the first 16 bits being an undefined instruction in the thumb world. These have more limitations on what architectures support them.
You are not going to be able to completely create a disassembler for one of these processors, for the same reason you cant do one for x86 or many other processors (all?). If you assume that all the instructions are one mode (arm or thumb or thumb+thumb2) but no mode mixing (arm+thumb) then you can because everything is fixed instruction length and you can simply disassemble everything data and code and you wont run into any problems. In order to disassemble mixed mode you have to basically emulate/execute the instructions and follow instruction flow (just like a variable word length instruction set disassembler) to try to find the transitions, problem here of course the transitions are multi instruction at a minimum load a register then bx that register, sometimes there is math involved on the instruction computation, and there is no guarantee that the address computation or load happens the instruction before the bx. So you could do some of that and get a long way through disassembling the program.
If thumb2 is supported/allowed on the processor you are using then you have the variable instruction length problem for times that you detect entry points to thumb code. And unless you already are doing this you have to follow execution of the code to determine where instructions start (elementary variable instruction length disassembly stuff).
The combination of technical reference manual and architectural reference manual will tell you if the architecture and implementation of that architecture (trm) allow arm and thumb modes. I would assume an A9 supports arm thumb and thumb2, all three.
The cortex-m family is the only one so far that is limited to not supporting arm, and their thumb2 varies widely as the cortex-m0 (and m1) are armv6m and the m3 and m4 are armv7m (A few dozen (armv6m) instructions to many dozen thumb2 extensions in the armv7m). There are separate architectural reference manuals specifically for the -m variants, armv7-m vs the armv7-ar manuals for example.

IDA Pro string function

I have this binary file that I wish to edit, however after loading it, all strings are in some sort of gibberish symbols. Is there anyway to format it?
Why you are seeing "gibberish":
The strings are likely obfuscated. Chances are, before each of the strings is used in the program, a deobfuscation routine is run to convert the string in memory back into something meaningful. This is a common technique used to prevent static analysis tools (such as the GNU "strings" utility or IDA Pro) from properly analyzing the binary. The rest of this answer makes the assumption that this is true of your binary.
How to deobfuscate the strings (dynamic approach):
If you are able to run the binary, you can let it take care of the deobfuscation for you. All you need to do is run the binary in a debugger and analyze the memory after it has been deobfuscated.
Several binaries that obfuscate their strings never re-obfuscate them after their use, so one interesting shortcut you might want to try first is to run the binary in a debugger and break execution right before it exits. If the strings are still debofuscated, you can do a memory dump of the appropriate section to save the deobfuscated strings. (This will not necessarily deobfuscate all of the strings for you; you'll only get the strings that were deobfuscated along the path of the binary's execution)
If the previous method does not work for you, try setting a hardware write breakpoint on the first byte of an obfuscated string, then running the binary. If the breakpoint trips, step through the instructions to allow the rest of the string to be deobfuscated. If the deobfuscation always happens from a common routine, you can place a breakpoint near the end of that routine and possibly script your debugger to print the debofuscated string each time execution passes through that routine.
Once you have a list of deobfuscated strings, you can either patch them directly into the IDA database (discussed below), or you can leave repeatable comments (use the ' key) at the addresses of each of the strings in the database, such that the deobfuscated string will display as a comment on every instruction that references it.
For small binaries, you can get away with doing the annotations by hand, but it would be worthwhile to read into scripting IDA so that you can automate this process. The IDA Pro Book contains a great reference for this.
How to deobfuscate the strings (static approach):
If you can't run the binary, or if the dynamic approach isn't deobfuscating all the strings for you, then you can deobfuscate them yourself.
Chances are good that if you view the cross-references to any of the obfuscated strings in IDA Pro (view them with the x key), you should be taken to the deobfuscation routine. If the routine isn't too complicated -- and they usually aren't -- you should be able to write a script to emulate the debofuscation routine. This will allow you to replace the obfuscated strings with the deobfuscated strings in the IDA database.
(As a point of clarification, the IDA database is entirely separate from the binary itself. Anything you do to the database will have no effect on the actual binary, and anything you do to the binary will have no effect on the database)
Your options for scripting IDA are IDC (IDA's original built-in scripting language) and IDAPython. I highly recommend using IDAPython, as it is much easier to use, and a much more powerful language. I'm not sure if you can install IDAPython on IDA Free 5.0, but it should be bundled with all vaguely recent versions of IDA Pro.
Giving an overview of scripting IDA would be beyond the scope of this answer, but here's an example to get you started. I'm writing it in IDC in case you're using IDA Free. Let's say your deobfuscation routine simply XOR'd each successive byte with 0x1F until the null byte was decoded. Then the following loop might end up being part of your IDC script:
// *EXAMPLE*
auto addr = 0x00401000; // The address of your string
while(1){
auto b = Byte(addr) ^ 0x1F;
PatchByte(addr, b);
if (b == '\0'){
break;
}
addr = addr + 1;
}
Running a script can be done from File > IDC Command... or File > Script file....
As you might guess, Byte returns the byte stored at a given address, and PatchByte writes a byte to an address. Built-in functions in IDAPython share the same names with their IDC counterparts, so the IDAPython version would be nearly identical, sans the C-like syntax. As mentioned before, I highly recommend The IDA Pro Book for a walkthrough on scripting IDA. Once you have the basics down, you can use IDA's built-in help index and The IDAPython documentation as a couple other references.
Always save your database before running a script that patches code! There is no "undo" feature in IDA, so a small coding error could trash your entire database.
Good luck!

Tool to compare/diff HTML in bulk

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.
Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.
(Oh and it needs to run under linux)
You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.
In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.
At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.
Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.
I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/
Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net
(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)
Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.
I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!