Benefits of compiling po files to mo - gettext

What's the benefit and primary reason for compiling GNU gettext .po (Portable Object) files to .mo (Machine Object) ?
I saw many programs reading/parsing .po directly.
I'm not using wordpress but on their docs it says:
https://codex.wordpress.org/I18n_for_WordPress_Developers
PO files are compiled to binary MO files, which give faster access to the strings at run-time
Is faster access true? PO can be read only once and cached in some hash table, the same probably goes for MO

There are several reasons:
You should always compile PO files with msgfmt --check which performs several important checks on the PO file, not only a syntax check. For example, if you you are using printf format strings, it will check that %-expansions in the translation match the original string. Failure to do so, may result in crashs at runtime. There are a lot more checks, depending on the (programming) language.
Reading a binary MO file is usually faster and simpler than parsing a textual PO file.
PO files often have translation entries that should not be used for production, for example fuzzy or obsolete entries.
Many PO parsers are buggy or incomplete.
It is part of the gettext API. Translations are expected to be located under /usr/share/locale/LOCALE/LC_MESSAGES/TEXTDOMAIN.mo and they are expected to be in MO format, not in PO format. That does, of course, not apply to the countless libraries that just implement a subset of the gettext API.

Related

Making a shared library that "re-exports" symbols from other shared libraries

Suppose that there is a simple binary that depends on three libraries, libA.so, libB.so, and libC.so. In the usual case, these three dependencies would show up in readelf as needed. However, I am curious about whether it is possible to make a shared library libABC.so that does absolutely nothing but act as an interface to the three actual libraries by "redirecting" the symbols. This way, perhaps one can have multiple versions of libABC.so that in turn point to different versions of the three dependencies, and the binary can "depend" on just libABC.so. Is this possible with ELF?
Another possible use case is the inverse, when the binary already depends on an existing library libABC.so that just so happens to have become split up into three individual libraries.
Beware that I do not necessarily have a practical use or actual use case for this. Whether or not the above example cases are practical, I am merely curious about the possibility.
Re-export Shared Library Symbols from Other Library (OS X / POSIX) has a promising title, but the answers seem either Darwin-specific, or do not quite answer this question.
That kind of works with ELF because of the flat namespace of symbols: if you're depending on one library you usually get access to the symbols of its dependencies at the same time (the exception being when dlopen() is used).
But most link editors (ld) do not do that by default (anymore), because it would let unneeded libraries to be added to the dependencies otherwise. In GNU ld the feature is controlled by the --as-needed flag, and was turned on around 10 years ago by default if I remember correctly.
You should be able to force the behaviour you're looking into with GNU ld by linking (e.g. via the GCC frontend) with gcc yourprogram.c -Wl,--no-as-needed -lABC -Wl,--as-needed. That will force linking to libABC.so whether the program is using one of its exported symbols or not.
I have written extensively on the feature, because it solved many problems for distributions at the time, on my blog if you're looking into what the practicalities of it are.

linker: what is "NMAGIC" section of linked file and what is the aim of section alignment?

ld man here say
-n
--nmagic
Turn off page alignment of sections, and mark the output as "NMAGIC" if possible.
-N
--omagic
Set the text and data sections to be readable and writable. Also, do not page-align the data segment, and disable linking against shared
libraries. If the output format supports Unix style magic numbers,
mark the output as "OMAGIC". Note: Although a writable text section is
allowed for PE-COFF targets, it does not conform to the format
specification published by Microsoft.
--no-omagic
This option negates most of the effects of the -N option. It sets the text section to be read-only, and forces the data segment to be
page-aligned. Note - this option does not enable linking against
shared libraries. Use -Bdynamic for this.
I do understand that theses options are used to make the code (.text) section writable or not, but I don't get the point to align or not the sections, and what is a "NMAGIC" section
On historic (PDP-11) Unix, an executable file's header began with a branch instruction that would jump past the header, to the actual start of the code. When Unix was ported to other processors, that initial PDP-11 branch instruction became fossilized as the "magic number" for the a.out(5) file format. When "pure text" was introduced, initially to allow processes to share their code segments, a new magic number was introduced so that the kernel could tell the difference (there were some important Unix programs that relied on self-modifying code and thus needed to be loaded with writable code segments). The old magic number (0407) was given the name "OMAGIC" -- "old magic" -- and the new magic number (0410) was given the name "NMAGIC", "new magic". The data segment immediately follows the code segment in memory, so when the code segment is made read-only, it must be padded to a page boundary.
Various operating systems and file formats since then introduced other magic numbers; in the last FreeBSD releases to use a.out format, the normal formats were ZMAGIC and QMAGIC, which were introduced to allow page zero in the address space to be unmapped for safety (so that a null-pointer dereference would fault) while still allowing executables to be demand paged (i.e., mmap()ed into the process's address space).
So to answer your question more directly: NMAGIC and OMAGIC are different formats of executable files, not of individual sections. They indicate the desired correspondence between the in-memory and on-disk layouts of the executable. (The reason these numbers are traditionally written in octal rather than hex or decimal is that octal is a natural representation for the instruction format on the PDP-11.) GNU ld uses these names (only) as references to executable formats that have analogous features, even when you are not generating traditional a.out format -- which of course is quite rare today. One particular benefit to using OMAGIC format is that it is more compact than any other format, which may matter in cases like boot loaders where space is limited, there is no demand paging, and there is also no room for any sort of padding.

IDA Pro string function

I have this binary file that I wish to edit, however after loading it, all strings are in some sort of gibberish symbols. Is there anyway to format it?
Why you are seeing "gibberish":
The strings are likely obfuscated. Chances are, before each of the strings is used in the program, a deobfuscation routine is run to convert the string in memory back into something meaningful. This is a common technique used to prevent static analysis tools (such as the GNU "strings" utility or IDA Pro) from properly analyzing the binary. The rest of this answer makes the assumption that this is true of your binary.
How to deobfuscate the strings (dynamic approach):
If you are able to run the binary, you can let it take care of the deobfuscation for you. All you need to do is run the binary in a debugger and analyze the memory after it has been deobfuscated.
Several binaries that obfuscate their strings never re-obfuscate them after their use, so one interesting shortcut you might want to try first is to run the binary in a debugger and break execution right before it exits. If the strings are still debofuscated, you can do a memory dump of the appropriate section to save the deobfuscated strings. (This will not necessarily deobfuscate all of the strings for you; you'll only get the strings that were deobfuscated along the path of the binary's execution)
If the previous method does not work for you, try setting a hardware write breakpoint on the first byte of an obfuscated string, then running the binary. If the breakpoint trips, step through the instructions to allow the rest of the string to be deobfuscated. If the deobfuscation always happens from a common routine, you can place a breakpoint near the end of that routine and possibly script your debugger to print the debofuscated string each time execution passes through that routine.
Once you have a list of deobfuscated strings, you can either patch them directly into the IDA database (discussed below), or you can leave repeatable comments (use the ' key) at the addresses of each of the strings in the database, such that the deobfuscated string will display as a comment on every instruction that references it.
For small binaries, you can get away with doing the annotations by hand, but it would be worthwhile to read into scripting IDA so that you can automate this process. The IDA Pro Book contains a great reference for this.
How to deobfuscate the strings (static approach):
If you can't run the binary, or if the dynamic approach isn't deobfuscating all the strings for you, then you can deobfuscate them yourself.
Chances are good that if you view the cross-references to any of the obfuscated strings in IDA Pro (view them with the x key), you should be taken to the deobfuscation routine. If the routine isn't too complicated -- and they usually aren't -- you should be able to write a script to emulate the debofuscation routine. This will allow you to replace the obfuscated strings with the deobfuscated strings in the IDA database.
(As a point of clarification, the IDA database is entirely separate from the binary itself. Anything you do to the database will have no effect on the actual binary, and anything you do to the binary will have no effect on the database)
Your options for scripting IDA are IDC (IDA's original built-in scripting language) and IDAPython. I highly recommend using IDAPython, as it is much easier to use, and a much more powerful language. I'm not sure if you can install IDAPython on IDA Free 5.0, but it should be bundled with all vaguely recent versions of IDA Pro.
Giving an overview of scripting IDA would be beyond the scope of this answer, but here's an example to get you started. I'm writing it in IDC in case you're using IDA Free. Let's say your deobfuscation routine simply XOR'd each successive byte with 0x1F until the null byte was decoded. Then the following loop might end up being part of your IDC script:
// *EXAMPLE*
auto addr = 0x00401000; // The address of your string
while(1){
auto b = Byte(addr) ^ 0x1F;
PatchByte(addr, b);
if (b == '\0'){
break;
}
addr = addr + 1;
}
Running a script can be done from File > IDC Command... or File > Script file....
As you might guess, Byte returns the byte stored at a given address, and PatchByte writes a byte to an address. Built-in functions in IDAPython share the same names with their IDC counterparts, so the IDAPython version would be nearly identical, sans the C-like syntax. As mentioned before, I highly recommend The IDA Pro Book for a walkthrough on scripting IDA. Once you have the basics down, you can use IDA's built-in help index and The IDAPython documentation as a couple other references.
Always save your database before running a script that patches code! There is no "undo" feature in IDA, so a small coding error could trash your entire database.
Good luck!

Config file handling in Perl

There are plenty of Modules in the Config:: Namespace on CPAN, but they are all limited in ond way or another.
I'm currently using Config::Std, which is fine most of the time, however it makes certain things difficult:
more than two levels of nested directives
handling of multiple values per key
conf.d directories, i.e. multiple config files which are merged into one big config hash
Config::Std generates a blessed hashref after parsing the config, so all my applications are coded to use a hashref for configuration. I'd prefer not having to change this.
What I am looking for is a universal, lightweight Config Module that produces a hashref.
My Question is: Which Config Modules should I consider for replacing Config::Std?
Config::Any (for loading several files and flattening to a hash) and its Config::General backend (for arbitrarily nested configuration items and multiple values per key à la Apache httpd)
You didn't state where your data is coming from. Are you reading in a configuration file and running into the limit of the configuration file itself?
Config::Std is a great module. However, it was meant to read and write Windows Config/INI files, and Windows Config/INI files are very flat and simple formats. Thus, I wouldn't expect Config::Std to do much more.
If you're using Windows Config/INI files right now, but may need to read more complex data structures in the future, Config::Any is a good way to go. It'll handle Windows Config/INI files and using the same programming interface, read and write XML, YAML, and JSON file structures too.
If you're merely trying to keep a complex data structure in your program and don't care about reading and writing configuration files, I would recommend looking at XML::Simple for the very simple reason that it is ...well... simple and can handle all sorts of data structures. Plus, XML::Simple is a very commonly used module, so there's lots of help on the Internet if you have any questions about the module, and it is actively supported.
You could use Config::Any, but I find it more complex to use, and harder to configure. In fact, you have to install XML::Simple (or a similar module) in order to use it. The advantage of Config::Any is that it is a single interface for all sorts of configuration file formats. That way, you don't have to hack through your program if you decide to switch form Windows Config/INI to XML or YAML.
So, if you're working with Windows Config/INI files now, and need a more complex data structure: Look at Config::Any.
If you're merely wanting a simple way to track complex data structures, look at XML::Simple.
YAML will handle that and more.
And here's the website for the protocol.

parsing different files of the same grammar and calculating file to file similarities

I've got a bunch of ACPI Source Language files and I want to calculate file to file similarities between them. I thought of using something like Perl's Parse::RecDescent
but I am stuck at:
1) Translating the ACPI Grammar (www.acpi.info/DOWNLOADS/ACPIspec40a.pdf) to something Parse::RecDescent would understand
2) Have a metric to compare 2 parsed files
Any ideas?
To get started with Parse::RecDescent you may look at Pro Perl Parsing, Ch. 5 or
at Advanced Perl Programming, Ch. 2
Xml Diff tools should be appropriate for comparing hierarchically structured data; perhaps you can apply such a tool to ASTs saved in XML format
So you have two problems:
Parsing ACPI to build an AST. This has the usual troubles of ensuring that you have a well defined grammar, that your parsing machinery can parse according to that grammar (often you have to bend a good grammar definition to enable the parsing machiney to process it), and building a corresponding AST. You will have these troubles with Perl parsing machinery, simply because it is a parsing engine.
Comparing the structure of the ASTs and producing a sensible answer. What you are likely to find here is that there is some literature describing roughtly how to do this (using e.g. Levenshtein distance), but that the details for ASTs matter. (Change distilling: Tree differencing for fine-grained source code change extraction Finally, having determined the distance, you need to print out the deltas in some readable form.
However, AFAIK, my company is the only one that has reduced this to practice. See our Smart Differencer tool. THe SmartDifferencers parse, build ASTs, and report changers in terms of ASTs elements moved, inserted, deleted, replaced, or modifiied by consistent identifier substitition. They depend on any underlying very strong GLR parsing engine which minimized the problems of accepting new grammars. They work for many common languages but not presently for ACPI.