In the PNG spec, uncompressed blocks include two pieces of header information:
LEN is the number of data bytes in the block. NLEN is the one's complement of LEN
Why would the file include the one's complement of a value? How would this be used and/or for what purpose?
Rather than inventing a new compression type for PNG, its authors decided to use an existing industry standard: zlib.
The link you provide does not point to the official PNG specifications at http://www.w3.org/TR/PNG/ but only to this part: the DEFLATE compression scheme. NLEN is not mentioned in the official specs; it only says the default compression is done according to zlib (https://www.rfc-editor.org/rfc/rfc1950), and therefore DEFLATE (https://www.rfc-editor.org/rfc/rfc1951).
As to "why": zlib precedes current day high-speed internet connections, and at the time it was invented, private internet communication was still done using audio line modems. Only few institutions could afford dedicated landlines for just data; the rest of the world was connected via dial-up. Due to this, data transmission was highly susceptible to corruption. For simple text documents, a corrupted file might still be usable, but in compressed data literally every single bit counts.
Apart from straight-on data corruption, a dumb (or badly configured) transmission program might try to interpret certain bytes, for instance changing Carriage Return (0x0D) into Newline (0x0A), which was a common option at the time. "One's complement" is the inversion of every single bit for 0 to 1 and the reverse. If either LEN or NLEN happened to be garbled or changed by the transmission software, then its one's complement would not match anymore.
Effectively, the presence of both LEN and NLEN doubles the level of protection against transmission errors: if they do not match, there is an error. It adds another layer of error checking over zlib's ADLER32, and PNGs own per-block checksum.
Related
I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?
There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.
One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.
I'm trying to solidify my understanding of encoding and decoding. I'm not sure how the sequence of events works in different settings:
When I type on my computer, is the computer(or whatever program I'm in), automatically decoding my letters in UTF-8(or whatever encoding is used)
When I save a file, is it automatically saving it using the encoding standard that was used to decode my text? Let's say I send over that document or dataset to someone, am I sending a bunch of 1s and 0s to them? and then their decoder is decoding it based on whatever default or encoding standard they specify?
How does code points play into this? Does my computer also have a default code point dictionary it uses?
If these above is true, how do I find out what kind of decoding/encoding my computer/program is using?
Sorry if this isn't clear, or if I'm misunderstanding/using terminology incorrectly.
There are a few ways that this can work, but here is one possibility.
First, yes, in a way, the computer "decodes" each letter you type into some encoding. Each time you press a key on your keyboard, you close a circuit, which signals to other hardware in your computer (e.g., a keyboard controller) that a key was pressed. This hardware then populates a buffer with information about the keyboard event (key up, key down, key repeat) and sends an interrupt to the CPU.
When the CPU receives the interrupt, it jumps to a hardware-defined location in memory and begins executing the code it finds there. This code often will examine which device sent the interrupt and then jump to some other location that has code to handle an interrupt sent by the particular device. This code will then read a "scan code" from the buffer on the device to determine which key event occurred.
The operating system then processes the scan code and delivers it to the application that is waiting for keyboard input. One way it can do this is by populating a buffer with the UTF-8-encoded character that corresponds to the key (or keys) that was pressed. The application would then read the buffer when it receives control back from the operating system.
To answer your second question, we first have to remember what happens as you enter data into your file. As you type, your application receives the letters (perhaps UTF-8-encoded, as mentioned above) corresponding to the keys that you press. Now, your application will need to keep track of which letters it has received so that it can later save the data you've entered to a file. One way that it can do this is by allocating a buffer when the program is started and then copying each character into the buffer as it is received. If the characters are delivered from the OS UTF-8-encoded, then your application could simply copy those bytes to the other buffer. As you continue typing, your buffer will continue to be populated by the characters that are delivered by the OS. When it's time to save your file, your application can ask the OS to write the contents of the buffer to a file or to send them over the network. Device drivers for your disk or network interface know how to send this data to the appropriate hardware device. For example, to write to a disk, you may have to write your data to a buffer, write to a register on the disk controller to signal to write the data in the buffer to the disk, and then repeatedly read from another register on the disk controller to check if the write is complete.
Third, Unicode defines a code point for each character. Each code point can be encoded in more than one way. For example, the code point U+004D ("Latin capital letter M") can be encoded in UTF-8 as 0x4D, in UTF-16 as 0x004D, or in UTF-32 as 0x0000004D (see Table 3-4 in The Unicode Standard). If you have data in memory, then it is encoded using some encoding, and there are libraries available that can convert from one encoding to another.
Finally, you can find out how your computer processes keyboard input by examining the device drivers. You could start by looking at some Linux drivers, as many are open source. Each program, however, can encode and decode data however it chooses to. You would have to examine the code for each individual program to understand how its encoding and decoding works.
It is a complex question, also because it depends on many things.
When I type on my computer, is the computer(or whatever program I'm in), automatically decoding my letters in UTF-8(or whatever encoding is used)
This is very complex. Some programs get the keyboard code (e.g. games), but most programs uses operating system services, to interpret keyboard codes (considering various keyboard layouts, but also modifying result according Shift, Control, etc.).
So, it depends on operating system and program about which encoding you get. For terminal programs, the locale of the process include also encoding of stdin/stdout (standard input and standard output). For graphical interfaces, you may get different encoding (according system encoding).
But UTF-8 is an encoding, so you used wrongly the word decoding in UTF-8.
When I save a file, is it automatically saving it using the encoding standard that was used to decode my text? Let's say I send over that document or dataset to someone, am I sending a bunch of 1s and 0s to them? and then their decoder is decoding it based on whatever default or encoding standard they specify?
This is the complex part. Many systems, and computer languages are old, so they were designed with just one system encoding. E.g. C language. So there is not really a decoding. Programs uses directly the encoding, and they hard code that letter A has a specific value. For computers, only the numeric value matter. Only when data is printed things are interpreted, and in a complex way (fonts, character size, ligatures, next line, ...). [And also if you use string functions, you explicitly tell program to uses the numbers as a string of characters].
Some languages (and HTML: you view a page generated by an external machine, so system encoding is not more the same) introduced the decoding part: internally in a program you have one single way to represent a string (e.g. with Unicode Code Points). But to have such uniform format, we need to decode strings (but so, now we can handle different encoding, and not being restricted to the encoding of the system).
If you save a file, it will have a sequences of bytes. To interpret (also known as decoding) you need to know which encoding has the file. In general you should know it, or give (e.g. as HTML) an out-of-band information ("the following file is UTF-8", e.g. in HTTP headers, or in extension, or in field definition of a database, or...). Some systems (Microsoft Windows) uses BOM (Byte order mark) to distinguish between UTF-16LE, UTF-16BE, UTF-8, and old system encoding (some people call it ANSI, but it is not ANSI, and it could be many different code pages).
The decoder: usually it should know the encoding, else either it use defaults, or it guess it. HTML has a list on step to perform to get an estimate. BOM method above could help. And some tools will check looking common combination of characters (in various languages). But this is still magic. Without BOM or out-of-band data, we can just estimate, and we get wrong often.
How does code points play into this? Does my computer also have a default code point dictionary it uses?
Code point is the base of Unicode. Every "character" has a code point: a fix number, with a description. This is abstract. In UTF-32 you use the same number for encoding (using 32bit integers), on all other encoding, you have a functions (or a map) from code point to encoded values (and also the way back). Code point is just a numeric value which describes the semantic (so the meaning) of a character. To transmit such information, usually we need an encoding (or just a escaping sequence, e.g. U+FFFF represent (as text) the BOM character).
If these above is true, how do I find out what kind of decoding/encoding my computer/program is using?
Nobody can answer: your computer will uses a lot of encoding.
MacOS, Unix, POSIX systems: modern systems (and not root account): they will use probably UTF-8. Root will probably use just ASCII (7-bit).
Windows: Internally it uses often UTF16. The output, it depends on the program, but nearly always it uses an 8-bit encoding (so not the UTF16). Windows can read and write several encoding. You can ask the system the default encoding (but programs could still write in UTF-8 or other encoding, if they want). Terminal and settings could gives you different default encoding on different programs.
For this reason, if you program in Windows, you should explicitly save files as UTF-8 (my recommendation), and possibly with BOM (but if you need interoperability with non-Windows machines, in such case, ignore BOM, but you should already know that such files must be UTF-8).
This is actually related to code golf in general, but also appliable elsewhere. People commonly use base64 encoding to store large amounts of binary data in source code.
Assuming all programming languages to be happy to read Unicode source code, what is the max N, for which we can reliably devise a baseN encoding?
Reliability here means being able to encode/decode any data, so every single combination of input bytes can be encoded, and then decoded. The encoded form is free from this rule.
The main goal is to minimize the character count, regardless of byte-count.
Would it be base2147483647 (32-bit) ?
Also, because I know it may vary from browser-to-browser, and we already have problems with copy-pasting code from codegolf answers to our editors, the copy-paste-ability is also a factor here. I know there is a Unicode range of characters that are not displayed.
NOTE:
I know that for binary data, base64 usually expands data, but here the character-count is the main factor.
It really depends on how reliable you want the encoding to be. Character encodings are designed with trade-offs, and in general the more characters allowed, the less likely it is to be universally accepted i.e. less reliable. Base64 isn't immune to this. RFC 3548, published in 2003, mentions that case sensitivity may be an issue, and that the characters + and / may be problematic in certain scenarios. It describes Base32 (no lowercase) and Base16 (hex digits) as potentially safer alternatives.
It does not get better with Unicode. Adding that many characters introduces many more possible points of failure. Depending on how stringent your requirements are, you might have different values for N. I'll cover a few possibilities from large N to small N, adding a requirement each time.
1,114,112: Code points. This is the number of possible code points defined by the Unicode Standard.
1,112,064: Valid UTF. This excludes the surrogates which cannot stand on their own.
1,111,998: Valid for exchange between processes. Unicode reserves 66 code points as permanent non-characters for internal use only. Theoretically, this is the maximum N you could justifiably expect for your copy-paste scenario, but as you noted, in practice many other Unicode strings will fail that exercise.
120,503: Printable characters only, depending on your definition. I've defined it to be all characters outside of the Other and Separator general categories. Also, starting from this bullet point, N is subject to change in future versions of Unicode.
103,595: NFKD normalized Unicode. Unfortunately, many processes automatically normalize Unicode input to a standardized form. If the process used NFKC or NFKD, some information may have been lost. For more reliability, the encoding should thus define a normalization form, with NFKD being better for increasing character count
101,684: No combining characters. These are "characters" which shouldn't stand on their own, such as accents, and are meant to be combined with another base character. Some processes might panic if they are left standing alone, or if there are too many combining characters on a single base character. I've now excluded the Mark category.
85: ASCII85, aka. I want my ASCII back. Okay, this is no longer Unicode, but I felt like mentioning it because it's a lesser known ASCII-only encoding. It's mainly used in Adobe's PostScript and PDF formats, and has a 5:4 encoded data size increase, rather than Base64's 4:3 ratio.
I'm planning to write a web application using C/C++ servlets/handlers for G-Wan web/app server. I would like my application to work with multiple languages including multibyte characters and hence am wondering how i should handle this in G-WAN servlets.
The xbuf_t structure seems to be char* as its underlying storage buffer for building HTTP response; and since char is a single byte, i would want to know how it would affect the text with unicode or multi-byte characters. I'm a bit reluctant to add heavy unicode libraries like IBM Unicode Library [ICU] and the likes.
Could someone explain me how others are dealing with this situation and if required what options are available for handling unicode, preferably with as little and small dependencies as possible?
The server response (called reply in servlet examples) can contain binary data so this is possible of course. There are examples that send dynamically pictures (GIF, PNG, JSON, etc.), so there's no limit to what you can send as a reply.
Without UNICODE, you are using xbuf_xcat() which acts like sprintf() with a dynamically growing buffer (the server reply).
What you should do is just build your UNICODE reply (with your favorite UNICODE library - ANSI C and almost all languages have one) and then copy it into the reply buffer with xbuf_ncat();
Of course, you can also use xbuf_ncat(); on-the-fly for each piece of data you build rather than for all the big buffer at the end of your servlet. Your choice.
Note that using UTF-8 may be (it depends on your application) a better choice than UNICODE because then most of your text might be able to use xbuf_xcat() (this is faster than a buffer copy).
You will only need to call xbuf_ncat(); for the non-ASCII characters.
The xbuf_xxx() functions could be modified to support UTF-8/UNICODE (with a flag to tell which encoding is used for example) but this will be for later.
I have to write some code working with character encoding. Is there a good introduction to the subject to get me started?
First posted at What every developer should know about character encoding.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
From Joel Spolsky
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html
As usual, Wikipedia is a good starting point: http://en.wikipedia.org/wiki/Character_encoding
I have a very basic introduction on my blog, which also includes links to in-depth resources if you REALLY want to dig into the subject matter.
http://www.dotnetnoob.com/2011/12/introduction-to-character-encoding.html