Why were PNG chunks named like that? - png

I've studied the PNG structure to develop something about it. And I found something interesting.
The names of critical PNG chunks(IHDR, PLTE, IDAT, IEND, PLTE) are all uppercase. And there is at least one lowercase character in the names of ancillary PNG chunks(bKGD, cHRM, gAMA, hIST, iCCP, iTXt, pHYs, sBIT, sPLT, sRGB, sTER, tEXt, tIME, tRNS, zTXt, etc.).
I'm so curious. Was there a naming rule when they standardize them?

According to Jongware, the answer is this:
https://www.w3.org/TR/PNG/#5Chunk-naming-conventions
5.4 Chunk naming conventions
Four bits of the chunk type, the property bits, namely bit 5 (value 32) of each byte, are used to convey chunk properties. This choice means that a human can read off the assigned properties according to whether the letter corresponding to each byte of the chunk type is uppercase (bit 5 is 0) or lowercase (bit 5 is 1). However, decoders should test the properties of an unknown chunk type by numerically testing the specified bits; testing whether a character is uppercase or lowercase is inefficient, and even incorrect if a locale-specific case definition is used.
The property bits are an inherent part of the chunk type, and hence are fixed for any chunk type. Thus, CHNK and cHNk would be unrelated chunk types, not the same chunk with different properties.
The semantics of the property bits are defined in Table 5.2.
Table 5.2 — Semantics of property bits
Ancillary bit: first byte
0 (uppercase) = critical, 1 (lowercase) = ancillary.
Critical chunks are necessary for successful display of the contents of the datastream, for example the image header chunk(IHDR). A decoder trying to extract the image, upon encountering an unknown chunk type in which the ancillary bit is 0, shall indicate to the user that the image contains information it cannot safely interpret.
Ancillary chunks are not strictly necessary in order to meaningfully display the contents of the datastream, for example the time chunk(tIME). A decoder encountering an unknown chunk type in which the ancillary bit is 1 can safely ignore the chunk and proceed to display the image.
Private bit: second byte
0 (uppercase) = public, 1 (lowercase) = private.
A public chunk is one that is defined in this International Standard or is registered in the list of PNG special-purpose public chunk types maintained by the Registration Authority (see 4.9 Extension and registration). Applications can also define private (unregistered) chunk types for their own purposes. The names of private chunks have a lowercase second letter, while public chunks will always be assigned names with uppercase second letters. Decoders do not need to test the private-chunk property bit, since it has no functional significance; it is simply an administrative convenience to ensure that public and private chunk names will not conflict. See clause 14: Editors and extensions and 12.10.2: Use of private chunks.
Reserved bit: third byte
0 (uppercase) in this version of PNG. If the reserved bit is 1, the datastream does not conform to this version of PNG.
The significance of the case of the third letter of the chunk name is reserved for possible future extension. In this International Standard, all chunk names shall have uppercase third letters.
Safe-to-copy bit: fourth byte
0 (uppercase) = unsafe to copy, 1 (lowercase) = safe to copy.
This property bit is not of interest to pure decoders, but it is needed by PNG editors. This bit defines the proper handling of unrecognized chunks in a datastream that is being modified. Rules for PNG editors are discussed further in 14.2: Behaviour of PNG editors.

Related

Understanding the need of encoding and decoding in context to saving the strings on disk

I have read the answer here. I understand what a byte stream is (a stream of 1s and 0s), encoding is (a mapping from that stream to what characters that we humans understand) and decoding is (a reverse mapping from characters to corresponding bytes).
I still cannot reconcile the entire concept in my head. In the RAM we already have everything as bytes only. And I guess my interpreter is inherently using some decoding scheme to show me the characters corresponding to that bytes stream. What then do we mean by having to encode before saving to the disk? If my interpreter is using 'utf-8' to show us this text that I am typing and I ask it to save this text using 'cp-1252' have I changed the underlying bytes stream?
There are different ways to see it.
On way: "Hello World!" could be encoded in different way. You want the semantic of the string: so a salutation and a target. But if you save to a UTF-8 file, you will have different values, as in a UTF-16LE file, or in a EBCDIC encoding.
E.g. A is 65 on ASCII encoding, but 193 in EBCDIC encoding (used e.g. by many IBM mainframes), 0 65 on a UTF-16 encoding (or 65 0). Etc. So when you save a number, you need to specify the encoding (as expected for the reader, so it may depend on file format).
But also libraries on a language could not handle all encodings (for all functions). Usually it is better to decode, using the standard libraries, and then encode when the data should go out. So you need to implement just encoding and decoding (e.g. for EBCDIC), and not all sorting, upper/lower case handling, is_digits, is_symbol, etc.
it is standard practice to divide semantic with real values. Or display with logic. If you are a control freak, you can do all without decoding values. But it is error prone, and you should know so many details, that few people want to know.
An other example, do you need to know the real values of your data/strings? You have a number, it is encoded little-endian or big-endian? Or maybe as a float (e.g. JavaScript). We just know it, when we save data (e.g. to send in internet, we need a way to tell the ordering. Or when saving images: we tell the ordering, so on some machines, the bytes will be swapped, when reading a large number).
Or an other example: you take a selfies. You have an image, but you can save it as a PNG file, or a JPEG file: you will get very different files, with different values. But you know the encoding (fortunately, for such image files, the first bytes describe the format, and then few data about the encoding). For you it is enough to know that it is your image. But do you think computer will take the bytes of the two formats? Probably no. When you read the image, you will convert in a different encoding in memory (but you probably do not need to care about it): often a RGB (or RGBA) format, but how many bit per channel, or if there is some colour rendering (from profiles), you do not know [JPEG saves it as YCC]
Python has a stricter semantic view: you do not know how Python will encode the string. It may be 8bit: ASCII/Latin1, or 16-bit (UCS2), or 32-bit (UTF-32). It handles the internal encoding dynamically, according the most efficient way to store a string. You can still get a codepoint, a for each character, and many string/character function. Just then you encode a string, you have a fix sequence of numbers. On the string side you really do not know how strings are represented in memory. So this keep the two different parts of Unicode clearly separated: semantic value (description of all character), and the encoding/decoding (how to represent the values in bytes).
When you are handling a string in Python, you should just care about the semantic. The implementation (and so the physical layout of string in memory) is not your businesses, and Python can change it. (it changed it).
But with your example:
You may not get much of it, because recent standardisation: ASCII become nearly the only encoding for the most common Latin letters, and symbols. Latin-1 is compatible with ASCII, just extending from 7-bit to 8-bit. "Windows ANSI" uses Latin-1 and add characters on the non-allocated parts. Unicode based from Latin-1 (for first 256 characters). So you may see a character with a fixed number (or not available), but this was not the rule, also in early Windows.
So your cp-1252 is for most characters compatible with UTF-8 (but few characters). But if you uses other encoding, you should do much a transcoding (changing from an encoding to an other). But usually you do this just when you save: you keep the internal encoding, but you do a copy to be saved.
A byte is 8 bits, whether it is in RAM, on disk, or on the wire.
A bit is the "atom" of computer data. A byte is the "molecule", except that there is only one kind of byte.
A bit is the smallest unit of information in computers. It is usually said to represent 0 or 1, or OFF or ON.
Whether you "interpret" a byte as a number (0 to 255), a signed number (-128 to +127), an "ascii" character, like the characters I am typing, depends on what you (or the computer) does with the byte. Or a byte can be part of a bigger number, one that requires several bytes to represent.
Because there are too many "letters" or "characters" (especially in Chinese), to fit in a byte, there is the additional concept of a "character" may be composed of multiple bytes. UTF-8 is the main standard today. Giacomo discusses several less-common encodings that say what "character" is represented by a byte (or bytes). Remember, each byte is composed of 8 bits.
English letters and numbers and some punctuation is represented (encoded) in bytes in the same way for Ascii, Latin1, cp-1252, and UTF-8 (and some other encodings). But as soon as you get into European accented letters, the encodings diverge.
A common thing you may hear of is to represent one byte as two hexadecimal digits.

Why can't we store Unicode directly?

I read some article about Unicode and UTF-8.
The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12CA‘. U+12CA is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly. I think the reason is:
Unicode is not Self-synchronizing code, so if
10 represent A
110 represent B
10110 represent C
When I see 10110 in the disk we can't tell it's A and B or just C.
Unicode uses much more space instead of UTF-8 or UTF-16.
Am I right?
Read about Unicode, UTF-8 and the UTF-8 everywhere website.
There are more than a million Unicode code-points (you mentionned 1,114,111...). So you need at least 21 bits to be able to separate all of them (since 221 > 1114111).
So you can store Unicode characters directly, if you represent each of them by a wide enough integral type. In practice, that type would be some 32 bits integer (because it is not convenient to handle 3-bytes i.e. 24 bits integers). This is called UCS-4 and some systems or software do already handle their Unicode string in such a format.
Notice also that displaying Unicode strings is quite difficult, because of the variety of human languages (and also since Unicode has combining characters). Some need to be displayed right to left (Arabic, Hebrew, ....), others left to right (English, French, Spanish, German, Russian ...), and some top to down (Chinese, ...). A library displaying Unicode strings should be capable of displaying a string containing English, Chinese and Arabic words.... Then you see that decoding UTF-8 is the easy part of Unicode string displaying (and storing UCS-4 strings won't help much).
But, since English is the dominant language in IT technology (for economical reasons), it is very often cheaper to keep strings in UTF8 form. If most of the strings handled by your system are English (or in some other European language using the Latin alphabet), it is cheaper and it takes less space to keep them in UTF-8.
I guess than when China will become a dominant power in IT, things might change (or maybe not).
(I have no idea of the most common encoding used today on Chinese supercomputers or smartphones; I guess it is still UTF-8)
In practice, use a library (perhaps libunistring or Glib in C), to process UTF-8 strings and another one (e.g. pango and GTK in C) to display them. You'll find many Unicode related libraries in various programming languages.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly.
How do you write 12CA to a disk directly? It is a bigger value than a byte can hold, so you need to write at least two bytes. Do you write 12 followed by CA? You just encoded it in UTF-16BE. That's what an encoding is...a definition of how to write an abstract number as bytes.
Other reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Pragmatic Unicode
For good and specific reasons, Unicode doesn't specify any particular encoding. If it makes sense for your scenario, you can specify your own.
Because Unicode doesn't specify any serialization, there is no way to "directly" store Unicode, just like you can't "directly" store a mathematical number or a flow chart to implement a program you designed. The question isn't really well-defined.
There are a number of existing serialization formats (encodings) so it is very likely that it makes the most sense to use an existing one unless your requirements are significantly different than what any existing encoding provides; even then, is it really worth the cost?
A stream of bits is just a stream of bits. Conventionally, we chop them up into groups of 8 and call that a "byte" and the latter half of your question is really "if it's not a byte, how can you tell which bits belong to which symbol?" There are many ways to do that, but the common ones generally define a sequence of some particular length (8, 16, and 32 are often convenient for reasons of compatibility with bus width on modern computers etc) but again, if you really wanted to, you could come up with something different. Huffman trees come to mind as one way to implement a way to communicate a structure of variable length (and is used for precisely that in many compression algorithms).
Consider one situation, even if you can directly save unicode binary into disk and close the file, what happens when you open the file again? It's just a bunch of binary, you don't know how many bytes a char occupied right, which means, if '🥶'(U+129398) and 'A' are the content of your file, then if you take it 1 byte for a char, then '🥶' can't be decoded correctly, which takes 2 bytes, then instead 1 emoji you see, you get two, which is U+63862 and U+65536 unicode char.

"In language x strings are y - e.g. UTF-16 - by default" - what does that mean?

In many places we can read that, for example, "C# uses UTF-16 for its strings" (link). Technically, what does this mean?
My source file is just some text. Say I'm using Notepad++ to code a simple C# app; how the text is represented in bytes on disk, after I save the file, depends on N++, so that's probably not what people mean. Does that mean that:
The language specification requires/recommends that the compiler input be encoded as UTF-16?
The standard library functions are encoding-aware and treat the strings as UTF-16, for example String's operator [] (which returns the n-th character and not the n-th byte)?
Once the compiler produces an executable, the strings stored inside it are in UTF-16?
I've used C# as an example, but this question applies to any language of which one could say that it uses encoding Y for its strings.
"C# uses UTF-16 for its strings"
As far as I understand this concept, this is a simplification at best. A CLI runtime (such as the CLR) is required to store strings it loads from assemblies or that are generated at runtime in UTF-16 encoding in memory - or at least present them as such to the rest of the runtime and the application.
See CLI specification:
III.1.1.3 Character data type
A CLI char type occupies 2 bytes in memory and represents a Unicode code unit using UTF-16
encoding. For the purpose of stack operations char values are treated as unsigned 2-byte integers
(§III.1.1.1)
And C# specification:
4.2.4 The string type
Instances of the string class represent Unicode [being UTF-16 in .NET jargon] character strings.
I can't find that quickly which file encodings the C# compiler supports, but I'm quite sure you can have a source file stored in UTF-8 encoding, or even ASCII (or another non-unicode code page).
The standard library functions are encoding-aware and treat the strings as UTF-16
No, the BCL just treats strings as strings, being a wrapper around a char[] array. Only when transitioning outside the runtime, like in a P/Invoke call, the runtime "knows" which platform functions to invoke and how to marshal a string to those functions. See for example C++/CLI Converting from System::String^ to std::string
Once the compiler produces an [assembly], the strings are stored inside it in UTF-16?
Yes.
Let take a look at C/C++ char type. It is 8 bits long(1 byte). This means that it can store 255 different symbols. Now let's think what actually a font is. It is something like a map. Values from 0 to 255 (1 byte) are mapped to symbols. These type of fonts usually contains 2 type of characters (cyrillic and latin for example) and special symbols. There is no enough space (255 limit) to save greece or chinese letters.
Now let's see what is UTF-8. It is encoding, which stores some symbols using 8 bits and some using 16 bits. For example if you type in notepad word "word" and save file with UTF-8 encoding the resulting file will be exactly 4 bytes length, but if you type word "дума", which is again 4 symbols it will use 8 bytes on your storage. So some letters are stored as 1 byte, others as 2.
UTF-16 means that all symbols are stored in 2 bytes and logically UTF-32 = 4 bytes.
Let's see how this looks from programming sight. When you are typing symbols in notepad they are stored in RAM (in some format that notepad understands). When you save file on the disk notepad write a sequence of bytes on the disk. These sequence depends on chosen encoding. When you are reading (with C# or some other language) file you have to know its encoding. By knowing it you will know how to interpret the sequence written on the disk.

Read UTF-8 encoded string from io.Reader

I am writing an small communication protocol with TCP sockets.
I am able to read and write basic data types such as integers but I have no idea of how to read an UTF-8 encoded string from a slice of bytes.
The protocol client is written in Java and the server is Go.
As per I read: GO runes are 32 bit long and UTF-8 chars are 1 to 4 byte long, what makes not possible to simply cast a byte slice to a String.
I'd like to know how can I read and write this UTF-8 stream.
Note
I have the byte buffer length on time to read the string.
Some theory first:
A rune in Go represents a Unicode code point — a number assigned to a particular character in Unicode. It's an alias to uint32.
UTF-8 is a Unicode encoding — a format of representing Unicode code points for the means of storage and transmission. UTF-8 might use 1 to 4 bytes to encode a single code point.
How this maps on Go data types:
Both []byte and string store a series of bytes (a byte in Go is an alias for uint8).
The chief difference is that strings are immutable, so while you can
b := make([]byte, 2)
b[0] = byte('a')
b[1] = byte('z')
you can't
var s string
s[0] = byte('a')
The latter fact is even underlined by the inability to set the string length explicitly (like in imaginary s := make(string, 10)).
While strings in Go contain abstract bytes (you're free to store in them, say, characters encoded using Windows-1252), certain Go statements and type conversions interpret strings as being encoded in UTF-8, in particular:
A type conversion between string and []rune parses the string as a sequence of UTF-8-encoded code points and produces a slice of them. The reverse type conversion takes the Unicode code points from the slice of runes and produces an UTF-8-encoded string.
A range loop over a string loops through Unicode code points comprising the string, not just bytes.
Go also supplies the type conversions between string and []byte and back. Now recall that strings are read-only, while slices of bytes are not. This means a construct like
b := make([]byte, 1000)
io.ReadFull(r, b)
s := string(b)
always copies the data, no matter if you convert a slice to a string or back. This wastes space but is type-safe and enforces the semantics.
Now back to your task at hand.
If you work with reasonably small strings and are not under memory pressure, just convert your byte slices filled by io.Read() (or whatever) to strings. Be sure to reuse the slice you're using to read the data to ease the pressure on the garbage collector — that is, do not allocate a new slice for each new read as you're gonna to copy the data put to it by the reading code off to a string.
Finally, if you absolutely must to not copy the data (say, you're dealing with multi-megabyte strings, and you have tight memory requirements), you may try to play dirty tricks by unsafely working with memory — here is an example of how you might transplant the memory from a byte slice to a string. Note that should you revert to something like this, you must very well understand that it's free to break with any new release of Go, and it's not even guaranteed to work at all.

Displaying Unicode Characters

I already searched for answers to this sort of question here, and have found plenty of them -- but I still have this nagging doubt about the apparent triviality of the matter.
I have read this very interesting an helpful article on the subject: http://www.joelonsoftware.com/articles/Unicode.html, but it left me wondering about how one would go about identifying individual glyphs given a buffer of Unicode data.
My questions are:
How would I go about parsing a Unicode string, say UTF-8?
Assuming I know the byte order, what happens when I encounter the beginning of a glyph that is supposed to be represented by 6 bytes?
That is, if I interpreted the method of storage correctly.
This is all related to a text display system I am designing to work with OpenGL.
I am storing glyph data in display lists and I need to translate the contents of a string to a sequence of glyph indexes, which are then mapped to display list indices (since, obviously, storing the entire glyph set in graphics memory is not always practical).
To have to represent every string as an array of shorts would require a significant amount of storage considering everything I have need to display.
Additionally, it seems to me that 2 bytes per character simply isn't enough to represent every possible Unicode element.
How would I go about parsing a Unicode string, say UTF-8?
I'm assuming that by "parsing", you mean converting to code points.
Often, you don't have to do that. For example, you can search for a UTF-8 string within another UTF-8 string without needing to care about what characters those bytes represent.
If you do need to convert to code points (UTF-32), then:
Check the first byte to see how many bytes are in the character.
Look at the trailing bytes of the character to ensure that they're in the range 80-BF. If not, report an error.
Use bit masking and shifting to convert the bytes to the code point.
Report an error if the byte sequence you got was longer than the minimum needed to represent the character.
Increment your pointer by the sequence length and repeat for the next character.
Additionally, it seems to me that 2
bytes per character simply isn't
enough to represent every possible
Unicode element.
It's not. Unicode was originally intended to be a fixed-with 16-bit encoding. It was later decided that 65,536 characters wasn't enough, so UTF-16 was created, and Unicode was redefined to use code points between 0 and 1,114,111.
If you want a fixed-width encoding, you need 21 bits. But they aren't many languages that have a 21-bit integer type, so in practice you need 32 bits.
Well, I think this answers it:
http://en.wikipedia.org/wiki/UTF-8
Why it didn't show up the first time I went searching, I have no idea.