Difference between Big Endian and little Endian Byte order - unicode

What is the difference between Big Endian and Little Endian Byte order ?
Both of these seem to be related to Unicode and UTF16. Where exactly do we use this?

Big-Endian (BE) / Little-Endian (LE) are two ways to organize multi-byte words. For example, when using two bytes to represent a character in UTF-16, there are two ways to represent the character 0x1234 as a string of bytes (0x00-0xFF):
Byte Index: 0 1
---------------------
Big-Endian: 12 34
Little-Endian: 34 12
In order to decide if a text uses UTF-16BE or UTF-16LE, the specification recommends to prepend a Byte Order Mark (BOM) to the string, representing the character U+FEFF. So, if the first two bytes of a UTF-16 encoded text file are FE, FF, the encoding is UTF-16BE. For FF, FE, it is UTF-16LE.
A visual example: The word "Example" in different encodings (UTF-16 with BOM):
Byte Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
------------------------------------------------------------
ASCII: 45 78 61 6d 70 6c 65
UTF-16BE: FE FF 00 45 00 78 00 61 00 6d 00 70 00 6c 00 65
UTF-16LE: FF FE 45 00 78 00 61 00 6d 00 70 00 6c 00 65 00
For further information, please read the Wikipedia page of Endianness and/or UTF-16.

Ferdinand's answer (and others) are correct, but incomplete.
Big Endian (BE) / Little Endian (LE) have nothing to do with UTF-16 or UTF-32.
They existed way before Unicode, and affect how the bytes of numbers get stored in the computer's memory. They depend on the processor.
If you have a number with the value 0x12345678 then in memory it will be represented as 12 34 56 78 (BE) or 78 56 34 12 (LE).
UTF-16 and UTF-32 happen to be represented on 2 respectively 4 bytes, so the order of the bytes respects the ordering that any number follows on that platform.

UTF-16 encodes Unicode into 16-bit values. Most modern filesystems operate on 8-bit bytes. So, to save a UTF-16 encoded file to disk, for example, you have to decide which part of the 16-bit value goes in the first byte, and which goes into the second byte.
Wikipedia has a more complete explanation.

little-endian: adj.
Describes a computer architecture in which, within a given 16- or 32-bit word, bytes at lower addresses have lower significance (the word is stored ‘little-end-first’). The PDP-11 and VAX families of computers and Intel microprocessors and a lot of communications and networking hardware are little-endian. The term is sometimes used to describe the ordering of units other than bytes; most often, bits within a byte.
big-endian: adj.
[common; From Swift's Gulliver's Travels via the famous paper On Holy Wars and a Plea for Peace by Danny Cohen, USC/ISI IEN 137, dated April 1, 1980]
Describes a computer architecture in which, within a given multi-byte numeric representation, the most significant byte has the lowest address (the word is stored ‘big-end-first’). Most processors, including the IBM 370 family, the PDP-10, the Motorola microprocessor families, and most of the various RISC designs are big-endian. Big-endian byte order is also sometimes called network order.
---from the Jargon File: http://catb.org/~esr/jargon/html/index.html

Byte endianness (big or little) needs to be specified for Unicode/UTF-16 encoding because for character codes that use more than a single byte, there is a choice of whether to read/write the most significant byte first or last. Unicode/UTF-16, since they are variable-length encodings (i.e. each char can be represented by one or several bytes) require this to be specified. (Note however that UTF-8 "words" are always 8-bits/one byte in length [though characters can be multiple points], therefore there is no problem with endianness.) If the encoder of a stream of bytes representing Unicode text and the decoder aren't agreed on which convention is being used, the wrong character code can be interpreted. For this reason, either the convention of endianness is known beforehand or more commonly a byte order mark is usually specified at the beginning of any Unicode text file/stream to indicate whethere big or little endian order is being used.

Related

How to encode data to remove any 0x00 bytes

I am streaming some data of fixed length. The data consists of some 64-bit ints as well as some 32-bit floats. Because the format of the data is fixed and known, I am just sending an array of bytes with a known endian-ness. The data can then be easily reconstructed at the other end.
However, my transport protocol will not allow any 0x00 bytes. Is there a way I can encode my data differently to avoid this? Losing some range in the data is fine (e.g. ints having a maximum of 2^60 is totally fine). Incresing the full size of the message is totally fine too, as long as the full length of data is fixed no matter what the values of the ints and floats are (e.g. if ints now take 9 bytes to store).
I don't know much about encoding formats, but I learned about CRCs a long time ago and I'm wondering if there's something like that, which will add some fixed length block to the end of the bytestream, but which will prevent the bytestream from containing any 0x00 bytes?
Let's take the case of 64-bit numbers:
Reduce your value range to 256 and use the last (or first) 7 bytes to encode that value.
For the 7 value bytes, replace all the 0x00 bytes with 0xff bytes. Record the positions of the bytes that have been flipped.
Use the remaining byte as a bit mask to encode the positions of the bytes that have been flipped. This will take up 7 bits of that remaining byte. The first (or last) bit of that byte needs to be always set to 1 to prevent the encoding byte to become 0x00 itself.
For example:
Take the 7 byte value b2 00 c3 d4 e5 ff 00.
Flip the 0x00 bytes to get b2 ff c3 d4 e5 ff ff. Bytes 2 and 7 have been flipped.
Create the bit mask 0100001 and prefix with a 1 bit to get a binary value of 10100001, or a hex value of 0xa1.
Your encoded 64-bit value will then be a1 b2 ff c3 d4 e5 ff ff.
The approach for 32-bit numbers is the same. Use 28 bits for the value, 3 bits to encode which bytes have been flipped, and the leftover bit always set to 1.

Why is the vocab size of Byte level BPE smaller than Unicode's vocab size?

I recently read GPT2 and the paper says:
This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256.
I really don't understand the words. The number of characters that Unicode represents is 130K but how can this be reduced to 256? Where's the rest of approximately 129K characters? What am I missing? Does byte-level BPE allow duplicating of representation between different characters?
I don't understand the logic. Below are my questions:
Why the size of vocab is reduced? (from 130K to 256)
What's the logic of the BBPE (Byte-level BPE)?
Detail question
Thank you for your answer but I really don't get it. Let's say we have 130K unique characters. What we want (and BBPE do) is to reduce this basic (unique) vocabulary. Each Unicode character can be converted 1 to 4 bytes by utilizing UTF-8 encoding. The original paper of BBPE says (Neural Machine Translation with Byte-Level Subwords):
Representing text at the level of bytes and using the 256 bytes set as vocabulary is a potential solution to this issue.
Each byte can represent 256 characters (bits, 2^8), we only need 2^17 (131072) bits for representing the unique Unicode characters. In this case, where did the 256 bytes in the original paper come from? I don't know both the logic and how to derive this result.
I arrange my questions again, more detail:
How does BBPE work?
Why the size of vocab is reduced? (from 130K to 256 bytes)
Anyway, we always need 130K space for a vocab. What's the difference between representing unique characters as Unicode and Bytes?
Since I have little knowledge of computer architecture and programming, please let me know if there's something I missed.
Sincerely, thank you.
Unicode code points are integers in the range 0..1,114,112, of which roughly 130k are in use at the moment. Every Unicode code point corresponds to a character, like "a" or "λ" or "龙", which is handy to work with in many cases (but there are a lot of complicated details, eg. combining marks).
When you save text data to a file, you use one of the UTFs (UTF-8, UTF-16, UTF-32) to convert code points (integers) to bytes. For UTF-8 (the most popular file encoding), each character is represented by 1, 2, 3, or 4 bytes (there's some inner logic to discriminate single- and multi-byte characters).
So when the base vocabulary are bytes, this means that rare characters will be encoded with multiple BPE segments.
Example
Let's consider a short example sentence like “That’s great 👍”.
With a base vocabulary of all Unicode characters, the BPE model starts off with something like this:
T 54
h 68
a 61
t 74
’ 2019
s 73
20
g 67
r 72
e 65
a 61
t 74
20
👍 1F44D
(The first column is the character, the second its codepoint in hexadecimal notation.)
If you first encode this sentence with UTF-8, then this sequence of bytes is fed to BPE instead:
T 54
h 68
a 61
t 74
� e2
� 80
� 99
s 73
20
g 67
r 72
e 65
a 61
t 74
20
� f0
� 9f
� 91
� 8d
The typographic apostrophe "’" and the thumbs-up emoji are represented by multiple bytes.
With either input, the BPE segmentation (after training) may end with something like this:
Th|at|’s|great|👍
(This is a hypothetical segmentation, but it's possible that capitalised “That“ is too rare to be represented as a single segment.)
The number of BPE operations is different though: to arrive at the segment ’s, only one merge step is required for code-point input, but three steps for byte input.
With byte input, the BPE segmentation is likely to end up with sub-character segments for rare characters.
The down-stream language model will have to learn to deal with that kind of input.
So you already know the BPE right Byte-level BPE is an improvisation of how the base vocabulary is defined. Recall, there is 1,43,859 unicode characters in unicode alphabets, but wonder how the gpt-2 vocabulary size is just 50,257. Having a base vocabulary of 1.4L will increase the size even more during the training process(where we will combine frequent occuring unicode characters).
To solve this issue GPT-2 uses a byte-level process which has a base vocabulary of just 256 characters using which any unicode characters can be represented by either a single or multiple byte-level characters. I still dont know the process of how a unicode character is converted to byte-level representation.
Does this explanation gave you a clarity why we go to a byte-level representation. Once again gpt-2 uses this 256 base vocabulary and increase the vocabulary size by adding frequent co occuring characters.

Oddity when encoding large integers using asn.1

I have found numerous references to the encoding requirements of Integers in ASN.1
and that Integers are inherently signed objects
TLV 02 02 0123 for exmaple.
However, I have a 256 bit integer (within a certificate) encoded
30 82 01 09 02 82 01 00 d1 a5 xx xx xx… 02 03 010001
30 start
82 2 byte length
0109 265 bytes
02 Integer
82 2 byte length
0100 256 bytes
d1 a5 xxxx
The d1 is the troubling part because the leading bit is 1, meaning this 256 bit number is signed when in fact it is an unsigned number, a public rsa key infact. Does the signed constraint apply to Integers > 64 bits?
Thanks,
BER/DER uses 2s-complement representation for encoding integer values. This means the the first bit (not byte) determines whether a number is positive or negative. This means that sometimes an extra leading zero byte needs to be added to prevent the first bit from causing the integer to be interpreted as a negative number. Note that it is invalid BER/DER to have the first 9 bits all zero.
Yes, you are right. For any non negative DER/BER-encoded INTEGER - no matter its length - the MSB of the first payload byte is 0.
The program that generated such key is incorrect.
The "signed constraint" (actually, a rule) totally applies to any size integers. However, depending on a domain you might find all sorts of oddities in how domain objects are encoded. This is something that has to be learned and accounted for the hard way, unfortunately.

How is this octet stream being interpreted as Hebrew UTF-8 encoding?

The following byte stream is identified by as UTF-8, it contains the Hebrew sentence: דירות לשותפים בתל אביב - הומלס. I'm trying to understand the encoding.
ubuntu#ip-10-126-21-104:~$ od -t x1 homeless-title-fromwireshark_followed_by_hexdump.txt
0000000 0a 09 d7 93 d7 99 d7 a8 d7 95 d7 aa 20 d7 9c d7
0000020 a9 d7 95 d7 aa d7 a4 d7 99 d7 9d 20 20 d7 91 d7
0000040 aa d7 9c 20 d7 90 d7 91 d7 99 d7 91 20 2d 20 d7
0000060 94 d7 95 d7 9e d7 9c d7 a1 0a
0000072
ubuntu#ip-10-126-21-104:~$ file -i homeless-title-fromwireshark_followed_by_hexdump.txt
homeless-title-fromwireshark_followed_by_hexdump.txt: text/plain; charset=utf-8
The file is UTF-8, I've verified this by opening notepad (windows 7), inputing the Hebrew character ד and then saving the file. The result of which yields the following:
ubuntu#ip-10-126-21-104:~$ od -t x1 test_from_notepad_utf8_daled.txt
0000000 ef bb bf d7 93
0000005
ubuntu#ip-10-126-21-104:~$ file -i test_from_notepad_utf8_daled.txt
test_from_notepad_utf8_daled.txt: text/plain; charset=utf-8
Where ef bb bf is the BOM encoded in utf-8 form and d7 93 is exactly the sequence of bytes that appears in the original stream after 0a 09 (new line, tab in ascii).
The problem here is that by unicode code pages, ד should be coded as 05 D3 so why and how did the utf-8 encoding came to be d7 93 ?
d7 93 in binary is 11010111 10010011, while
05 D3 in binary is 00000101 11010011
I can't seem to find a correct transformation that will make sense for these encoding, that (to my understanding) represent the same Unicode entity, which is "HEBREW LETTER DALET"
Thank you,
Maxim.
Unicode defines (among other things) a bunch of "code points" and gives each one a numerical value. The value for HEBREW LETTER DALET is U+05D3 or 0x05D3. But that's just a number, and that DOES NOT tell you how to "encode" the code point (i.e. the set of actual bits) in a file/in memory...UTF-8 (as well as UTF-16, UTF-32 and a variety of other schemes) tell you how to do that.
There is actually a formulaic way of translating Unicode code points to UTF-8 characters (but that's a whole different SO question). It turns out that in UTF-8, HEBREW LETTER DALET is encoded as 0xD7 0x93. By the way, if you find a text editor that allows you to save as UTF-32 or UCS-4, you will find that that (in addition to a very large file) the bytes that you see with a hex-editor should match the code points from the Unicode spec.
This page may give a little extra information on some of the representations for that one character.
For a great introduction to Unicode, I would suggest Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Unicode code points U+0000..U+007F are encoded in UTF-8 as a single byte 0x00..0x7F.
Unicode code points u+0080..U+07FF (including HEBREW LETTER DALET U+05D3) are encoded in UTF-8 as two bytes. The binary values for these can be split into a group of 5 bits and a group of 6 bits, as in xxxxxyyyyyy. The first byte of the UTF-8 representation has the bit pattern 110xxxxx; the second has the bit pattern 10yyyyyy.
0x05D3 = 0000 0101 1101 0011
The last 6 bits of 0x05D3 are 010011; prefixed by the 10, that gives 1001 0011 or 0x93.
The previous 5 bits are 10111; prefixed by the 110, that gives 1101 0111 or 0xD7.
Hence, the UTF-8 encoding for U+05D3 is 0xD7 0x93.
There are more rules for Unicode code points U+0800 upwards that require 3 or 4 (but not more) bytes for the UTF-8 representation. The continuation bytes always have the 10yyyyyy bit pattern. The first bytes have bit patterns 1110xxxx (3 byte values) and 11110xxx (4 byte values). There are a number of byte values that cannot appear in valid UTF-8; they are 0xC0, 0xC1, and 0xF5..0xFF.
Legacy codepages defined a set of characters and their mapping to byte sequences. Unicode separates the concepts of character set and character encoding.
So, the Unicode character set is a list of code points. Each code point is assigned a unique value as an identifier - ד is U+05D3.
The encodings - Unicode transformation formats - describe how to encode each code point as a sequence of code units.
UTF-8 uses a 1-octet code unit and code points are encoded as sequences of between one and four bytes. The algorithm is described in RFC 3629.
A similar procedure exists for UTF-16 which uses 2-octet code units - each code point is two or four bytes. And there isn't anything to do for UTF-32 except make every value four bytes long. These encodings can come in big- or little-endian forms, so U+05D3 might be 00 00 05 D3 or D3 05 00 00 in UTF-32. The BOM is often used to tell which encoding is being used and what the endianness is if the encoding of the data is ambiguous.
There's also UTF-7, but I've never seen it in the wild.

PNG file format endianness?

Im not sure if endian is the right word but..
I have been parsing through a PNG file and I have noticed that all of the integer values are in big endian. Is this true?
For example, the width and height are stored in the PNG file as 32bit unsigned integers. My image is 16x16 and in the file its stored as:
00 00 00 10
when it should be:
10 00 00 00
Is this true or is there something I am missing?
Yes, according to the specification, integers must be in network byte order (big endian):
All integers that require more than one byte shall be in network byte order: the most significant byte comes first, then the less significant bytes in descending order of significance (MSB LSB for two-byte integers, MSB B2 B1 LSB for four-byte integers). The highest bit (value 128) of a byte is numbered bit 7; the lowest bit (value 1) is numbered bit 0. Values are unsigned unless otherwise noted. Values explicitly noted as signed are represented in two's complement notation.
http://www.w3.org/TR/2003/REC-PNG-20031110/#7Integers-and-byte-order
Integers in PNG are in network byte order (big endian).
See: the spec.