How is this octet stream being interpreted as Hebrew UTF-8 encoding? - unicode

The following byte stream is identified by as UTF-8, it contains the Hebrew sentence: דירות לשותפים בתל אביב - הומלס. I'm trying to understand the encoding.
ubuntu#ip-10-126-21-104:~$ od -t x1 homeless-title-fromwireshark_followed_by_hexdump.txt
0000000 0a 09 d7 93 d7 99 d7 a8 d7 95 d7 aa 20 d7 9c d7
0000020 a9 d7 95 d7 aa d7 a4 d7 99 d7 9d 20 20 d7 91 d7
0000040 aa d7 9c 20 d7 90 d7 91 d7 99 d7 91 20 2d 20 d7
0000060 94 d7 95 d7 9e d7 9c d7 a1 0a
0000072
ubuntu#ip-10-126-21-104:~$ file -i homeless-title-fromwireshark_followed_by_hexdump.txt
homeless-title-fromwireshark_followed_by_hexdump.txt: text/plain; charset=utf-8
The file is UTF-8, I've verified this by opening notepad (windows 7), inputing the Hebrew character ד and then saving the file. The result of which yields the following:
ubuntu#ip-10-126-21-104:~$ od -t x1 test_from_notepad_utf8_daled.txt
0000000 ef bb bf d7 93
0000005
ubuntu#ip-10-126-21-104:~$ file -i test_from_notepad_utf8_daled.txt
test_from_notepad_utf8_daled.txt: text/plain; charset=utf-8
Where ef bb bf is the BOM encoded in utf-8 form and d7 93 is exactly the sequence of bytes that appears in the original stream after 0a 09 (new line, tab in ascii).
The problem here is that by unicode code pages, ד should be coded as 05 D3 so why and how did the utf-8 encoding came to be d7 93 ?
d7 93 in binary is 11010111 10010011, while
05 D3 in binary is 00000101 11010011
I can't seem to find a correct transformation that will make sense for these encoding, that (to my understanding) represent the same Unicode entity, which is "HEBREW LETTER DALET"
Thank you,
Maxim.

Unicode defines (among other things) a bunch of "code points" and gives each one a numerical value. The value for HEBREW LETTER DALET is U+05D3 or 0x05D3. But that's just a number, and that DOES NOT tell you how to "encode" the code point (i.e. the set of actual bits) in a file/in memory...UTF-8 (as well as UTF-16, UTF-32 and a variety of other schemes) tell you how to do that.
There is actually a formulaic way of translating Unicode code points to UTF-8 characters (but that's a whole different SO question). It turns out that in UTF-8, HEBREW LETTER DALET is encoded as 0xD7 0x93. By the way, if you find a text editor that allows you to save as UTF-32 or UCS-4, you will find that that (in addition to a very large file) the bytes that you see with a hex-editor should match the code points from the Unicode spec.
This page may give a little extra information on some of the representations for that one character.
For a great introduction to Unicode, I would suggest Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Unicode code points U+0000..U+007F are encoded in UTF-8 as a single byte 0x00..0x7F.
Unicode code points u+0080..U+07FF (including HEBREW LETTER DALET U+05D3) are encoded in UTF-8 as two bytes. The binary values for these can be split into a group of 5 bits and a group of 6 bits, as in xxxxxyyyyyy. The first byte of the UTF-8 representation has the bit pattern 110xxxxx; the second has the bit pattern 10yyyyyy.
0x05D3 = 0000 0101 1101 0011
The last 6 bits of 0x05D3 are 010011; prefixed by the 10, that gives 1001 0011 or 0x93.
The previous 5 bits are 10111; prefixed by the 110, that gives 1101 0111 or 0xD7.
Hence, the UTF-8 encoding for U+05D3 is 0xD7 0x93.
There are more rules for Unicode code points U+0800 upwards that require 3 or 4 (but not more) bytes for the UTF-8 representation. The continuation bytes always have the 10yyyyyy bit pattern. The first bytes have bit patterns 1110xxxx (3 byte values) and 11110xxx (4 byte values). There are a number of byte values that cannot appear in valid UTF-8; they are 0xC0, 0xC1, and 0xF5..0xFF.

Legacy codepages defined a set of characters and their mapping to byte sequences. Unicode separates the concepts of character set and character encoding.
So, the Unicode character set is a list of code points. Each code point is assigned a unique value as an identifier - ד is U+05D3.
The encodings - Unicode transformation formats - describe how to encode each code point as a sequence of code units.
UTF-8 uses a 1-octet code unit and code points are encoded as sequences of between one and four bytes. The algorithm is described in RFC 3629.
A similar procedure exists for UTF-16 which uses 2-octet code units - each code point is two or four bytes. And there isn't anything to do for UTF-32 except make every value four bytes long. These encodings can come in big- or little-endian forms, so U+05D3 might be 00 00 05 D3 or D3 05 00 00 in UTF-32. The BOM is often used to tell which encoding is being used and what the endianness is if the encoding of the data is ambiguous.
There's also UTF-7, but I've never seen it in the wild.

Related

Why is the vocab size of Byte level BPE smaller than Unicode's vocab size?

I recently read GPT2 and the paper says:
This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256.
I really don't understand the words. The number of characters that Unicode represents is 130K but how can this be reduced to 256? Where's the rest of approximately 129K characters? What am I missing? Does byte-level BPE allow duplicating of representation between different characters?
I don't understand the logic. Below are my questions:
Why the size of vocab is reduced? (from 130K to 256)
What's the logic of the BBPE (Byte-level BPE)?
Detail question
Thank you for your answer but I really don't get it. Let's say we have 130K unique characters. What we want (and BBPE do) is to reduce this basic (unique) vocabulary. Each Unicode character can be converted 1 to 4 bytes by utilizing UTF-8 encoding. The original paper of BBPE says (Neural Machine Translation with Byte-Level Subwords):
Representing text at the level of bytes and using the 256 bytes set as vocabulary is a potential solution to this issue.
Each byte can represent 256 characters (bits, 2^8), we only need 2^17 (131072) bits for representing the unique Unicode characters. In this case, where did the 256 bytes in the original paper come from? I don't know both the logic and how to derive this result.
I arrange my questions again, more detail:
How does BBPE work?
Why the size of vocab is reduced? (from 130K to 256 bytes)
Anyway, we always need 130K space for a vocab. What's the difference between representing unique characters as Unicode and Bytes?
Since I have little knowledge of computer architecture and programming, please let me know if there's something I missed.
Sincerely, thank you.
Unicode code points are integers in the range 0..1,114,112, of which roughly 130k are in use at the moment. Every Unicode code point corresponds to a character, like "a" or "λ" or "龙", which is handy to work with in many cases (but there are a lot of complicated details, eg. combining marks).
When you save text data to a file, you use one of the UTFs (UTF-8, UTF-16, UTF-32) to convert code points (integers) to bytes. For UTF-8 (the most popular file encoding), each character is represented by 1, 2, 3, or 4 bytes (there's some inner logic to discriminate single- and multi-byte characters).
So when the base vocabulary are bytes, this means that rare characters will be encoded with multiple BPE segments.
Example
Let's consider a short example sentence like “That’s great 👍”.
With a base vocabulary of all Unicode characters, the BPE model starts off with something like this:
T 54
h 68
a 61
t 74
’ 2019
s 73
20
g 67
r 72
e 65
a 61
t 74
20
👍 1F44D
(The first column is the character, the second its codepoint in hexadecimal notation.)
If you first encode this sentence with UTF-8, then this sequence of bytes is fed to BPE instead:
T 54
h 68
a 61
t 74
� e2
� 80
� 99
s 73
20
g 67
r 72
e 65
a 61
t 74
20
� f0
� 9f
� 91
� 8d
The typographic apostrophe "’" and the thumbs-up emoji are represented by multiple bytes.
With either input, the BPE segmentation (after training) may end with something like this:
Th|at|’s|great|👍
(This is a hypothetical segmentation, but it's possible that capitalised “That“ is too rare to be represented as a single segment.)
The number of BPE operations is different though: to arrive at the segment ’s, only one merge step is required for code-point input, but three steps for byte input.
With byte input, the BPE segmentation is likely to end up with sub-character segments for rare characters.
The down-stream language model will have to learn to deal with that kind of input.
So you already know the BPE right Byte-level BPE is an improvisation of how the base vocabulary is defined. Recall, there is 1,43,859 unicode characters in unicode alphabets, but wonder how the gpt-2 vocabulary size is just 50,257. Having a base vocabulary of 1.4L will increase the size even more during the training process(where we will combine frequent occuring unicode characters).
To solve this issue GPT-2 uses a byte-level process which has a base vocabulary of just 256 characters using which any unicode characters can be represented by either a single or multiple byte-level characters. I still dont know the process of how a unicode character is converted to byte-level representation.
Does this explanation gave you a clarity why we go to a byte-level representation. Once again gpt-2 uses this 256 base vocabulary and increase the vocabulary size by adding frequent co occuring characters.

Hex to / from datetime stamp

I've an application running on Windows, I don't have the source code, the GUI presents the date as 22/06/2018 08:44, this date/time is written/read from a file. This file contains a Hex representation of the date, some examples below (the latter two have been edited by myself - hence the weird year).
2C 05 0A D4 01 (22/06/2018 08:44)
2C 06 0A D4 01 (22/06/2018 08:51)
2C 08 11 D4 01 (01/07/2018 06:53)
B4 AE 08 D4 01 (06/12/5671 13:13)
B4 AE 11 12 10 (31/07/5270 10:53)
I'm trying to understand the conversion from Hex to the GUI date/time, so that I could modify the Hex in the file direct and see the GUI date/time accordingly
Thanks
Edit: The hex numbers are standard Windows 64-bit values representing the number of 100-nanosecond intervals since January 1, 1601, with the three least significant bytes omitted and written as little endian (least significant byte first). For example, your first hex string, 2C 05 0A D4 01, means hex 01D4 0A05 2C00 0000 units at 100 nanos since January 1, 1601 UTC (this is precisely 22/06/2018 08:44:02.9898752 UTC, but your GUI omits seconds and fraction of second).
You can read more here: File Times on MSDN.
For the conversion from date and time to hex you may for example use http://www.silisoftware.com/tools/date.php?inputdate=2018-06-22T08%3A44%3A00%2B00%3A00&inputformat=text, enter your date as 2018-06-22T08:44:00+00:00 and get the hex out as 01D40A05:2A37C800. Round up so it ends in three zero bytes: 01D40A05:2B000000. Reorder the remaining bytes: 2B 05 0A D4 01.
Original answer
It’s not a date-time encoding scheme that I have met before. And from the data you have provided I am not able to deduct the full scheme. I believe I have found a bit of the scheme. I cannot get further.
Assuming some linear correspondence I first note by comparing the first two samples that a difference of 1 unit of the second group of hex digits (the second byte if you will) makes for a difference of 7 minutes. Or approximately: we don’t know if the times have seconds and maybe even fractions of seconds that are not displayed.
I used this information when comparing to the third sample. The third byte has increased by 7 from the first to the third sample (hex 11 - hex 0A = 7). Taking the increase on the second byte into account it would seem that one unit of the third byte approximates 1832 minutes, which is suspiciously close to 256 * 7 minutes = 1792 minutes. So it would seem that the 2nd and 3rd bytes have a “little endian” relationship, where the 3rd byte is more significant than the 2nd. Using this information we can obtain a little more accuracy: The difference in the times is 12849 minutes, and the difference on the 2nd and 3rd byte is hex 1108 - 0A05 = decimal 1795, so each unit is 7.1582 minutes (it agrees with the 7 minutes from before, only it’s more precise). Using this value I interpolated the second date-time from the hex value 2C 06 0A D4 01 and got 2018-06-22T08:51:09. It agrees. Hypothesis confirmed!
The information found so far suffices for encoding values between 09/06/2018 14:43 (2C 00 00 D4 01) and 01/05/2019 09:17 (2C FF FF D4 01) with a precision of 7 minutes. I’d be surprised if that were enough for you.
Comparing to the value in the 4th sample it would seem that one unit on the first byte corresponds to 14 128 940 minutes (26.86 years). It doesn’t divide nicely by the 7.1582 minutes from before, as we might have hoped, so I’m not sure how we might use this observation.
Comparing the last two samples it seems that the 4th and 5th byte cannot have the same little endian relationship since the 5th byte increases while the date decreases. It’s still possible, though, if we assume that at least one of the years is before the common era (“BC”) since era is not printed. Another possibility might be that the fifth byte is ignored. This leads to a unit of the fourth byte corresponding to 1 088 006 minutes. Again it bears no nice relationship to the 7.15 minutes from bytes 2 and 3, and it’s suspicously close to the unit of the first byte, so probably incorrect.
To learn more: First try to see if you get a meaningful date-time from editing (hex) 00 00 00 00 00 into your file. If you do, next try one F at a time:
F0 00 00 00 00
0F 00 00 00 00
…
00 00 00 00 0F
If this doesn’t make a pattern that is clear enough, try one bit at a time, using hex digits 1, 2, 4 and 8 instead of F.

Storing unicode code points, high-endian or low-endian mode?

In his famous blog post The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Joel said :
The earliest idea for Unicode encoding, which led to the myth about
the two bytes, was, hey, let's just store those numbers in two bytes
each. So Hello becomes
00 48 00 65 00 6C 00 6C 00 6F
Right? Not so fast! Couldn't it also be:
48 00 65 00 6C 00 6C 00 6F 00 ?
The second representation is faster ? why ?
How does swapping the high and low bytes affect performance ?
The sentence "Not so fast!" isn't about computing performance but a way to say "hey, don't make assumptions so fast, here's another way to look at it".
The question is Mu.

Oddity when encoding large integers using asn.1

I have found numerous references to the encoding requirements of Integers in ASN.1
and that Integers are inherently signed objects
TLV 02 02 0123 for exmaple.
However, I have a 256 bit integer (within a certificate) encoded
30 82 01 09 02 82 01 00 d1 a5 xx xx xx… 02 03 010001
30 start
82 2 byte length
0109 265 bytes
02 Integer
82 2 byte length
0100 256 bytes
d1 a5 xxxx
The d1 is the troubling part because the leading bit is 1, meaning this 256 bit number is signed when in fact it is an unsigned number, a public rsa key infact. Does the signed constraint apply to Integers > 64 bits?
Thanks,
BER/DER uses 2s-complement representation for encoding integer values. This means the the first bit (not byte) determines whether a number is positive or negative. This means that sometimes an extra leading zero byte needs to be added to prevent the first bit from causing the integer to be interpreted as a negative number. Note that it is invalid BER/DER to have the first 9 bits all zero.
Yes, you are right. For any non negative DER/BER-encoded INTEGER - no matter its length - the MSB of the first payload byte is 0.
The program that generated such key is incorrect.
The "signed constraint" (actually, a rule) totally applies to any size integers. However, depending on a domain you might find all sorts of oddities in how domain objects are encoded. This is something that has to be learned and accounted for the hard way, unfortunately.

Difference between Big Endian and little Endian Byte order

What is the difference between Big Endian and Little Endian Byte order ?
Both of these seem to be related to Unicode and UTF16. Where exactly do we use this?
Big-Endian (BE) / Little-Endian (LE) are two ways to organize multi-byte words. For example, when using two bytes to represent a character in UTF-16, there are two ways to represent the character 0x1234 as a string of bytes (0x00-0xFF):
Byte Index: 0 1
---------------------
Big-Endian: 12 34
Little-Endian: 34 12
In order to decide if a text uses UTF-16BE or UTF-16LE, the specification recommends to prepend a Byte Order Mark (BOM) to the string, representing the character U+FEFF. So, if the first two bytes of a UTF-16 encoded text file are FE, FF, the encoding is UTF-16BE. For FF, FE, it is UTF-16LE.
A visual example: The word "Example" in different encodings (UTF-16 with BOM):
Byte Index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
------------------------------------------------------------
ASCII: 45 78 61 6d 70 6c 65
UTF-16BE: FE FF 00 45 00 78 00 61 00 6d 00 70 00 6c 00 65
UTF-16LE: FF FE 45 00 78 00 61 00 6d 00 70 00 6c 00 65 00
For further information, please read the Wikipedia page of Endianness and/or UTF-16.
Ferdinand's answer (and others) are correct, but incomplete.
Big Endian (BE) / Little Endian (LE) have nothing to do with UTF-16 or UTF-32.
They existed way before Unicode, and affect how the bytes of numbers get stored in the computer's memory. They depend on the processor.
If you have a number with the value 0x12345678 then in memory it will be represented as 12 34 56 78 (BE) or 78 56 34 12 (LE).
UTF-16 and UTF-32 happen to be represented on 2 respectively 4 bytes, so the order of the bytes respects the ordering that any number follows on that platform.
UTF-16 encodes Unicode into 16-bit values. Most modern filesystems operate on 8-bit bytes. So, to save a UTF-16 encoded file to disk, for example, you have to decide which part of the 16-bit value goes in the first byte, and which goes into the second byte.
Wikipedia has a more complete explanation.
little-endian: adj.
Describes a computer architecture in which, within a given 16- or 32-bit word, bytes at lower addresses have lower significance (the word is stored ‘little-end-first’). The PDP-11 and VAX families of computers and Intel microprocessors and a lot of communications and networking hardware are little-endian. The term is sometimes used to describe the ordering of units other than bytes; most often, bits within a byte.
big-endian: adj.
[common; From Swift's Gulliver's Travels via the famous paper On Holy Wars and a Plea for Peace by Danny Cohen, USC/ISI IEN 137, dated April 1, 1980]
Describes a computer architecture in which, within a given multi-byte numeric representation, the most significant byte has the lowest address (the word is stored ‘big-end-first’). Most processors, including the IBM 370 family, the PDP-10, the Motorola microprocessor families, and most of the various RISC designs are big-endian. Big-endian byte order is also sometimes called network order.
---from the Jargon File: http://catb.org/~esr/jargon/html/index.html
Byte endianness (big or little) needs to be specified for Unicode/UTF-16 encoding because for character codes that use more than a single byte, there is a choice of whether to read/write the most significant byte first or last. Unicode/UTF-16, since they are variable-length encodings (i.e. each char can be represented by one or several bytes) require this to be specified. (Note however that UTF-8 "words" are always 8-bits/one byte in length [though characters can be multiple points], therefore there is no problem with endianness.) If the encoder of a stream of bytes representing Unicode text and the decoder aren't agreed on which convention is being used, the wrong character code can be interpreted. For this reason, either the convention of endianness is known beforehand or more commonly a byte order mark is usually specified at the beginning of any Unicode text file/stream to indicate whethere big or little endian order is being used.