UTF-8 multibyte & bom

UTF-8 multibyte & bom - unicode

I had read this great tutorial
http://www.joelonsoftware.com/articles/Unicode.html
But I didn't understand how UTF-8 solves high-endian, low-endian machines thing.
For 1byte, its fine.
For multi byte, how it works?
Can someone explain better?

Here is a link that explains UTF-8 in depth. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
At the heart of it, UTF-16 is short integer(16 bit) oriented and UTF-8 is byte oriented. Since architectures can differ on how the bytes of a datatypes are ordered(big endian, little endian) the UTF-16 encoding can go either way. On all architectures I am aware of there is no endian-ness at the nibble or semi-octet level. All bytes are a sequential series of 8 bits. Therefore UTF-8 has no endian-ness.
The Japanese character あ is a good example. It is U+3042 (binary=0011 0000 : 0100 0010).
UTF-16BE: 30, 42 = 0011 0000 : 0100 0010
UTF-16LE: 42, 30 = 0100 0010 : 0011 0000
UTF-8: e3, 81, 82 = 1110 0011 : 10 0000 01 : 10 00 0010
Here is some information on unicode あ

There is no endiannes problem with UTF-8. The problem arises with UTF-16, because there's a need to see a sequence of two-byte chunks as a sequence of byte chunks when writing it into a file or a communication stream, which may have different idea about byte order in a two-byte number. Because UTF-8 works at byte level, there's no need for BOM to be able to parse the sequence correctly on both a big-endian and a little-endian machine. It does not matter if a character is multibyte: UTF-8 defines exactly what order should the characters come, in case of a multi-byte encoding of a codepoint.
The BOM in UTF-8 is for something completely different (well, so the name 'Byte Order Mark' is a litle 'off'). It is to manifest that "this is going to be a UTF-8 stream". UTF-8 BOM is generally unpopular, and many programs do not support it correctly. The site utf8everywhere.org believes it should be deprecated in future.

Related

Why UTF-8 encoding doesn't need a Byte Order Mark?

Unicode FAQ mentions that UTF-8 doesn't need BOM.
Q: Is the UTF-8 encoding scheme the same irrespective of whether the
underlying processor is little endian or big endian?
A: Yes. Since UTF-8 is interpreted as a sequence of bytes, there is no
endian problem as there is for encoding forms that use 16-bit or
32-bit code units. Where a BOM is used with UTF-8, it is only used as
an encoding signature to distinguish UTF-8 from other encodings — it
has nothing to do with byte order.
For code points above U+0744, UTF-8 needs 2 to 4 bytes to represent them. Doesn't it need a BOM to specify the endianness of these bytes or does UTF-8 adopt a default?

UTF-8 gives a strict definition for the order of the bytes that encode a character. No variation between computing platforms is allowed.
For example, the Euro sign U+20AC must be encoded as the byte sequence \xE2\x82\xAC. No other ordering of these bytes is permitted.

UTF-8 uses 1-byte code units, so there is no need for a BOM to indicate a byte order, because there is only 1 byte order possible, and the encoding algorithm determines the ordering of the bytes. For example, U+0744 is encoded in UTF-8 as code units 0xDD 0x84, which are represented in bytes as DD 84. Bytes 84 DD would be an illegal UTF-8 sequence.
Unlike UTF-16 and UTF-32, which use 2-byte and 4-byte code units, respectively. The encoding algorithm determines the order of the code units, but since the code units themselves are multi-byte, they are subject to endian. For example, U+0744 is encoded in UTF-16 as code unit 0x0744, and in UTF-32 as code unit 0x00000744, which are represented in bytes as 07 44 or 44 07 in UTF-16, and as 07 44 00 00 or 00 00 44 07 in UTF-32, depending on endian.
So, a BOM makes sense to indicate which endian is actually being used for UTF-16/32, but not for UTF-8.

Is a .txt expected to be in UTF-8 encoding these days? Must I end it with .utf8?

I'm producing plain-text files. I do not use ASCII/ANSI but UTF-8 encoding, since the year is 2020 and not 1995. Unicode/UTF-8 is very well established now and it would be madness to assume no UTF-8 support these days.
At the same time, I have a feeling that plain-text files (.txt) are associated with ANSI/ASCII encoding, as in, because it's so primitive-looking it must also be primitive in the encoding it uses.
However, I wish to use all kinds of Unicode characters, and not just be limited to the basic ANSI/ASCII ones.
Since plain-text has no metadata like HTML does, there is (beknownst to me) no way to tell the reader that this .txt uses Unicode/UTF-8, and from what I have learned, you cannot detect it reliably but have to make "educated guesses".
I have seen people add .utf8 to the end of text files before, but this seems kind of ugly and I strongly question how widespread support for this is...
Should I do this?
test.txt.utf8
Whenever the .txt file is using UTF-8? Or will it just make it even harder for people to open them with no actual benefit in terms of detecting it as UTF-8?

You do not elaborate on the use case of the text files you generate, but actually the "way to tell the reader that this .txt uses Unicode/UTF-8" is the Byte Order Mark at the beginning of the text file. By the way it is represented in actual bytes, it tells the reader which Unicode encoding to use to read the file.
From the Unicode FAQ:
Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

UTF16 BIG ENDIAN to UTF8 conversion for failed for 0xdcf0

I am trying to convert a UTF16 to UTF8. For string 0xdcf0, the conversion failed with invalid multi byte sequence. I don't understand why the conversion fails. In the library I am using to do utf-16 to utf-8 conversion, there is a check
if (first_byte & 0xfc == 0xdc) {
return -1;
}
Can you please help me understand why this check is present.

Unicode characters in the DC00–DFFF range are "low" surrogates, i.e. are used in UTF-16 as the second part of a surrogate pair, the first part being a "high" surrogate character in the range D800–DBFF.
See e.g. Wikipedia article UTF-16 for more information.
The reason you cannot convert to UTF-8, is that you only have half a Unicode code point.

In UTF-16, the two byte sequence
DCFO
cannot begin the encoding of any character at all.
The way UTF-16 works is that some characters are encoded in 2 bytes and some characters are encoded in 4 bytes. The characters that are encoded with two bytes use 16-bit sequences in the ranges:
0000 .. D7FF
E000 .. FFFF
All other characters require four bytes to be encoded in UTF-16. For these characters the first pair of bytes must be in the range
D800 .. DBFF
and the second pair of bytes must be in the range
DC00 .. DFFF
This is how the encoding scheme is defined. See the Wikipedia page for UTF-16.
Notice that the FIRST sixteen bits of an encoding of a character can NEVER be in DC00 through DFFF. It is simply not allowed in UTF-16. This is (if you follow the bitwise arithmetic in the code you found), exactly what is being checked for.

How to encode ASCII text in binary opcode instructions?

I do not need a refresher on anything. I am generally asking how would one encode a data string within the data segment of a binary file for execution on bare metal.
Purpose? Say I am writing a bootloader, and I need to define the static data used in the file to represent the string to move to a memory address, for example, with the Genesis TMSS issue.
I assume that binary encoding of an address is literally-translated-as binary equivalent to its hexadecimal representation in Motorola 68000 memory-mapping, so that's not an issue for now.
The issue is ... how do I encode strings/characters/glyphs in binary code to encode within an M68000k opcode? I read the manual, references, etc., and none quite touch this(from what I read through).
Say I want to encode move.l #'SEGA', $A14000. I would get this resulting opcode (without considering how to encode the ASCII characters):
0010 1101 0100 0010 1000 0000 0000 0000
Nibble 1 = MOVE LONG, Nibble 2 = MEMORY ADDRESSING MODE, Following three bytes equals address.
My question is, do I possibly encode each string in literal ASCII per character as part of the preceeding MAM nibble of the instruction?
I am confused at this point, and was hoping somebody might know how to encode data text within an instruction.

Well I had experienced programming in 4 different assembly languages,and Motorola M68HC11 is one of them.In my experience, ASCII is just for display purposes. The CPU at low level treats everything as binary values, it cannot distinguish between ASCII characters and other characters. Although higher assembly languages like x86 support instructions like AAA(ASCII adjust for addition) which makes sure that after addition of two ASCII numbers the result is still a legal ASCII number.
So mainly it's assembler dependent, if the assembler supports the instruction move.l #'SEGA', $A14000 this might work, but since you are not using an assembler and directly writing op-codes, you have to encode the ascii into binary, example ascii number '1'(0x31) will be encoded as 0000 0000 0011 0001 in 16 bit representation. Also in my experience there's no machine code which can move the whole string. SO in microcode the first character is fetched, then copied to destination address, then second character is fetched and copied it into second location and so on..
Assuming that instruction size is 32 bits long and immediate addressing mode is supported the first two nibbles would suggest the move instruction and the immediate addressing type, next two nibbles would be the binary encoded character and remaining would be the address you want to copy it to. Hope this helps

Unicode BOM for UTF-16LE vs UTF32-LE

It seems like there's an ambiguity between the Byte Order Marks used for UTF16-LE and UTF-32LE. In particular, consider a file that contains the following 8 bytes:
FF FE 00 00 00 00 00 00
How can I tell if this file contains:
The UTF16-LE BOM (FF FE) followed by 3 null characters; or
The UTF32-LE BOM (FF FE 00 00) followed by one null character?
Unicode BOMs are described here: http://unicode.org/faq/utf_bom.html#bom4 but there's no discussion of this ambiguity. Am I missing something?

As the name suggests, the BOM only tells you the byte order, not the encoding. You have to know what the encoding is first, then you can use the BOM to determine whether the least or most significant bytes are first for multibyte sequences.
A fortunate side-effect of the BOM is that you can also sometimes use it to guess the encoding if you don't know it, but that is not what it was designed for and it is no substitute for sending proper encoding information.

It is unambiguous. FF FE is for UTF-16LE, and FF FE 00 00 denotes UTF-32LE. There is no reason to think that FF FE 00 00 is possibly UTF-16LE because the UTFs were designed for text, and users shouldn't be using NUL characters in their text. After all, when was the last time you opened a hex editor and inserted a few bytes of 00 into a text document? ^_^

I have experienced the same problem like Edward. I agree with Dustin, usually one will not use null-characters in textfiles.
However i have created a file that contains all unicode characters. I have first used the utf-32le encoding, then a utf-32be encoding, a utf-16le and a utf-16be encoding as well as a utf-8 encoding.
When trying to re-encode the files to utf-8, i wanted to compare the result to the already existing utf-8 file. Because the first character in my files after the BOM is the null-character, i could not successfully detect the file with utf-16le BOM, it showed up as utf-32le BOM, because the bytes appeared exactly like Edward has described. The first character after the BOM FFFE is 0000, but the BOM detection found a BOM FFFE0000 and so, detected utf-32le instead of utf-16le whereby my first 0000-character was stolen and taken as part of the BOM.
So one should never use a null-character as first character of a file encoded with utf-16 little endian, because it will make the utf-16le and utf-32le BOM ambiguous.
To solve my problem, i will swap the first and second character. :-)