Utf8 encoding makes me confused - unicode

let buf1 = Buffer.from("3", "utf8");
let buf2 = Buffer.from("Здравствуйте", "utf8");
// <Buffer 33>
// <Buffer d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5>
Why does char '3' encode to '33' in buf1 but 'd0 97' in buf2?

Because 3 is not З, despite the similarity to the untrained eye. Look closer and you'll see the difference, however subtle.
The former is Unicode code point U+0033 - DIGIT THREE (see here), while the latter is U+0417 - CYRILLIC CAPITAL LETTER ZE (see here), encoded in UTF-8 as d0 97.
The Russian word is actually hello, pronounced (very roughly, since I only know hello and goodbye, taught by a Russian girlfriend many decades ago) "Strasvoytza", with no "three" anywhere in the concept.

The first character of the second buffer is the Cyrillic character "Ze" https://en.m.wikipedia.org/wiki/Ze_(Cyrillic) and not the Arabic numeral 3 https://en.m.wikipedia.org/wiki/3

Related

What does the \u{...} notation mean in UNICODE and why are only some characters displayed like this in the CLDR project?

In this link you will find the most used characters for each language. Why are some characters in some languages displayed under the \u{...} notation?
I think that what is in the brackets is the hexadecimal code of the character, but I can't understand why they would only do it with some characters.
The character sequences enclosed in curly brackets {} are digraphs (trigraphs, …) counted as a distinct letter in given language (supposedly with its own place in the alphabet), for instance
digraph {ch} in cs (Czech language);
trigraph {dzs} in hu (Hungarian alphabet);
more complex digraph examples in kkj (Kako language) shows the following Python code snippet:
>>> kkj='[a á à â {a\u0327} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ\u0301} {ɛ\u0300} {ɛ\u0302} {ɛ\u0327} f g {gb} {gw} h i í ì î {i\u0327} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ\u0301} {ɔ\u0300} {ɔ\u0302} {ɔ\u0327} p r s t u ú ù û {u\u0327} v w y]'
>>> print( kkj)
[a á à â {a̧} b ɓ c d ɗ {ɗy} e é è ê ɛ {ɛ́} {ɛ̀} {ɛ̂} {ɛ̧} f g {gb} {gw} h i í ì î {i̧} j k {kp} {kw} l m {mb} n {nd} nj {ny} ŋ {ŋg} {ŋgb} {ŋgw} o ó ò ô ɔ {ɔ́} {ɔ̀} {ɔ̂} {ɔ̧} p r s t u ú ù û {u̧} v w y]
>>>
For instance, {a\u0327} renders as {a̧} i.e. something like Latin Small Letter A with Combining Cedilla which has no Unicode equivalent. A counterexample:
ņ (U+0146) Latin Small Letter N With Cedilla with decomposition 004E 0327:
>>> import unicodedata
>>> print( 'ņ', unicodedata.normalize('NFC','{n\u0327}'))
ņ {ņ}
Edit:
Characters presented as unicode literals (\uxxxx = a character with 16-bit hex value xxxx) are unrenderable ones (or hard to render, at least). The following Python script shows some of them (Bidi_Class Values L-Left_To_Right, R-Right_To_Left, NSM-Nonspacing_Mark, BN-Boundary_Neutral):
# -*- coding: utf-8 -*-
import unicodedata
pa = 'ੱੰ਼੍ੁੂੇੈੋੌ'
pa = '\u0327 \u0A71 \u0A70 \u0A3C ੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯ ੴ ੳ ਉ ਊ ਓ ਅ ਆ ਐ ਔ ੲ ਇ ਈ ਏ ਸ {ਸ\u0A3C} ਹ ਕ ਖ {ਖ\u0A3C} ਗ {ਗ\u0A3C} ਘ ਙ ਚ ਛ ਜ {ਜ\u0A3C} ਝ ਞ ਟ ਠ ਡ ਢ ਣ ਤ ਥ ਦ ਧ ਨ ਪ ਫ {ਫ\u0A3C} ਬ ਭ ਮ ਯ ਰ ਲ ਵ ੜ \u0A4D ਾ ਿ ੀ \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C'
pa = '\u0300 \u0301 \u0302 \u1DC6 \u1DC7 \u0A71 \u0A70 \u0A3C \u0A4D \u0A41 \u0A42 \u0A47 \u0A48 \u0A4B \u0A4C \u05B7 \u05B8 \u05BF \u200C \u200D \u200E \u200F \u064B \u064C \u064E \u064F \u0650'
# above examples from ·kkj· ·bas· ·pa· ·yi· ·kn· ·ur· ·mzn·
print( pa )
for chr in pa:
if chr != ' ':
if chr == '{' or chr == '}':
print( chr )
else:
print( '\\u%04x' % ord(chr), chr,
unicodedata.category(chr),
unicodedata.bidirectional(chr) + '\t',
str( unicodedata.combining(chr)) + '\t',
unicodedata.name(chr, '?') )
Result: .\SO\63659122.py
̀ ́ ̂ ᷆ ᷇ ੱ ੰ ਼ ੍ ੁ ੂ ੇ ੈ ੋ ੌ ַ ָ ֿ ‌ ‍ ‎ ‏ ً ٌ َ ُ ِ
\u0300 ̀ Mn NSM 230 COMBINING GRAVE ACCENT
\u0301 ́ Mn NSM 230 COMBINING ACUTE ACCENT
\u0302 ̂ Mn NSM 230 COMBINING CIRCUMFLEX ACCENT
\u1dc6 ᷆ Mn NSM 230 COMBINING MACRON-GRAVE
\u1dc7 ᷇ Mn NSM 230 COMBINING ACUTE-MACRON
\u0a71 ੱ Mn NSM 0 GURMUKHI ADDAK
\u0a70 ੰ Mn NSM 0 GURMUKHI TIPPI
\u0a3c ਼ Mn NSM 7 GURMUKHI SIGN NUKTA
\u0a4d ੍ Mn NSM 9 GURMUKHI SIGN VIRAMA
\u0a41 ੁ Mn NSM 0 GURMUKHI VOWEL SIGN U
\u0a42 ੂ Mn NSM 0 GURMUKHI VOWEL SIGN UU
\u0a47 ੇ Mn NSM 0 GURMUKHI VOWEL SIGN EE
\u0a48 ੈ Mn NSM 0 GURMUKHI VOWEL SIGN AI
\u0a4b ੋ Mn NSM 0 GURMUKHI VOWEL SIGN OO
\u0a4c ੌ Mn NSM 0 GURMUKHI VOWEL SIGN AU
\u05b7 ַ Mn NSM 17 HEBREW POINT PATAH
\u05b8 ָ Mn NSM 18 HEBREW POINT QAMATS
\u05bf ֿ Mn NSM 23 HEBREW POINT RAFE
\u200c ‌ Cf BN 0 ZERO WIDTH NON-JOINER
\u200d ‍ Cf BN 0 ZERO WIDTH JOINER
\u200e ‎ Cf L 0 LEFT-TO-RIGHT MARK
\u200f ‏ Cf R 0 RIGHT-TO-LEFT MARK
\u064b ً Mn NSM 27 ARABIC FATHATAN
\u064c ٌ Mn NSM 28 ARABIC DAMMATAN
\u064e َ Mn NSM 30 ARABIC FATHA
\u064f ُ Mn NSM 31 ARABIC DAMMA
\u0650 ِ Mn NSM 32 ARABIC KASRA
It seems like all codepoints that don't have a well-defined stand-alone look (or are not meant to be used as stand-alone characters) are represented with this notation.
For example U+0A3C is present in the "character" {ਫ\u0A3C}. U+0A3C is a combining codepoint that modifies the one that is before it.

How do I print a string in one line in MARIE?

I want to print a set of letters in one line in MARIE. I modified the code to print Hello World and came up with:
ORG 0 / implemented using "do while" loop
WHILE, LOAD STR_BASE / load str_base into ac
ADD ITR / add index to str_base
STORE INDEX / store (str_base + index) into ac
CLEAR / set ac to zero
ADDI INDEX / get the value at ADDR
SKIPCOND 400 / SKIP if ADDR = 0 (or null char)
JUMP DO / jump to DO
JUMP PRINT / JUMP to END
DO, STORE TEMP / output value at ADDR
LOAD ITR / load iterator into ac
ADD ONE / increment iterator by one
STORE ITR / store ac in iterator
JUMP WHILE / jump to while
PRINT, SUBT ONE
SKIPCOND 000
JUMP PR
HALT
PR, OUTPUT
JUMP WHILE
ONE, DEC 1
ITR, DEC 0
INDEX, HEX 0
STR_BASE, HEX 12 / memory location of str
STR, HEX 48 / H
HEX 65 / E
HEX 6C / L
HEX 6C / L
HEX 6F / O
HEX 0 / carriage return
HEX 57 / W
HEX 6F / O
HEX 72 / R
HEX 6C / L
HEX 64 / D
HEX 0 / NULL char
My program ends up halting past two iterations. I can't seem to figure out how to print a set of characters in one line. Thanks.
Your value of STR_BASE is almost certainly incorrect. Based on what is here I would say it needs to be 18 instead of 12. Also you would either want to remove current null char that is between "HELLO" and "WORLD" and replace it with a space or simply remove that line, depending on your intended output.

How to replace a value with "." in sed

I want to replace all instances of "a number followed by any number of spaces followed by a period and possibly more spaces" with the number and period only.
For example, '14 . x' will become '14.x'.
My test data is:
1. c4 e5 2. g3 c6 { good move. } 3. Bg2 Nf6 4. Nc3 $6 d5 5. cxd5 cxd5 6. Qb3 Nc6 $1.. Nxd5 Nd4
8. Nxf6+ Qxf6 9. Qd1.f5 10. d3 Rc8 (10... Bb4+ $5 11. Bd2 Bxd2+ 12. Qxd2 Qa6 $1.3. Rc1.xa2
14. Bxb7 $2 Rb8 15. Qb4 Bd7) 11. Kf1.c5 12. Nf3 O-O
How can I do that?
If you want any number of spaces removed from either side of the period, you should try s/\([0-9]\) *\. */\1./g:
$ echo '11. A 12 .B 13 . C 14.D 15 . E' | sed 's/\([0-9]\) *\. */\1./g'
11.A 12.B 13.C 14.D 15.E
For your test data, the results are:
1.c4 e5 2.g3 c6 { good move. } 3.Bg2 Nf6 4.Nc3 $6 d5 5.cxd5 cxd5 6.Qb3 Nc6 $1.. Nxd5 Nd4
8.Nxf6+ Qxf6 9.Qd1.f5 10.d3 Rc8 (10... Bb4+ $5 11.Bd2 Bxd2+ 12.Qxd2 Qa6 $1.3.Rc1.xa2
14.Bxb7 $2 Rb8 15.Qb4 Bd7) 11.Kf1.c5 12.Nf3 O-O

Why Groovy file write with UTF-16LE produce BOM char?

Do you have idea why first and secod lines below do not produce BOM to the file and third line does? I thought UTF-16LE is correct encoding name and that encoding does no create BOM automatically to beginning of the file.
new File("foo-wo-bom.txt").withPrintWriter("utf-16le") {it << "test"}
new File("foo-bom1.txt").withPrintWriter("UnicodeLittleUnmarked") {it << "test"}
new File("foo-bom.txt").withPrintWriter("UTF-16LE") {it << "test"}
Another samples
new File("foo-bom.txt").withPrintWriter("UTF-16LE") {it << "test"}
new File("foo-bom.txt").getBytes().each {System.out.format("%02x ", it)}
prints
ff fe 74 00 65 00 73 00 74 00
and with java
PrintWriter w = new PrintWriter("foo.txt","UTF-16LE");
w.print("test");
w.close();
FileInputStream r = new FileInputStream("foo.txt");
int c;
while ((c = r.read()) != -1) {
System.out.format("%02x ",c);
}
r.close();
prints
74 00 65 00 73 00 74 00
With Java is does not produce BOM and with Groovy there is BOM.
There appears to be a difference in behavior with withPrintWriter. Try this out in your GroovyConsole
File file = new File("tmp.txt")
try {
String text = " "
String charset = "UTF-16LE"
file.withPrintWriter(charset) { it << text }
println "withPrintWriter"
file.getBytes().each { System.out.format("%02x ", it) }
PrintWriter w = new PrintWriter(file, charset)
w.print(text)
w.close()
println "\n\nnew PrintWriter"
file.getBytes().each { System.out.format("%02x ", it) }
} finally {
file.delete()
}
It outputs
withPrintWriter
ff fe 20 00
new PrintWriter
20 00
This is because calling new PrintWriter calls the Java constructor, but calling withPrintWriter eventually calls org.codehaus.groovy.runtime.ResourceGroovyMethods.writeUTF16BomIfRequired(), which writes the BOM.
I'm uncertain whether this difference in behavior is intentional. I was curious about this, so I asked on the mailing list. Someone there should know the history behind the design.
Edit: GROOVY-7465 was created out of the aforementioned discussion.

UTF-8 & Unicode, what's with 0xC0 and 0x80?

I've been reading about Unicode and UTF-8 in the last couple of days and I often come across a bitwise comparison similar to this :
int strlen_utf8(char *s)
{
int i = 0, j = 0;
while (s[i])
{
if ((s[i] & 0xc0) != 0x80) j++;
i++;
}
return j;
}
Can someone clarify the comparison with 0xc0 and checking if it's the most significant bit ?
Thank you!
EDIT: ANDed, not comparison, used the wrong word ;)
It's not a comparison with 0xc0, it's a logical AND operation with 0xc0.
The bit mask 0xc0 is 11 00 00 00 so what the AND is doing is extracting only the top two bits:
ab cd ef gh
AND 11 00 00 00
-- -- -- --
= ab 00 00 00
This is then compared to 0x80 (binary 10 00 00 00). In other words, the if statement is checking to see if the top two bits of the value are not equal to 10.
"Why?", I hear you ask. Well, that's a good question. The answer is that, in UTF-8, all bytes that begin with the bit pattern 10 are subsequent bytes of a multi-byte sequence:
UTF-8
Range Encoding Binary value
----------------- -------- --------------------------
U+000000-U+00007f 0xxxxxxx 0xxxxxxx
U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx
10xxxxxx
U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx
10yyyyxx
10xxxxxx
U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy xxxxxxxx
10zzyyyy
10yyyyxx
10xxxxxx
So, what this little snippet is doing is going through every byte of your UTF-8 string and counting up all the bytes that aren't continuation bytes (i.e., it's getting the length of the string, as advertised). See this wikipedia link for more detail and Joel Spolsky's excellent article for a primer.
An interesting aside by the way. You can classify bytes in a UTF-8 stream as follows:
With the high bit set to 0, it's a single byte value.
With the two high bits set to 10, it's a continuation byte.
Otherwise, it's the first byte of a multi-byte sequence and the number of leading 1 bits indicates how many bytes there are in total for this sequence (110... means two bytes, 1110... means three bytes, etc).