What is the difference between ö and ö? - unicode

The following characters look alike. But they are not the same. I can not visually see their difference. Could anybody let me know what their difference is? Why are there two Unicode characters that are so similar?
$ xxd <<< ö
00000000: c3b6 0a ...
$ xxd <<< ö
00000000: 6fcc 880a o...

The first is a single Unicode code point, while the second is two Unicode code points. They are two forms of the same glyph (examples in Python):
import unicodedata as ud
o1 = 'ö' # '\xf6'
o2 = 'ö' # 'o\u0308'
for c in o1:
print(f'U+{ord(c):04X} {ud.name(c)}')
print()
for c in o2:
print(f'U+{ord(c):04X} {ud.name(c)}')
U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
U+006F LATIN SMALL LETTER O
U+0308 COMBINING DIAERESIS
Ensure the two strings are in the same normalization form (either composed or decomposed) for comparison:
print(ud.normalize('NFC',o1) == ud.normalize('NFC',o2))
print(ud.normalize('NFD',o1) == ud.normalize('NFD',o2))
True
True

Related

Extended Grapheme Clusters stop combining

I am having one question with the Extended Grapheme Clusters.
For example, look at following code:
let message = "c\u{0327}a va bien" // => "ça va bien"
How does Swift know it needs to be combined (i.e. ç) rather than treating it as a small letter c AND a "COMBINING CEDILLA"?
Use the unicodeScalars view on the string:
let message1 = "c\u{0327}".decomposedStringWithCanonicalMapping
for scalar in message1.unicodeScalars {
print(scalar) // print c and Combining Cedilla separately
}
let message2 = "c\u{0327}".precomposedStringWithCanonicalMapping
for scalar in message2.unicodeScalars {
print(scalar) // print Latin Small Letter C with Cedilla
}
Note that not all composite characters have a precomposed form, as noted by Apple's Technical Q&A:
Important: Do not convert to precomposed Unicode in an attempt to simplify your text processing. Precomposed Unicode can still contain composite characters. For example, there is no precomposed equivalent of U+0065 U+030A (LATIN SMALL LETTER E followed by COMBINING RING ABOVE)

Sorting Czech in Perl

I have the following perl program
use 5.014_001;
use utf8;
use Unicode::Collate::Locale;
require 'Unicode/Collate/Locale/cs.pl';
binmode STDOUT, ':encoding(UTF-8)';
my #old_list = (
"cash",
"Cash",
"cat",
"Cat",
"čash",
"dash",
"Dash",
"Ďash",
"database",
"Database",
);
my $col= Unicode::Collate::Locale->new(
level => 3,
locale => 'cs',
normalization => 'NFD',
);
my #list = $col->sort(#old_list);
foreach my $item (#list){
print $item, "\n";
}
This program prints out the output:
cash
Cash
cat
Cat
čash
dash
Dash
Ďash
database
Database
I believe that a careful observer would have to conclude that in Czech either
č is a first-class letter while Ď is not.
The Unicode::Collate::Locale sorting of Czech in Perl is not correct
I'd like to believe (1), and the following bolsters my case:
http://en.wiktionary.org/wiki/Index_talk:Czech
where it says:
Let us sort the entries by the existing Czech conventions, as far as practicable. That is, only the following characters have any sorting significance:
a b c č d e f g h ch i j k l m n o p q r ř s š t u v w x y z ž
But I'm confused, because I thought "D with a v over it" (and it's lowercase equivalent), is a first-class letter of the Czech alphabet.
Where is #tchrist when I need him?
I'd appreciate any insights on this.
I have not yet seen a language that would correctly order Czech or Slovak words. (Slovak is quite similar to Czech alphabet.) .NET, Java, Python, all get it wrong. The closest to the correct solution are Raku and Go.
Yes, in Czech and Slovak, ď letter comes (right) after d. There are quite a few peculiarities, such as digraphs ch, dz, dž.
#!/usr/bin/perl
use v5.30;
use warnings;
use utf8;
use Unicode::Collate::Locale;
use open ":std", ":encoding(UTF-8)";
my #words = qw/čaj auto pot márny kľak chyba drevo cibuľa džíp džem šum pól čučoriedka
banán čerešňa červený klam čierny tŕň pôst hôrny mat chobot cesnak kĺb mäta ďateľ
troska sýkorka elektrón fuj zem guma hora gejzír ihla pýr hrozno jazva džavot lom/;
my $col = Unicode::Collate::Locale->new(
level => 3,
locale => 'sk',
normalization => 'NFKC',
);
my #sort_asc = $col->sort(#words);
say "#sort_asc";
The example sorts Slovak words; it contains plenty of challenges.
$./sort_accented_words.pl
auto banán cesnak cibuľa čaj čerešňa červený čierny čučoriedka ďateľ drevo
džavot džem džíp elektrón fuj gejzír guma hora hôrny hrozno chobot chyba
ihla jazva kľak klam kĺb lom márny mat mäta pól pot pôst pýr sýkorka šum
tŕň troska zem
Perl did not order the accented words correctly. Interestingly, it correctly ordered the words with ch, dz, dž digraphs.
#!/usr/bin/raku
my #words = <čaj auto pot márny kľak chyba drevo cibuľa džíp džem šum pól čučoriedka
banán čerešňa červený klam čierny tŕň pôst hôrny mat chobot cesnak kĺb mäta ďateľ
troska sýkorka elektrón fuj zem guma hora gejzír ihla pýr hrozno jazva džavot lom>;
say #words.sort({ .unival, .NFKD[0], .fc });
This is a Raku example.
./sort_words.raku
(auto banán cesnak chobot chyba cibuľa čaj čerešňa červený čierny čučoriedka
drevo džavot džem džíp ďateľ elektrón fuj gejzír guma hora hrozno hôrny ihla
jazva klam kĺb kľak lom mat márny mäta pot pól pôst pýr sýkorka šum troska
tŕň zem)
Accented words are correctly sorted but the ch, dz, and dž digraphs are wrong.
So in my opinion, unless we create our own solution, we won't get a 100% correct output in any programming language.
A locale is just a set of rules. Here's the locale for cs from Collate::Locale 1.31. DUCET is the Default Unicode Collation Element Table.
The Ď may be a first class letter, but that's not what DUCET thinks. If you want different sorts, you can adjust your locale or supply your own.
+{
locale_version => 1.31,
entry => <<'ENTRY', # for DUCET v13.0.0
010D ; [.1FD7.0020.0002] # LATIN SMALL LETTER C WITH CARON
0063 030C ; [.1FD7.0020.0002] # LATIN SMALL LETTER C WITH CARON
010C ; [.1FD7.0020.0008] # LATIN CAPITAL LETTER C WITH CARON
0043 030C ; [.1FD7.0020.0008] # LATIN CAPITAL LETTER C WITH CARON
0063 0068 ; [.2076.0020.0002] # <LATIN SMALL LETTER C, LATIN SMALL LETTER H>
0063 0048 ; [.2076.0020.0007][.0000.0000.0002] # <LATIN SMALL LETTER C, LATIN CAPITAL LETTER H>
0043 0068 ; [.2076.0020.0007][.0000.0000.0008] # <LATIN CAPITAL LETTER C, LATIN SMALL LETTER H>
0043 0048 ; [.2076.0020.0008] # <LATIN CAPITAL LETTER C, LATIN CAPITAL LETTER H>
0159 ; [.2194.0020.0002] # LATIN SMALL LETTER R WITH CARON
0072 030C ; [.2194.0020.0002] # LATIN SMALL LETTER R WITH CARON
0158 ; [.2194.0020.0008] # LATIN CAPITAL LETTER R WITH CARON
0052 030C ; [.2194.0020.0008] # LATIN CAPITAL LETTER R WITH CARON
0161 ; [.21D3.0020.0002] # LATIN SMALL LETTER S WITH CARON
0073 030C ; [.21D3.0020.0002] # LATIN SMALL LETTER S WITH CARON
0160 ; [.21D3.0020.0008] # LATIN CAPITAL LETTER S WITH CARON
0053 030C ; [.21D3.0020.0008] # LATIN CAPITAL LETTER S WITH CARON
017E ; [.2287.0020.0002] # LATIN SMALL LETTER Z WITH CARON
007A 030C ; [.2287.0020.0002] # LATIN SMALL LETTER Z WITH CARON
017D ; [.2287.0020.0008] # LATIN CAPITAL LETTER Z WITH CARON
005A 030C ; [.2287.0020.0008] # LATIN CAPITAL LETTER Z WITH CARON
ENTRY
};
If the default sort is not working for you, this common workaround is an easy do-it-yourself:
Make a sort-array by transforming your strings: if a and á should be equivalent, transform both to a; if á should follow a, transform it into a[, for example (any character after z should be fine). Transform ch into h[, as it goes after h, if I understand correctly. Then sort the original array together with the sort-array.
Despite Czech being my native language, I don't know Czech collation perfectly. But surely, for ď, ť, ň and wovels with diacritics, the diacritics has a lower signifficance than for other Czech characters like č.
Why? This is related to pronunciation. Barring assimilation and non-native words, all consonants but d, t and n have clear pronunciation regardless of their context. (“Ch” is considered as a separate letter.) Those three letters (D, T and N) can be “softened” when they are followed by “i”, “í” or “ě”. In those cases, they are prononuced like they had a caron (háček). As a result, the diacritics for them is less signifficant.

Swift: Check if first character of String is ASCII or a letter with an accent

In Swift, how do I check whether the first character of a String is ASCII or a letter with an accent? Wider characters, like emoticons & logograms, don't count.
I'm trying to copy how the iPhone Messages app shows a person's initials beside their message in a group chat. But, it doesn't include an initial if it's an emoticon or a Chinese character for example.
I see decomposableCharacterSet & nonBaseCharacterSet, but I'm not sure if those are what I want.
There are many Unicode Character with an accent.
Is this for you a character with an accent?
Ê
Ð
Ï
Ḝ
Ṹ
é⃝
In Unicode there are combined characters, which are two unicode characters as one:
let eAcute: Character = "\u{E9}" // é
let combinedEAcute: Character = "\u{65}\u{301}" // e followed by ́
// eAcute is é, combinedEAcute is é
For Swift the Character is the same!
Good reference is here.
If you want to know the CodeUnit of the Characters in the String, you can use the utf8 or utf16 property. They are different!
let characterString: String = "abc"
for character in characterString.utf8 {
print("\(character) ")
}
// output are decimal numbers: 97 98 99
// output of only é: 195 169, used the combined é
Then you could check for ASCII alphabet A-Z as 65-90 and a-z as 97-122.
And then check for the standard accent grave and acute
À 192
Á 193
È 200
É 201
à 224
á 225
è 232
é 233
... and the combined ones and everything you like.
But there are symbols that look like a latin letter with accent, but doesn't have the same meaning!
You should make sure, that only these characters are accepted, that you want with the correct linguistic meaning.

Unicode characters having asymmetric upper/lower case. Why?

Why do the following three characters have not symmetric toLower, toUpper results
/**
* Written in the Scala programming language, typed into the Scala REPL.
* Results commented accordingly.
*/
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"
For the first one, there is this explanation:
In the German language, the Sharp S ("ß" or U+00df) is a lowercase letter, and it capitalizes to the letters "SS".
In other words, U+1E9E lower-cases to U+00DF, but the upper-case of U+00DF is not U+1E9E.
For the second one, U+212A (KELVIN SIGN) lower-cases to U+0068 (LATIN SMALL LETTER K). The upper-case of U+0068 is U+004B (LATIN CAPITAL LETTER K). This one seems to make sense to me.
For the third case, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) is a Turkish/Azerbaijani character that lower-cases to U+0069 (LATIN SMALL LETTER I). I would imagine that if you were somehow in a Turkish/Azerbaijani locale you'd get the proper upper-case version of U+0069, but that might not necessarily be universal.
Characters need not necessarily have symmetric upper- and lower-case transformations.
Edit: To respond to PhiLho's comment below, the Unicode 6.0 spec has this to say about U+212A (KELVIN SIGN):
Three letterlike symbols have been given canonical equivalence to regular letters: U+2126
OHM SIGN, U+212A KELVIN SIGN, and U+212B ANGSTROM SIGN. In all three instances, the regular letter should be used. If text is normalized according to Unicode Standard Annex #15, “Unicode Normalization Forms,” these three characters will be replaced by their regular equivalents.
In other words, you shouldn't really be using U+212A, you should be using U+004B (LATIN CAPITAL LETTER K) instead, and if you normalize your Unicode text, U+212A should be replaced with U+004B.
May I refer to another post about Unicode and upper and lower case..
It is a common mistake to think that signs for a language have to be available in upper and lower case!
Unicode-correct title case in Java

How does uʍop-ǝpᴉsdn text work?

Here's a website I found that will produce upside down versions of any English text.
how does it work? does unicode have upside down chars? Or what?
How can I write my own text flipping function?
how does it work? does unicode have
upside down chars?
Unicode does have upside-down characters. They have "TURNED" in their name:
ƍ U+018D LATIN SMALL LETTER TURNED DELTA
Ɯ U+019C LATIN CAPITAL LETTER TURNED M
ǝ U+01DD LATIN SMALL LETTER TURNED E
Ʌ U+0245 LATIN CAPITAL LETTER TURNED V
ɐ U+0250 LATIN SMALL LETTER TURNED A
ɒ U+0252 LATIN SMALL LETTER TURNED ALPHA
ɥ U+0265 LATIN SMALL LETTER TURNED H
ɯ U+026F LATIN SMALL LETTER TURNED M
ɰ U+0270 LATIN SMALL LETTER TURNED M WITH LONG LEG
ɹ U+0279 LATIN SMALL LETTER TURNED R
ɺ U+027A LATIN SMALL LETTER TURNED R WITH LONG LEG
ɻ U+027B LATIN SMALL LETTER TURNED R WITH HOOK
ʇ U+0287 LATIN SMALL LETTER TURNED T
ʌ U+028C LATIN SMALL LETTER TURNED V
ʍ U+028D LATIN SMALL LETTER TURNED W
ʎ U+028E LATIN SMALL LETTER TURNED Y
ʞ U+029E LATIN SMALL LETTER TURNED K
ʮ U+02AE LATIN SMALL LETTER TURNED H WITH FISHHOOK
ʯ U+02AF LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL
ʴ U+02B4 MODIFIER LETTER SMALL TURNED R
ʵ U+02B5 MODIFIER LETTER SMALL TURNED R WITH HOOK
ʻ U+02BB MODIFIER LETTER TURNED COMMA
̒ U+0312 COMBINING TURNED COMMA ABOVE
ჹ U+10F9 GEORGIAN LETTER TURNED GAN
ᴂ U+1D02 LATIN SMALL LETTER TURNED AE
ᴈ U+1D08 LATIN SMALL LETTER TURNED OPEN E
ᴉ U+1D09 LATIN SMALL LETTER TURNED I
ᴔ U+1D14 LATIN SMALL LETTER TURNED OE
ᴚ U+1D1A LATIN LETTER SMALL CAPITAL TURNED R
ᴟ U+1D1F LATIN SMALL LETTER SIDEWAYS TURNED M
ᵄ U+1D44 MODIFIER LETTER SMALL TURNED A
ᵆ U+1D46 MODIFIER LETTER SMALL TURNED AE
ᵌ U+1D4C MODIFIER LETTER SMALL TURNED OPEN E
ᵎ U+1D4E MODIFIER LETTER SMALL TURNED I
ᵚ U+1D5A MODIFIER LETTER SMALL TURNED M
ᵷ U+1D77 LATIN SMALL LETTER TURNED G
ᶛ U+1D9B MODIFIER LETTER SMALL TURNED ALPHA
ᶣ U+1DA3 MODIFIER LETTER SMALL TURNED H
ᶭ U+1DAD MODIFIER LETTER SMALL TURNED M WITH LONG LEG
ᶺ U+1DBA MODIFIER LETTER SMALL TURNED V
℩ U+2129 TURNED GREEK SMALL LETTER IOTA
Ⅎ U+2132 TURNED CAPITAL F
⅁ U+2141 TURNED SANS-SERIF CAPITAL G
⅂ U+2142 TURNED SANS-SERIF CAPITAL L
⅄ U+2144 TURNED SANS-SERIF CAPITAL Y
⅋ U+214B TURNED AMPERSAND
ⅎ U+214E TURNED SMALL F
⌙ U+2319 TURNED NOT SIGN
❛ U+275B HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
❝ U+275D HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
⦢ U+29A2 TURNED ANGLE
Ɐ U+2C6F LATIN CAPITAL LETTER TURNED A
ⱹ U+2C79 LATIN SMALL LETTER TURNED R WITH TAIL
ⱻ U+2C7B LATIN LETTER SMALL CAPITAL TURNED E
Ꝿ U+A77E LATIN CAPITAL LETTER TURNED INSULAR G
ꝿ U+A77F LATIN SMALL LETTER TURNED INSULAR G
Ꞁ U+A780 LATIN CAPITAL LETTER TURNED L
ꞁ U+A781 LATIN SMALL LETTER TURNED L
However, it's far from a complete set. Most upside-down text works by choosing characters that happen to have a close-enough resemblance to upside-down letters. It's the equivalent of typing 0.7734 on your calculator to spell "hELLO".
does unicode have upside down chars?
Yup! Or at least characters that look like they are upside down. Also, regular English-alphabetical characters can appear to be upside down. Like u could be an upside-down n.
To code it up, you just have to take an array of characters, display them in reverse order and replace those characters with the upside down version of them. This will get you a good start: zʎxʍʌnʇsɹbdouɯןʞſıɥbɟǝpɔqɐ
When 'uʍop-ǝpısdn' is copied and echoed into a hex dump program, the string is seen as:
75 CA 8D 6F 70 2D C7 9D 70 C4 B1 73 64 6E
The UTF-8 breakdown of that is:
0x75 = U+0075 = LATIN SMALL LETTER U
0xCA 0x8D = U+028D = LATIN SMALL LETTER TURNED W
0x6F = U+006F = LATIN SMALL LETTER O
0x70 = U+0070 = LATIN SMALL LETTER P
0x2D = U+002D = HYPHEN MINUS
0xC7 0x9D = U+01DD = LATIN SMALL LETTER TURNED E
0x70 = U+0070 = LATIN SMALL LETTER P
0xC4 0xB1 = U+0131 = LATIN SMALL LETTER DOTLESS I
0x73 = U+0073 = LATIN SMALL LETTER S
0x64 = U+0064 = LATIN SMALL LETTER D
0x6E = U+006E = LATIN SMALL LETTER N
They are just unicode characters.
Look at source of web page:
function flip() {
var result = flipString(document.f.original.value);
document.f.flipped.value = result;
}
function flipString(aString) {
aString = aString.toLowerCase();
var last = aString.length - 1;
var result = "";
for (var i = last; i >= 0; --i) {
result += flipChar(aString.charAt(i))
}
return result;
}
function flipChar(c) {
if (c == 'a') {
return '\u0250'
}
else if (c == 'b') {
return 'q'
}
else if (c == 'c') {
return '\u0254' //Open o -- copied from pne
There is the ”upsidedown” python module. https://pypi.org/project/upsidedown/. And it supports non-english characters too.