Is anyone aware of where I could find a table mapping LaTeX commands to Unicode code points? eg: \le is 0x2264. I'm looking for something as comprehensive as possible.
The document I've used before is this XML file from the W3C. It maps Unicode to HTML, MathML, LaTeX, Mathematica, and others. (The file is 1.4 MB, uncompressed.)
You can read more about it here: http://www.w3.org/TR/unicode-xml/
I once cooked up this for a report generator written in Java (hence the Java String literals):
'\\'(REVERSE SOLIDUS) "\\textbackslash{}"
'^'(CIRCUMFLEX ACCENT) "$\\uparrow$"
'_'(LOW LINE) "\\textunderscore{}"
'|'(VERTICAL LINE) "\\vline{}"
'~'(TILDE) "\\textasciitilde{}" "~"
'§'(SECTION SIGN) "\\S{}"
'ª'(FEMININE ORDINAL INDICATOR) "$^a$"
''(SOFT HYPHEN) "\\-"
'²'(SUPERSCRIPT TWO) "$^2$"
'³'(SUPERSCRIPT THREE) "$^3$"
'·'(MIDDLE DOT) "$\\cdot$"
'¹'(SUPERSCRIPT ONE) "$^1$"
'º'(MASCULINE ORDINAL INDICATOR) "$^o$"
'\u013a'(LATIN SMALL LETTER L WITH ACUTE) "\\'l"
'\u013b'(LATIN CAPITAL LETTER L WITH CEDILLA) "\\c{L}"
'\u013c'(LATIN SMALL LETTER L WITH CEDILLA) "\\c{l}"
'\u013d'(LATIN CAPITAL LETTER L WITH CARON) "\\v{L}"
'\u013e'(LATIN SMALL LETTER L WITH CARON) "\\v{l}"
'\u013f'(LATIN CAPITAL LETTER L WITH MIDDLE DOT) "L\\hspace{-0.35em}$\\cdot$"
'\u0140'(LATIN SMALL LETTER L WITH MIDDLE DOT) "l$\\cdot$"
'\u0141'(LATIN CAPITAL LETTER L WITH STROKE) "\\L{}"
'\u0142'(LATIN SMALL LETTER L WITH STROKE) "\\l{}"
'\u0143'(LATIN CAPITAL LETTER N WITH ACUTE) "\\'N"
'\u0144'(LATIN SMALL LETTER N WITH ACUTE) "\\'n"
'\u0145'(LATIN CAPITAL LETTER N WITH CEDILLA) "\\c{N}"
'\u0146'(LATIN SMALL LETTER N WITH CEDILLA) "\\c{n}"
'\u0147'(LATIN CAPITAL LETTER N WITH CARON) "\\v{N}"
'\u0148'(LATIN SMALL LETTER N WITH CARON) "\\v{n}"
'\u0149'(LATIN SMALL LETTER N PRECEDED BY APOSTROPHE) "'n"
'\u014c'(LATIN CAPITAL LETTER O WITH MACRON) "\\={O}"
'\u014d'(LATIN SMALL LETTER O WITH MACRON) "\\={o}"
'\u014e'(LATIN CAPITAL LETTER O WITH BREVE) "\\u{O}"
'\u014f'(LATIN SMALL LETTER O WITH BREVE) "\\u{o}"
'\u0150'(LATIN CAPITAL LETTER O WITH DOUBLE ACUTE) "\\H{O}"
'\u0151'(LATIN SMALL LETTER O WITH DOUBLE ACUTE) "\\H{o}"
'\u0152'(LATIN CAPITAL LIGATURE OE) "\\OE{}"
'\u0153'(LATIN SMALL LIGATURE OE) "\\oe{}"
'\u0154'(LATIN CAPITAL LETTER R WITH ACUTE) "\\'{R}"
'\u0155'(LATIN SMALL LETTER R WITH ACUTE) "\\'{r}"
'\u0156'(LATIN CAPITAL LETTER R WITH CEDILLA) "\\c{R}"
'\u0157'(LATIN SMALL LETTER R WITH CEDILLA) "\\c{r}"
'\u0158'(LATIN CAPITAL LETTER R WITH CARON) "\\v{R}"
'\u0159'(LATIN SMALL LETTER R WITH CARON) "\\v{r}"
'\u015a'(LATIN CAPITAL LETTER S WITH ACUTE) "\\'S"
'\u015b'(LATIN SMALL LETTER S WITH ACUTE) "\\'s"
'\u015c'(LATIN CAPITAL LETTER S WITH CIRCUMFLEX) "\\^{S}"
'\u015d'(LATIN SMALL LETTER S WITH CIRCUMFLEX) "\\^{s}"
'\u015e'(LATIN CAPITAL LETTER S WITH CEDILLA) "\\c{S}"
'\u015f'(LATIN SMALL LETTER S WITH CEDILLA) "\\c{s}"
'\u0160'(LATIN CAPITAL LETTER S WITH CARON) "\\v{S}"
'\u0161'(LATIN SMALL LETTER S WITH CARON) "\\v{s}"
'\u0162'(LATIN CAPITAL LETTER T WITH CEDILLA) "\\c{T}"
'\u0163'(LATIN SMALL LETTER T WITH CEDILLA) "\\c{t}"
'\u0164'(LATIN CAPITAL LETTER T WITH CARON) "\\v{T}"
'\u0165'(LATIN SMALL LETTER T WITH CARON) "\\v{t}"
'\u0168'(LATIN CAPITAL LETTER U WITH TILDE) "\\~{U}"
'\u0169'(LATIN SMALL LETTER U WITH TILDE) "\\~{u}"
'\u016a'(LATIN CAPITAL LETTER U WITH MACRON) "\\={U}"
'\u016b'(LATIN SMALL LETTER U WITH MACRON) "\\={u}"
'\u016c'(LATIN CAPITAL LETTER U WITH BREVE) "\\u{U}"
'\u016d'(LATIN SMALL LETTER U WITH BREVE) "\\u{u}"
'\u016e'(LATIN CAPITAL LETTER U WITH RING ABOVE) "\\r{U}"
'\u016f'(LATIN SMALL LETTER U WITH RING ABOVE) "\\r{u}"
'\u0170'(LATIN CAPITAL LETTER U WITH DOUBLE ACUTE) "\\H{U}"
'\u0171'(LATIN SMALL LETTER U WITH DOUBLE ACUTE) "\\H{u}"
'\u0172'(LATIN CAPITAL LETTER U WITH OGONEK) "\\k{U}"
'\u0173'(LATIN SMALL LETTER U WITH OGONEK) "\\k{u}"
'\u0174'(LATIN CAPITAL LETTER W WITH CIRCUMFLEX) "\\^{W}"
'\u0175'(LATIN SMALL LETTER W WITH CIRCUMFLEX) "\\^{w}"
'\u0176'(LATIN CAPITAL LETTER Y WITH CIRCUMFLEX) "\\^{Y}"
'\u0177'(LATIN SMALL LETTER Y WITH CIRCUMFLEX) "\\^{y}"
'\u0178'(LATIN CAPITAL LETTER Y WITH DIAERESIS) "\\\"Y"
'\u0179'(LATIN CAPITAL LETTER Z WITH ACUTE) "\\'Z"
'\u017a'(LATIN SMALL LETTER Z WITH ACUTE) "\\'z"
'\u017b'(LATIN CAPITAL LETTER Z WITH DOT ABOVE) "\\.{Z}"
'\u017c'(LATIN SMALL LETTER Z WITH DOT ABOVE) "\\.{z}"
'\u017d'(LATIN CAPITAL LETTER Z WITH CARON) "\\v{Z}"
'\u017e'(LATIN SMALL LETTER Z WITH CARON) "\\v{z}"
'\u01CD'(LATIN CAPITAL LETTER A WITH CARON) "\\v A"
'\u01CE'(LATIN SMALL LETTER A WITH CARON) "\\v a"
'\u01CF'(LATIN CAPITAL LETTER I WITH CARON) "\\v I"
'\u01D0'(LATIN SMALL LETTER I WITH CARON) "\\v \\i{}"
'\u01D1'(LATIN CAPITAL LETTER O WITH CARON) "\\v O"
'\u01D2'(LATIN SMALL LETTER O WITH CARON) "\\v o"
'\u01D3'(LATIN CAPITAL LETTER U WITH CARON) "\\v U"
'\u01D4'(LATIN SMALL LETTER U WITH CARON) "\\v u"
'\u01D5'(LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON) "\\=Ü"
'\u01D6'(LATIN SMALL LETTER U WITH DIAERESIS AND MACRON) "\\=ü"
'\u01D7'(LATIN CAPITAL LETTER U WITH DIAERESIS AND ACUTE) "\\'Ü"
'\u01D8'(LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE) "\\'ü"
'\u01D9'(LATIN CAPITAL LETTER U WITH DIAERESIS AND CARON) "\\v Ü"
'\u01DA'(LATIN SMALL LETTER U WITH DIAERESIS AND CARON) "\\v ü"
'\u01DB'(LATIN CAPITAL LETTER U WITH DIAERESIS AND GRAVE) "\\` Ü"
'\u01DC'(LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE) "\\` ü"
'\u01DE'(LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON) "\\= Ä"
'\u01DF'(LATIN SMALL LETTER A WITH DIAERESIS AND MACRON) "\\= ä"
'\u01E6'(LATIN CAPITAL LETTER G WITH CARON) "\\v G"
'\u01E7'(LATIN SMALL LETTER G WITH CARON) "\\v g"
'\u01E8'(LATIN CAPITAL LETTER K WITH CARON) "\\v K"
'\u01E9'(LATIN SMALL LETTER K WITH CARON) "\\v k"
'\u01EA'(LATIN CAPITAL LETTER O WITH OGONEK) "\\k O"
'\u01EB'(LATIN SMALL LETTER O WITH OGONEK) "\\k o"
'\u01F1'(LATIN CAPITAL LETTER DZ) "DZ"
'\u01F2'(LATIN CAPITAL LETTER D WITH SMALL LETTER Z) "Dz"
'\u01F3'(LATIN SMALL LETTER DZ) "dz"
'\u01F4'(LATIN CAPITAL LETTER G WITH ACUTE) "\\'G"
'\u01F5'(LATIN SMALL LETTER G WITH ACUTE) "\\`G"
'\u01F8'(LATIN CAPITAL LETTER N WITH GRAVE) "\\`N"
'\u01F9'(LATIN SMALL LETTER N WITH GRAVE) "\\`n"
'\u01FA'(LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE) "\\'Å"
'\u01FB'(LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE) "\\'å"
'\u01FC'(LATIN CAPITAL LETTER AE WITH ACUTE) "\\'Æ"
'\u01FD'(LATIN SMALL LETTER AE WITH ACUTE) "\\'æ"
'\u01FE'(LATIN CAPITAL LETTER O WITH STROKE AND ACUTE) "\\'Ø"
'\u01FF'(LATIN SMALL LETTER O WITH STROKE AND ACUTE) "\\'ø"
'\u0200'(LATIN CAPITAL LETTER A WITH DOUBLE GRAVE) "\\textdoublegrave{A}"
'\u0201'(LATIN SMALL LETTER A WITH DOUBLE GRAVE) "\\textdoublegrave{A}"
'\u0202'(LATIN CAPITAL LETTER A WITH INVERTED BREVE) "\\textroundcap{A}"
'\u0203'(LATIN SMALL LETTER A WITH INVERTED BREVE) "\\textroundcap{a}"
'\u0204'(LATIN CAPITAL LETTER E WITH DOUBLE GRAVE) "\\textdoublegrave{E}"
'\u0205'(LATIN SMALL LETTER E WITH DOUBLE GRAVE) "\\textdoublegrave{e}"
'\u0206'(LATIN CAPITAL LETTER E WITH INVERTED BREVE) "\\textroundcap{A}"
'\u0207'(LATIN SMALL LETTER E WITH INVERTED BREVE) "\\textroundcap{a}"
'\u0208'(LATIN CAPITAL LETTER I WITH DOUBLE GRAVE) "\\textdoublegrave{I}"
'\u0209'(LATIN SMALL LETTER I WITH DOUBLE GRAVE) "\\textdoublegrave{\\i}"
'\u020A'(LATIN CAPITAL LETTER I WITH INVERTED BREVE) "\\textroundcap{I}"
'\u020B'(LATIN SMALL LETTER I WITH INVERTED BREVE) "\\textroundcap{\\i}"
'\u020C'(LATIN CAPITAL LETTER O WITH DOUBLE GRAVE) "\\textdoublegrave{O}"
'\u020D'(LATIN SMALL LETTER O WITH DOUBLE GRAVE) "\\textdoublegrave{o}"
'\u020E'(LATIN CAPITAL LETTER O WITH INVERTED BREVE) "\\textroundcap{O}"
'\u020F'(LATIN SMALL LETTER O WITH INVERTED BREVE) "\\textroundcap{o}"
'\u0210'(LATIN CAPITAL LETTER R WITH DOUBLE GRAVE) "\\textdoublegrave{R}"
'\u0211'(LATIN SMALL LETTER R WITH DOUBLE GRAVE) "\\textdoublegrave{r}"
'\u0212'(LATIN CAPITAL LETTER R WITH INVERTED BREVE) "\\textroundcap{R}"
'\u0213'(LATIN SMALL LETTER R WITH INVERTED BREVE) "\\textroundcap{r}"
'\u0214'(LATIN CAPITAL LETTER U WITH DOUBLE GRAVE) "\\textdoublegrave{U}"
'\u0215'(LATIN SMALL LETTER U WITH DOUBLE GRAVE) "\\textdoublegrave{u}"
'\u0216'(LATIN CAPITAL LETTER U WITH INVERTED BREVE) "\\textroundcap{U}"
'\u0217'(LATIN SMALL LETTER U WITH INVERTED BREVE) "\\textroundcap{u}"
'\u0218'(LATIN CAPITAL LETTER S WITH COMMA BELOW) "\\textcommabelow{S}"
'\u0219'(LATIN SMALL LETTER S WITH COMMA BELOW) "\\textcommabelow{s}"
'\u021A'(LATIN CAPITAL LETTER T WITH COMMA BELOW) "\\textcommabelow{T}"
'\u021B'(LATIN SMALL LETTER T WITH COMMA BELOW) "\\textcommabelow{t}"
'\u021E'(LATIN CAPITAL LETTER H WITH CARON) "\\v{H}"
'\u021F'(LATIN SMALL LETTER H WITH CARON) "\\v{h}"
'\u0226'(LATIN CAPITAL LETTER A WITH DOT ABOVE) "\\.A"
'\u0227'(LATIN SMALL LETTER A WITH DOT ABOVE) "\\.a"
'\u0228'(LATIN CAPITAL LETTER E WITH CEDILLA) "\\c E"
'\u0229'(LATIN SMALL LETTER E WITH CEDILLA) "\\c e"
'\u022A'(LATIN CAPITAL LETTER O WITH DIAERESIS AND MACRON) "\\= Ö"
'\u022B'(LATIN SMALL LETTER O WITH DIAERESIS AND MACRON) "\\= ö"
'\u022C'(LATIN CAPITAL LETTER O WITH TILDE AND MACRON) "\\makeatletter\\#tabacckludge={\\~O}\\makeatother{}"
'\u022D'(LATIN SMALL LETTER O WITH TILDE AND MACRON) "\\makeatletter\\#tabacckludge={\\~o}\\makeatother{}"
'\u022E'(LATIN CAPITAL LETTER O WITH DOT ABOVE) "\\.O"
'\u022F'(LATIN SMALL LETTER O WITH DOT ABOVE) "\\.o"
'\u0232'(LATIN CAPITAL LETTER Y WITH MACRON) "\\=Y"
'\u0233'(LATIN SMALL LETTER Y WITH MACRON) "\\=y"
'\u023A'(LATIN CAPITAL LETTER A WITH STROKE) "/\\hspace{-0.5em}A"
'\u023B'(LATIN CAPITAL LETTER C WITH STROKE) "/\\hspace{-0.5em}C"
'\u023C'(LATIN SMALL LETTER C WITH STROKE) "/\\hspace{-0.4em}c"
'\u023D'(LATIN CAPITAL LETTER L WITH BAR) "-\\hspace{-0.3em}L"
'\u023E'(LATIN CAPITAL LETTER T WITH DIAGONAL STROKE) "-\\hspace{-0.3em}T"
'\u20AC'(EURO SIGN) "\\texteuro{}"
'\u2018'(LEFT SINGLE QUOTATION MARK) "'"
'\u2019'(RIGHT SINGLE QUOTATION MARK) "'"
'\u201A'(SINGLE LOW-9 QUOTATION MARK) "'"
'\u201B'(SINGLE HIGH-REVERSED-9 QUOTATION MARK) "'"
'\u201C'(LEFT DOUBLE QUOTATION MARK) "\"{}"
'\u201D'(RIGHT DOUBLE QUOTATION MARK) "\"{}"
'\u201E'(DOUBLE LOW-9 QUOTATION MARK) "\"{}"
'\u201F'(DOUBLE HIGH-REVERSED-9 QUOTATION MARK) "\"{}"
'\u025B'(LATIN SMALL LETTER OPEN E) "\\textepsilon{}"
'\u0283'(LATIN SMALL LETTER ESH) "\\textesh{}"
But I'm pretty sure there isn't a comprehensive mapping anywhere - Unicode is HUGE. You'll probably have to compile and maintain it yourself. Good luck!
Here's a web app based on the data mentioned above: http://www.johndcook.com/unicode_latex.html
Type in Unicode and it looks up the LaTeX symbol and vice versa.
You can check out my LaTeX to Unicode converter. It has a JavaScript API which you can use under MIT license. It is partially based on the W3C document shared earlier, but supports even more mappings that I gathered from here and there.
Most mappings are straightforward table lookups, but some commands have no or ambiguous Unicode equivalents. A comprehensive converter requires creative decisions. For example, fractions are quite complicated. frac{5}{8} produces ⅝, frac{5}{80} produces 5⁄80 and frac{5}{80a} produces (5 / (80a))).
This is for the Word 2007 Equation Editor but it shares many similar commands with LaTeX: http://unicode.org/notes/tn28/UTN28-PlainTextMath.pdf
This huge table contains Unicode translation to LaTeX, MathML entities and Mathematica: http://www.ams.org/STIX/bnb/stix-tbl.asc98feb26
Related
The following characters look alike. But they are not the same. I can not visually see their difference. Could anybody let me know what their difference is? Why are there two Unicode characters that are so similar?
$ xxd <<< ö
00000000: c3b6 0a ...
$ xxd <<< ö
00000000: 6fcc 880a o...
The first is a single Unicode code point, while the second is two Unicode code points. They are two forms of the same glyph (examples in Python):
import unicodedata as ud
o1 = 'ö' # '\xf6'
o2 = 'ö' # 'o\u0308'
for c in o1:
print(f'U+{ord(c):04X} {ud.name(c)}')
print()
for c in o2:
print(f'U+{ord(c):04X} {ud.name(c)}')
U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
U+006F LATIN SMALL LETTER O
U+0308 COMBINING DIAERESIS
Ensure the two strings are in the same normalization form (either composed or decomposed) for comparison:
print(ud.normalize('NFC',o1) == ud.normalize('NFC',o2))
print(ud.normalize('NFD',o1) == ud.normalize('NFD',o2))
True
True
I have the following perl program
use 5.014_001;
use utf8;
use Unicode::Collate::Locale;
require 'Unicode/Collate/Locale/cs.pl';
binmode STDOUT, ':encoding(UTF-8)';
my #old_list = (
"cash",
"Cash",
"cat",
"Cat",
"čash",
"dash",
"Dash",
"Ďash",
"database",
"Database",
);
my $col= Unicode::Collate::Locale->new(
level => 3,
locale => 'cs',
normalization => 'NFD',
);
my #list = $col->sort(#old_list);
foreach my $item (#list){
print $item, "\n";
}
This program prints out the output:
cash
Cash
cat
Cat
čash
dash
Dash
Ďash
database
Database
I believe that a careful observer would have to conclude that in Czech either
č is a first-class letter while Ď is not.
The Unicode::Collate::Locale sorting of Czech in Perl is not correct
I'd like to believe (1), and the following bolsters my case:
http://en.wiktionary.org/wiki/Index_talk:Czech
where it says:
Let us sort the entries by the existing Czech conventions, as far as practicable. That is, only the following characters have any sorting significance:
a b c č d e f g h ch i j k l m n o p q r ř s š t u v w x y z ž
But I'm confused, because I thought "D with a v over it" (and it's lowercase equivalent), is a first-class letter of the Czech alphabet.
Where is #tchrist when I need him?
I'd appreciate any insights on this.
I have not yet seen a language that would correctly order Czech or Slovak words. (Slovak is quite similar to Czech alphabet.) .NET, Java, Python, all get it wrong. The closest to the correct solution are Raku and Go.
Yes, in Czech and Slovak, ď letter comes (right) after d. There are quite a few peculiarities, such as digraphs ch, dz, dž.
#!/usr/bin/perl
use v5.30;
use warnings;
use utf8;
use Unicode::Collate::Locale;
use open ":std", ":encoding(UTF-8)";
my #words = qw/čaj auto pot márny kľak chyba drevo cibuľa džíp džem šum pól čučoriedka
banán čerešňa červený klam čierny tŕň pôst hôrny mat chobot cesnak kĺb mäta ďateľ
troska sýkorka elektrón fuj zem guma hora gejzír ihla pýr hrozno jazva džavot lom/;
my $col = Unicode::Collate::Locale->new(
level => 3,
locale => 'sk',
normalization => 'NFKC',
);
my #sort_asc = $col->sort(#words);
say "#sort_asc";
The example sorts Slovak words; it contains plenty of challenges.
$./sort_accented_words.pl
auto banán cesnak cibuľa čaj čerešňa červený čierny čučoriedka ďateľ drevo
džavot džem džíp elektrón fuj gejzír guma hora hôrny hrozno chobot chyba
ihla jazva kľak klam kĺb lom márny mat mäta pól pot pôst pýr sýkorka šum
tŕň troska zem
Perl did not order the accented words correctly. Interestingly, it correctly ordered the words with ch, dz, dž digraphs.
#!/usr/bin/raku
my #words = <čaj auto pot márny kľak chyba drevo cibuľa džíp džem šum pól čučoriedka
banán čerešňa červený klam čierny tŕň pôst hôrny mat chobot cesnak kĺb mäta ďateľ
troska sýkorka elektrón fuj zem guma hora gejzír ihla pýr hrozno jazva džavot lom>;
say #words.sort({ .unival, .NFKD[0], .fc });
This is a Raku example.
./sort_words.raku
(auto banán cesnak chobot chyba cibuľa čaj čerešňa červený čierny čučoriedka
drevo džavot džem džíp ďateľ elektrón fuj gejzír guma hora hrozno hôrny ihla
jazva klam kĺb kľak lom mat márny mäta pot pól pôst pýr sýkorka šum troska
tŕň zem)
Accented words are correctly sorted but the ch, dz, and dž digraphs are wrong.
So in my opinion, unless we create our own solution, we won't get a 100% correct output in any programming language.
A locale is just a set of rules. Here's the locale for cs from Collate::Locale 1.31. DUCET is the Default Unicode Collation Element Table.
The Ď may be a first class letter, but that's not what DUCET thinks. If you want different sorts, you can adjust your locale or supply your own.
+{
locale_version => 1.31,
entry => <<'ENTRY', # for DUCET v13.0.0
010D ; [.1FD7.0020.0002] # LATIN SMALL LETTER C WITH CARON
0063 030C ; [.1FD7.0020.0002] # LATIN SMALL LETTER C WITH CARON
010C ; [.1FD7.0020.0008] # LATIN CAPITAL LETTER C WITH CARON
0043 030C ; [.1FD7.0020.0008] # LATIN CAPITAL LETTER C WITH CARON
0063 0068 ; [.2076.0020.0002] # <LATIN SMALL LETTER C, LATIN SMALL LETTER H>
0063 0048 ; [.2076.0020.0007][.0000.0000.0002] # <LATIN SMALL LETTER C, LATIN CAPITAL LETTER H>
0043 0068 ; [.2076.0020.0007][.0000.0000.0008] # <LATIN CAPITAL LETTER C, LATIN SMALL LETTER H>
0043 0048 ; [.2076.0020.0008] # <LATIN CAPITAL LETTER C, LATIN CAPITAL LETTER H>
0159 ; [.2194.0020.0002] # LATIN SMALL LETTER R WITH CARON
0072 030C ; [.2194.0020.0002] # LATIN SMALL LETTER R WITH CARON
0158 ; [.2194.0020.0008] # LATIN CAPITAL LETTER R WITH CARON
0052 030C ; [.2194.0020.0008] # LATIN CAPITAL LETTER R WITH CARON
0161 ; [.21D3.0020.0002] # LATIN SMALL LETTER S WITH CARON
0073 030C ; [.21D3.0020.0002] # LATIN SMALL LETTER S WITH CARON
0160 ; [.21D3.0020.0008] # LATIN CAPITAL LETTER S WITH CARON
0053 030C ; [.21D3.0020.0008] # LATIN CAPITAL LETTER S WITH CARON
017E ; [.2287.0020.0002] # LATIN SMALL LETTER Z WITH CARON
007A 030C ; [.2287.0020.0002] # LATIN SMALL LETTER Z WITH CARON
017D ; [.2287.0020.0008] # LATIN CAPITAL LETTER Z WITH CARON
005A 030C ; [.2287.0020.0008] # LATIN CAPITAL LETTER Z WITH CARON
ENTRY
};
If the default sort is not working for you, this common workaround is an easy do-it-yourself:
Make a sort-array by transforming your strings: if a and á should be equivalent, transform both to a; if á should follow a, transform it into a[, for example (any character after z should be fine). Transform ch into h[, as it goes after h, if I understand correctly. Then sort the original array together with the sort-array.
Despite Czech being my native language, I don't know Czech collation perfectly. But surely, for ď, ť, ň and wovels with diacritics, the diacritics has a lower signifficance than for other Czech characters like č.
Why? This is related to pronunciation. Barring assimilation and non-native words, all consonants but d, t and n have clear pronunciation regardless of their context. (“Ch” is considered as a separate letter.) Those three letters (D, T and N) can be “softened” when they are followed by “i”, “í” or “ě”. In those cases, they are prononuced like they had a caron (háček). As a result, the diacritics for them is less signifficant.
Our product takes advantage of the ascii folding token filter and our customers are asking for specific information about it. Specifically, they would like the mapping of unicode characters to the ASCII equivalent. While I believe most conversions are obvious (e.g. ü = u), there are some "tricky" ones like ß, which I believe translates to "ss".
I've googled but have not been able to find a definitive mapping. Is there some place I can get this information?
Thanks for your help,
Eric
You can just read the source code for ASCIIFoldingFilter.
A sample from that source:
case '\u00C0': // À [LATIN CAPITAL LETTER A WITH GRAVE]
case '\u00C1': // Á [LATIN CAPITAL LETTER A WITH ACUTE]
case '\u00C2': // Â [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
case '\u00C3': // Ã [LATIN CAPITAL LETTER A WITH TILDE]
case '\u00C4': // Ä [LATIN CAPITAL LETTER A WITH DIAERESIS]
case '\u00C5': // Å [LATIN CAPITAL LETTER A WITH RING ABOVE]
case '\u0100': // Ā [LATIN CAPITAL LETTER A WITH MACRON]
case '\u0102': // Ă [LATIN CAPITAL LETTER A WITH BREVE]
case '\u0104': // Ą [LATIN CAPITAL LETTER A WITH OGONEK]
case '\u018F': // Ə http://en.wikipedia.org/wiki/Schwa [LATIN CAPITAL LETTER SCHWA]
case '\u01CD': // Ǎ [LATIN CAPITAL LETTER A WITH CARON]
case '\u01DE': // Ǟ [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
case '\u01E0': // Ǡ [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
case '\u01FA': // Ǻ [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
case '\u0200': // Ȁ [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
case '\u0202': // Ȃ [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
case '\u0226': // Ȧ [LATIN CAPITAL LETTER A WITH DOT ABOVE]
case '\u023A': // Ⱥ [LATIN CAPITAL LETTER A WITH STROKE]
case '\u1D00': // ᴀ [LATIN LETTER SMALL CAPITAL A]
case '\u1E00': // Ḁ [LATIN CAPITAL LETTER A WITH RING BELOW]
case '\u1EA0': // Ạ [LATIN CAPITAL LETTER A WITH DOT BELOW]
case '\u1EA2': // Ả [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
case '\u1EA4': // Ấ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
case '\u1EA6': // Ầ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
case '\u1EA8': // Ẩ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
case '\u1EAA': // Ẫ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
case '\u1EAC': // Ậ [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
case '\u1EAE': // Ắ [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
case '\u1EB0': // Ằ [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
case '\u1EB2': // Ẳ [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
case '\u1EB4': // Ẵ [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
case '\u1EB6': // Ặ [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
case '\u24B6': // Ⓐ [CIRCLED LATIN CAPITAL LETTER A]
case '\uFF21': // A [FULLWIDTH LATIN CAPITAL LETTER A]
output[outputPos++] = 'A';
break;
As you can see, it doesn't do anything to Greek and Cyrillic letters, let alone other ones.
Also. as you guessed correctly, ß gets converted into ss:
case '\u00DF': // ß [LATIN SMALL LETTER SHARP S]
output[outputPos++] = 's';
output[outputPos++] = 's';
break;
Why do the following three characters have not symmetric toLower, toUpper results
/**
* Written in the Scala programming language, typed into the Scala REPL.
* Results commented accordingly.
*/
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"
For the first one, there is this explanation:
In the German language, the Sharp S ("ß" or U+00df) is a lowercase letter, and it capitalizes to the letters "SS".
In other words, U+1E9E lower-cases to U+00DF, but the upper-case of U+00DF is not U+1E9E.
For the second one, U+212A (KELVIN SIGN) lower-cases to U+0068 (LATIN SMALL LETTER K). The upper-case of U+0068 is U+004B (LATIN CAPITAL LETTER K). This one seems to make sense to me.
For the third case, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) is a Turkish/Azerbaijani character that lower-cases to U+0069 (LATIN SMALL LETTER I). I would imagine that if you were somehow in a Turkish/Azerbaijani locale you'd get the proper upper-case version of U+0069, but that might not necessarily be universal.
Characters need not necessarily have symmetric upper- and lower-case transformations.
Edit: To respond to PhiLho's comment below, the Unicode 6.0 spec has this to say about U+212A (KELVIN SIGN):
Three letterlike symbols have been given canonical equivalence to regular letters: U+2126
OHM SIGN, U+212A KELVIN SIGN, and U+212B ANGSTROM SIGN. In all three instances, the regular letter should be used. If text is normalized according to Unicode Standard Annex #15, “Unicode Normalization Forms,” these three characters will be replaced by their regular equivalents.
In other words, you shouldn't really be using U+212A, you should be using U+004B (LATIN CAPITAL LETTER K) instead, and if you normalize your Unicode text, U+212A should be replaced with U+004B.
May I refer to another post about Unicode and upper and lower case..
It is a common mistake to think that signs for a language have to be available in upper and lower case!
Unicode-correct title case in Java
Here's a website I found that will produce upside down versions of any English text.
how does it work? does unicode have upside down chars? Or what?
How can I write my own text flipping function?
how does it work? does unicode have
upside down chars?
Unicode does have upside-down characters. They have "TURNED" in their name:
ƍ U+018D LATIN SMALL LETTER TURNED DELTA
Ɯ U+019C LATIN CAPITAL LETTER TURNED M
ǝ U+01DD LATIN SMALL LETTER TURNED E
Ʌ U+0245 LATIN CAPITAL LETTER TURNED V
ɐ U+0250 LATIN SMALL LETTER TURNED A
ɒ U+0252 LATIN SMALL LETTER TURNED ALPHA
ɥ U+0265 LATIN SMALL LETTER TURNED H
ɯ U+026F LATIN SMALL LETTER TURNED M
ɰ U+0270 LATIN SMALL LETTER TURNED M WITH LONG LEG
ɹ U+0279 LATIN SMALL LETTER TURNED R
ɺ U+027A LATIN SMALL LETTER TURNED R WITH LONG LEG
ɻ U+027B LATIN SMALL LETTER TURNED R WITH HOOK
ʇ U+0287 LATIN SMALL LETTER TURNED T
ʌ U+028C LATIN SMALL LETTER TURNED V
ʍ U+028D LATIN SMALL LETTER TURNED W
ʎ U+028E LATIN SMALL LETTER TURNED Y
ʞ U+029E LATIN SMALL LETTER TURNED K
ʮ U+02AE LATIN SMALL LETTER TURNED H WITH FISHHOOK
ʯ U+02AF LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL
ʴ U+02B4 MODIFIER LETTER SMALL TURNED R
ʵ U+02B5 MODIFIER LETTER SMALL TURNED R WITH HOOK
ʻ U+02BB MODIFIER LETTER TURNED COMMA
̒ U+0312 COMBINING TURNED COMMA ABOVE
ჹ U+10F9 GEORGIAN LETTER TURNED GAN
ᴂ U+1D02 LATIN SMALL LETTER TURNED AE
ᴈ U+1D08 LATIN SMALL LETTER TURNED OPEN E
ᴉ U+1D09 LATIN SMALL LETTER TURNED I
ᴔ U+1D14 LATIN SMALL LETTER TURNED OE
ᴚ U+1D1A LATIN LETTER SMALL CAPITAL TURNED R
ᴟ U+1D1F LATIN SMALL LETTER SIDEWAYS TURNED M
ᵄ U+1D44 MODIFIER LETTER SMALL TURNED A
ᵆ U+1D46 MODIFIER LETTER SMALL TURNED AE
ᵌ U+1D4C MODIFIER LETTER SMALL TURNED OPEN E
ᵎ U+1D4E MODIFIER LETTER SMALL TURNED I
ᵚ U+1D5A MODIFIER LETTER SMALL TURNED M
ᵷ U+1D77 LATIN SMALL LETTER TURNED G
ᶛ U+1D9B MODIFIER LETTER SMALL TURNED ALPHA
ᶣ U+1DA3 MODIFIER LETTER SMALL TURNED H
ᶭ U+1DAD MODIFIER LETTER SMALL TURNED M WITH LONG LEG
ᶺ U+1DBA MODIFIER LETTER SMALL TURNED V
℩ U+2129 TURNED GREEK SMALL LETTER IOTA
Ⅎ U+2132 TURNED CAPITAL F
⅁ U+2141 TURNED SANS-SERIF CAPITAL G
⅂ U+2142 TURNED SANS-SERIF CAPITAL L
⅄ U+2144 TURNED SANS-SERIF CAPITAL Y
⅋ U+214B TURNED AMPERSAND
ⅎ U+214E TURNED SMALL F
⌙ U+2319 TURNED NOT SIGN
❛ U+275B HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
❝ U+275D HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
⦢ U+29A2 TURNED ANGLE
Ɐ U+2C6F LATIN CAPITAL LETTER TURNED A
ⱹ U+2C79 LATIN SMALL LETTER TURNED R WITH TAIL
ⱻ U+2C7B LATIN LETTER SMALL CAPITAL TURNED E
Ꝿ U+A77E LATIN CAPITAL LETTER TURNED INSULAR G
ꝿ U+A77F LATIN SMALL LETTER TURNED INSULAR G
Ꞁ U+A780 LATIN CAPITAL LETTER TURNED L
ꞁ U+A781 LATIN SMALL LETTER TURNED L
However, it's far from a complete set. Most upside-down text works by choosing characters that happen to have a close-enough resemblance to upside-down letters. It's the equivalent of typing 0.7734 on your calculator to spell "hELLO".
does unicode have upside down chars?
Yup! Or at least characters that look like they are upside down. Also, regular English-alphabetical characters can appear to be upside down. Like u could be an upside-down n.
To code it up, you just have to take an array of characters, display them in reverse order and replace those characters with the upside down version of them. This will get you a good start: zʎxʍʌnʇsɹbdouɯןʞſıɥbɟǝpɔqɐ
When 'uʍop-ǝpısdn' is copied and echoed into a hex dump program, the string is seen as:
75 CA 8D 6F 70 2D C7 9D 70 C4 B1 73 64 6E
The UTF-8 breakdown of that is:
0x75 = U+0075 = LATIN SMALL LETTER U
0xCA 0x8D = U+028D = LATIN SMALL LETTER TURNED W
0x6F = U+006F = LATIN SMALL LETTER O
0x70 = U+0070 = LATIN SMALL LETTER P
0x2D = U+002D = HYPHEN MINUS
0xC7 0x9D = U+01DD = LATIN SMALL LETTER TURNED E
0x70 = U+0070 = LATIN SMALL LETTER P
0xC4 0xB1 = U+0131 = LATIN SMALL LETTER DOTLESS I
0x73 = U+0073 = LATIN SMALL LETTER S
0x64 = U+0064 = LATIN SMALL LETTER D
0x6E = U+006E = LATIN SMALL LETTER N
They are just unicode characters.
Look at source of web page:
function flip() {
var result = flipString(document.f.original.value);
document.f.flipped.value = result;
}
function flipString(aString) {
aString = aString.toLowerCase();
var last = aString.length - 1;
var result = "";
for (var i = last; i >= 0; --i) {
result += flipChar(aString.charAt(i))
}
return result;
}
function flipChar(c) {
if (c == 'a') {
return '\u0250'
}
else if (c == 'b') {
return 'q'
}
else if (c == 'c') {
return '\u0254' //Open o -- copied from pne
There is the ”upsidedown” python module. https://pypi.org/project/upsidedown/. And it supports non-english characters too.