Sorting Czech in Perl

Sorting Czech in Perl - perl

I have the following perl program
use 5.014_001;
use utf8;
use Unicode::Collate::Locale;
require 'Unicode/Collate/Locale/cs.pl';
binmode STDOUT, ':encoding(UTF-8)';
my #old_list = (
"cash",
"Cash",
"cat",
"Cat",
"čash",
"dash",
"Dash",
"Ďash",
"database",
"Database",
);
my $col= Unicode::Collate::Locale->new(
level => 3,
locale => 'cs',
normalization => 'NFD',
);
my #list = $col->sort(#old_list);
foreach my $item (#list){
print $item, "\n";
}
This program prints out the output:
cash
Cash
cat
Cat
čash
dash
Dash
Ďash
database
Database
I believe that a careful observer would have to conclude that in Czech either
č is a first-class letter while Ď is not.
The Unicode::Collate::Locale sorting of Czech in Perl is not correct
I'd like to believe (1), and the following bolsters my case:
http://en.wiktionary.org/wiki/Index_talk:Czech
where it says:
Let us sort the entries by the existing Czech conventions, as far as practicable. That is, only the following characters have any sorting significance:
a b c č d e f g h ch i j k l m n o p q r ř s š t u v w x y z ž
But I'm confused, because I thought "D with a v over it" (and it's lowercase equivalent), is a first-class letter of the Czech alphabet.
Where is #tchrist when I need him?
I'd appreciate any insights on this.

I have not yet seen a language that would correctly order Czech or Slovak words. (Slovak is quite similar to Czech alphabet.) .NET, Java, Python, all get it wrong. The closest to the correct solution are Raku and Go.
Yes, in Czech and Slovak, ď letter comes (right) after d. There are quite a few peculiarities, such as digraphs ch, dz, dž.
#!/usr/bin/perl
use v5.30;
use warnings;
use utf8;
use Unicode::Collate::Locale;
use open ":std", ":encoding(UTF-8)";
my #words = qw/čaj auto pot márny kľak chyba drevo cibuľa džíp džem šum pól čučoriedka
banán čerešňa červený klam čierny tŕň pôst hôrny mat chobot cesnak kĺb mäta ďateľ
troska sýkorka elektrón fuj zem guma hora gejzír ihla pýr hrozno jazva džavot lom/;
my $col = Unicode::Collate::Locale->new(
level => 3,
locale => 'sk',
normalization => 'NFKC',
);
my #sort_asc = $col->sort(#words);
say "#sort_asc";
The example sorts Slovak words; it contains plenty of challenges.
$./sort_accented_words.pl
auto banán cesnak cibuľa čaj čerešňa červený čierny čučoriedka ďateľ drevo
džavot džem džíp elektrón fuj gejzír guma hora hôrny hrozno chobot chyba
ihla jazva kľak klam kĺb lom márny mat mäta pól pot pôst pýr sýkorka šum
tŕň troska zem
Perl did not order the accented words correctly. Interestingly, it correctly ordered the words with ch, dz, dž digraphs.
#!/usr/bin/raku
my #words = <čaj auto pot márny kľak chyba drevo cibuľa džíp džem šum pól čučoriedka
banán čerešňa červený klam čierny tŕň pôst hôrny mat chobot cesnak kĺb mäta ďateľ
troska sýkorka elektrón fuj zem guma hora gejzír ihla pýr hrozno jazva džavot lom>;
say #words.sort({ .unival, .NFKD[0], .fc });
This is a Raku example.
./sort_words.raku
(auto banán cesnak chobot chyba cibuľa čaj čerešňa červený čierny čučoriedka
drevo džavot džem džíp ďateľ elektrón fuj gejzír guma hora hrozno hôrny ihla
jazva klam kĺb kľak lom mat márny mäta pot pól pôst pýr sýkorka šum troska
tŕň zem)
Accented words are correctly sorted but the ch, dz, and dž digraphs are wrong.
So in my opinion, unless we create our own solution, we won't get a 100% correct output in any programming language.

A locale is just a set of rules. Here's the locale for cs from Collate::Locale 1.31. DUCET is the Default Unicode Collation Element Table.
The Ď may be a first class letter, but that's not what DUCET thinks. If you want different sorts, you can adjust your locale or supply your own.
+{
locale_version => 1.31,
entry => <<'ENTRY', # for DUCET v13.0.0
010D ; [.1FD7.0020.0002] # LATIN SMALL LETTER C WITH CARON
0063 030C ; [.1FD7.0020.0002] # LATIN SMALL LETTER C WITH CARON
010C ; [.1FD7.0020.0008] # LATIN CAPITAL LETTER C WITH CARON
0043 030C ; [.1FD7.0020.0008] # LATIN CAPITAL LETTER C WITH CARON
0063 0068 ; [.2076.0020.0002] # <LATIN SMALL LETTER C, LATIN SMALL LETTER H>
0063 0048 ; [.2076.0020.0007][.0000.0000.0002] # <LATIN SMALL LETTER C, LATIN CAPITAL LETTER H>
0043 0068 ; [.2076.0020.0007][.0000.0000.0008] # <LATIN CAPITAL LETTER C, LATIN SMALL LETTER H>
0043 0048 ; [.2076.0020.0008] # <LATIN CAPITAL LETTER C, LATIN CAPITAL LETTER H>
0159 ; [.2194.0020.0002] # LATIN SMALL LETTER R WITH CARON
0072 030C ; [.2194.0020.0002] # LATIN SMALL LETTER R WITH CARON
0158 ; [.2194.0020.0008] # LATIN CAPITAL LETTER R WITH CARON
0052 030C ; [.2194.0020.0008] # LATIN CAPITAL LETTER R WITH CARON
0161 ; [.21D3.0020.0002] # LATIN SMALL LETTER S WITH CARON
0073 030C ; [.21D3.0020.0002] # LATIN SMALL LETTER S WITH CARON
0160 ; [.21D3.0020.0008] # LATIN CAPITAL LETTER S WITH CARON
0053 030C ; [.21D3.0020.0008] # LATIN CAPITAL LETTER S WITH CARON
017E ; [.2287.0020.0002] # LATIN SMALL LETTER Z WITH CARON
007A 030C ; [.2287.0020.0002] # LATIN SMALL LETTER Z WITH CARON
017D ; [.2287.0020.0008] # LATIN CAPITAL LETTER Z WITH CARON
005A 030C ; [.2287.0020.0008] # LATIN CAPITAL LETTER Z WITH CARON
ENTRY
};

If the default sort is not working for you, this common workaround is an easy do-it-yourself:
Make a sort-array by transforming your strings: if a and á should be equivalent, transform both to a; if á should follow a, transform it into a[, for example (any character after z should be fine). Transform ch into h[, as it goes after h, if I understand correctly. Then sort the original array together with the sort-array.

Despite Czech being my native language, I don't know Czech collation perfectly. But surely, for ď, ť, ň and wovels with diacritics, the diacritics has a lower signifficance than for other Czech characters like č.
Why? This is related to pronunciation. Barring assimilation and non-native words, all consonants but d, t and n have clear pronunciation regardless of their context. (“Ch” is considered as a separate letter.) Those three letters (D, T and N) can be “softened” when they are followed by “i”, “í” or “ě”. In those cases, they are prononuced like they had a caron (háček). As a result, the diacritics for them is less signifficant.

Related

What is the difference between ö and ö?

The following characters look alike. But they are not the same. I can not visually see their difference. Could anybody let me know what their difference is? Why are there two Unicode characters that are so similar?
$ xxd <<< ö
00000000: c3b6 0a ...
$ xxd <<< ö
00000000: 6fcc 880a o...

The first is a single Unicode code point, while the second is two Unicode code points. They are two forms of the same glyph (examples in Python):
import unicodedata as ud
o1 = 'ö' # '\xf6'
o2 = 'ö' # 'o\u0308'
for c in o1:
print(f'U+{ord(c):04X} {ud.name(c)}')
print()
for c in o2:
print(f'U+{ord(c):04X} {ud.name(c)}')
U+00F6 LATIN SMALL LETTER O WITH DIAERESIS
U+006F LATIN SMALL LETTER O
U+0308 COMBINING DIAERESIS
Ensure the two strings are in the same normalization form (either composed or decomposed) for comparison:
print(ud.normalize('NFC',o1) == ud.normalize('NFC',o2))
print(ud.normalize('NFD',o1) == ud.normalize('NFD',o2))
True
True

How to match Unicode vowels?

What character class or Unicode property will match any Unicode vowel in Perl?
Wrong answer: [aeiouAEIOU]. (sermon here, item #24 in the laundry list)
perluniprops mentions vowels only for Hangul and Indic scripts.
Let's set aside the question what a vowel is. Yes, i may not be a vowel in some contexts. So, any character that can be a vowel will do.

There's no such property.
$ uniprops --all a
U+0061 <a> \N{LATIN SMALL LETTER A}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS
Age=1.1 Age=V1_1 Block=Basic_Latin Bidi_Class=L Bidi_Class=Left_To_Right BC=L
Bidi_Paired_Bracket_Type=None Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR
Decomposition_Type=None DT=None East_Asian_Width=Na East_Asian_Width=Narrow EA=Na
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Indic_Positional_Category=NA InPC=NA
Indic_Syllabic_Category=Other InSC=Other Joining_Group=No_Joining_Group JG=NoJoiningGroup
Joining_Type=Non_Joining JT=U Joining_Type=U Script=Latin Line_Break=AL
Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0
Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0
Present_In=6.1 IN=6.1 Present_In=6.2 IN=6.2 Present_In=6.3 IN=6.3 Present_In=7.0 IN=7.0
Present_In=8.0 IN=8.0 SC=Latn Script=Latn Script_Extensions=Latin Scx=Latn
Script_Extensions=Latn Sentence_Break=LO Sentence_Break=Lower SB=LO Word_Break=ALetter WB=LE
Word_Break=LE
The most important thing when dealing with i18n is to think about what you actually need, yet you didn't even mention what you are trying to accomplish.
Find vowels? That can't be what you are actually trying to do. I could see a use for identifying vowel sounds in a word, but those are often formed from multiple letters (such as "oo" in English, and "in", "an"/"en", "ou", "ai", "au"/"eau", "eu" in French), and it would be language-specific.
As it stands, you're asking for a global solution but you're defining the problem in local terms. You first need to start by defining the actual problem you are trying to solve.

Setting aside the definition of a vowel and the obvious problem that different languages share symbols but use them differently, there's a way that you can define your own property for use in a Perl pattern.
Define a subroutine that starts with In or Is and specify the characters that can be in it. The simplest is one code number be line, or a range of code numbers separated by horizontal whitespace:
#!perl
use v5.10;
use utf8;
use open qw(:std :utf8);
sub InSpecial {
return <<"HERE";
00A7
00B6
2295\t229C
HERE
}
$_ = "ABC\x{00A7}";
say $_;
say /\p{InSpecial}/ ? 'Matched' : 'Missed';

First of all, not all written languages have "vowels". For one example, 中文 (Zhōngwén) (written Chinese) does not, as it is ideogrammatic instead of phonetic. For another example, Japanese mostly doesn't; it uses mostly consonant+vowel hiragana or katakana syllabics such as "ga", "wa", "tsu" instead.
And some written languages (for example, Hindi, Bangla, Greek, Russian) do have vowels, but use characters which are not easily mapable to aeiou. For such languages you'd have to find (search metacpan?) or make look-up tables specifying which letters are "vowels".
But if you're dealing with any written language based even loosely on the Latin alphabet (abcdefghijklmnopqrstuvwxyz), even if the language uses tons of diacritics (called "combining marks" in Perl and Unicode circles) (eg, Vietnamese), you can easily map those to "vowel" or "not-vowel", yes. The way is to "normalize-to-fully-decomposed-form", then strip-out all the combining marks, then fold-case, then compare each letter to regex /[aeiou]/. The following Perl script will find most-or-all "vowels" in any language using a Latin-based alphabet:
#!/usr/bin/perl -CSDA
# vowel-count.pl
use v5.20;
use Unicode::Normalize 'NFD';
my $vcount;
while (<>)
{
$_ =~ s/[\r\n]+$//;
say "\nRaw string: $_";
my $decomposed = NFD $_;
my $stripped = ($decomposed =~ s/\pM//gr);
say "Stripped string: $stripped";
my $folded = fc $stripped;
my #base_letters = split //, $stripped;
$vcount = 0;
/[aeiou]/ and ++$vcount for #base_letters;
say "# of vowels: $vcount";
}

Multilingual text sorting in Perl, on Windows, using locale

I am building a piece of software for sorting book indexes in different languages. It uses Perl, and keys off of the locale. I am developing it on Unix, but it needs to be portable to Windows. Should this work in principle, or by relying on locale, am I barking up the wrong tree? Bottom line, Windows is really where I need this to work, but I am more comfortable developing in my UNIX environment.

Assuming that your starting point is Unicode, because you have been very careful to decode all incoming data no matter what its native encoding might be, then it is easy to use to the Unicode::Collate module as a starting point.
If you want locale tailoring, then you probably want to start with Unicode::Collate::Locale instead.
Decoding into Unicode
If you run in an all-UTF8 environment, this is easy, but if you are subject to the vicissitudes of random so-called “locales” (or even worse, the ugly things Microsoft calls “code pages”), then you might want to get the CPAN Encode::Locale module to help you out. For example:
use Encode;
use Encode::Locale;
# use "locale" as an arg to encode/decode
#ARGV = map { decode(locale => $_) } #ARGV;
# or as a stream for binmode or open
binmode $some_fh, ":encoding(locale)";
binmode STDIN, ":encoding(console_in)" if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;
(If it were me, I would just use ":utf8" for the output.)
Standard Collation, plus locales and tailoring
The point is, once you have everything decoded into internal Perl format, you can use Unicode::Collate and Unicode::Collate::Locale on it. These can be really easy:
use v5.14;
use utf8;
use Unicode::Collate;
my #exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
#exes = Unicode::Collate->new->sort(#exes);
say "#exes";
# prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹
Or they can be pretty fancy. Here is one that tries to deal with book titles: it strips leading articles and zero-pads numbers.
my $collator = Unicode::Collate->new(
--upper_before_lower => 1,
--preprocess => {
local $_ = shift;
s/^ (?: The | An? ) \h+ //x; # strip articles
s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
return $_;
};
);
Now just use that object’s sort method to sort with.
Sometimes you need to turn the sort inside out. For example:
my $collator = Unicode::Collate->new();
for my $rec (#recs) {
$rec->{NAME_key} =
$collator->getSortKey( $rec->{NAME} );
}
#srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} #recs;
The reason you have to do that is because you are sorting on a record with various fields. The binary sort key allows you to use the cmp operator on data that has been through your chosen/custom collator object.
The full constructor for the collator object has all this for a formal syntax:
$Collator = Unicode::Collate->new(
UCA_Version => $UCA_Version,
alternate => $alternate, # alias for 'variable'
backwards => $levelNumber, # or \#levelNumbers
entry => $element,
hangul_terminator => $term_primary_weight,
highestFFFF => $bool,
identical => $bool,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
ignore_level2 => $bool,
katakana_before_hiragana => $bool,
level => $collationLevel,
minimalFFFE => $bool,
normalization => $normalization_form,
overrideCJK => \&overrideCJK,
overrideHangul => \&overrideHangul,
preprocess => \&preprocess,
rearrange => \#charList,
rewrite => \&rewrite,
suppress => \#charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool,
variable => $variable,
);
But you usually don’t have to worry about almost any of those. In fact, if you want country-specific locale tailoring using the CLDR data, you should just use Unicode::Collate::Locale, which adds exactly one more parameter to the constructor: locale => $country_code.
use Unicode::Collate::Locale;
$coll = Unicode::Collate::Locale->
new(locale => "fr");
#french_text = $coll->sort(#french_text);
See how easy that is?
But you can do other cool things, too.
use Unicode::Collate::Locale;
my $Collator = new Unicode::Collate::Locale::
locale => "de__phonebook",
level => 1,
normalization => undef,
;
my $full = "Ich müß Perl studieren.";
my $sub = "MUESS";
if (my ($pos,$len) = $Collator->index($full, $sub)) {
my $match = substr($full, $pos, $len);
say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";
}
When run, that says:
Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›
Here are the available locales as of v0.96 of the Unicode::Collate::Locale module, taken from its manpage:
locale name description
--------------------------------------------------------------
af Afrikaans
ar Arabic
as Assamese
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
bn Bengali
bs Bosnian
bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
ee Ewe
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fa Persian
fi Finnish (v and w are primary equal)
fi__phonebook Finnish (v and w as separate characters)
fil Filipino
fo Faroese
fr French
gu Gujarati
ha Hausa
haw Hawaiian
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
kn Kannada
ko Korean [2]
kok Konkani
ln Lingala
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
or Oriya
pa Punjabi
pl Polish
ro Romanian
ru Russian
sa Sanskrit
se Northern Sami
si Sinhala
si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sr_Latn Serbian in Latin (tailored as Croatian)
sv Swedish (v and w are primary equal)
sv__reformed Swedish (v and w as separate characters)
ta Tamil
te Telugu
th Thai
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
wae Walser
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order) [3]
zh__stroke Chinese (ideographs: stroke order) [3]
zh__zhuyin Chinese (ideographs: zhuyin order) [3]
Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
(Zulu).
Note
[1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and halfwidth forms are identical to their regular form. The
difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
and then "katakana_before_hiragana" has no effect.
[2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
(level 2) greater than, the corresponding hangul syllable.
[3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.
Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.
So in summary, the main trick is to get your local data decoded into a uniform Unicode representation, then use deterministic sorting, possibly tailored, that doesn’t rely on random settings of the user’s console window for correct behavior.
Note: All these examples, apart from the manpage citation, are lovingly lifted from the 4th edition of Programming Perl, by kind permission of its author. :)

Win32::OLE::NLS gives you access to that part of the system. It provides you CompareString and the necessary tools to obtain the necessary locale id.
In case you want/need to locate the system documentation, the underlying system call is named CompareStringEx.

How does uʍop-ǝpᴉsdn text work?

Here's a website I found that will produce upside down versions of any English text.
how does it work? does unicode have upside down chars? Or what?
How can I write my own text flipping function?

how does it work? does unicode have
upside down chars?
Unicode does have upside-down characters. They have "TURNED" in their name:
ƍ U+018D LATIN SMALL LETTER TURNED DELTA
Ɯ U+019C LATIN CAPITAL LETTER TURNED M
ǝ U+01DD LATIN SMALL LETTER TURNED E
Ʌ U+0245 LATIN CAPITAL LETTER TURNED V
ɐ U+0250 LATIN SMALL LETTER TURNED A
ɒ U+0252 LATIN SMALL LETTER TURNED ALPHA
ɥ U+0265 LATIN SMALL LETTER TURNED H
ɯ U+026F LATIN SMALL LETTER TURNED M
ɰ U+0270 LATIN SMALL LETTER TURNED M WITH LONG LEG
ɹ U+0279 LATIN SMALL LETTER TURNED R
ɺ U+027A LATIN SMALL LETTER TURNED R WITH LONG LEG
ɻ U+027B LATIN SMALL LETTER TURNED R WITH HOOK
ʇ U+0287 LATIN SMALL LETTER TURNED T
ʌ U+028C LATIN SMALL LETTER TURNED V
ʍ U+028D LATIN SMALL LETTER TURNED W
ʎ U+028E LATIN SMALL LETTER TURNED Y
ʞ U+029E LATIN SMALL LETTER TURNED K
ʮ U+02AE LATIN SMALL LETTER TURNED H WITH FISHHOOK
ʯ U+02AF LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL
ʴ U+02B4 MODIFIER LETTER SMALL TURNED R
ʵ U+02B5 MODIFIER LETTER SMALL TURNED R WITH HOOK
ʻ U+02BB MODIFIER LETTER TURNED COMMA
̒ U+0312 COMBINING TURNED COMMA ABOVE
ჹ U+10F9 GEORGIAN LETTER TURNED GAN
ᴂ U+1D02 LATIN SMALL LETTER TURNED AE
ᴈ U+1D08 LATIN SMALL LETTER TURNED OPEN E
ᴉ U+1D09 LATIN SMALL LETTER TURNED I
ᴔ U+1D14 LATIN SMALL LETTER TURNED OE
ᴚ U+1D1A LATIN LETTER SMALL CAPITAL TURNED R
ᴟ U+1D1F LATIN SMALL LETTER SIDEWAYS TURNED M
ᵄ U+1D44 MODIFIER LETTER SMALL TURNED A
ᵆ U+1D46 MODIFIER LETTER SMALL TURNED AE
ᵌ U+1D4C MODIFIER LETTER SMALL TURNED OPEN E
ᵎ U+1D4E MODIFIER LETTER SMALL TURNED I
ᵚ U+1D5A MODIFIER LETTER SMALL TURNED M
ᵷ U+1D77 LATIN SMALL LETTER TURNED G
ᶛ U+1D9B MODIFIER LETTER SMALL TURNED ALPHA
ᶣ U+1DA3 MODIFIER LETTER SMALL TURNED H
ᶭ U+1DAD MODIFIER LETTER SMALL TURNED M WITH LONG LEG
ᶺ U+1DBA MODIFIER LETTER SMALL TURNED V
℩ U+2129 TURNED GREEK SMALL LETTER IOTA
Ⅎ U+2132 TURNED CAPITAL F
⅁ U+2141 TURNED SANS-SERIF CAPITAL G
⅂ U+2142 TURNED SANS-SERIF CAPITAL L
⅄ U+2144 TURNED SANS-SERIF CAPITAL Y
⅋ U+214B TURNED AMPERSAND
ⅎ U+214E TURNED SMALL F
⌙ U+2319 TURNED NOT SIGN
❛ U+275B HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
❝ U+275D HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
⦢ U+29A2 TURNED ANGLE
Ɐ U+2C6F LATIN CAPITAL LETTER TURNED A
ⱹ U+2C79 LATIN SMALL LETTER TURNED R WITH TAIL
ⱻ U+2C7B LATIN LETTER SMALL CAPITAL TURNED E
Ꝿ U+A77E LATIN CAPITAL LETTER TURNED INSULAR G
ꝿ U+A77F LATIN SMALL LETTER TURNED INSULAR G
Ꞁ U+A780 LATIN CAPITAL LETTER TURNED L
ꞁ U+A781 LATIN SMALL LETTER TURNED L
However, it's far from a complete set. Most upside-down text works by choosing characters that happen to have a close-enough resemblance to upside-down letters. It's the equivalent of typing 0.7734 on your calculator to spell "hELLO".

does unicode have upside down chars?
Yup! Or at least characters that look like they are upside down. Also, regular English-alphabetical characters can appear to be upside down. Like u could be an upside-down n.
To code it up, you just have to take an array of characters, display them in reverse order and replace those characters with the upside down version of them. This will get you a good start: zʎxʍʌnʇsɹbdouɯןʞſıɥbɟǝpɔqɐ

When 'uʍop-ǝpısdn' is copied and echoed into a hex dump program, the string is seen as:
75 CA 8D 6F 70 2D C7 9D 70 C4 B1 73 64 6E
The UTF-8 breakdown of that is:
0x75 = U+0075 = LATIN SMALL LETTER U
0xCA 0x8D = U+028D = LATIN SMALL LETTER TURNED W
0x6F = U+006F = LATIN SMALL LETTER O
0x70 = U+0070 = LATIN SMALL LETTER P
0x2D = U+002D = HYPHEN MINUS
0xC7 0x9D = U+01DD = LATIN SMALL LETTER TURNED E
0x70 = U+0070 = LATIN SMALL LETTER P
0xC4 0xB1 = U+0131 = LATIN SMALL LETTER DOTLESS I
0x73 = U+0073 = LATIN SMALL LETTER S
0x64 = U+0064 = LATIN SMALL LETTER D
0x6E = U+006E = LATIN SMALL LETTER N

They are just unicode characters.

Look at source of web page:
function flip() {
var result = flipString(document.f.original.value);
document.f.flipped.value = result;
}
function flipString(aString) {
aString = aString.toLowerCase();
var last = aString.length - 1;
var result = "";
for (var i = last; i >= 0; --i) {
result += flipChar(aString.charAt(i))
}
return result;
}
function flipChar(c) {
if (c == 'a') {
return '\u0250'
}
else if (c == 'b') {
return 'q'
}
else if (c == 'c') {
return '\u0254' //Open o -- copied from pne

There is the ”upsidedown” python module. https://pypi.org/project/upsidedown/. And it supports non-english characters too.

Map between LaTeX commands and Unicode points

Is anyone aware of where I could find a table mapping LaTeX commands to Unicode code points? eg: \le is 0x2264. I'm looking for something as comprehensive as possible.

The document I've used before is this XML file from the W3C. It maps Unicode to HTML, MathML, LaTeX, Mathematica, and others. (The file is 1.4 MB, uncompressed.)
You can read more about it here: http://www.w3.org/TR/unicode-xml/

I once cooked up this for a report generator written in Java (hence the Java String literals):
'\\'(REVERSE SOLIDUS) "\\textbackslash{}"
'^'(CIRCUMFLEX ACCENT) "$\\uparrow$"
'_'(LOW LINE) "\\textunderscore{}"
'|'(VERTICAL LINE) "\\vline{}"
'~'(TILDE) "\\textasciitilde{}" "~"
'§'(SECTION SIGN) "\\S{}"
'ª'(FEMININE ORDINAL INDICATOR) "$^a$"
''(SOFT HYPHEN) "\\-"
'²'(SUPERSCRIPT TWO) "$^2$"
'³'(SUPERSCRIPT THREE) "$^3$"
'·'(MIDDLE DOT) "$\\cdot$"
'¹'(SUPERSCRIPT ONE) "$^1$"
'º'(MASCULINE ORDINAL INDICATOR) "$^o$"
'\u013a'(LATIN SMALL LETTER L WITH ACUTE) "\\'l"
'\u013b'(LATIN CAPITAL LETTER L WITH CEDILLA) "\\c{L}"
'\u013c'(LATIN SMALL LETTER L WITH CEDILLA) "\\c{l}"
'\u013d'(LATIN CAPITAL LETTER L WITH CARON) "\\v{L}"
'\u013e'(LATIN SMALL LETTER L WITH CARON) "\\v{l}"
'\u013f'(LATIN CAPITAL LETTER L WITH MIDDLE DOT) "L\\hspace{-0.35em}$\\cdot$"
'\u0140'(LATIN SMALL LETTER L WITH MIDDLE DOT) "l$\\cdot$"
'\u0141'(LATIN CAPITAL LETTER L WITH STROKE) "\\L{}"
'\u0142'(LATIN SMALL LETTER L WITH STROKE) "\\l{}"
'\u0143'(LATIN CAPITAL LETTER N WITH ACUTE) "\\'N"
'\u0144'(LATIN SMALL LETTER N WITH ACUTE) "\\'n"
'\u0145'(LATIN CAPITAL LETTER N WITH CEDILLA) "\\c{N}"
'\u0146'(LATIN SMALL LETTER N WITH CEDILLA) "\\c{n}"
'\u0147'(LATIN CAPITAL LETTER N WITH CARON) "\\v{N}"
'\u0148'(LATIN SMALL LETTER N WITH CARON) "\\v{n}"
'\u0149'(LATIN SMALL LETTER N PRECEDED BY APOSTROPHE) "'n"
'\u014c'(LATIN CAPITAL LETTER O WITH MACRON) "\\={O}"
'\u014d'(LATIN SMALL LETTER O WITH MACRON) "\\={o}"
'\u014e'(LATIN CAPITAL LETTER O WITH BREVE) "\\u{O}"
'\u014f'(LATIN SMALL LETTER O WITH BREVE) "\\u{o}"
'\u0150'(LATIN CAPITAL LETTER O WITH DOUBLE ACUTE) "\\H{O}"
'\u0151'(LATIN SMALL LETTER O WITH DOUBLE ACUTE) "\\H{o}"
'\u0152'(LATIN CAPITAL LIGATURE OE) "\\OE{}"
'\u0153'(LATIN SMALL LIGATURE OE) "\\oe{}"
'\u0154'(LATIN CAPITAL LETTER R WITH ACUTE) "\\'{R}"
'\u0155'(LATIN SMALL LETTER R WITH ACUTE) "\\'{r}"
'\u0156'(LATIN CAPITAL LETTER R WITH CEDILLA) "\\c{R}"
'\u0157'(LATIN SMALL LETTER R WITH CEDILLA) "\\c{r}"
'\u0158'(LATIN CAPITAL LETTER R WITH CARON) "\\v{R}"
'\u0159'(LATIN SMALL LETTER R WITH CARON) "\\v{r}"
'\u015a'(LATIN CAPITAL LETTER S WITH ACUTE) "\\'S"
'\u015b'(LATIN SMALL LETTER S WITH ACUTE) "\\'s"
'\u015c'(LATIN CAPITAL LETTER S WITH CIRCUMFLEX) "\\^{S}"
'\u015d'(LATIN SMALL LETTER S WITH CIRCUMFLEX) "\\^{s}"
'\u015e'(LATIN CAPITAL LETTER S WITH CEDILLA) "\\c{S}"
'\u015f'(LATIN SMALL LETTER S WITH CEDILLA) "\\c{s}"
'\u0160'(LATIN CAPITAL LETTER S WITH CARON) "\\v{S}"
'\u0161'(LATIN SMALL LETTER S WITH CARON) "\\v{s}"
'\u0162'(LATIN CAPITAL LETTER T WITH CEDILLA) "\\c{T}"
'\u0163'(LATIN SMALL LETTER T WITH CEDILLA) "\\c{t}"
'\u0164'(LATIN CAPITAL LETTER T WITH CARON) "\\v{T}"
'\u0165'(LATIN SMALL LETTER T WITH CARON) "\\v{t}"
'\u0168'(LATIN CAPITAL LETTER U WITH TILDE) "\\~{U}"
'\u0169'(LATIN SMALL LETTER U WITH TILDE) "\\~{u}"
'\u016a'(LATIN CAPITAL LETTER U WITH MACRON) "\\={U}"
'\u016b'(LATIN SMALL LETTER U WITH MACRON) "\\={u}"
'\u016c'(LATIN CAPITAL LETTER U WITH BREVE) "\\u{U}"
'\u016d'(LATIN SMALL LETTER U WITH BREVE) "\\u{u}"
'\u016e'(LATIN CAPITAL LETTER U WITH RING ABOVE) "\\r{U}"
'\u016f'(LATIN SMALL LETTER U WITH RING ABOVE) "\\r{u}"
'\u0170'(LATIN CAPITAL LETTER U WITH DOUBLE ACUTE) "\\H{U}"
'\u0171'(LATIN SMALL LETTER U WITH DOUBLE ACUTE) "\\H{u}"
'\u0172'(LATIN CAPITAL LETTER U WITH OGONEK) "\\k{U}"
'\u0173'(LATIN SMALL LETTER U WITH OGONEK) "\\k{u}"
'\u0174'(LATIN CAPITAL LETTER W WITH CIRCUMFLEX) "\\^{W}"
'\u0175'(LATIN SMALL LETTER W WITH CIRCUMFLEX) "\\^{w}"
'\u0176'(LATIN CAPITAL LETTER Y WITH CIRCUMFLEX) "\\^{Y}"
'\u0177'(LATIN SMALL LETTER Y WITH CIRCUMFLEX) "\\^{y}"
'\u0178'(LATIN CAPITAL LETTER Y WITH DIAERESIS) "\\\"Y"
'\u0179'(LATIN CAPITAL LETTER Z WITH ACUTE) "\\'Z"
'\u017a'(LATIN SMALL LETTER Z WITH ACUTE) "\\'z"
'\u017b'(LATIN CAPITAL LETTER Z WITH DOT ABOVE) "\\.{Z}"
'\u017c'(LATIN SMALL LETTER Z WITH DOT ABOVE) "\\.{z}"
'\u017d'(LATIN CAPITAL LETTER Z WITH CARON) "\\v{Z}"
'\u017e'(LATIN SMALL LETTER Z WITH CARON) "\\v{z}"
'\u01CD'(LATIN CAPITAL LETTER A WITH CARON) "\\v A"
'\u01CE'(LATIN SMALL LETTER A WITH CARON) "\\v a"
'\u01CF'(LATIN CAPITAL LETTER I WITH CARON) "\\v I"
'\u01D0'(LATIN SMALL LETTER I WITH CARON) "\\v \\i{}"
'\u01D1'(LATIN CAPITAL LETTER O WITH CARON) "\\v O"
'\u01D2'(LATIN SMALL LETTER O WITH CARON) "\\v o"
'\u01D3'(LATIN CAPITAL LETTER U WITH CARON) "\\v U"
'\u01D4'(LATIN SMALL LETTER U WITH CARON) "\\v u"
'\u01D5'(LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON) "\\=Ü"
'\u01D6'(LATIN SMALL LETTER U WITH DIAERESIS AND MACRON) "\\=ü"
'\u01D7'(LATIN CAPITAL LETTER U WITH DIAERESIS AND ACUTE) "\\'Ü"
'\u01D8'(LATIN SMALL LETTER U WITH DIAERESIS AND ACUTE) "\\'ü"
'\u01D9'(LATIN CAPITAL LETTER U WITH DIAERESIS AND CARON) "\\v Ü"
'\u01DA'(LATIN SMALL LETTER U WITH DIAERESIS AND CARON) "\\v ü"
'\u01DB'(LATIN CAPITAL LETTER U WITH DIAERESIS AND GRAVE) "\\` Ü"
'\u01DC'(LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE) "\\` ü"
'\u01DE'(LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON) "\\= Ä"
'\u01DF'(LATIN SMALL LETTER A WITH DIAERESIS AND MACRON) "\\= ä"
'\u01E6'(LATIN CAPITAL LETTER G WITH CARON) "\\v G"
'\u01E7'(LATIN SMALL LETTER G WITH CARON) "\\v g"
'\u01E8'(LATIN CAPITAL LETTER K WITH CARON) "\\v K"
'\u01E9'(LATIN SMALL LETTER K WITH CARON) "\\v k"
'\u01EA'(LATIN CAPITAL LETTER O WITH OGONEK) "\\k O"
'\u01EB'(LATIN SMALL LETTER O WITH OGONEK) "\\k o"
'\u01F1'(LATIN CAPITAL LETTER DZ) "DZ"
'\u01F2'(LATIN CAPITAL LETTER D WITH SMALL LETTER Z) "Dz"
'\u01F3'(LATIN SMALL LETTER DZ) "dz"
'\u01F4'(LATIN CAPITAL LETTER G WITH ACUTE) "\\'G"
'\u01F5'(LATIN SMALL LETTER G WITH ACUTE) "\\`G"
'\u01F8'(LATIN CAPITAL LETTER N WITH GRAVE) "\\`N"
'\u01F9'(LATIN SMALL LETTER N WITH GRAVE) "\\`n"
'\u01FA'(LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE) "\\'Å"
'\u01FB'(LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE) "\\'å"
'\u01FC'(LATIN CAPITAL LETTER AE WITH ACUTE) "\\'Æ"
'\u01FD'(LATIN SMALL LETTER AE WITH ACUTE) "\\'æ"
'\u01FE'(LATIN CAPITAL LETTER O WITH STROKE AND ACUTE) "\\'Ø"
'\u01FF'(LATIN SMALL LETTER O WITH STROKE AND ACUTE) "\\'ø"
'\u0200'(LATIN CAPITAL LETTER A WITH DOUBLE GRAVE) "\\textdoublegrave{A}"
'\u0201'(LATIN SMALL LETTER A WITH DOUBLE GRAVE) "\\textdoublegrave{A}"
'\u0202'(LATIN CAPITAL LETTER A WITH INVERTED BREVE) "\\textroundcap{A}"
'\u0203'(LATIN SMALL LETTER A WITH INVERTED BREVE) "\\textroundcap{a}"
'\u0204'(LATIN CAPITAL LETTER E WITH DOUBLE GRAVE) "\\textdoublegrave{E}"
'\u0205'(LATIN SMALL LETTER E WITH DOUBLE GRAVE) "\\textdoublegrave{e}"
'\u0206'(LATIN CAPITAL LETTER E WITH INVERTED BREVE) "\\textroundcap{A}"
'\u0207'(LATIN SMALL LETTER E WITH INVERTED BREVE) "\\textroundcap{a}"
'\u0208'(LATIN CAPITAL LETTER I WITH DOUBLE GRAVE) "\\textdoublegrave{I}"
'\u0209'(LATIN SMALL LETTER I WITH DOUBLE GRAVE) "\\textdoublegrave{\\i}"
'\u020A'(LATIN CAPITAL LETTER I WITH INVERTED BREVE) "\\textroundcap{I}"
'\u020B'(LATIN SMALL LETTER I WITH INVERTED BREVE) "\\textroundcap{\\i}"
'\u020C'(LATIN CAPITAL LETTER O WITH DOUBLE GRAVE) "\\textdoublegrave{O}"
'\u020D'(LATIN SMALL LETTER O WITH DOUBLE GRAVE) "\\textdoublegrave{o}"
'\u020E'(LATIN CAPITAL LETTER O WITH INVERTED BREVE) "\\textroundcap{O}"
'\u020F'(LATIN SMALL LETTER O WITH INVERTED BREVE) "\\textroundcap{o}"
'\u0210'(LATIN CAPITAL LETTER R WITH DOUBLE GRAVE) "\\textdoublegrave{R}"
'\u0211'(LATIN SMALL LETTER R WITH DOUBLE GRAVE) "\\textdoublegrave{r}"
'\u0212'(LATIN CAPITAL LETTER R WITH INVERTED BREVE) "\\textroundcap{R}"
'\u0213'(LATIN SMALL LETTER R WITH INVERTED BREVE) "\\textroundcap{r}"
'\u0214'(LATIN CAPITAL LETTER U WITH DOUBLE GRAVE) "\\textdoublegrave{U}"
'\u0215'(LATIN SMALL LETTER U WITH DOUBLE GRAVE) "\\textdoublegrave{u}"
'\u0216'(LATIN CAPITAL LETTER U WITH INVERTED BREVE) "\\textroundcap{U}"
'\u0217'(LATIN SMALL LETTER U WITH INVERTED BREVE) "\\textroundcap{u}"
'\u0218'(LATIN CAPITAL LETTER S WITH COMMA BELOW) "\\textcommabelow{S}"
'\u0219'(LATIN SMALL LETTER S WITH COMMA BELOW) "\\textcommabelow{s}"
'\u021A'(LATIN CAPITAL LETTER T WITH COMMA BELOW) "\\textcommabelow{T}"
'\u021B'(LATIN SMALL LETTER T WITH COMMA BELOW) "\\textcommabelow{t}"
'\u021E'(LATIN CAPITAL LETTER H WITH CARON) "\\v{H}"
'\u021F'(LATIN SMALL LETTER H WITH CARON) "\\v{h}"
'\u0226'(LATIN CAPITAL LETTER A WITH DOT ABOVE) "\\.A"
'\u0227'(LATIN SMALL LETTER A WITH DOT ABOVE) "\\.a"
'\u0228'(LATIN CAPITAL LETTER E WITH CEDILLA) "\\c E"
'\u0229'(LATIN SMALL LETTER E WITH CEDILLA) "\\c e"
'\u022A'(LATIN CAPITAL LETTER O WITH DIAERESIS AND MACRON) "\\= Ö"
'\u022B'(LATIN SMALL LETTER O WITH DIAERESIS AND MACRON) "\\= ö"
'\u022C'(LATIN CAPITAL LETTER O WITH TILDE AND MACRON) "\\makeatletter\\#tabacckludge={\\~O}\\makeatother{}"
'\u022D'(LATIN SMALL LETTER O WITH TILDE AND MACRON) "\\makeatletter\\#tabacckludge={\\~o}\\makeatother{}"
'\u022E'(LATIN CAPITAL LETTER O WITH DOT ABOVE) "\\.O"
'\u022F'(LATIN SMALL LETTER O WITH DOT ABOVE) "\\.o"
'\u0232'(LATIN CAPITAL LETTER Y WITH MACRON) "\\=Y"
'\u0233'(LATIN SMALL LETTER Y WITH MACRON) "\\=y"
'\u023A'(LATIN CAPITAL LETTER A WITH STROKE) "/\\hspace{-0.5em}A"
'\u023B'(LATIN CAPITAL LETTER C WITH STROKE) "/\\hspace{-0.5em}C"
'\u023C'(LATIN SMALL LETTER C WITH STROKE) "/\\hspace{-0.4em}c"
'\u023D'(LATIN CAPITAL LETTER L WITH BAR) "-\\hspace{-0.3em}L"
'\u023E'(LATIN CAPITAL LETTER T WITH DIAGONAL STROKE) "-\\hspace{-0.3em}T"
'\u20AC'(EURO SIGN) "\\texteuro{}"
'\u2018'(LEFT SINGLE QUOTATION MARK) "'"
'\u2019'(RIGHT SINGLE QUOTATION MARK) "'"
'\u201A'(SINGLE LOW-9 QUOTATION MARK) "'"
'\u201B'(SINGLE HIGH-REVERSED-9 QUOTATION MARK) "'"
'\u201C'(LEFT DOUBLE QUOTATION MARK) "\"{}"
'\u201D'(RIGHT DOUBLE QUOTATION MARK) "\"{}"
'\u201E'(DOUBLE LOW-9 QUOTATION MARK) "\"{}"
'\u201F'(DOUBLE HIGH-REVERSED-9 QUOTATION MARK) "\"{}"
'\u025B'(LATIN SMALL LETTER OPEN E) "\\textepsilon{}"
'\u0283'(LATIN SMALL LETTER ESH) "\\textesh{}"
But I'm pretty sure there isn't a comprehensive mapping anywhere - Unicode is HUGE. You'll probably have to compile and maintain it yourself. Good luck!

Here's a web app based on the data mentioned above: http://www.johndcook.com/unicode_latex.html
Type in Unicode and it looks up the LaTeX symbol and vice versa.

You can check out my LaTeX to Unicode converter. It has a JavaScript API which you can use under MIT license. It is partially based on the W3C document shared earlier, but supports even more mappings that I gathered from here and there.
Most mappings are straightforward table lookups, but some commands have no or ambiguous Unicode equivalents. A comprehensive converter requires creative decisions. For example, fractions are quite complicated. frac{5}{8} produces ⅝, frac{5}{80} produces ‌5⁄80‌ and frac{5}{80a} produces (5 / (80a))).

This is for the Word 2007 Equation Editor but it shares many similar commands with LaTeX: http://unicode.org/notes/tn28/UTN28-PlainTextMath.pdf
This huge table contains Unicode translation to LaTeX, MathML entities and Mathematica: http://www.ams.org/STIX/bnb/stix-tbl.asc98feb26

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Sorting Czech in Perl - perl

Related

What is the difference between ö and ö?

How to match Unicode vowels?

Multilingual text sorting in Perl, on Windows, using locale

How does uʍop-ǝpᴉsdn text work?

Map between LaTeX commands and Unicode points

Categories

Resources