Multilingual text sorting in Perl, on Windows, using locale - perl

I am building a piece of software for sorting book indexes in different languages. It uses Perl, and keys off of the locale. I am developing it on Unix, but it needs to be portable to Windows. Should this work in principle, or by relying on locale, am I barking up the wrong tree? Bottom line, Windows is really where I need this to work, but I am more comfortable developing in my UNIX environment.

Assuming that your starting point is Unicode, because you have been very careful to decode all incoming data no matter what its native encoding might be, then it is easy to use to the Unicode::Collate module as a starting point.
If you want locale tailoring, then you probably want to start with Unicode::Collate::Locale instead.
Decoding into Unicode
If you run in an all-UTF8 environment, this is easy, but if you are subject to the vicissitudes of random so-called “locales” (or even worse, the ugly things Microsoft calls “code pages”), then you might want to get the CPAN Encode::Locale module to help you out. For example:
use Encode;
use Encode::Locale;
# use "locale" as an arg to encode/decode
#ARGV = map { decode(locale => $_) } #ARGV;
# or as a stream for binmode or open
binmode $some_fh, ":encoding(locale)";
binmode STDIN, ":encoding(console_in)" if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;
(If it were me, I would just use ":utf8" for the output.)
Standard Collation, plus locales and tailoring
The point is, once you have everything decoded into internal Perl format, you can use Unicode::Collate and Unicode::Collate::Locale on it. These can be really easy:
use v5.14;
use utf8;
use Unicode::Collate;
my #exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
#exes = Unicode::Collate->new->sort(#exes);
say "#exes";
# prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹
Or they can be pretty fancy. Here is one that tries to deal with book titles: it strips leading articles and zero-pads numbers.
my $collator = Unicode::Collate->new(
--upper_before_lower => 1,
--preprocess => {
local $_ = shift;
s/^ (?: The | An? ) \h+ //x; # strip articles
s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
return $_;
};
);
Now just use that object’s sort method to sort with.
Sometimes you need to turn the sort inside out. For example:
my $collator = Unicode::Collate->new();
for my $rec (#recs) {
$rec->{NAME_key} =
$collator->getSortKey( $rec->{NAME} );
}
#srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} #recs;
The reason you have to do that is because you are sorting on a record with various fields. The binary sort key allows you to use the cmp operator on data that has been through your chosen/custom collator object.
The full constructor for the collator object has all this for a formal syntax:
$Collator = Unicode::Collate->new(
UCA_Version => $UCA_Version,
alternate => $alternate, # alias for 'variable'
backwards => $levelNumber, # or \#levelNumbers
entry => $element,
hangul_terminator => $term_primary_weight,
highestFFFF => $bool,
identical => $bool,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
ignore_level2 => $bool,
katakana_before_hiragana => $bool,
level => $collationLevel,
minimalFFFE => $bool,
normalization => $normalization_form,
overrideCJK => \&overrideCJK,
overrideHangul => \&overrideHangul,
preprocess => \&preprocess,
rearrange => \#charList,
rewrite => \&rewrite,
suppress => \#charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool,
variable => $variable,
);
But you usually don’t have to worry about almost any of those. In fact, if you want country-specific locale tailoring using the CLDR data, you should just use Unicode::Collate::Locale, which adds exactly one more parameter to the constructor: locale => $country_code.
use Unicode::Collate::Locale;
$coll = Unicode::Collate::Locale->
new(locale => "fr");
#french_text = $coll->sort(#french_text);
See how easy that is?
But you can do other cool things, too.
use Unicode::Collate::Locale;
my $Collator = new Unicode::Collate::Locale::
locale => "de__phonebook",
level => 1,
normalization => undef,
;
my $full = "Ich müß Perl studieren.";
my $sub = "MUESS";
if (my ($pos,$len) = $Collator->index($full, $sub)) {
my $match = substr($full, $pos, $len);
say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";
}
When run, that says:
Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›
Here are the available locales as of v0.96 of the Unicode::Collate::Locale module, taken from its manpage:
locale name description
--------------------------------------------------------------
af Afrikaans
ar Arabic
as Assamese
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
bn Bengali
bs Bosnian
bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
ee Ewe
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fa Persian
fi Finnish (v and w are primary equal)
fi__phonebook Finnish (v and w as separate characters)
fil Filipino
fo Faroese
fr French
gu Gujarati
ha Hausa
haw Hawaiian
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
kn Kannada
ko Korean [2]
kok Konkani
ln Lingala
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
or Oriya
pa Punjabi
pl Polish
ro Romanian
ru Russian
sa Sanskrit
se Northern Sami
si Sinhala
si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sr_Latn Serbian in Latin (tailored as Croatian)
sv Swedish (v and w are primary equal)
sv__reformed Swedish (v and w as separate characters)
ta Tamil
te Telugu
th Thai
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
wae Walser
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order) [3]
zh__stroke Chinese (ideographs: stroke order) [3]
zh__zhuyin Chinese (ideographs: zhuyin order) [3]
Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
(Zulu).
Note
[1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and halfwidth forms are identical to their regular form. The
difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
and then "katakana_before_hiragana" has no effect.
[2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
(level 2) greater than, the corresponding hangul syllable.
[3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.
Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.
So in summary, the main trick is to get your local data decoded into a uniform Unicode representation, then use deterministic sorting, possibly tailored, that doesn’t rely on random settings of the user’s console window for correct behavior.
Note: All these examples, apart from the manpage citation, are lovingly lifted from the 4th edition of Programming Perl, by kind permission of its author. :)

Win32::OLE::NLS gives you access to that part of the system. It provides you CompareString and the necessary tools to obtain the necessary locale id.
In case you want/need to locate the system documentation, the underlying system call is named CompareStringEx.

Related

How to match Unicode vowels?

What character class or Unicode property will match any Unicode vowel in Perl?
Wrong answer: [aeiouAEIOU]. (sermon here, item #24 in the laundry list)
perluniprops mentions vowels only for Hangul and Indic scripts.
Let's set aside the question what a vowel is. Yes, i may not be a vowel in some contexts. So, any character that can be a vowel will do.
There's no such property.
$ uniprops --all a
U+0061 <a> \N{LATIN SMALL LETTER A}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS
Age=1.1 Age=V1_1 Block=Basic_Latin Bidi_Class=L Bidi_Class=Left_To_Right BC=L
Bidi_Paired_Bracket_Type=None Block=ASCII BLK=ASCII Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR
Decomposition_Type=None DT=None East_Asian_Width=Na East_Asian_Width=Narrow EA=Na
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
Hangul_Syllable_Type=Not_Applicable HST=NA Indic_Positional_Category=NA InPC=NA
Indic_Syllabic_Category=Other InSC=Other Joining_Group=No_Joining_Group JG=NoJoiningGroup
Joining_Type=Non_Joining JT=U Joining_Type=U Script=Latin Line_Break=AL
Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0
Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0
Present_In=6.1 IN=6.1 Present_In=6.2 IN=6.2 Present_In=6.3 IN=6.3 Present_In=7.0 IN=7.0
Present_In=8.0 IN=8.0 SC=Latn Script=Latn Script_Extensions=Latin Scx=Latn
Script_Extensions=Latn Sentence_Break=LO Sentence_Break=Lower SB=LO Word_Break=ALetter WB=LE
Word_Break=LE
The most important thing when dealing with i18n is to think about what you actually need, yet you didn't even mention what you are trying to accomplish.
Find vowels? That can't be what you are actually trying to do. I could see a use for identifying vowel sounds in a word, but those are often formed from multiple letters (such as "oo" in English, and "in", "an"/"en", "ou", "ai", "au"/"eau", "eu" in French), and it would be language-specific.
As it stands, you're asking for a global solution but you're defining the problem in local terms. You first need to start by defining the actual problem you are trying to solve.
Setting aside the definition of a vowel and the obvious problem that different languages share symbols but use them differently, there's a way that you can define your own property for use in a Perl pattern.
Define a subroutine that starts with In or Is and specify the characters that can be in it. The simplest is one code number be line, or a range of code numbers separated by horizontal whitespace:
#!perl
use v5.10;
use utf8;
use open qw(:std :utf8);
sub InSpecial {
return <<"HERE";
00A7
00B6
2295\t229C
HERE
}
$_ = "ABC\x{00A7}";
say $_;
say /\p{InSpecial}/ ? 'Matched' : 'Missed';
First of all, not all written languages have "vowels". For one example, 中文 (Zhōngwén) (written Chinese) does not, as it is ideogrammatic instead of phonetic. For another example, Japanese mostly doesn't; it uses mostly consonant+vowel hiragana or katakana syllabics such as "ga", "wa", "tsu" instead.
And some written languages (for example, Hindi, Bangla, Greek, Russian) do have vowels, but use characters which are not easily mapable to aeiou. For such languages you'd have to find (search metacpan?) or make look-up tables specifying which letters are "vowels".
But if you're dealing with any written language based even loosely on the Latin alphabet (abcdefghijklmnopqrstuvwxyz), even if the language uses tons of diacritics (called "combining marks" in Perl and Unicode circles) (eg, Vietnamese), you can easily map those to "vowel" or "not-vowel", yes. The way is to "normalize-to-fully-decomposed-form", then strip-out all the combining marks, then fold-case, then compare each letter to regex /[aeiou]/. The following Perl script will find most-or-all "vowels" in any language using a Latin-based alphabet:
#!/usr/bin/perl -CSDA
# vowel-count.pl
use v5.20;
use Unicode::Normalize 'NFD';
my $vcount;
while (<>)
{
$_ =~ s/[\r\n]+$//;
say "\nRaw string: $_";
my $decomposed = NFD $_;
my $stripped = ($decomposed =~ s/\pM//gr);
say "Stripped string: $stripped";
my $folded = fc $stripped;
my #base_letters = split //, $stripped;
$vcount = 0;
/[aeiou]/ and ++$vcount for #base_letters;
say "# of vowels: $vcount";
}

Filtering out all non-kanji characters in a text with Python 3

I have a text in which there are latin letters and japanese characters (hiragana, katakana & kanji).
I want to filter out all latin characters, hiragana and katakana but I am not sure how to do this in an elegant way.
My direct approach would be to just filter out every single letter of the latin alphabet in addition to every single hiragana/katakana but I am sure there is a better way.
I am guessing that I have to use regex but I am not quite sure how to go about it. Are letters somehow classified in roman letters, japanese, chinese etc.
If yes, could I somehow use this?
Here some sample text:
"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"
The program should only return the kanjis (chinese character) like this:
`私、人,方,皆`
I found the answer thanks to Olsgaarddk on reddit.
https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py
# -*- coding: utf-8 -*-
import re
''' This is a library of functions and variables that are helpful to have handy
when manipulating Japanese text in python.
This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
All rights reserved.
Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''
## UNICODE BLOCKS ##
# Regular expression unicode blocks collected from
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[⦅-゚]'
alphanum_full = r'[!-~]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'
## FUNCTIONS ##
def extract_unicode_block(unicode_block, string):
''' extracts and returns all texts from a unicode block from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.findall( unicode_block, string)
def remove_unicode_block(unicode_block, string):
''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.sub( unicode_block, '', string)
## EXAMPLES ##
text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕⦅゚abc!~、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'
print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))

To split hex file by dclfdd or split?

I am solving the challenge about splitting with simultaneous progress monitoring here.
Assume the detectable field of the header is FA FA FA FA in Hex.
Steps in splitting by the field
binary to binary ascii of the data as described here
bin2hex
mark split points by gsed 's/FA FA FA FA/\0/100g'
split where marks \0 by pseudocode split -p'\0' input.txt which is discussed here
How can you do the conversion bin2hex?
Couldn't access the link you shared...
But, you can use bc (bash calculator) utility for the conversions... Maybe there will be a better approach (which i am unaware of :-))
For hex to binary:
v=F; echo "ibase=16;obase=2;$(echo $v)" | bc
This will convert 0xF to 1111b.
ibase is the input base system (16 => hex). obase is the output base system (2 => binary).
For binary to hex:
v=1111; echo "ibase=2;obase=10000;$(echo $v)" | bc
This will convert 1111b to 0xF.
ibase is the input base system (2 => binary). obase is the output base system (10000 => hex (obase should be mentioned using ibase as the base system, our obase, 10000b = 16 decimal => hex)).

Sorting Czech in Perl

I have the following perl program
use 5.014_001;
use utf8;
use Unicode::Collate::Locale;
require 'Unicode/Collate/Locale/cs.pl';
binmode STDOUT, ':encoding(UTF-8)';
my #old_list = (
"cash",
"Cash",
"cat",
"Cat",
"čash",
"dash",
"Dash",
"Ďash",
"database",
"Database",
);
my $col= Unicode::Collate::Locale->new(
level => 3,
locale => 'cs',
normalization => 'NFD',
);
my #list = $col->sort(#old_list);
foreach my $item (#list){
print $item, "\n";
}
This program prints out the output:
cash
Cash
cat
Cat
čash
dash
Dash
Ďash
database
Database
I believe that a careful observer would have to conclude that in Czech either
č is a first-class letter while Ď is not.
The Unicode::Collate::Locale sorting of Czech in Perl is not correct
I'd like to believe (1), and the following bolsters my case:
http://en.wiktionary.org/wiki/Index_talk:Czech
where it says:
Let us sort the entries by the existing Czech conventions, as far as practicable. That is, only the following characters have any sorting significance:
a b c č d e f g h ch i j k l m n o p q r ř s š t u v w x y z ž
But I'm confused, because I thought "D with a v over it" (and it's lowercase equivalent), is a first-class letter of the Czech alphabet.
Where is #tchrist when I need him?
I'd appreciate any insights on this.
I have not yet seen a language that would correctly order Czech or Slovak words. (Slovak is quite similar to Czech alphabet.) .NET, Java, Python, all get it wrong. The closest to the correct solution are Raku and Go.
Yes, in Czech and Slovak, ď letter comes (right) after d. There are quite a few peculiarities, such as digraphs ch, dz, dž.
#!/usr/bin/perl
use v5.30;
use warnings;
use utf8;
use Unicode::Collate::Locale;
use open ":std", ":encoding(UTF-8)";
my #words = qw/čaj auto pot márny kľak chyba drevo cibuľa džíp džem šum pól čučoriedka
banán čerešňa červený klam čierny tŕň pôst hôrny mat chobot cesnak kĺb mäta ďateľ
troska sýkorka elektrón fuj zem guma hora gejzír ihla pýr hrozno jazva džavot lom/;
my $col = Unicode::Collate::Locale->new(
level => 3,
locale => 'sk',
normalization => 'NFKC',
);
my #sort_asc = $col->sort(#words);
say "#sort_asc";
The example sorts Slovak words; it contains plenty of challenges.
$./sort_accented_words.pl
auto banán cesnak cibuľa čaj čerešňa červený čierny čučoriedka ďateľ drevo
džavot džem džíp elektrón fuj gejzír guma hora hôrny hrozno chobot chyba
ihla jazva kľak klam kĺb lom márny mat mäta pól pot pôst pýr sýkorka šum
tŕň troska zem
Perl did not order the accented words correctly. Interestingly, it correctly ordered the words with ch, dz, dž digraphs.
#!/usr/bin/raku
my #words = <čaj auto pot márny kľak chyba drevo cibuľa džíp džem šum pól čučoriedka
banán čerešňa červený klam čierny tŕň pôst hôrny mat chobot cesnak kĺb mäta ďateľ
troska sýkorka elektrón fuj zem guma hora gejzír ihla pýr hrozno jazva džavot lom>;
say #words.sort({ .unival, .NFKD[0], .fc });
This is a Raku example.
./sort_words.raku
(auto banán cesnak chobot chyba cibuľa čaj čerešňa červený čierny čučoriedka
drevo džavot džem džíp ďateľ elektrón fuj gejzír guma hora hrozno hôrny ihla
jazva klam kĺb kľak lom mat márny mäta pot pól pôst pýr sýkorka šum troska
tŕň zem)
Accented words are correctly sorted but the ch, dz, and dž digraphs are wrong.
So in my opinion, unless we create our own solution, we won't get a 100% correct output in any programming language.
A locale is just a set of rules. Here's the locale for cs from Collate::Locale 1.31. DUCET is the Default Unicode Collation Element Table.
The Ď may be a first class letter, but that's not what DUCET thinks. If you want different sorts, you can adjust your locale or supply your own.
+{
locale_version => 1.31,
entry => <<'ENTRY', # for DUCET v13.0.0
010D ; [.1FD7.0020.0002] # LATIN SMALL LETTER C WITH CARON
0063 030C ; [.1FD7.0020.0002] # LATIN SMALL LETTER C WITH CARON
010C ; [.1FD7.0020.0008] # LATIN CAPITAL LETTER C WITH CARON
0043 030C ; [.1FD7.0020.0008] # LATIN CAPITAL LETTER C WITH CARON
0063 0068 ; [.2076.0020.0002] # <LATIN SMALL LETTER C, LATIN SMALL LETTER H>
0063 0048 ; [.2076.0020.0007][.0000.0000.0002] # <LATIN SMALL LETTER C, LATIN CAPITAL LETTER H>
0043 0068 ; [.2076.0020.0007][.0000.0000.0008] # <LATIN CAPITAL LETTER C, LATIN SMALL LETTER H>
0043 0048 ; [.2076.0020.0008] # <LATIN CAPITAL LETTER C, LATIN CAPITAL LETTER H>
0159 ; [.2194.0020.0002] # LATIN SMALL LETTER R WITH CARON
0072 030C ; [.2194.0020.0002] # LATIN SMALL LETTER R WITH CARON
0158 ; [.2194.0020.0008] # LATIN CAPITAL LETTER R WITH CARON
0052 030C ; [.2194.0020.0008] # LATIN CAPITAL LETTER R WITH CARON
0161 ; [.21D3.0020.0002] # LATIN SMALL LETTER S WITH CARON
0073 030C ; [.21D3.0020.0002] # LATIN SMALL LETTER S WITH CARON
0160 ; [.21D3.0020.0008] # LATIN CAPITAL LETTER S WITH CARON
0053 030C ; [.21D3.0020.0008] # LATIN CAPITAL LETTER S WITH CARON
017E ; [.2287.0020.0002] # LATIN SMALL LETTER Z WITH CARON
007A 030C ; [.2287.0020.0002] # LATIN SMALL LETTER Z WITH CARON
017D ; [.2287.0020.0008] # LATIN CAPITAL LETTER Z WITH CARON
005A 030C ; [.2287.0020.0008] # LATIN CAPITAL LETTER Z WITH CARON
ENTRY
};
If the default sort is not working for you, this common workaround is an easy do-it-yourself:
Make a sort-array by transforming your strings: if a and á should be equivalent, transform both to a; if á should follow a, transform it into a[, for example (any character after z should be fine). Transform ch into h[, as it goes after h, if I understand correctly. Then sort the original array together with the sort-array.
Despite Czech being my native language, I don't know Czech collation perfectly. But surely, for ď, ť, ň and wovels with diacritics, the diacritics has a lower signifficance than for other Czech characters like č.
Why? This is related to pronunciation. Barring assimilation and non-native words, all consonants but d, t and n have clear pronunciation regardless of their context. (“Ch” is considered as a separate letter.) Those three letters (D, T and N) can be “softened” when they are followed by “i”, “í” or “ě”. In those cases, they are prononuced like they had a caron (háček). As a result, the diacritics for them is less signifficant.

Is there a clean way to specify character literals in Swift?

Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (ä for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("é").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("😀").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("é").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeé#¶œŎO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
é is not ascii; é is a letter
# is ascii with value 64; # is not a letter
¶ is not ascii; ¶ is not a letter
œ is not ascii; œ is a letter
Ŏ is not ascii; Ŏ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter