How to sort unicode strings alphabetically in Common Lisp? - unicode

This:
(sort '("Aaa" "Ééé" "Zzz") #'string-lessp)
;; ("Aaa" "Zzz" "Ééé")
is not satisfying, because "Ééé" should come before "Zzz".
How can we sort unicode strings alphabetically?
My current approach has been to create a copy of the strings, replace accentuated letters by their counterpart (with cl-slug:asciify, that calls ppcre:regexp-replace-all), sort this and display the original string back.
Thanks.

If you use SBCL, you have integrated support for unicode.
String operations
Try to sort with unicode< instead of string-lessp.

Related

Sorting Data in Matlab

I am trying to sort the following data in the Matlab, but not getting the expected output what I need.
Here is data:
'1B-3A-5A'
'1A-3A-19A'
'2A-2A-4A-5A'
'2B-2A-5A'
'2A-4A-5A'
'2C-5A-30A'
'11A-3A-19A
'3A-19A-42C'
'4A-4A-12A'
'19A-21A-42C'
'25A-41D'
'41C-41C'
'39C-41C'
'43E'
'39A-41D'
'1A-3A-5A-7A'
'7C-27A-28A'
I need the sorted list such that it considers the first number then the alphabet to sort the list like below
'1A-3A-19A'
'1A-3A-5A-7A'
'1B-3A-5A'
'2A-2A-4A-5A'
'2A-4A-5A'
'2B-2A-5A'
'2C-5A-30A'
'3A-19A-42C'
'4A-4A-12A'
'7C-27A-28A'
'11A-3A-19A
'19A-21A-42C'
'25A-41D'
'39A-41D'
'39C-41C'
'41C-41C'
'43E'
Can you please suggest a way to do it? I tried all ways but it doesn't sort it like I want. Thanks!!
How about using sort or sortrows? This does actually sort strings as well:
If A is a string, then sort(A) sorts according to the ASCII dictionary order. The sort is case sensitive with uppercase letters appearing in the output before the lowercase letters.
As #StewieGriffin pointed out, this sorts 11a before 1a. Conveniently Douglas Schwarz has already produced a code that overcomes exactly this problem of alphanumeric sorting on numerics first and characters after.

Is there a rule to match unicode printable characters in parboiled2?

As part of a larger parser, I am writing a rule to match strings like the following using parboiled2:
Italiana Relè
I would like to use something simple like the following:
CharPredicate.Printable
But the parser is failing with an org.parboiled2.ParseError because of the unicode character at the end of the string.
Is there a simple option that I'm not aware of for matching printable unicode characters?
Take a look at https://github.com/sirthias/parboiled2/blob/master/parboiled-core/src/main/scala/org/parboiled2/CharPredicate.scala#L112 - it is very easy to do your own predicates, for instance:
val latinSupplementCharsPredicate = CharPredicate('\u00c0' to '\u00dc') ++ CharPredicate('\u00e0' to '\u00fd')

How can I get a substring of a string in Emacs Lisp?

When I have a string like "Test.m", how can I get just the substring "Test" from that via elisp? I'm trying to use this in my .emacs file.
One way is to use substring (or substring-no-properties):
(substring "Test.m" 0 -2) => "Test"
(substring STRING FROM &optional TO )
Return a new string whose contents are a substring of STRING. The
returned string consists of the characters between index FROM
(inclusive) and index TO (exclusive) of STRING. FROM and TO are
zero-indexed: 0 means the first character of STRING. Negative values
are counted from the end of STRING. If TO is nil, the substring runs
to the end of STRING.
Stefan's answer is idiomatic, when you just need a filename without extension. However, if you manipulate files and filepaths heavily in your code, i recommend installing Johan Andersson's f.el file and directory API, because it provides many functions absent in Emacs with a consistent API. Check out functions f-base and f-no-ext:
(f-base "~/doc/index.org") ; => "index"
(f-no-ext "~/doc/index.org") ; => "~/doc/index"
If, instead, you work with strings often, install Magnar Sveen's s.el for the same reasons. You might be interested in s-chop-suffix:
(s-chop-suffix ".org" "~/doc/index.org") ; => "~/doc/index"
For generic substring retrieval use dkim's answer.
In your particular case, you might like to use file-name-sans-extension.
Probably the most flexible option (although it's not clear if you need flexibility) would be to use replace-regexp-in-string:
See C-hf replace-regexp-in-string RET
e.g.:
(replace-regexp-in-string "\\..*" "" "Test.m")

preg_match a keyword variable against a list of latin and non-latin chars keywords in a local UTF-8 encoded file

I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.
How do I go about matching both Latin and non-Latin keywords.
The badwords.txt file includes one word per line as in this example
bad
nasty
racist
سفالة
وساخة
جنس
Code used for matching:
$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);
foreach ($badwords as $key => $val) {
if (!empty($val)) {
$val = trim($val);
$regexp = "/\b" . $val . "\b/i";
if (preg_match($regexp, $query))
$badFlag = 1;
if ($badFlag == 1) {
// Bad word detected die...
}
}
}
I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.
The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set
$wstart = '(^|[^\p{L}])';
$wend = '([^\p{L}]|$)';
and modify the regexp as follows:
$regexp = "/" . $wstart . $val . $wend . "/iu";
Some string functions in PHP cannot be used on UTF-8 strings, they're supposedly going to fix it in version 6, but for now you need to be careful what you do with a string.
It looks like strtolower() is one of them, you need to use mb_strtolower($query, 'UTF-8'). If that doesn't fix it, you'll need to read through the code and find every point where you process $query or badwords.txt and check the documentation for UTF-8 bugs.
As far as I know, preg_match() is ok with UTF-8 strings, but there are some features disabled by default to improve performance. I don't think you need any of them.
Please also double check that badwords.txt is a UTF-8 file and that $query contains a valid UTF-8 string (if it's coming from the browser, you set it with a <meta> tag).
If you're trying to debug UTF-8 text, remember most web browsers do not default to the UTF-8 text encoding, so any PHP variable you print out for debugging will not be displayed correctly by the browser, unless you select UTF-8 (in my browser, with View -> Encoding -> Unicode).
You shouldn't need to use iconv or any of the other conversion API's, most of them will simply replace all of the non-latin characters with latin ones. Obviously not what you want.

Convert a UTF8 string to ASCII in Perl

I've tried everything Google and StackOverflow have recommended (that I could find) including using Encode. My code works but it just uses UTF8 and I get the wide character warnings. I know how to work around those warnings but I'm not using UTF8 for anything else so I'd like to just convert it and not have to adapt the rest of my code to deal with it. Here's my code:
my $xml = XMLin($content);
# Populate the #titles array with each item title.
my #titles;
for my $item (#{$xml->{channel}->{item}}) {
my $title = Encode::decode_utf8($item->{title});
#my $title = $item->{title};
#utf8::downgrade($title, 1);
Encode::from_to($title, 'utf8', 'iso-8859-1');
push #titles, $title;
}
return #titles;
Commented out you can see some of the other things I've tried. I'm well aware that I don't know what I'm doing here. I just want to end up with a plain old ASCII string though. Any ideas would be greatly appreciated. Thanks.
The answer depends on how you want to use the title. There are 3 basic ways to go:
Bytes that represent a UTF-8 encoded string.
This is the format that should be used if you want to store the UTF-8 encoded string outside your application, be it on disk or sending it over the network or anything outside the scope of your program.
A string of Unicode characters.
The concept of characters is internal to Perl. When you perform Encode::decode_utf8, then a bunch of bytes is attempted to be converted to a string of characters, as seen by Perl. The Perl VM (and the programmer writing Perl code) cannot externalize that concept except through decoding UTF-8 bytes on input and encoding them to UTF-8 bytes on output. For example, your program receives two bytes as input that you know they represent UTF-8 encoded character(s), let's say 0xC3 0xB6. In that case decode_utf8 returns a representation that instead of two bytes, sees one character: ö.
You can then proceed to manipulate that string in Perl. To illustrate the difference further, consider the following code:
my $bytes = "\xC3\xB6";
say length($bytes); # prints "2"
my $string = decode_utf8($bytes);
say length($string); # prints "1"
The special case of ASCII, a subset of UTF-8.
ASCII is a very small subset of Unicode, where characters in that range are represented by a single byte. Converting Unicode into ASCII is an inherently lossy operation, as most of the Unicode characters are not ASCII characters. You're either forced to drop every character in your string which is not in ASCII or try to map from a Unicode character to their closest ASCII equivalents (which isn't possible in the vast majority of cases), when trying to coerce a Unicode string to ASCII.
Since you have wide character warnings, it means that you're trying to manipulate (possibly output) Unicode characters that cannot be represented as ASCII or ISO-8859-1.
If you do not need to manipulate the title from your XML document as a string, I'd suggest you leave it as UTF-8 bytes (I'd mention that you should be careful not to mix bytes and characters in strings). If you do need to manipulate it, then decode, manipulate, and on output encode it in UTF-8.
For further reading, please use perldoc to study perlunitut, perlunifaq, perlunicode, perluniintro, and Encode.
Although this is an old question, I just spent several hours (!) trying to do more or less the same thing! That is: read data from a UTF-8 XML file, and convert that data into the Windows-1252 codepage (I could also have used Latin1, ISO-8859-1 etc.) in order to be able to create filenames with accented letters.
After much experimentation, and even more searching, I finally managed to get the conversion working. The "trick" is to use Encode::encode instead of Encode::decode.
For example, given the code in the original question, the correct (or at least one :-) way to convert from UTF-8 would be:
my $title = Encode::encode("Windows-1252", $item->{title});
or
my $title = Encode::encode("ISO-8859-1", $item->{title});
or
my $title = Encode::encode("<your-favourite-codepage-here>", $item->{title});
I hope this helps others having similar problems!
You can use the following line to simply get rid of the warning. This assumes that you want to use UTF8, which shouldn't normally be a problem.
binmode(STDOUT, ":encoding(utf8)");