How does Perl compare strings under the hood?

How does Perl compare strings under the hood? - perl

Does Perl just compare the ASCII values of each character of each string until it can place one before the other or does the language compare strings in another way?

Perl does take your current locale into account and uses the sort order defined by this locale. This does not only take charsets (such as ASCII) into account but also languages. For instance words are sorted differently in French than in German, etc…

Related

Unicode range mapping between languages

there is 7707 languages listed in this link http://www.sil.org/iso639-3/download.asp and http://en.wikipedia.org/wiki/ISO_639:a.
And also Unicode support the writing system of the languages, but i want to know mapping beetween the languages and unicode range.
Unicode range is listed in this link http://www.unicode.org/roadmaps/bmp/
Example one of unicode range : "start"=> "0x0900", "end"=> "0x097F", "block_name"=> "Devanagari" (what language use this range of unicode ?)
there is any documentation ? I need full languages mapping that are supported in unicode range.

You can take a look at ICU4C locale (http://icu-project.org/apiref/icu4c/uloc_8h.html)
You can get all the locales (with uloc_getAvailable), then for each locale call uloc_addLikelySubtags, and then uloc_getScript on the result.
This is going to give you the most likely script used by a language. But there are languages that use more than one script. Some of them are captured by ICU, but some are not.

How to convert unicode escape code to character in Objective C (on iPhone)

I have a string that contains unicode escape codes, eg. #"D\u017cem" (\u017c is code for ż). I would like to convert that string to the one containg actual characters. In the example that would be #"Dżem".
Is there any method in SDK or library that can do such replacement AND work on iPhone?
(Obviously I can do the replacement myself, changing characters one by one, but it is rather cumbersome)

According to Apple,
It is not safe is to include high-bit characters in your source code
Note that the "universal character name" \u017c is replaced at compile time with an implementation-defined value which in practice is the UTF8 representation, so the end result is the same as you would get if you (correctly) did the replacement you are talking about. If you're having a problem with some other source-processing tool, you might be better served by teaching that tool to recognize C99 universal character names.

I suggest to start using NSLocalizedString()
http://www.pushplay.net/2009/08/developing-localized-iphone-applications/
http://developer.apple.com

Detect if character is simplified or traditional Chinese character

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.

It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.

As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.

Does Lua support Unicode?

Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?

You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page

Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)

If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.

Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?

It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ÐŸÐ¾Ð´Ñ€Ð°Ð¶Ð°Ð½ÑÐºÐ°Ñ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.

I believe you could use Text::Unidecode for this, it is precisely what it tries to do.

In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);

If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?

If you get cyrilic text there is no "closest ASCII representation" for many characters.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse