How can I encode UTF-8 to create case-insensitive ASCII filenames?

How can I encode UTF-8 to create case-insensitive ASCII filenames? - encoding

I have a list of UTF-8 strings, most of which are ASCII, but some of which contain Korean/Japanese characters. I want to create a file associated with each string. To avoid any problems with the filesystem not supporting these weird characters, I want to somehow encode each UTF-8 string into ASCII to generate its filename. The generated ASCII should be case-insensitive and all-lowercase to avoid problems with Windows' case-insensitive filesystem. (So e.g. Base64 encoding along would not work.) What are some ways to do this?

Related

Character Encodings compatibility with ASCII

I'm currently reading mails from file and process some of the header information. Non-ASCII characters are encoded according to RFC2047 in quoted-printable oder Base64, so the files contain no non-ASCII characters . If the file is encoded in UTF-8, Win-1252 or one of the ISO-8859-* character encodings, I won't run into problems because ASCII is embedded at the same place in all these charsets (so 0x41 is a A in all of those charsets).
But what if the file is encoded using an encoding that does not embed ASCII in that way? Do encodings like this even exist? And if so, is there even a reliable way of detecting them?

There is a Charset-detector of Mozilla based on this very interesting article. It can detect a very large amount of different encodings. There is also a port to C# available on GitHub which I used before. It turned out to be quite reliable. But of course, when the text just contains ASCII characters, it cannot distinguish between the different encodings that encode ASCII in the same way. But any encodings that encode ASCII in a different way should be detected correctly with this library.

addPortalMessage requires decode('utf-8')

Currently it seems that in order for UTF-8 characters to display in a portal message you need to decode them first.
Here is a snippet from my code:
self.context.plone_utils.addPortalMessage(_(u'This document (%s) has already been uploaded.' % (doc_obj.Title().decode('utf-8'))))
If Titles in Plone are already UTF-8 encoded, the string is a unicode string and the underscore function is handled by i18ndude, I do not see a reason why we specifically need to decode utf-8. Usually I forget to add it and remember once I get a UnicodeError.
Any thoughts? Is this the expected behavior of addPortalMessage? Is it i18ndude that is causing the issue?

UTF-8 is a representation of Unicode, not Unicode and not a Python unicode string. In Python, we convert back and forth between Python's unicode strings and representations of unicode via encode/decode.
Decoding a UTF-8 string via utf8string.decode('utf-8') produces a Python unicode string that may be concatenated with other unicode strings.
Python will automatically convert a string to unicode if it needs to by using the ASCII decoder. That will fail if there are non-ASCII characters in the string -- because, for example, it is encoded in UTF-8.

Scandinavian characters when encoding to Ascii in Powershell

I need to export some data using Powershell to a ASCII encoded file.
My problem is that Scandinavian characters like Æ, Ø and Å turns into ? ? ? in the output file.
Example:
$str = "ÆØÅ"
$str | Out-File C:\test\test.txt -Encoding ascii
In the output file the result of this is: ???

It seems as though you have conflicting requirements.
Save the text in ASCII encoding
Save characters outside the ASCII character range
ASCII encoding does not support the characters you mention, which is the reason they do not work as you expect them to. The MSDN documentation on ASCII Encoding states that:
ASCII characters are limited to the lowest 128 Unicode characters, from U+0000 to U+007F.
And also further that
If your application requires 8-bit encoding (which is sometimes incorrectly referred to as "ASCII"), the UTF-8 encoding is recommended over the ASCII encoding. For the characters 0-7F, the results are identical, but use of UTF-8 avoids data loss by allowing representation of all Unicode characters that are representable. Note that the ASCII encoding has an 8th bit ambiguity that can allow malicious use, but the UTF-8 encoding removes ambiguity about the 8th bit.
You can read more about ASCII encoding on the Wikipedia page regarding ASCII Encoding (this page also includes tables showing all possible ASCII characters and control codes).
You need to either use a different encoding (such as UTF-8) or accept that you can't use characters which fall outside the ASCII range.

I do replace literal \xNN with their character in Perl?

I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?

ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.

\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.

It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ÐŸÐ¾Ð´Ñ€Ð°Ð¶Ð°Ð½ÑÐºÐ°Ñ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.

I believe you could use Text::Unidecode for this, it is precisely what it tries to do.

In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);

If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?

If you get cyrilic text there is no "closest ASCII representation" for many characters.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I encode UTF-8 to create case-insensitive ASCII filenames? - encoding

Related

Character Encodings compatibility with ASCII

addPortalMessage requires decode('utf-8')

Scandinavian characters when encoding to Ascii in Powershell

I do replace literal \xNN with their character in Perl?

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

Categories

Resources