filepicker.io - unicode accented characters in filename - NFC vs NFD

filepicker.io - unicode accented characters in filename - NFC vs NFD - unicode

When my users uploads files with filepicker.io and they use characters with accents (like ąłżźð) some of the files uploaded to S3 use NFC while others use NFD unicode normalization form (there is about ~20% of NFD, and sources suggests those would be from Mac OS X users).
In result I cannot simply download NFD files from S3, without trying both forms (the file names I store are always in NFC).
Is there a way to tell filepicker.io to always convert filenames to NFC before uploading to S3? Or do I really have to resort to doing conversion myself in javascript (for example using: https://github.com/walling/unorm )?
For those investigating similar issue, few background reads on the topic:
https://github.com/fog/fog/issues/1294
File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

Related

how to generate a pdf with pdfbox which should contains latin/asian/arbian charaters in the same page?

I try since few days to produce a new pdf file with pdfbox from a data extraction which contains values with different fonts. I have mainly latin characters but some names in my list of strings are in chinese or cyrillic, etc, characters.
I have spent lots of time and energy on google or stackoverflow but still don t manage to produce it(glyphe issue).
Currently, I m on Windows but will be deployed on Linux, and I use the version 2.0.26 or 3.0.0-RC1 of pdfbox.
I m manage to load ttf like that:
PDType0Font.load(doc, File("src/main/resources/font/LiberationSans-Regular.ttf").inputStream(),false)
if I set true to embedded in any cases, I got an issue of cmap.
I also tried to load ttc files but failed each time.
I have already started to implement this solution link but I don t manage to init/load correctly my font
Do you have any idea to do it?
Best, Mat

Google Cloud Storage not handling UTF-8 filenames

I am serving files from Google Cloud Storage and some of the filenames contain non-ASCII, UTF-8 encoded characters. For example, volvía.mp3.
If I request volvía.mp3, GCS throws an error.
If I percent encode the filename (í = %C3%AD) as volv%C3%AD.mp3, it still fails.
If I percent encode the filename using the "combining acute accent" = %CC%81 as volvi%CC%81a.mp3, it succeeds.
Any ideas what is going on?
EDIT: The error it throws is an "Access Denied" error:
Anonymous users does not have storage.objects.get access to object. However, this seems to be the error one gets when requesting an object that's not found.

The problem is due Mac OS's HFS+ file system, which enforces canonical decomposition (NFD) on filenames. This means it normalizes characters such as í into two code points (i + combining acute accent) rather than the single code point that is used in "composed" forms, ie., NFC).
GCS treats these two different forms as distinct filenames, despite that fact that they appear identical.
One solution is to convert NFD filenames to the more common NFC forms (using a utility such as convmv) before uploading to GCS. However, this can't be done on Mac OS because the file system itself enforces NFD.

I was not able to reproduce your issue. I uploaded an object named volvía.mp3 and was able to retrieve it as both http://storage.googleapis.com/bucketname/volvía.mp3 and http://storage.googleapis.com/bucketname/volv%C3%ADa.mp3
I suspect that you actually created an object with the "combining acute accent" character instead. How did you upload your object?

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin

é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

Displaying Chinese characters on a form from an INI File

My plugin reads the control caption text from an INI file (ANSI as UTF-8 encoding) in order to display multiple languages. Key point being it is a plugin, I have no control nor ability to change this INI file format or file type.
They are currently being read into my plugin with TINIFile.ReadString and stored as a string. I can modify this (data type, read method, etc) as needed.
The main application reads from its own application language files that are UCS-2 Little Endian encoded as a TXT file. These display fine when the language is changed, even when the Windows OS is kept in English (in other words no OS locale changes need to be made for the application to switch display languages).
My plugin's form cannot display Asian characters (Chinese, Japanese, Korean, etc). English language is fine.
I have tried various fonts, using various combinations of AnsiString, String, etc. What am I missing to be able to display Asian characters on the form? I have not found a similar question to what I'm trying to do specifically with how my language text is being read into the plugin.

If the .INI file reader does not interpret the contents of the values, and allows all values through transparently, then you need to map the strings into one with the correct locale.
There is a similar question at Delphi 2010: how do I convert a UTF8-encoded PAnsiChar to a UnicodeString? that explains how to do the conversion. You may need to extract the contents into a RawByteString to avoid the implicit conversions.

What charset to use to store russian text into javascript files as an array

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?

Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.

By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.

I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.

The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse