Inno Setup Unicode encoding issue with messages in ISS script - unicode

We have a script setup to run with Inno Setup Unicode compiler. The installer at the moment supports English, German and French.
It has been brought to our attention that the encoding for our custom messages in French aren't correct. The custom message files are saved as UTF-8 so there should not have been an encoding issue. We verified we were using the Unicode compiler and not ANSI by accident.
Expected Custom message:
French.UninstallOldVersionPrompt=Il semble y avoir une version antérieure de Rubberduck installée sur ce système pour %s.There seems to be a previous version of Rubberduck, qui devra d'abord être désinstallée. Procéder à la désinstallation?
Link to source
The file is saved as UTF-8, so it shouldn't have had encoding issues. What went wrong?
For more details, you can read up on the Github issue

The .iss file needs to have UTF-8 BOM, if it includes Unicode/UTF-8 strings.
In your case, it's the French.CustomMessages.iss that is missing the BOM.
The German.CustomMessages.iss has the BOM, that's why it works correctly with German.
See also UTF-8 characters not displaying correctly in Inno Setup.
The BOM requirement is indeed not documented. But I believe it's clear from the code of TTextFileReader.ReadLine.

Related

Save file with ANSI encoding in VS Code

I have a text file that needs to be in ANSI mode. It specifies only that: ANSI. Notepad++ has an option to convert to ANSI and that does the trick.
In VS Code I don't find this encoding option. So I read up on it and it looks like ANSI doesn't really exist and should actually be called Windows-1252.
However, there's a difference between Notepad++'s ANSI encoding and VS Code's Windows-1252. It picks different codepoints for characters such as an accented uppercase e (É), as is evident from the document.
When I let VS Code guess the encoding of the document converted to ANSI by Notepad++, however, it still guesses Windows-1252.
So the questions are:
Is there something like pure ANSI?
How can VS Code convert to it?
Check out the upcoming VSCode 1.48 (July 2020) browser support.
It also solves issue 33720, where you had to force encoding for the full project before with, for instance:
"files.encoding": "utf8",
"files.autoGuessEncoding": false
Now you can set the encoding file by file:
Text file encoding support
All of the text file encodings of the desktop version of VSCode are now supported for web as well.
As you can see, the encoding list includes ISO 8859-1, which is the closest norm of what "ANSI encoding" could represent.
The Windows codepage 1252 was created based on ISO 8859-1 but is not completely equal.
It differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range
From Wikipedia:
Historically, the phrase "ANSI Code Page" was used in Windows to refer to non-DOS encodings; the intention was that most of these would be ANSI standards such as ISO-8859-1.
Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard.
Microsoft explains, "The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community."

Windows code page 1252 able to handle non English Characters

We are in a situation where our program works on some machine and not on some others. We identified the problem as that of us using ANSI versions of GetTempPath; which fail in non English OSes. So far so good. However our code works on "some" computers and the results of a test app are inconsistent. It seems if the TEMP path has non english characters, say TEMP=E:/टेम्प, then on some computers GetTempPath returns E:/??? and then later attempting to open a file on that folder fails. Rightly so. Easy to fix - use unicode versions of the API.
But on some other computers it return the correct encoding, such that ultimately file opening succeeds.
I check the ACP on these computer - it is 1252. HOW IS 1252 able to encode non english characters?
It has become a topic of discussion - how was our program working all along? Such a bug should have been reported long ago etc.
HOW IS 1252 able to encode non english characters?"
Because codepage 1252 has various non-English characters in it. See the full character table on Wikipedia. Note that टे, म्, and प are NOT present in 1252, which is why they end up as ? when treated as ANSI.
Also, you should be using the Unicode version of API functions instead of the ANSI versions, then you wouldn't have this problem anymore.

Identify hidden non-UTF8 encoded characters

I am working in postgreSQL database and I have text column which in various languages like russian, chineses, korean, english etc. Although our application handles these languages well, we are having a issue dealing with non-UTF-8 characters.
For example, if you see the image from notepad++ where I have done Encoding > Encode in UTF-8, it neatly shows all the non-recognizable characters.
However, we are facing issue marking such records as non-process-able in postgres. Something like a flag should also do but I am trying something like below but it flags the valid russian records as well whereas notepad++ explicitly shows the hidden/non-UTF-8 characters.
Notepad++
Weird thing about these characters are that they do not show up regular select query but when I convert them to "UTF-8", those show up like below.
Database
Tried something like this (below query) but it does not seem to work i.e give me the desired output. Expectation is to set a flag to such records which have invalid hidden HTML references but not lose the valid text like the valid russian sentence in the snapshot. Should be able to distinctly identify only such texts.
select text, text ~ '[^[:ascii:]]', text ~ '^[\x00-\x7F]*$'
from sample_data;
Sample Data -
"Я не наркоман. Это у меня всегда, когда мне афигитительно. А если серьёзно, это интересно,…"
"Ya le dieron amor a la foto de instagram de mi #UberCALAVERITA?"
"Executive Admininstrative Assistant in Toronto, ON for a Group"
"Сегодня валютные стратеги BMO обновили прогнозы по основным валютам на ближайшие пять кварталов (на конец периода): читать далее…"
"Flicitations Gestion d'actifs pour 6 Trophes #FundGradeA+2016 de fonds communs de placement :"
This answer might help you go back to fix problems. It doesn't directly help you to go forward in the direction you are asking about.
Looking at Flicitations and F\302\202licitations, the escapes look like octal, which is possibly a presentation choice of your "IDE" and/or the convert_to function. From octal, \302\202 is 0xC2 0x82, decoding as UTF-8 gives U+0082. In Unicode, that's a control character, in ISO 8859-1 it's a non-character, either might explain why some renderings make it invisible or take no space.
Now, Google tells me that Flicitations is almost like a French word, Félicitations. So, perhaps there is a character set and encoding where é is encoded as 0x82. Wikipedia helps here—Indeed there is: IBM850, which has been used for some French text.
So, it seems that someone has mishandled the user's text, causing data loss. The fundamental rule of text encoding is that text bytes must be read with the same encoding they were written with. Don't guess; Ask, or reference a standard, specification, documentation, or convention. Maybe you can go back and find the misbehaving process/code—at least that would prevent future data loss.
"Dealing with non-UTF-8 characters": There aren't really any non-UTF-8 characters. UTF-8 is an encoding of the Unicode character set. There are areas with exceptions but, practically speaking, Unicode has all characters, and UTF-8 can encode them all. So, if you think there are non-UTF-8 characters, the writer is either non-compliant or the reader is using the wrong encoding.

Asciidoctor: Wrong Encoding in the Footer

I saved my .adoc file as UTF-8 and compiled it with asciidoctor (on Windows 10). In the text I wrote, there are no encoding problems, but in the automatically generated footer I get
Last updated 2016-08-27 11:52:56 Mitteleuropõische Sommerzeit
You see that I compiled on a German machine. For those not too familiar with German, the "õ" should be an "ä" instead.
I guess there is some problem in generating the timestamp. I would like to either correct the misspelling or change the time format so that it does not contain "words". Can anybody help?
This issue has been fixed in Asciidoctor 1.5.8, see: https://github.com/asciidoctor/asciidoctor/issues/2770
We are not using %Z anymore because %Z is OS dependent and may contain characters that aren't UTF-8 encoded: https://github.com/asciidoctor/asciidoctor/blob/cb7c20593344bda9bc968a619b02065d3401ad29/lib/asciidoctor/document.rb#L1253-L1254

Is it safe to use .islu translation files in Inno Setup for all languages?

I'm using Inno Setup for my open source project WinSCP.
So far I'm generating Inno Setup .isl translation files from our project-specific translation files (particularly to translate CustomMessages section).
But the .isl's need to be converted to ANSI encoding. There's a problem with ANSI encoding for languages that does not have an ANSI encoding at all (like Hindi or Armenian) or whose ANSI encoding is limited (like Romanian).
I see that core Inno Setup translations for some languages use .islu extension, (probably) indicating that the contents is UTF-8 encoded. I can also see in the Inno Setup source code that the .islu's are used in Unicode version of Inno Setup only. That's ok, as I'm using Unicode version only.
But I did not find any mention of .islu in documentation.
Is it OK if I generate just .islu's for all languages? Is there any drawback (apart from inability to use the ANSI version of Inno Setup)?
Or should I keep using .isl for languages with good ANSI encoding, and use .islu just for selected languages?
I'd obviously prefer the first to simplify the process.
Also what LanguageCodePage should be set to for .islu? The official Nepali translation uses 0. Not sure if that's a general rule for .islu or it's because Nepali does not have an ANSI encoding.
After few months of using this, I can confirm that adding the UTF-8 BOM and using the .islu extension for all Inno Setup translations seems to work well.
The .islu files are documented since Inno Setup 5.5.9:
If no code page currently exists for the language, set LanguageCodePage to 0, use a special .islu extension for the language's file, and encode this file as Unicode. Note: this makes your language file unusable by Non Unicode Inno Setup so only do this if really needed. Also note: a LanguageName setting in a .islu file does not need to use the special "<nnnn>" encoding mentioned above.