Asciidoctor: Wrong Encoding in the Footer

Asciidoctor: Wrong Encoding in the Footer - encoding

I saved my .adoc file as UTF-8 and compiled it with asciidoctor (on Windows 10). In the text I wrote, there are no encoding problems, but in the automatically generated footer I get
Last updated 2016-08-27 11:52:56 Mitteleuropõische Sommerzeit
You see that I compiled on a German machine. For those not too familiar with German, the "õ" should be an "ä" instead.
I guess there is some problem in generating the timestamp. I would like to either correct the misspelling or change the time format so that it does not contain "words". Can anybody help?

This issue has been fixed in Asciidoctor 1.5.8, see: https://github.com/asciidoctor/asciidoctor/issues/2770
We are not using %Z anymore because %Z is OS dependent and may contain characters that aren't UTF-8 encoded: https://github.com/asciidoctor/asciidoctor/blob/cb7c20593344bda9bc968a619b02065d3401ad29/lib/asciidoctor/document.rb#L1253-L1254

Related

Stata 13: Encoding of German Characters in Windows 8 and Mac OS X

For a current project, I use a number of csv files that are saved in UTF8. The motivation for this encoding is that it contains information in German with special characters ä,ö,ü,ß. My team is working with Stata 13 on Mac OS X and Windows 7 (software is frequently updated).
When we import the csv file (when importing, we choose Latin-1) in Stata special characters are correctly displayed on both operating system. However, when we export the dataset to another csv file on Mac OS X - which we need to do quite often in our setup - the special characters are replaced, e.g. ä -> Š, ü -> Ÿ etc. On Windows, exporting works like a charme and special characters are not replaced.
Troubleshooting: Stata 13 cannot interpret unicode. I have tried to convert the utf8 files to windows1252 and latin 1 (iso 8859-1) encoding (since, after all, all it contains are german characters) using Sublime Text 2 prior to importing it in Stata. However the same problem remains for Mac OS X.
Yesterday, Stata 14 was announced which apparently can deal with unicode. If that is the reason, then it would probably help with my problem, however, we will not be able to upgrade soon. Apart from then, I am wondering why the problem arises on Mac but not on Windows? Can anyone help? Thank you.
[EDIT Start] When I import the exported csv file again using a "Mac Roman" Text encoding (Stata allows to specify that in the importing dialogue), then my german special characters appear again. Apparently I am not the only one encountering this problem by the looks of this thread. However, because I need to work with the exported csv files, I still need a solution to this problem. [EDIT End]
[EDIT2 Start] One example is the word "Bösdorf" that is changed to "Bšsdorf". In the original file the hex code is 42c3 b673 646f 7266, whereas the hex code in the exported file is 42c5 a173 646f 7266. [EDIT2 End]

Until the bug gets fixed, you can work around this with
iconv -f utf-8 -t cp1252 <oldfile.csv | iconv -f mac -t utf-8 >newfile.csv
This undoes an incorrect transcoding which apparently the export function in Stata performs internally.
Based on your indicators, cp1252 seems like a good guess, but it could also be cp1254. More examples could help settle the issue if you can't figure it out (common German characters to test with still would include ä and the uppercase umlauts, the German double s ligature ß, etc).

Stata 13 and below uses a deprecated locale in Mac OS X, macroman (Mac OS X is unicode). I generally used StatTransfer to convert, for example, from Excel (unicode) to Stata (Western, macroman; Options->Encoding options) in Spanish language. It was the only way to have á, é, etc. Furthermore, Stata 14 imports unicode without problem but insist to export es_ES (Spanish Spain) as the default locale, having to add the command locale UTF-8 at the end of the export command to have a readable Excel file.

Perl Catalyst encoding issue

Well this is strange and hard to explain what's wrong but I'll try do my best.
For some reason values to template changes their encoding (I'm pretty sure they are).
Controller file (encoded in UTF-8):
print STDERR "ąęść";
$c->stash->{some_variable} = "ąęść"; # some unicode chars
Template file (encoded in UTF-8):
[% some_variable %]<br>
test: ąęść
As output in browser I'm getting:
ÄÄÅÄ
test: ąęść
Output on console (with UTF-8 encoding enabled):
ąęść

Please take a look at the good documentation provided by the Catalyst Wiki at Using Unicode and also Catalyst::View::TT. The Perl Unicode Cookbook may help you get a better understanding on Perl support of UTF-8, usually better than most other languages available today.
You may need to save your templates with the UTF-8 BOM mark using your editor, so your text editor does encode properly your template file when saving, or if not setting BOM, then at least define file encoding as UTF-8 every time you save it.

There's been a ton of fixes to unicode support and UTF-8 in general with the most recent stable release of Catalyst (5.90084). Catalyst now is UTF-8 by default, but of course there's always some hard edges. You should review the most recent docs on the subject to see what is going wrong.
If you template contains multibyte character than you do indeed need to set the BOM or review the documentation for your template view of choice.

Notepad++ can recognize encoding?

I created file with UTF-8 encoded content (using PHP fputcsv).
When I open this file in Notepad++ - characters are wrong (Notepad++ starts with ANSI encoding).
When I set Format->"Encode in UTF-8" from menu - everything is fine.
Im worrying, that Notepad++ can recognize encoding somehow, and maybe something is wrong with my file created with fputcsv? First byte or something?

Automatically detecting an encoding is not something that can be done accurately. It's pretty much essential that the encoding be specified explicitly. It can be guessed in some cases, but even then not with 100% certainty.
This documentation (Encoding) explains the situation in relation to Notepad++.
They also point out that the difficulty arises especially if the file has not been saved with a Byte Order Mark (BOM).
Given that your file displays correctly once you manually set the encoding, I would say there's nothing wrong with how you are generating and saving the file. The only thing you can check for is whether a BOM is being saved, which might improve the chances of Notepad++ being able to automatically detect the encoding.
It's worth noting that although it may help editors like Notepad++ identify the encoding more accurately, according to The Unicode Standard document, the BOM is not recommended.

You have to check the lower right corner of the Notepad++ GUI to see the actual enconding that is being used. The problem it's not that Notepad++ specific because guessing the right encoding is a big problem without any real solution so it's better to let the user decide what is the most appropriate encoding in each single case.

When you want to reflect the encoding of the text file in a Java program, you have to consider two thnigs: encoding and character set. When you open a text file, you see encoding under "Encoding" menu. Additionally look at the character set menu point. Under "Eastern European" you will find "ISO 8859-2", and under Central European "Windows-1250". You can set corresponding encoding in the Java program
when you look up in the table:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
For example, for Cenntral European character set "Windows-1250" the table suggest Java encoding "Cp1250". Set the encoding and you will see the characters in program properly.

How to "force" a file's ISO-8859-1ness?

I remember when I used to develop website in Japan - where there are three different character encodings in currency - the developers had a trick to "force" the encoding of a source file so it would always open in their IDEs in the correct encoding.
What they did was to put a comment at the top of the file containing a Japanese character that only existed in that particular character encoding - it wasn't in any of the others! This worked perfectly.
I remember this because now I have a similar, albeit Anglophone, problem.
I've got some files that MUST be ISO-8859-1 but keep opening in my editor (Bluefish 1.0.7 on Linux) as UTF-8. This isn't normally a problem EXCEPT for pound (£) symbols and whatnot. Don't get me wrong, I can fix the file and save it out again as ISO-8859-1, but I want it to always open as ISO-8859-1 in my editor.
So, are there any sort of character hacks - like I mention above - to do this? Or any other methods?
PS. Unicode advocates / evangelists needn't waste their time trying to convert me because I'm already one of them! This is a rickety older system I've inherited :-(
PPS. Please don't say "use a different editor" because I'm an old fart and set in my ways :-)

Normally, if you have a £ encoded as ISO-8859-1 (ie. a single byte 0xA3), that's not going to form part of a valid UTF-8 byte sequence, unless you're unlucky and it comes right after another top-bit-set character in such a way to make them work together as a UTF-8 sequence. (You could guard against that by putting a £ on its own at the top of the file.)
So no editor should open any such file as UTF-8; if it did, it'd lose the £ completely. If your editor does that, “use a different editor”—seriously! If your problem is that your editor is loading files that don't contain £ or any other non-ASCII character as UTF-8, causing any new £ you add to them to be saved as UTF-8 afterwards, then again, simply adding a £ character on its own to the top of the file should certainly stop that.
What you can't necessarily do is make the editor load it as ISO-8859-1 as opposed to any other character set where all single top-bit-set bytes are valid. It's only multibyte encodings like UTF-8 and Shift-JIS which you can exclude them by using byte sequences that are invalid for that encoding.
What will usually happen on Windows is that the editor will load the file using the system default code page, typically 1252 on a Western machine. (Not actually quite the same as ISO-8859-1, but close.)
Some editors have a feature where you can give them a hint what encoding to use with a comment in the first line, eg. for vim:
# vim: set fileencoding=iso-8859-1 :
The syntax will vary from editor to editor/configuration. But it's usually pretty ugly. Other controls may exist to change default encodings on a directory basis, but since we don't know what you're using...
In the long run, files stored as ISO-8859-1 or any other encoding that isn't UTF-8 need to go away and die, of course. :-)

You can put character ÿ (0xFF) in the file. It's invalid in UTF8. BBEdit on Mac correctly identifies it as ISO-8859-1. Not sure how your editor of choice will do.

What encoding does InstallShield expect non-latin-alphabet string table entries to use?

I work on an app that gets distributed via a single installer containing multiple localizations. The build process includes a script that updates the .ism string table with translations for each supported language.
This works fine for languages like French and German. But when testing the installer in, i.e. Japanese, the text shows up as a series of squares. It's unlikely to be a font problem, since the InstallShield-supplied strings show up fine; only the string table entries are mangled. So the problem seems to be that the strings are in the wrong encoding.
The .ism is in XML format, with UTF-8 declared as its encoding, so I assumed the strings needed to be UTF-8 encoded as well. Do they actually need to use the encoding of the target platform? Is there any concern, then, about targets having different encodings, i.e. Chinese systems using one GB-encoding versus another? What is the right thing to do here?
Edit: Using InstallShield 2009, since there is apparently a difference between that and 2010.

In InstallShield 2009 and earlier, the encoding is a base-64 encoding of the binary string in the ANSI encoding specific to the language in question (e.g. CP932 for Japanese). In InstallShield 2010 and later, it will still accept that or use UTF-8, depending on other columns in that table.

Thanks (up-voted his answer) go to Michael Urman, for pointing us in the right direction. But this is the actual working (with InstallShield 2009) algorithm, reverse-engineered by a co-worker:
Start with a unicode (multi-byte-character) string
Write out the length as the encoded-length field in the ism-file
Encode the string as UTF-16-little-endian
Base-64 using the uuencode dictionary, except with ` (back-tick) instead of spaces.
Write the result to the ism-file, escaping XML entities
Be aware that base-64ing using the uuencode dictionary is not the same as using the uuencode algorithm. Standard uuencode produces a set of newline-separated lines, including a header, footers and one or more data lines, each of which begins with a length-character. If you're implementing this using a uuencode codec, you'll need to strip all of that off.

I'm also trying to figure this out...
I've inhereted some Installshield 12 (which is pre-2009) projects with string table entries containing characters outside the range of base64 'target' characters.
For example, one of the Japanese strings is:
4P!H&$9!O'<4!R&\=!E&,=``#$(80!C&L=0!P"00!G`&4`;#!T`)(PI##S,+DPR##\,.LP5S!^,%DP`C
After much searching I happened upon Base85 encoding, which looks much closer to being plausible, but have not yet verified this to be the solution.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse