How to "force" a file's ISO-8859-1ness? - unicode

I remember when I used to develop website in Japan - where there are three different character encodings in currency - the developers had a trick to "force" the encoding of a source file so it would always open in their IDEs in the correct encoding.
What they did was to put a comment at the top of the file containing a Japanese character that only existed in that particular character encoding - it wasn't in any of the others! This worked perfectly.
I remember this because now I have a similar, albeit Anglophone, problem.
I've got some files that MUST be ISO-8859-1 but keep opening in my editor (Bluefish 1.0.7 on Linux) as UTF-8. This isn't normally a problem EXCEPT for pound (£) symbols and whatnot. Don't get me wrong, I can fix the file and save it out again as ISO-8859-1, but I want it to always open as ISO-8859-1 in my editor.
So, are there any sort of character hacks - like I mention above - to do this? Or any other methods?
PS. Unicode advocates / evangelists needn't waste their time trying to convert me because I'm already one of them! This is a rickety older system I've inherited :-(
PPS. Please don't say "use a different editor" because I'm an old fart and set in my ways :-)

Normally, if you have a £ encoded as ISO-8859-1 (ie. a single byte 0xA3), that's not going to form part of a valid UTF-8 byte sequence, unless you're unlucky and it comes right after another top-bit-set character in such a way to make them work together as a UTF-8 sequence. (You could guard against that by putting a £ on its own at the top of the file.)
So no editor should open any such file as UTF-8; if it did, it'd lose the £ completely. If your editor does that, “use a different editor”—seriously! If your problem is that your editor is loading files that don't contain £ or any other non-ASCII character as UTF-8, causing any new £ you add to them to be saved as UTF-8 afterwards, then again, simply adding a £ character on its own to the top of the file should certainly stop that.
What you can't necessarily do is make the editor load it as ISO-8859-1 as opposed to any other character set where all single top-bit-set bytes are valid. It's only multibyte encodings like UTF-8 and Shift-JIS which you can exclude them by using byte sequences that are invalid for that encoding.
What will usually happen on Windows is that the editor will load the file using the system default code page, typically 1252 on a Western machine. (Not actually quite the same as ISO-8859-1, but close.)
Some editors have a feature where you can give them a hint what encoding to use with a comment in the first line, eg. for vim:
# vim: set fileencoding=iso-8859-1 :
The syntax will vary from editor to editor/configuration. But it's usually pretty ugly. Other controls may exist to change default encodings on a directory basis, but since we don't know what you're using...
In the long run, files stored as ISO-8859-1 or any other encoding that isn't UTF-8 need to go away and die, of course. :-)

You can put character ÿ (0xFF) in the file. It's invalid in UTF8. BBEdit on Mac correctly identifies it as ISO-8859-1. Not sure how your editor of choice will do.

Related

Notepad++ can recognize encoding?

I created file with UTF-8 encoded content (using PHP fputcsv).
When I open this file in Notepad++ - characters are wrong (Notepad++ starts with ANSI encoding).
When I set Format->"Encode in UTF-8" from menu - everything is fine.
Im worrying, that Notepad++ can recognize encoding somehow, and maybe something is wrong with my file created with fputcsv? First byte or something?
Automatically detecting an encoding is not something that can be done accurately. It's pretty much essential that the encoding be specified explicitly. It can be guessed in some cases, but even then not with 100% certainty.
This documentation (Encoding) explains the situation in relation to Notepad++.
They also point out that the difficulty arises especially if the file has not been saved with a Byte Order Mark (BOM).
Given that your file displays correctly once you manually set the encoding, I would say there's nothing wrong with how you are generating and saving the file. The only thing you can check for is whether a BOM is being saved, which might improve the chances of Notepad++ being able to automatically detect the encoding.
It's worth noting that although it may help editors like Notepad++ identify the encoding more accurately, according to The Unicode Standard document, the BOM is not recommended.
You have to check the lower right corner of the Notepad++ GUI to see the actual enconding that is being used. The problem it's not that Notepad++ specific because guessing the right encoding is a big problem without any real solution so it's better to let the user decide what is the most appropriate encoding in each single case.
When you want to reflect the encoding of the text file in a Java program, you have to consider two thnigs: encoding and character set. When you open a text file, you see encoding under "Encoding" menu. Additionally look at the character set menu point. Under "Eastern European" you will find "ISO 8859-2", and under Central European "Windows-1250". You can set corresponding encoding in the Java program
when you look up in the table:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
For example, for Cenntral European character set "Windows-1250" the table suggest Java encoding "Cp1250". Set the encoding and you will see the characters in program properly.

write utf8 with perl bug?

My problem is simple. I want to output UTF-8 with my Perl script.
This code is not working.
use utf8;
open(TROIS,">utf8.out.2.txt");
binmode(TROIS, ":utf8");
print TROIS "Hello\n";
The output file is not in UTF-8. (My file script is coded in UTF-8)
But if I insert an accentuated character in my print, then it's working and my output file is in UTF-8. Example:
print TROIS "é\n";
I use ActivePerl 5.10 under Windows. What might be the problem?
You're writing nothing but ASCII characters with Hello\n. Fortunately ASCII is still perfectly valid UTF-8. However, auto detection by editors will most likely not show UTF-8 as the encoding because they don't have anything to judge the file content's encoding by. I guess you simply don't know how file encodings work.
A file's encoding is a property that in general is not stored in a file or externally alongside a file. A lot of editors simply assume a certain encoding based on the operating system they run on or the environment settings (system language), or they include some kind of semi-intelligent auto-detection (which may still fail because file encodings cannot be auto-detected unambiguously). That's why you have to tell Perl that a file is encoded in UTF-8 when you read it with binmode or the corresponding I/O layer.
Now there is one way of marking a text file's encoding if said encoding is one of the UTF family (UTF-8, UTF-16 LE and BE, UTF-32 LE and BE) . That way is called the BOM (byte order mark). However, producing files with a BOM came from a time when UTF-8 had not been spread as widely as it is today. It usually poses more and different problems than it solves, especially due to editors and applications in general not supporting BOMs at all. Therefore BOMs should probably be avoided nowadays.
There are exceptions, of course, in which the file format contains certain instructions that tell the file's encoding. XML comes to mind with its DOCTYPE declaration. However, even for such files you will have to recognize if a file is encoded in a multi-byte encoding that always uses at least two bytes per character (UTF-16/UTF-32) or not in order to parse the DOCTYPE declaration in the first place. It's simply not simple ;)

How to get vim to show a byte-by-byte representation of file data

I don't want vim to ever interpret my data in any encoding specific way. In other words, when I'm in vim, I want the character that my cursor is on to correspond to the actual byte, not a utf* (etc.) representation of that byte.
I need to use vim to analyze issues caused by Unicode conversion errors made by other people (using other software) so it's important that I see what is actually there.
For example, in Cygwin's vim, I have been able to see UTF-8 BOMs as
 [START OF FILE DATA]
This is perfect. I recognize this as a UTF-8 BOM and if I want to know what the hex for each character is, I can put the cursor on the characters and use 'ga'.
I recently got a proper Linux machine (Fedora). In /etc/vimrc, this line exists
set fileencodings=ucs-bom,utf-8,latin1
When I look at a UTF-8 BOM on this machine, the BOM is completely hidden.
When I add the following line to ~/.vimrc
set fileencodings=latin1
I see

The first 3 characters are the BOM (when ga is used against them). I don't know what the last 3 characters are.
At one point, I even saw the UTF-8 BOM represented as "feff" - the UTF-16 BOM.
Anyway, you see my problem. I need to see exactly what is in my file without vim interpreting the bytes for me. I know I could use xxd, od, etc but vim has always been very convenient as an analysis tool. Plus I want to be able to edit the files and save them without any conversion problems.
Thanks for your help.
Use 'binary' mode:
:edit ++bin file
or
vim -b file
From :help 'binary':
The 'fileencoding' and 'fileencodings' options will not be used, the
file is read without conversion.
I get some good mileage from doing :e ++enc=latin1 after loading the file (VIm's initial guess on the encoding isn't important at this stage).
The sequence  is actually the U+FEFF (BOM) encoded UTF-8, decoded latin1, encoded UTF-8, and decoded latin1 again.  is the U+FEFF (BOM) encoded as UTF-8 and decoded as latin1. You can't get away from encodings. Those aren't the actual bytes, they are the latin1 characters displayed from an incorrect decoding. If you want bytes, use a hex editor; otherwise, use the correct decoding.

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?
Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.
If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake
It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

Displaying Unicode characters above U+FFFF on Windows

the application I'm developing with EVC++ 4 runs on Windows CE 5 and should support unicode (AFAIK wchar_t uses UTF-16 on windows, so I'm using that), so I want to be able to test it with "more exotic" characters. Especially with characters that use 4 Byte in UTF-16 and not just 2. Therefore I'm trying to display such characters in a texteditor (atm on my desktop PC with Windows XP, not on the embedded device).
But I haven't managed it to do so yet. As an example I've chosen this character.
Like mentioned here "MPH 2B Damase" should support this character. So I downloaded the font and put it into Windows\Fonts. I created a textfile using a hexeditor (just to be sure) with following content:
FFFE D802 DC00
When I open it with notepad (which should be unicode-capable, right?) and use the downloaded font it doesn't display 1 char, as intended, but this 2:
˘Ü
What am I doing wrong? :)
Thanks!
hrniels
Edit:
Flipping the BOM, as suggested, doesn't work. Notepad (and all other editors I tried, too) displays two squares in this case. Interesting is that if I copy the two squares here (with firefox) I see the right character:
I've also tried it with Komodo Edit with the same result.
Using UTF-8 doesn't help notepad either.
What happens if you put the byte order mark the other way around?
FEFF D802 DC00
(At the moment the byte sequence is being interpreted as the two characters U+02D8 U+00DC, so hopefully flipping the BOM will cause the bytes to be read in the intended order)
Probably you forgot to read the _wfopen() documentation. There they specify the encoding parameter. BTW, I assumed you are already using Unicode (wchars).
I would recommend you to use UTF-8 in files with or without BOM but forcing your fopen to use UTF-8 flag. It looks _wfopen("newfile.txt", "r, ccs=UTF-8"); will work with UTF-8 with or without BOM and also with UTF-16. Do not make the mistake of using the ccs=Unicode, it is a common thing to have UTF-8 files without BOM.
You should really read a little bit about Unicode before trying to work. This about this as a very good investment - it will save you time if you understand how Unicode works.
Here is a start http://blog.i18n.ro/newbie-guide-to-unicode/ and do not forget to read the links from the end of the article.
If you really need a simple text editor that allows you to play with Unicode encodings, use Notepad++ and forget about Notepad.
Your text editor might not like UTF-16. It probably assumes ANSI or UTF-8.
Try typing in the UTF-8 equivalent instead:
0xF0 0x90 0xA0 0x80
This won't help your testing, but will make sure your font isn't at fault. A text editor that does support UTF-16 is Komodo Edit.