understanding file encodings - encoding

in eclipse, I have a file where some place this is written:
onclick='obj1.help_open_new_window(fn1(), "/redir/url_name")'
and in eclipse Edit menu->set encoding, I see this:
Now I change the encoding to UTF-8 using the same dialog box and the text changes to:
onclick='obj1.help_open_new_window(fn1(),�"/redir/url_name")'
All I know is if this was not happening, then my website would be working fine. Why is this happening and what do I do to prevent this?
I do have some knowledge about encodings: Â and nbsp mystery explained The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) but still I do not understand why this is happening. Feel free to go to byte level(how file is stored) just to explain it.
UPDATE: Here's what I understand: if the file is encoded in latin-1 then every character is a byte and so is the . it should be hex(32). now when I convert it to utf-8, it still remains hex(32) and that is definitely . this leads me to believe that in latin-1, is not hex(32) but a combination of two bytes. How is that possible?

The character you have between the comma and the quote appear sto not be a normal space but some other whitespace character, probably the famous U+00A0 NO-BREAK SPACE. Since the file is encoded in latin1, the character is stored on disk as the byte \xA0, which does not form a valid character in UTF-8. This means that if you reload the file in your editor using UTF-8 you will see the universal replacement character � in its stead. (The proper UTF-8 encoding of no-break space would be \xC2\xA0.)
To get rid of the problem replace the no-break space with a normal space (U+0020). There is no reason why you should use a no-break space in this context, i.e. in program text.

Related

Chinese in Japanese encoding

This may sound like a stupid question. I typed some Chinese characters into an empty text file in VS code text editor (default utf8). Then I saved the file in an encoding for Japanese: shift JIS, which apparently doesn't cover all the characters I have typed in.
However, before I close the file, all Chinese characters are displayed properly in VS code. Now after I closed the file and reopened it using shift JIS encoding, several characters are displayed as a question mark ?. I guess these are the Chinese characters not covered by the Japanese encoding?
What happened in the process? Is there anyway I can 'get back' the Chinese characters that are now shown in ?? I don't really understand how encoding works in this scenario...
Not all encodings cover all characters. (Unicode encodings, in principle, do, but even they don't have quite everything yet.) If you save some text in an encoding which does not include all characters in that text, something has to give.
Options:
you get an error message,
nothing saves at all,
the characters which cannot be included are silently dropped,
the characters which cannot be included are converted to some other character (such as the question mark).
Once that conversion is done, the data is lost, and cannot be recovered. Why not use UTF-8 or another Unicode encoding? (GB 18030 might be the best for large amounts of Chinese text.)

Are there any character sets that don't respect ASCII?

As far as I understand, a character encoding maps bits to integers and a character set maps integers to characters.
So in the Unicode character set there is a telephone character. It is represented using the integer 9742, more commonly represented using Hexadecimal as 260E. This is then saved to a file using UTF-8 which translates the integer 9742 into 10011000001110. Please correct me if I am wrong.
Yesterday I created a text file that used the Unicode character set and UTF-8 encoding and I saved it to my desktop. I then reopened the file in my text editor and started to manually switch the character sets for fun. Unsurprisingly there were problems and odd characters starting displaying! I noticed that only some of the characters are misrepresented though. This got me thinking, why do only some of the characters break? Why not all?
Someone told me that the characters breaking are those outside the original ASCII specification. Upon reflection this seemed to make sense, as it's only non US characters that break. I was told that because all character sets use the ASCII character set up to the first 128 characters they will remain unbroken, and that it's the characters above 127 that break. Please correct me if I am wrong.
Finally, I got thinking. Are there any character sets that don't respect ASCII? If so, what are they called and what are they used for?
Based on my findings from the comments I am able to answer my own question. Thank you to everyone who commented!
Yes, there are a couple; EBCDIC and Baudot.

understanding different character encodings

When I save a text document in UTF-8 that's basically saying: Computer, use the codepage for UTF8 that's installed somewhere on your computer to figure out, how to turn the 1's and 0's to characters, right?
When I save this content:
激光
äüß
#§
in ISO-8895-1, it becomes this (on Linux, using Kate editor):
æ¿å
äüÃ
#§
What is not displayed here is that in the first and second row that are some weird squares displayed instead of characters (can be seen in developer tools).
So my understanding is that this means that the combination of 0's and 1's that represent 激 in utf-8 is mapped to æ in ISO-8895-1, right? And the weird squares > < happen because there is no mapping for that binary number in the ISO-8895-1 character set so the computer defaults to some other encoding.
Is that correct?
Yes, sort of correct.
If you store a file as UTF-8, it usually gets a special byte combination that indicates its type of encoding at the beginning of the file. I think, Kate (don't know this editor) doesn't recognize this and just displays the file as something else. So basically, your file is still correct, but was just visualized in a wrong way.
The weird squares are another indicator, that Kate doesn't recognize those leading bytes, cause usually editors hide them from the user and just use the information to display the file correctly.
You have it pretty much right. The character U+6FC0 (激) for example is encoded with 3 bytes in UTF-8: 0xE6 0xBF 0x80.
If you interpret these bytes in ISO-8859-1, you get the characters æ¿. Depending on the version of ISO-8859-1, 0x80 is either not mapped to a character at all, or is mapped to a non-printable control character, that's why you can see only two characters for the three bytes.
If you use Windows-1252 instead of ISO-8859-1 you'll see æ¿€.

How to get vim to show a byte-by-byte representation of file data

I don't want vim to ever interpret my data in any encoding specific way. In other words, when I'm in vim, I want the character that my cursor is on to correspond to the actual byte, not a utf* (etc.) representation of that byte.
I need to use vim to analyze issues caused by Unicode conversion errors made by other people (using other software) so it's important that I see what is actually there.
For example, in Cygwin's vim, I have been able to see UTF-8 BOMs as
 [START OF FILE DATA]
This is perfect. I recognize this as a UTF-8 BOM and if I want to know what the hex for each character is, I can put the cursor on the characters and use 'ga'.
I recently got a proper Linux machine (Fedora). In /etc/vimrc, this line exists
set fileencodings=ucs-bom,utf-8,latin1
When I look at a UTF-8 BOM on this machine, the BOM is completely hidden.
When I add the following line to ~/.vimrc
set fileencodings=latin1
I see

The first 3 characters are the BOM (when ga is used against them). I don't know what the last 3 characters are.
At one point, I even saw the UTF-8 BOM represented as "feff" - the UTF-16 BOM.
Anyway, you see my problem. I need to see exactly what is in my file without vim interpreting the bytes for me. I know I could use xxd, od, etc but vim has always been very convenient as an analysis tool. Plus I want to be able to edit the files and save them without any conversion problems.
Thanks for your help.
Use 'binary' mode:
:edit ++bin file
or
vim -b file
From :help 'binary':
The 'fileencoding' and 'fileencodings' options will not be used, the
file is read without conversion.
I get some good mileage from doing :e ++enc=latin1 after loading the file (VIm's initial guess on the encoding isn't important at this stage).
The sequence  is actually the U+FEFF (BOM) encoded UTF-8, decoded latin1, encoded UTF-8, and decoded latin1 again.  is the U+FEFF (BOM) encoded as UTF-8 and decoded as latin1. You can't get away from encodings. Those aren't the actual bytes, they are the latin1 characters displayed from an incorrect decoding. If you want bytes, use a hex editor; otherwise, use the correct decoding.

How to "force" a file's ISO-8859-1ness?

I remember when I used to develop website in Japan - where there are three different character encodings in currency - the developers had a trick to "force" the encoding of a source file so it would always open in their IDEs in the correct encoding.
What they did was to put a comment at the top of the file containing a Japanese character that only existed in that particular character encoding - it wasn't in any of the others! This worked perfectly.
I remember this because now I have a similar, albeit Anglophone, problem.
I've got some files that MUST be ISO-8859-1 but keep opening in my editor (Bluefish 1.0.7 on Linux) as UTF-8. This isn't normally a problem EXCEPT for pound (£) symbols and whatnot. Don't get me wrong, I can fix the file and save it out again as ISO-8859-1, but I want it to always open as ISO-8859-1 in my editor.
So, are there any sort of character hacks - like I mention above - to do this? Or any other methods?
PS. Unicode advocates / evangelists needn't waste their time trying to convert me because I'm already one of them! This is a rickety older system I've inherited :-(
PPS. Please don't say "use a different editor" because I'm an old fart and set in my ways :-)
Normally, if you have a £ encoded as ISO-8859-1 (ie. a single byte 0xA3), that's not going to form part of a valid UTF-8 byte sequence, unless you're unlucky and it comes right after another top-bit-set character in such a way to make them work together as a UTF-8 sequence. (You could guard against that by putting a £ on its own at the top of the file.)
So no editor should open any such file as UTF-8; if it did, it'd lose the £ completely. If your editor does that, “use a different editor”—seriously! If your problem is that your editor is loading files that don't contain £ or any other non-ASCII character as UTF-8, causing any new £ you add to them to be saved as UTF-8 afterwards, then again, simply adding a £ character on its own to the top of the file should certainly stop that.
What you can't necessarily do is make the editor load it as ISO-8859-1 as opposed to any other character set where all single top-bit-set bytes are valid. It's only multibyte encodings like UTF-8 and Shift-JIS which you can exclude them by using byte sequences that are invalid for that encoding.
What will usually happen on Windows is that the editor will load the file using the system default code page, typically 1252 on a Western machine. (Not actually quite the same as ISO-8859-1, but close.)
Some editors have a feature where you can give them a hint what encoding to use with a comment in the first line, eg. for vim:
# vim: set fileencoding=iso-8859-1 :
The syntax will vary from editor to editor/configuration. But it's usually pretty ugly. Other controls may exist to change default encodings on a directory basis, but since we don't know what you're using...
In the long run, files stored as ISO-8859-1 or any other encoding that isn't UTF-8 need to go away and die, of course. :-)
You can put character ÿ (0xFF) in the file. It's invalid in UTF8. BBEdit on Mac correctly identifies it as ISO-8859-1. Not sure how your editor of choice will do.