Can eclipse detect a file encoding with a specific text header? - eclipse

What can I do on a file to have eclipse opening it with an UTF-8 encoding on any computer ?
Context : I will distribute a text file to multiple people. This file contains UTF-8 characters but eclipse does not display them correctly by default since the file properties specify that the encoding is "Default (Inherited from container: Cp1252)".
This file is correctly displayed if I change this property to "Other: UTF-8", however I don't want the people who will receive this file to need to configure this property to have a correct display of UTF-8 characters, nor to change any setting.

If the text file starts with a UTF-8 Byte Order Mark (the sequence 0xEF,0xBB,0xBF) then the Eclipse content type system should recognize the file as being UTF-8.

Related

PhpStorm: Converting folders encoding to another

I have project, where are lots of files in ISO-8859-15 and I need to convert them to UTF-8. If I change one file, it asks "Do you want to convert - plaplapla", if I say yes, important symbols wont become ???.
However, since my project file amount is HUGE, I cannot do that one by one. Changing encoding settings from project settings, it might change encoding to utf-8 but all the symbols will become ??? (thus no conversion).
So, how can I tell PhpStorm to convert all files into utf-8? Is it possible and if yes, how? What is the alternative method?
AFAIK it's not possible to do this for whole folder at a time .. but it can be done for multiple files (e.g. all files in certain folder):
Select desired files in Project View panel
Use File | File Encoding
When asked -- make sure you choose "convert" and not just "read in another encoding".
You can repeat this procedure for each subfolder (still much faster than doing this for each file individually).
Another possible alternative is to use something like iconv (or any other similar tool) and do it in terminal/console.
Watch out when opening the file inPHPStorm that you want to convert. In my case all the files were still encoded in ISO-8859 but opened in UTF-8 resulting in misspelled umlauts i.e. In this case direct conversion to UTF-8 is not possible.
If you encounter this do following:
Open the ISO-8859 file
Change file encoding dropdown (lower right corner) to ISO-8859-1 or ISO-8859-15 and choose REOPEN
Misspellings will now disappear
Then change the encoding again (dropdown lower right corner), this time to UTF-8 and choose CONVERT
Now the file is properly encoded in UTF-8
cheers

How to keep BOM from removal from Perforce unicode files

I have converted entire branch with .NET and SQL sources to UTF-8 with BOM, having their Perforce file type changed to Unicode in the same operation. (Encoding difference might sound confusing, but in Perforce, Unicode file type denotes UTF-8 file content.) But later I have found out that Perforce silently elliminates BOM marker from UTF-8 files. Is it possible to set Perforce to keep UTF-8 BOM markers in files of Unicode file type? I can't find it documented.
Perforce server is switched to Unicode mode, connection encoding is UTF-8 no BOM (but changing it to UTF-8 with BOM doesn't make any difference).
Example:
check out a source file from Perforce
change file type to Unicode
convert file content to format "UTF-8 with BOM"
submit the file (now the file still keeps BOM in first 3 bytes)
remove the file from workspace
get the latest revision of the file (now the file doesn't contain BOM at the beginning)
OK, Hans Passant's comment encouraged me to re-examine P4CHARSET and finally, the answer has two parts:
For Perforce command line access, setting of P4CHARSET variable controls the behavior. To enable adding BOM to files of Unicode type, use command
p4 set P4CHARSET=utf8-bom
In order to have these files without BOM, use
p4 set P4CHARSET=utf8
For P4V The Perforce Visual Client, the setting can be changed via menu Connection > Choose Character Encoding.... Use value Unicode (UTF-8) to enable adding BOM and Unicode (UTF-8, no BOM) to suppress it.
if menu item Choose Character Encoding... is disabled, ensure the following (and then check again)
P4V has connection to server open and working
pane containing depot/workspace tree is focused (click inside to re-ensure this)
Notes:
if you usually combine both above ways to access Perforce, you need to apply both solutions, otherwise you will keep getting mixed results
if you want to instantly add/remove BOM to/from existing files, adjust the above settings, then remove files from workspace and add them again (see steps 5 and 6 of example posted in the question). Other server actions changing content of files (integrating, merging etc.) will do the similar
for other encoding options and their impact on BOM, see the second table in Internationalization Notes for P4D, the Perforce Server and Perforce client applications

Extra characters in .doc file when opened with textpad

When I open a document in textpad, some extra null character is appended between every character.
Like my document is having following text
बॉम्बे testing for webmail.
When I am opening in text it is coming as
I....M....I t.e.s.t.i.n.g. f.o.r. w.e.b.m.a.i.l.
Can Anybody help me on this ?
This file is in UTF-16 or UCS-2 format. When opening it, you must specify in which encoding you want to open it. Your text editor does not recognize this encoding automatically.
If your text editor does not allow for setting encoding on opening file, try using Notepad++ or Textpad.

GWT: Character encoding umlauts

I want to set a text in a label:
labelDemnaechst.setText(" Demnächst fällig:");
On the output in the application the characters "ä" are displayed wrong.
How can I display them well?
GWT assumes all source files are encoded in UTF-8 (and this is why you see löschen if you open them with an editor that interprets them as Windows-1252 or ISO-8859-1).
GWT also generates UTF-8-encoded files, so your web server should serve them as UTF-8 (or don't specify an encoding at all, to let browsers "sniff" it) and your HTML host page should be served in UTF-8 too (best is to put a <meta charset=utf-8> at the beginning of your <head>).
If a \u00E4 in Java is displayed correctly in the browser, then you're probably in the first case, where your source files are not encoded in UTF-8.
See http://code.google.com/webtoolkit/doc/latest/FAQ_Troubleshooting.html#International_characters_don't_display_correctly
well you have to encode your special charactars to Unicode. You can finde a list of the representive Unicode characters here.
Your examle would look like this:
labelDemnaechst.setText("Demn\u00E4lachst f\u00E4llig:");
Hope this helps, if noone has a better solution.
Appendix:
Thanks Thomas for your tipp, you really have to change the format in which eclipse safes it's source files. Per default it uses something like Cp1252. If you change it to UTF-8, your example works correct. (So Demnächst is written correctly).
You can edit the safing format, if you right-click on your file --> Preferences.
To get UTF-8 encoding for your entire workspace, go to Window -> Preferences. In the pop-up start typing encoding. Now you should have Content Types, Workspace, CSS Files, HTML Files, XML Files as result. In content Types you can type UTF-8 in the Default encoding text box, for the other elements you can just select the encoding in their respective listboxes.
Then check the encoding for your project in Project -> Properties -> Resource.
Detailed instruction with pictures can be found here:
http://stijndewitt.wordpress.com/2010/05/05/unicode-utf-8-in-eclipse-java/
Cheers
what i did:
open the file with notepad (Windows Explorer),
and save it with the option UFT-8 instead of proposed ANSI.
Encoding the project to UTF-8 didn't work (for me)
Cheerio
Use iso-8859-1 (western europe) character set instead of UTF-8.

Is there a way to get the encoding of a text file in UltraEdit?

Is there a setting in UltraEdit that allows me to see the encoding of the file?
In UltraEdit, the encoding that is being used to -display- the file, is shown in the status bar at the right somewhere, together with the line-ending type in use, for example, "U8-UNIX". You can also manually set as what encoding the file has to be displayed. In version 10 this is under menu View -> Set Code Page. You can also -convert- the actual codepage of the file under menu File -> Conversions.
If the file does not have a BOM header, a couple of bytes at the start of the file indicating the encoding, the -actual- encoding of the file, can only be guessed. And even if the file has a BOM header, there can still be encoding issues.
All text editors do this, and some are better at it than others. I haven't done a comparision to see which is best at it. At the moment (2012), I know UltraEdit fails to detect UTF-8 and other variants in 1000 line (or longer) text files if the first UTF-8 character only appears later in the document. It also fails to show the encoding properly when you set it manually.
Notepad++ is also not great at detecting it, but when you know the encoding, you can set it manually.
Sublime Text is, as far as I know, best at detecting the encoding, also in large files.
I think there are also some very good command line tools out there, ported from GNU to Windows, to detect encoding. My bet would be that that's going to be the best option.