How to keep BOM from removal from Perforce unicode files - unicode

I have converted entire branch with .NET and SQL sources to UTF-8 with BOM, having their Perforce file type changed to Unicode in the same operation. (Encoding difference might sound confusing, but in Perforce, Unicode file type denotes UTF-8 file content.) But later I have found out that Perforce silently elliminates BOM marker from UTF-8 files. Is it possible to set Perforce to keep UTF-8 BOM markers in files of Unicode file type? I can't find it documented.
Perforce server is switched to Unicode mode, connection encoding is UTF-8 no BOM (but changing it to UTF-8 with BOM doesn't make any difference).
Example:
check out a source file from Perforce
change file type to Unicode
convert file content to format "UTF-8 with BOM"
submit the file (now the file still keeps BOM in first 3 bytes)
remove the file from workspace
get the latest revision of the file (now the file doesn't contain BOM at the beginning)

OK, Hans Passant's comment encouraged me to re-examine P4CHARSET and finally, the answer has two parts:
For Perforce command line access, setting of P4CHARSET variable controls the behavior. To enable adding BOM to files of Unicode type, use command
p4 set P4CHARSET=utf8-bom
In order to have these files without BOM, use
p4 set P4CHARSET=utf8
For P4V The Perforce Visual Client, the setting can be changed via menu Connection > Choose Character Encoding.... Use value Unicode (UTF-8) to enable adding BOM and Unicode (UTF-8, no BOM) to suppress it.
if menu item Choose Character Encoding... is disabled, ensure the following (and then check again)
P4V has connection to server open and working
pane containing depot/workspace tree is focused (click inside to re-ensure this)
Notes:
if you usually combine both above ways to access Perforce, you need to apply both solutions, otherwise you will keep getting mixed results
if you want to instantly add/remove BOM to/from existing files, adjust the above settings, then remove files from workspace and add them again (see steps 5 and 6 of example posted in the question). Other server actions changing content of files (integrating, merging etc.) will do the similar
for other encoding options and their impact on BOM, see the second table in Internationalization Notes for P4D, the Perforce Server and Perforce client applications

Related

PhpStorm: Converting folders encoding to another

I have project, where are lots of files in ISO-8859-15 and I need to convert them to UTF-8. If I change one file, it asks "Do you want to convert - plaplapla", if I say yes, important symbols wont become ???.
However, since my project file amount is HUGE, I cannot do that one by one. Changing encoding settings from project settings, it might change encoding to utf-8 but all the symbols will become ??? (thus no conversion).
So, how can I tell PhpStorm to convert all files into utf-8? Is it possible and if yes, how? What is the alternative method?
AFAIK it's not possible to do this for whole folder at a time .. but it can be done for multiple files (e.g. all files in certain folder):
Select desired files in Project View panel
Use File | File Encoding
When asked -- make sure you choose "convert" and not just "read in another encoding".
You can repeat this procedure for each subfolder (still much faster than doing this for each file individually).
Another possible alternative is to use something like iconv (or any other similar tool) and do it in terminal/console.
Watch out when opening the file inPHPStorm that you want to convert. In my case all the files were still encoded in ISO-8859 but opened in UTF-8 resulting in misspelled umlauts i.e. In this case direct conversion to UTF-8 is not possible.
If you encounter this do following:
Open the ISO-8859 file
Change file encoding dropdown (lower right corner) to ISO-8859-1 or ISO-8859-15 and choose REOPEN
Misspellings will now disappear
Then change the encoding again (dropdown lower right corner), this time to UTF-8 and choose CONVERT
Now the file is properly encoded in UTF-8
cheers

ANSI view get differed from notepad and notepad++.why?

I am writing some data as a xml file with ISO-8859 encoding.If I tried to open the file in notepad++.I can able to see the 'Â' character which is already present in the file.But if I tried to open the file in notepad the character 'Â' gets removed.Though I am very new to Encoding,I don't know why.Please suggest some reason for this.
This file is also get opened in browser with the 'Â' character.
Thanks in Advance
Windows notepad is a very basic editor, and has quite a number of limitations, one of which is the support it has for different encoding formats other than ANSI, Unicode and UTF-8. When editing files in other formats, it can give unreliable/unexpected results.
If you are handling files in different encoding formats, you are better off avoiding notepad altogether and using an editor (such as Notepad++) which has better support for multiple encoding formats.
For more information on how Windows notepad "guesses" at the correct format to use (with varying levels of success) see here
Bear in mind that other editors often use similar techniques to "guess" the format of a file, so it is often a good idea to check/set the encoding for a file manually (where possible) for less common encoding formats to ensure you get the correct results every time.

How to "force" a file's ISO-8859-1ness?

I remember when I used to develop website in Japan - where there are three different character encodings in currency - the developers had a trick to "force" the encoding of a source file so it would always open in their IDEs in the correct encoding.
What they did was to put a comment at the top of the file containing a Japanese character that only existed in that particular character encoding - it wasn't in any of the others! This worked perfectly.
I remember this because now I have a similar, albeit Anglophone, problem.
I've got some files that MUST be ISO-8859-1 but keep opening in my editor (Bluefish 1.0.7 on Linux) as UTF-8. This isn't normally a problem EXCEPT for pound (£) symbols and whatnot. Don't get me wrong, I can fix the file and save it out again as ISO-8859-1, but I want it to always open as ISO-8859-1 in my editor.
So, are there any sort of character hacks - like I mention above - to do this? Or any other methods?
PS. Unicode advocates / evangelists needn't waste their time trying to convert me because I'm already one of them! This is a rickety older system I've inherited :-(
PPS. Please don't say "use a different editor" because I'm an old fart and set in my ways :-)
Normally, if you have a £ encoded as ISO-8859-1 (ie. a single byte 0xA3), that's not going to form part of a valid UTF-8 byte sequence, unless you're unlucky and it comes right after another top-bit-set character in such a way to make them work together as a UTF-8 sequence. (You could guard against that by putting a £ on its own at the top of the file.)
So no editor should open any such file as UTF-8; if it did, it'd lose the £ completely. If your editor does that, “use a different editor”—seriously! If your problem is that your editor is loading files that don't contain £ or any other non-ASCII character as UTF-8, causing any new £ you add to them to be saved as UTF-8 afterwards, then again, simply adding a £ character on its own to the top of the file should certainly stop that.
What you can't necessarily do is make the editor load it as ISO-8859-1 as opposed to any other character set where all single top-bit-set bytes are valid. It's only multibyte encodings like UTF-8 and Shift-JIS which you can exclude them by using byte sequences that are invalid for that encoding.
What will usually happen on Windows is that the editor will load the file using the system default code page, typically 1252 on a Western machine. (Not actually quite the same as ISO-8859-1, but close.)
Some editors have a feature where you can give them a hint what encoding to use with a comment in the first line, eg. for vim:
# vim: set fileencoding=iso-8859-1 :
The syntax will vary from editor to editor/configuration. But it's usually pretty ugly. Other controls may exist to change default encodings on a directory basis, but since we don't know what you're using...
In the long run, files stored as ISO-8859-1 or any other encoding that isn't UTF-8 need to go away and die, of course. :-)
You can put character ÿ (0xFF) in the file. It's invalid in UTF8. BBEdit on Mac correctly identifies it as ISO-8859-1. Not sure how your editor of choice will do.

Working with utf-8 files in Eclipse

Quite straight forward question. Is there a way to configure Eclipse to work with text files encoded with utf-8 with and without the BOM?
So far I've used eclipse with utf-8 encoding and it works, but when I try to edit a file generated by another editor that includes the BOM, Eclipse doesn't handle it properly, it 'shows an invisible character' at the begining of the file (the BOM). Is there a way to make Eclipse understand utf-8 encoded files with BOM?
Both bug 78455 ("Provide an option to force writing a BOM to UTF-8 files") and bug 136854 don't leave much hope for such an option.
The support for encoding in the workspace is based on what is available from Java.
For any given resource in the workspace, it is possible to obtain a charset string that can be used with any Java APIs that take charset strings.
Examples are:
'US-ASCII',
'UTF-8',
'Cp1252',
'UTF-16' (Big Endian, BOM inserted automatically),
'UTF-16BE' (Big Endian, BOM not inserted automatically),
'UTF-16LE' (Little Endian, BOM not inserted automatically).
For Java encodings, except for the 'UTF-16' encoding, BOMs are not inserted (when writing) or discarded (when reading) for free.
Even if this is puzzling to end users, this is how all Java applications work.
If applications want to support creating UTF-8 files with BOMs to match their users' expectations, they need to provide such capability on their own (as neither Java nor the Resources model will help with that).
Eclipse does provide some improvements towards detecting BOMs, but not with generating or skipping them.

Is there a way to get the encoding of a text file in UltraEdit?

Is there a setting in UltraEdit that allows me to see the encoding of the file?
In UltraEdit, the encoding that is being used to -display- the file, is shown in the status bar at the right somewhere, together with the line-ending type in use, for example, "U8-UNIX". You can also manually set as what encoding the file has to be displayed. In version 10 this is under menu View -> Set Code Page. You can also -convert- the actual codepage of the file under menu File -> Conversions.
If the file does not have a BOM header, a couple of bytes at the start of the file indicating the encoding, the -actual- encoding of the file, can only be guessed. And even if the file has a BOM header, there can still be encoding issues.
All text editors do this, and some are better at it than others. I haven't done a comparision to see which is best at it. At the moment (2012), I know UltraEdit fails to detect UTF-8 and other variants in 1000 line (or longer) text files if the first UTF-8 character only appears later in the document. It also fails to show the encoding properly when you set it manually.
Notepad++ is also not great at detecting it, but when you know the encoding, you can set it manually.
Sublime Text is, as far as I know, best at detecting the encoding, also in large files.
I think there are also some very good command line tools out there, ported from GNU to Windows, to detect encoding. My bet would be that that's going to be the best option.