Should code files have a BOM or not?

Should code files have a BOM or not? - encoding

I am starting out on a new project, and I want to know what encoding I should choose for my code files.
I am writing a web app in Python + JavaScript + HTML + CSS on Linux, and my code editors Notepad++ and KomodoEdit give me some options:
Encode in ANSI
Encode in UTF-8 without BOM
Encode in UTF-8
I am not sure which encoding should I choose.

The answer could depend upon the operating system and versions of software (in particular, version of Python) you are using, but I would choose "Encode in UTF-8 without BOM"
In practice I would avoid using non-ASCII characters in source code. If you need them, use them only in string literals and in comments.
Avoid having them in identifiers.

Better use UTF-8 as its a web application, the UTF-8 support almost all the browsers and operating systems.

Related

Unicode vs. UTF-8

I believe Windows currently defaults to UTF-16 for “Unicode”, but that this may not be the case in the future.
For this reason, would it be better to use
[System.Text.Encoding]::UTF8.GetString($someByteArray)
instead of the following?:
[System.Text.Encoding]::Unicode.GetString($someByteArray)

this may not be the case in the future.
Unicode isn't a potentially-variable encoding; it's just Microsoft's (sadly misleading) name for UTF-16LE.
It isn't going to change. Even if Microsoft moved towards implementing Windows APIs natively in UTF-8 or UTF-32 (something there's no sign of ever happening), System.Text.Encoding.Unicode would have to remain UTF-16LE as that is how it is defined by the .NET specification.
would it be better to use UTF8 instead of Unicode?
Use UTF8 if the byte array contains UTF-8-encoded bytes, and use Unicode if they are in UTF-16LE.
If you get to choose what encoding is used to store data at rest, UTF-8 is usually the better choice for space efficiency reasons.

First, yes Windows defaults to UTF-16. Personally I would use UTF-8, because most of the applications I write have to communicate with Linux applications or some form of http so UTF-8 is more likely.
Besides even if all your code is used with Microsoft systems it's easy to convert to UTF-8 and a simple substitute regular expression could change everything over to Unicode (UTF-16) if .NET started requiring it.

What happens if you set your integration package to Unicode?

I'm importing data from flat-files (text files). I do not know which encoding they will use, it may be unicode, or it may be ASCII. What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues? I am using SSIS 2012.

What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues?
The encoding that Microsoft misleadingly call “Unicode” is actually UTF-16LE, an encoding based around two-byte code units.
UTF-16LE is not compatible with ASCII (or any of the locale-specific ANSI code pages) so if you read a file this is actually encoded in an ASCII superset you will get unreadable nonsense.
There's no magic ‘do the right thing’ option for reading characters from files, you have to know what encoding was used to create them. If you can see an encoded Byte Order Mark on the front of the data that usually allows you to make a good guess, but otherwise you're on your own.

What charset to use to store russian text into javascript files as an array

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?

Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.

By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.

I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.

The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?

Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.

If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake

It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

Working with utf-8 files in Eclipse

Quite straight forward question. Is there a way to configure Eclipse to work with text files encoded with utf-8 with and without the BOM?
So far I've used eclipse with utf-8 encoding and it works, but when I try to edit a file generated by another editor that includes the BOM, Eclipse doesn't handle it properly, it 'shows an invisible character' at the begining of the file (the BOM). Is there a way to make Eclipse understand utf-8 encoded files with BOM?

Both bug 78455 ("Provide an option to force writing a BOM to UTF-8 files") and bug 136854 don't leave much hope for such an option.
The support for encoding in the workspace is based on what is available from Java.
For any given resource in the workspace, it is possible to obtain a charset string that can be used with any Java APIs that take charset strings.
Examples are:
'US-ASCII',
'UTF-8',
'Cp1252',
'UTF-16' (Big Endian, BOM inserted automatically),
'UTF-16BE' (Big Endian, BOM not inserted automatically),
'UTF-16LE' (Little Endian, BOM not inserted automatically).
For Java encodings, except for the 'UTF-16' encoding, BOMs are not inserted (when writing) or discarded (when reading) for free.
Even if this is puzzling to end users, this is how all Java applications work.
If applications want to support creating UTF-8 files with BOMs to match their users' expectations, they need to provide such capability on their own (as neither Java nor the Resources model will help with that).
Eclipse does provide some improvements towards detecting BOMs, but not with generating or skipping them.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse