I am writing some data as a xml file with ISO-8859 encoding.If I tried to open the file in notepad++.I can able to see the 'Â' character which is already present in the file.But if I tried to open the file in notepad the character 'Â' gets removed.Though I am very new to Encoding,I don't know why.Please suggest some reason for this.
This file is also get opened in browser with the 'Â' character.
Thanks in Advance
Windows notepad is a very basic editor, and has quite a number of limitations, one of which is the support it has for different encoding formats other than ANSI, Unicode and UTF-8. When editing files in other formats, it can give unreliable/unexpected results.
If you are handling files in different encoding formats, you are better off avoiding notepad altogether and using an editor (such as Notepad++) which has better support for multiple encoding formats.
For more information on how Windows notepad "guesses" at the correct format to use (with varying levels of success) see here
Bear in mind that other editors often use similar techniques to "guess" the format of a file, so it is often a good idea to check/set the encoding for a file manually (where possible) for less common encoding formats to ensure you get the correct results every time.
Related
I'm looking at a xliff file and found some weird boxes which I don't know what they are? (Please see screenshot)
Do you guys have any ideas what the weird bug boxes are?
Thank you very much and I'm looking forward to your reply!
I have never seen that character, but here is how I would go about finding out what it is:
The first thing to do is to check the source and target language of the XLIFF file, which should be defined in the XLIFF header. Perhaps this character is a valid character in either the source or the target language script.
The next step depends on whether you can contact the person who created the XLIFF file. If yes, you can show them what the file looks like for you and ask them if the file has perhaps been garbled during transmission.
If not, you could check the encoding of the XLIFF file. If it is UTF-16, just open the file in a hex editor, find the code point for this character, and look it up on unicode.org. If the file is encoded as UTF-8 open it in Notepad++ (or any other text editor that allows you to change the encoding), convert it to UTF-16, then proceed as described above.
If you don't know the encoding of the file it becomes a matter of guessing. You can look at some other <trans-units> (assuming that there are more than this one in your XLIFF file): if they contain other extended characters and they are displayed correctly your editor has probably guessed the right encoding, and you can convert to Unicode and look up the character code. Different text editors have different ways of guessing encodings: try a few.
It's possible that those characters are the result of an encoding conversion error, which are commonly called mojibake.
It's also possible this is some sort of emoji or unusual glyph that's not rendering correctly in your editor. This would be unusual, but given that it appears to be a UI string, it might be possible.
I have project, where are lots of files in ISO-8859-15 and I need to convert them to UTF-8. If I change one file, it asks "Do you want to convert - plaplapla", if I say yes, important symbols wont become ???.
However, since my project file amount is HUGE, I cannot do that one by one. Changing encoding settings from project settings, it might change encoding to utf-8 but all the symbols will become ??? (thus no conversion).
So, how can I tell PhpStorm to convert all files into utf-8? Is it possible and if yes, how? What is the alternative method?
AFAIK it's not possible to do this for whole folder at a time .. but it can be done for multiple files (e.g. all files in certain folder):
Select desired files in Project View panel
Use File | File Encoding
When asked -- make sure you choose "convert" and not just "read in another encoding".
You can repeat this procedure for each subfolder (still much faster than doing this for each file individually).
Another possible alternative is to use something like iconv (or any other similar tool) and do it in terminal/console.
Watch out when opening the file inPHPStorm that you want to convert. In my case all the files were still encoded in ISO-8859 but opened in UTF-8 resulting in misspelled umlauts i.e. In this case direct conversion to UTF-8 is not possible.
If you encounter this do following:
Open the ISO-8859 file
Change file encoding dropdown (lower right corner) to ISO-8859-1 or ISO-8859-15 and choose REOPEN
Misspellings will now disappear
Then change the encoding again (dropdown lower right corner), this time to UTF-8 and choose CONVERT
Now the file is properly encoded in UTF-8
cheers
I created file with UTF-8 encoded content (using PHP fputcsv).
When I open this file in Notepad++ - characters are wrong (Notepad++ starts with ANSI encoding).
When I set Format->"Encode in UTF-8" from menu - everything is fine.
Im worrying, that Notepad++ can recognize encoding somehow, and maybe something is wrong with my file created with fputcsv? First byte or something?
Automatically detecting an encoding is not something that can be done accurately. It's pretty much essential that the encoding be specified explicitly. It can be guessed in some cases, but even then not with 100% certainty.
This documentation (Encoding) explains the situation in relation to Notepad++.
They also point out that the difficulty arises especially if the file has not been saved with a Byte Order Mark (BOM).
Given that your file displays correctly once you manually set the encoding, I would say there's nothing wrong with how you are generating and saving the file. The only thing you can check for is whether a BOM is being saved, which might improve the chances of Notepad++ being able to automatically detect the encoding.
It's worth noting that although it may help editors like Notepad++ identify the encoding more accurately, according to The Unicode Standard document, the BOM is not recommended.
You have to check the lower right corner of the Notepad++ GUI to see the actual enconding that is being used. The problem it's not that Notepad++ specific because guessing the right encoding is a big problem without any real solution so it's better to let the user decide what is the most appropriate encoding in each single case.
When you want to reflect the encoding of the text file in a Java program, you have to consider two thnigs: encoding and character set. When you open a text file, you see encoding under "Encoding" menu. Additionally look at the character set menu point. Under "Eastern European" you will find "ISO 8859-2", and under Central European "Windows-1250". You can set corresponding encoding in the Java program
when you look up in the table:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
For example, for Cenntral European character set "Windows-1250" the table suggest Java encoding "Cp1250". Set the encoding and you will see the characters in program properly.
How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?
Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.
If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake
It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)
Is there a setting in UltraEdit that allows me to see the encoding of the file?
In UltraEdit, the encoding that is being used to -display- the file, is shown in the status bar at the right somewhere, together with the line-ending type in use, for example, "U8-UNIX". You can also manually set as what encoding the file has to be displayed. In version 10 this is under menu View -> Set Code Page. You can also -convert- the actual codepage of the file under menu File -> Conversions.
If the file does not have a BOM header, a couple of bytes at the start of the file indicating the encoding, the -actual- encoding of the file, can only be guessed. And even if the file has a BOM header, there can still be encoding issues.
All text editors do this, and some are better at it than others. I haven't done a comparision to see which is best at it. At the moment (2012), I know UltraEdit fails to detect UTF-8 and other variants in 1000 line (or longer) text files if the first UTF-8 character only appears later in the document. It also fails to show the encoding properly when you set it manually.
Notepad++ is also not great at detecting it, but when you know the encoding, you can set it manually.
Sublime Text is, as far as I know, best at detecting the encoding, also in large files.
I think there are also some very good command line tools out there, ported from GNU to Windows, to detect encoding. My bet would be that that's going to be the best option.