HI, I am creating a file like so.
FileStream temp = File.Create( this.FileName );
Then putting data in the file like so.
this.Writer = new StreamWriter( this.Stream );
this.Writer.WriteLine( strMessage );
That code is encapsulated in a class hierarchy but that is the meat and potatoes of it.
My problem is this. MSDN says that the default encoding for creating a file this way is UTF8. And when I write a french character such as é Textpad interprets the file as a UTF 8 file, but notepad++ says it's "ANSI as UTF8" or maybe it's an ansi file but is reading it as UTF8. When I create a file the same way without the french character both textpad and notepad++ read the file as an ansi file even though according to msdn it should be a utf 8 file still.
Which program should be trusted. Notepad++ or textpad - Notepad++ seems to be more consistant, but is still the oppossite to what MSDN says it should be. My problem is that we create files that get sent off to another company and depending on whether there are french characters the encoding seems to keep changing.
Or is there a better way to determine the encoding of a file. I've read about byte order marks and preambles but as far as I understand neither are guaranteed to be there.
We initially thought that all the files we were building were ansi. Also please note that both ansi and utf8 should handle the french characters appropriately as the characters are part of both character sets.
as far as i know, "ansi" character encoding is another name for ascii-us.
if there are no characters in the file that aren't in the ascii charset then the file is valid ascii and valid utf8, there's no way to distinguish them. so your program can write it as utf8 and any other program would be correct in seeing it as ascii (ansi) just as it would be seeing it as utf8.
Related
I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.
Yesterday I wrote some text in a notepad file which was full of Unicode characters and saved the file as ANSI. Notepad gave me some warning, which i clicked OK without reading it fully and closed notepad.
Today when I again opened the same text in notepad, I am seeing notepad full of ??? signs. I now understand that this happened because I saved Unicode data as ANSI text. Is there a way to retrieve this text back? May be using some hex-editor or so?
No. Certain characters cannot be encoded in certain encodings. "風" cannot be encoded at all in ISO-8859 or any other single-byte encoding, for example. Each ANSI encoding also can only encode a certain subset of all possible characters. It is simply not possible to store characters not defined in a particular ANSI encoding in that encoding, they're simply not defined there.
So, they're gone. You better pull out a backup.
I have tried a program called UTFCast Professional. It checkes the file encoding.
When I write code I use Sublime Text.
Random encoding
What I get is that some files are UTF8 and some files are ASCII/UTF8. It appears to be set random. All of them are set to "BOM: No".
Why is some files UTF8 and some ASCII/UTF8?
Is it possible that in some cases it does not know if it's ASCII or UTF8?
Should I be worried for future encoding problems? I have not have any so far.
(I prefer UTF8)
A plain text file does not in any way save what encoding it's in anywhere. Any program that purportedly tells you what encoding a file is in is by definition only giving you its best guess based on the content of the file. Now, since a file which contains only characters which are present in ASCII and is saved as UTF-8 is indistinguishable from a pure ASCII file, either answer is valid. Even Latin-1 and a large number of other answers would be valid.
So the answer why that program randomly outputs one or the other is because its detection algorithm triggers one or the other based on some characteristics of the file content. Only the program author can tell you exactly why. The file is encoded as UTF-8 without BOM. Whatever any application tells you it thinks it is is entirely up to that application.
I am just working with a text file, that contains lots of deformed strings such as:
VyplÅ<88>te prosÃm pole "jméno
My editor says that the file encoding is latin1. The string is supposed to be a czech sentence that contains some diacritics so no wonder it is displayed wrong. I have tried to force utf8 and latin2 encodings in my editor but that did not help. I have also tried to use iconv to convert the file from latin1 to utf8 or latin2 but neither that helped. I quite often encounter issues likes this and I don't know any other solution than to manually rewrite the strings. Is there a better way to fix this?
EDIT:
Here is the original sentence:
Vyplňte prosím pole "jméno"
Here is hex dump of the part where the malformed string occurs:
0002640: 6a6d 656e 6f22 5d20 3d20 2744 453a 2056 jmeno"] = 'DE: V
0002650: 7970 6cc5 8874 6520 7072 6f73 c3ad 6d20 ypl..te pros..m
0002660: 706f 6c65 2022 6a6d c3a9 6e6f 222e 273b pole "jm..no".';
EDIT2:
The sentence above is really correct utf8 as deceze have said. But I have just found out some strange thing. If I try to transcode the file from utf8 to utf8 (with iconv), I get an error on a word: Postgebühr at character ü. If I look at hex dump, this character is represented as \xfc (252 in decimal), which is valid latin1 byte encoding for ü but completely invalid utf8 byte encoding. It seems that part of the file is in latin1 and another part in utf8. Here is part of the file that is in latin1 (probably):
0000250: 506f 7374 6765 62fc 6872 273b 0a09 0963 Postgeb.hr';...c
0000260: 6f6e 665b 2277 6166 6572 7322 5d20 3d20 onf["wafers"] =
0000270: 2744 453a 206f 706c c3a1 746b 20c3 273b 'DE: opl..tk .';
As I look into this more, this even does not seem to be valid latin1 cause even in latin1 it is garbled (DE: oplátk à instead of probably DE: oplatky za). This part of the file seems to contain some damaged text.
I can't understand how encoding in this file could have got mixed up like that. Any ideas?
If the file is supposed to contain Latin2 encoded text, then trying to convert it from Latin1 or similar is of course messing things up.
The problem is simply that your text editor does not automagically recognize the encoding, since the single-byte Latin* encodings all look identically interchangeable on a byte level. If your editor "tells" you the encoding is Latin1, what it means is that it is currently interpreting the file as Latin1. Obviously it has that wrong.
You either need to tell your editor to treat the file as Latin2 (Open As... Latin2, or however your editor gives you this choice) or to convert the file from Latin2 into an encoding your editor handles correctly.
To understand encodings better, I recommend you read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
In response to your posted hex dump: That file is UTF-8 encoded.
Iconv is the way to go, but you must know the correct enconding. Latin2 (iso8859-2) is only one of the possibilities, since there were many ascii extensions in Europe. What language is this supposed to be in?
My problem is simple. I want to output UTF-8 with my Perl script.
This code is not working.
use utf8;
open(TROIS,">utf8.out.2.txt");
binmode(TROIS, ":utf8");
print TROIS "Hello\n";
The output file is not in UTF-8. (My file script is coded in UTF-8)
But if I insert an accentuated character in my print, then it's working and my output file is in UTF-8. Example:
print TROIS "é\n";
I use ActivePerl 5.10 under Windows. What might be the problem?
You're writing nothing but ASCII characters with Hello\n. Fortunately ASCII is still perfectly valid UTF-8. However, auto detection by editors will most likely not show UTF-8 as the encoding because they don't have anything to judge the file content's encoding by. I guess you simply don't know how file encodings work.
A file's encoding is a property that in general is not stored in a file or externally alongside a file. A lot of editors simply assume a certain encoding based on the operating system they run on or the environment settings (system language), or they include some kind of semi-intelligent auto-detection (which may still fail because file encodings cannot be auto-detected unambiguously). That's why you have to tell Perl that a file is encoded in UTF-8 when you read it with binmode or the corresponding I/O layer.
Now there is one way of marking a text file's encoding if said encoding is one of the UTF family (UTF-8, UTF-16 LE and BE, UTF-32 LE and BE) . That way is called the BOM (byte order mark). However, producing files with a BOM came from a time when UTF-8 had not been spread as widely as it is today. It usually poses more and different problems than it solves, especially due to editors and applications in general not supporting BOMs at all. Therefore BOMs should probably be avoided nowadays.
There are exceptions, of course, in which the file format contains certain instructions that tell the file's encoding. XML comes to mind with its DOCTYPE declaration. However, even for such files you will have to recognize if a file is encoded in a multi-byte encoding that always uses at least two bytes per character (UTF-16/UTF-32) or not in order to parse the DOCTYPE declaration in the first place. It's simply not simple ;)