I have a .csv file generated by Excel that I got from my customer. My software has to open and parse it in java. I'm using universalchardet but it did not detect the encoding from the first 1,000 bytes of the file.
Within these 1,000 first bytes, there is a sequence that should be read as Boîte, however I cannot find the correct encoding to use to convert this file to UTF-8 strings in java.
In the file, Boîte is encoded as 42,6F,94,74,65 (read using a hex editor). B, o, t and e are using the standard latin encoding with 1 byte per character. The î is also encoded on only one byte, 0x94.
I don't know how to guess this charset, none of my searches online yielded relevant results.
I also tried to use file on that file:
$ file export.csv
/Users/bicou/Desktop/export.csv: Non-ISO extended-ASCII text, with CR line terminators
However I looked at the extended-ASCII charset, the value 0x94 stands for ö.
Have you got other ideas for guessing the encoding of that file?
This was Mac OS Roman encoding. When using the following java code, the text was properly decoded:
InputStreamReader isr = new InputStreamReader(new FileInputStream(targetFileName), "MacRoman");
I don't know how to delete my own question. I don't think it is useful anymore...
Related
I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.
I am just working with a text file, that contains lots of deformed strings such as:
VyplÅ<88>te prosÃm pole "jméno
My editor says that the file encoding is latin1. The string is supposed to be a czech sentence that contains some diacritics so no wonder it is displayed wrong. I have tried to force utf8 and latin2 encodings in my editor but that did not help. I have also tried to use iconv to convert the file from latin1 to utf8 or latin2 but neither that helped. I quite often encounter issues likes this and I don't know any other solution than to manually rewrite the strings. Is there a better way to fix this?
EDIT:
Here is the original sentence:
Vyplňte prosím pole "jméno"
Here is hex dump of the part where the malformed string occurs:
0002640: 6a6d 656e 6f22 5d20 3d20 2744 453a 2056 jmeno"] = 'DE: V
0002650: 7970 6cc5 8874 6520 7072 6f73 c3ad 6d20 ypl..te pros..m
0002660: 706f 6c65 2022 6a6d c3a9 6e6f 222e 273b pole "jm..no".';
EDIT2:
The sentence above is really correct utf8 as deceze have said. But I have just found out some strange thing. If I try to transcode the file from utf8 to utf8 (with iconv), I get an error on a word: Postgebühr at character ü. If I look at hex dump, this character is represented as \xfc (252 in decimal), which is valid latin1 byte encoding for ü but completely invalid utf8 byte encoding. It seems that part of the file is in latin1 and another part in utf8. Here is part of the file that is in latin1 (probably):
0000250: 506f 7374 6765 62fc 6872 273b 0a09 0963 Postgeb.hr';...c
0000260: 6f6e 665b 2277 6166 6572 7322 5d20 3d20 onf["wafers"] =
0000270: 2744 453a 206f 706c c3a1 746b 20c3 273b 'DE: opl..tk .';
As I look into this more, this even does not seem to be valid latin1 cause even in latin1 it is garbled (DE: oplátk à instead of probably DE: oplatky za). This part of the file seems to contain some damaged text.
I can't understand how encoding in this file could have got mixed up like that. Any ideas?
If the file is supposed to contain Latin2 encoded text, then trying to convert it from Latin1 or similar is of course messing things up.
The problem is simply that your text editor does not automagically recognize the encoding, since the single-byte Latin* encodings all look identically interchangeable on a byte level. If your editor "tells" you the encoding is Latin1, what it means is that it is currently interpreting the file as Latin1. Obviously it has that wrong.
You either need to tell your editor to treat the file as Latin2 (Open As... Latin2, or however your editor gives you this choice) or to convert the file from Latin2 into an encoding your editor handles correctly.
To understand encodings better, I recommend you read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
In response to your posted hex dump: That file is UTF-8 encoded.
Iconv is the way to go, but you must know the correct enconding. Latin2 (iso8859-2) is only one of the possibilities, since there were many ascii extensions in Europe. What language is this supposed to be in?
I am setting up Perl application . I am getting this error. "utf8 "\x96" does not map to Unicode at"
Can anybody let me know the cause and solution. Am i missing any configuration or it is my installation problem ?
Following is the code :
open(FILE,"<:encoding(UTF-8)",$self->{BASEDIR}.$self->{FILENAME}) || die "could not open file $basedir$filename - $!";
The character 0x96 is not a valid UTF-8 encoding. There is a block of code points just above 0x80 that, in UTF-8, encodes the start of a 2- or 3-byte character.
The input you are reading must not be UTF-8, and is most likely Latin1 or CP1252.
You will need to convert the input data to UTF-8, however one does that in Perl (it's been a long time since I did any Perl and it didn't use UTF-8 by default when I was writing Perl :-)
I suspect that something you believe to be encoded in UTF-8 is not, in fact, encoded with UTF-8.
Just putting this info out there in case it helps someone in the future.
If you're working with a Microsoft product, this can be caused by non-US characters (European, Chinese, etc). For instance, if someone sends you an excel spreadsheet of data that you need to process and it's saved in .csv format, those characters can be outside of the utf-8 range if it wasn't saved properly.
Fortunately, at least in Excel for Mac v. 15, it is possible to take that data and "save as" specifically a CSV UTF-8 file - it's in the list of options. This is a separate option from the other CSV file option. This will convert non-US characters into the UTF-8 charset and solve this issue.
I'm adding data from a csv file into a database. If I open the CSV file, some of the entries contain bullet points - I can see them. file says it is encoded as ISO-8859.
$ file data_clean.csv
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators
I read it in as follows and convert it from ISO-8859-1 to UTF-8, which my database requires.
row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
print row[4]
description = row[4].encode("UTF-8")
print description
This gives me the following:
'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight
¥ Media and communications
Why is the \xa5 bullet character converting as a yen symbol?
I assume because I'm reading it in as the wrong encoding, but what is the right encoding in this case? It isn't cp1252 either.
More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?
I don't know of any general tool, but this Wikipedia page (linked from the page on codepage 1252) shows that A5 is a bullet point in the Mac OS Roman codepage.
More generally, is there a tool where
you can specify (i) string (ii) known
character, and find out the encoding?
You can easily write one in Python.
(Examples use 3.x syntax.)
import encodings
ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}
def _decode(data, encoding):
try:
return data.decode(encoding)
except UnicodeError:
return None
def possible_encodings(encoded, decoded):
return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}
So if you know that your bullet point is U+2022, then
>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}
You could try
iconv -f latin1 -t utf8 data_clean.csv
if you know it is indeed iso-latin-1
Although in iso-latin-1 \xA5 is indeed a ¥
Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?
http://forums.quark.com/p/14849/61253.aspx
http://www.macosxhints.com/article.php?story=2003090403110643
HI, I am creating a file like so.
FileStream temp = File.Create( this.FileName );
Then putting data in the file like so.
this.Writer = new StreamWriter( this.Stream );
this.Writer.WriteLine( strMessage );
That code is encapsulated in a class hierarchy but that is the meat and potatoes of it.
My problem is this. MSDN says that the default encoding for creating a file this way is UTF8. And when I write a french character such as é Textpad interprets the file as a UTF 8 file, but notepad++ says it's "ANSI as UTF8" or maybe it's an ansi file but is reading it as UTF8. When I create a file the same way without the french character both textpad and notepad++ read the file as an ansi file even though according to msdn it should be a utf 8 file still.
Which program should be trusted. Notepad++ or textpad - Notepad++ seems to be more consistant, but is still the oppossite to what MSDN says it should be. My problem is that we create files that get sent off to another company and depending on whether there are french characters the encoding seems to keep changing.
Or is there a better way to determine the encoding of a file. I've read about byte order marks and preambles but as far as I understand neither are guaranteed to be there.
We initially thought that all the files we were building were ansi. Also please note that both ansi and utf8 should handle the french characters appropriately as the characters are part of both character sets.
as far as i know, "ansi" character encoding is another name for ascii-us.
if there are no characters in the file that aren't in the ascii charset then the file is valid ascii and valid utf8, there's no way to distinguish them. so your program can write it as utf8 and any other program would be correct in seeing it as ascii (ansi) just as it would be seeing it as utf8.