i need to convert a ISO-8859-1 file to utf-8 encoding, without loosing content intormations...
i have a file which looks like this:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<HelloEncodingWorld>Üöäüßßß Test!!!</HelloEncodingWorld>
Not i want to encode it into UTF-8.
I tried following:
f=new File('c:/temp/myiso88591.xml').getText('ISO-8859-1')
ts=new String(f.getBytes("UTF-8"), "UTF-8")
g=new File('c:/temp/myutf8.xml').write(ts)
didnt work due to String incompatibilities.
Then i read something about bytestreamreaders/writers/streamingmarkupbuilder and other...
then i tried
f=new File('c:/temp/myiso88591.xml').getText('ISO-8859-1')
mb = new groovy.xml.StreamingMarkupBuilder()
mb.encoding = "UTF-8"
new OutputStreamWriter(new FileOutputStream('c:/temp/myutf8.xml'),'utf-8') << mb.bind {
mkp.xmlDeclaration()
out << f
}
this was totally not that what i wanted..
I just want to get the content of an xml read with an ISO-8859-1 reader and then put it into a new (old) file... why this is so complicated :-/
The result should just be, and the file should be really encoded in utf-8:
<?xml version="1.0" encoding="UTF-8" ?>
<HelloEncodingWorld>Üöäüßßß Test!!!</HelloEncodingWorld>
Thanks for any answers
Cheers
def f=new File('c:/data/myiso88591.xml').getText('ISO-8859-1')
new File('c:/data/myutf8.xml').write(f,'utf-8')
(I just gave it a try, it works :-)
same as in java: the libraries do the conversion for you...
as deceze said: when you specify an encoding, it will be converted to an internal format (utf-16 afaik). When you specify another encoding when you write the string, it will be converted to this encoding.
But if you work with XML, you shouldn't have to worry about the encoding anyway because the XML parser will take care of it. It will read the first characters <?xml and determines the basic encoding from those characters. After that, it is able to read the encoding information from your xml header and use this.
Making it a little more Groovy, and not requiring the whole file to fit in memory, you can use the readers and writers to stream the file. This was my solution when I had files too big for plain old Unix iconv(1).
new FileOutputStream('out.txt').withWriter('UTF-8') { writer ->
new FileInputStream('in.txt').withReader('ISO-8859-1') { reader ->
writer << reader
}
}
http://www.hjsoft.com/blog/link/A_Useful_Example_in_Java_Ruby_and_Groovy
Related
I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.
I am working on files with unknown encoding at first but I get the encoding with this lines in JAVA:
InputStream in = new FileInputStream(new File("D:\\lbl2\\1 (26).LBL"));
InputStreamReader inputStreamReader = new InputStreamReader(in);
System.out.print(inputStreamReader.getEncoding());
and we get UTF8 in output.
but the problem is that when I try to see file content with the browser or text editor like Notpad++ I can't see character correctly. Instead when I change the encoding to Windows-1256 all of characters view correct and readable.
Do i do any mistake?
Java does not attempt to detect the encoding of a file. getEncoding returns the encoding that was selected in the InputStreamReader constructor. If you don't use one of the constructors that take a character set parameter, you get the 'platform default charset', according to Oracle's documentation.
This question discusses what the platform default charset is, and how you can change it.
If you know in advance that this file is Windows-1256, you can use:
InputStreamReader inputStreamReader = new InputStreamReader(in, "Windows-1256");
Attempting to detect the encoding of a file usually fails - see for example the Bush hid the facts issue in Windows Notepad.
Unfortunately there is no 100% reliable way to detect the encoding of a file and as the other answer points out Java by default doesn't try. It simply assumes the platform's default encoding.
If you know the files are all in a single encoding then great, you can just specify that encoding and life is good.
If you know that some files are in UTF-8 and some files are in a single legacy encoding then you can generally get away with trying a strict* UTF-8 decode first. If the strict UTF-8 decode errors out then you move on to your legacy encoding.
If you have a wider mix of encodings things get considerablly harder, you may have to resort to some quite complex language processing to sort them out.
* I belive to get a strict decode in Java you need to first get the "Charset", then get a "CharsetDecoder" and then use the "onMalformedInput" method to set it to strict mode.
i have problem when i parse xml because i have this caracter ö
<?xml version="1.0" encoding="UTF-8"?>
<rsp stat="ok">
<mediaid>abösjdk3</mediaid>
<mediaurl>http://twitöic.com/abc123</mediaurl>
</rsp>
the building:
parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x9A 0x74 0x68 0x65
<mediaid>ab\232sjdk3</mediaid>
^
other question please if want parse this > 6 < 12 month i will have problem,i not want replace > samone have solution?
You'll have this problem with any parser, not only with objective-c.
That character isn't encoded as UTF-8 and as such it will halt any parser.
Either remove the encoding information or change for the correct value.
Edited to answer a comment
i use GDataXmlNode to parse and in my xml file i not use <?xml version="1.0" encoding="UTF-8"?> – cs1.6
IF the original XML file does not have the encoding attribute, then either when you instantiate the parser, or load the XML file, inform the proper encoding, which I have no idea what it is.
Because for the way that the O.P. is posted, it implies that the character ö is encoded as \232. However, the decimal 232 in ISO-8859-1 represents the character è. The character ö is represented as \246.
Go through this, it will help...
how to parse XML which contains data in Norwegian language ?
Does i need any type of encoding with NSParser ?
Thanks.
I guess you are worried about non-ASCII characters in the XML file. Well you don't need to. The first line of an XML file should look something like:
<?xml version="1.0" encoding="UTF-8"?>
where the encoding attribute tells you which character set was used to encode the characters in the file. NSXMLParser will use that line to determine which character set it will use. Once it gets to your methods, all the text will be in NSStrings which will be able to cope with your Norwegian characters automatically.
All you need to be concerned about is that the file really is encoded in the character set that the first line says it is.
The xml is the language which don't concern which kind of language you are using!! In xml there should be one start tag and it's end tag. Then you can parse using xml parsing.
here is the tutorial to understand xml and
here is the link to tutorial to parse the xml file.
may this will be help full to your problem.
I have an XML document that may have shift-jis encoded data in it and I'm trying to parse it using an NSXMLParser object.
Ordinarily I assume the document is UTF8 encoded and all is well - does anyone know if/how I can determine if an element is shift-jis encoded and then how to decode it?
Thanks
An XML document is UTF-8 encoded unless it has an XML declaration stating otherwise, for example:
<?xml version="1.0" encoding="shift_jis"?>
or:
<?xml version="1.0" encoding="cp932"?>
Any XML parser should detect the encoding given in the XML declaration. (Some parsers may not support some of the CJK codecs so will complain, but AIUI NSXMLParser should be fine.)
If you've got a file with Shift-JIS byte sequences that does not have such a stated encoding, or which contains Shift-JIS byte sequences in some elements and UTF-8 in others, what you have is not well-formed; it's not an XML document at all and no parser will read it.
If you've just got a missing encoding declaration, you really need to fix it at the source end, but in the meantime hacking in a suitable XML declaration or transcoding the bytes manually from Shift-JIS to UTF-8 before feeding it into the parser should help.