XMLHTTP - Read iso-8859-2 content and write UTF-8 - encoding

I Need read a content from a page that is iso-8859-2 and write in UTF-8 in my code.
Code Example:
<%# language="VBSCRIPT" codepage="65001" %>
<%
set xmlhttp=Server.CreateObject("Msxml2.XMLHttp.6.0")
Set re=New RegExp
re.IgnoreCase=True
re.Global=True
xmlhttp.open "get", link, false
xmlhttp.setRequestHeader "Content-type", "application/x-www-form-urlencoded; charset=ISO-8859-2"
xmlhttp.send()
html=xmlhttp.responsetext
re.Pattern="<h1>.*?</h1>"
set aux=re.execute(html)
text = aux(0)
response.write text
%>
Original Text on Origin:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" >
<h1>Novo público no interior</h1>
Today's output on utf-8 page:
"Novo pï¿¿o no interior"
I Need output the text correctly on UTF-8. Can anyone help me?

The problem is .ResponseText will not decode your iso-8859-2 see this statement from the MSDN Documentation
IXMLHTTP attempts to decode the response into a Unicode string. It assumes the default encoding is UTF-8, but can decode any type of UCS-2 (big or little endian) or UCS-4 encoding as long as the server sends the appropriate Unicode byte-order mark.
Try using .ResponseBody instead or failing that use ADODB.Stream to take .ResponseStream and convert it to UTF-8 see ASP: I can´t decode some character from utf-8 to iso-8859-1.

Related

Arabic text shows strange characters الÙباى انگليسى ØŒ

I have Arabic text (.sql pure text). When I view it in any document, it shows like this:
حر٠اول الÙباى انگليسى ØŒ حر٠اضاÙÙ‡ مثبت
But when I use an HTML document with <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>, it shows properly like this:
حرف اول الفباى انگليسى ، حرف اضافه مثبت
How can I convert it to readable text?
The Arabic text has been encoded to bytes using UTF-8.
You are explicitly telling the HTML document that the bytes are encoded in UTF-8, which is why any HTML viewer will be able to display the text correctly.
However, any other text viewer will not know the bytes are encoded in UTF-8, unless you put a UTF-8 BOM in front of the text, and the viewer supports BOMs. Otherwise, as you are seeing, a text viewer may instead interpret the bytes in Latin-1 or similar encoding instead. So, you would have to manually tell the text viewer to interpret the bytes as UTF-8 instead. But how you actually do that depends on the particular text viewer you are using. Not all viewers offer this option.

Unknown charset accented characters convert to utf8

I have a website that users may enter an accented character search term.
Since users may come from various countries, various OS, the charset accented characters they input may be encoded in windows-1252, iso-8859-1, or even iso-8859-X, windows-125X.
I am using Perl, and my index server is Solr 8, all data in utf8.
I can use decode+encode to convert it if the source charset is known, but how could I convert an unknown accented to utf8? How could I detect the charset of the source accented characters, in Perl?
use utf8;
use Encode;
encode("utf8",decode("cp1252",$input));
The web page and the form need to specify UTF-8.
Then the browser can accept any script, and will send it to the server as UTF-8.
The form's encoding prevents the browser sending HTML entities like ă for special chars.
Header:
Content-type: text/html; charset=UTF-8
With perl (empty line for end-of-headers):
print "Content-Type: text/html; charset=UTF-8\n\n";
HTML content; in HTML 5:
<!DOCTYPE html>
<html>
<meta charset="UTF-8">
...
<form ... accept-charset="UTF-8"

JavaScript can put an ansi string in a text field, but not utf-8?

I always use UTF-8 everywhere. But I just stumbled upon a strange issue.
Here's a minimal example html file:
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script type="text/javascript">
function Foo()
{
var eacute_utf8 = "\xC3\xA9";
var eacute_ansi = "\xE9";
document.getElementById("bla1").value = eacute_utf8;
document.getElementById("bla2").value = eacute_ansi;
}
</script>
</head>
<body onload="Foo()">
<input type="text" id="bla1">
<input type="text" id="bla2">
</body>
</html>
The html contains a utf-8 charset header, thus the page uses utf-8 encoding. Hence I would expect the first field to contain an 'é' (e acute) character, and the second field something like '�', as a single E9 byte is not a valid utf-8 encoded string.
However, to my surprise, the first contains 'é' (as if the utf-8 data is interpreted as some ansi variant, probably iso-8859-1 or windows-1252), and the second contains the actual 'é' char. Why is this!?
Note that my problem is not related to the particular encoding that my text editor uses - this is exactly why I used the explicit \x character constructions. They contain the correct, binary representation (in ascii compatible notation) of this character in ansi and utf-8 encoding.
Suppose I would want to insert a 'ę' character, that's unicode U+0119, or 0xC4 0x99 in utf-8 encoding, and does not exist in iso-8859-1 or windows-1252 or latin1. How would that even be possible?
JavaScript strings are always strings of Unicode characters, never bytes. Encoding headers or meta tags do not affect the interpretation of escape sequences. The \x escapes do not specify bytes but are shorthand for individual Unicode characters. Therefore the behavior is expected. \xC3 is equivalent to \u00C3.

.ENCODING international chars (hebrew,thai,russian,chinese,....)

international html files archived by wget
should contain chars like this
(example hebrew and thai:)
אב
הם
and ยคน
instead they are saved like this:
íäáåãéú and ÃÒ¡à§é
How to get the these displayed properly?
iconv filename.html
iconv: illegal input sequence at position 1254
SOLVED: There was nothing wrong.
Only i didnt notice the default php.ini did set the charset in the http header but
to use various charsets like this meta http-equiv="Content-Type" content="text/html; charset=windows-874" you needed to set: default_charset = "empty";
....
The pages aren't "saved like this", whatever you're using to view the file is simply interpreting the encoding incorrectly. To know what encoding the file is in you should have paid attention to the HTTP Content-Type header during download; that's gone now.
Your only other chance is to parse the equivalent HTML meta tag in the <head>, if the document has one.
Otherwise, you can only guess the encoding of the document.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more required background knowledge.

What is the difference between EM Dash #151; and #8212;?

I've an ASCII file that contains an EM Dash (— or — in HTML). The hex value is 0x97. When we pass this file through one application it arrives as UTF-8, and it converts the character to 0xC297, which is — in HTML. However, when we pass this file through a different application it converts the character to 0xE28094 or —.
What would cause these applications to convert these characters differently? Is it perhaps a code page setting?
— is wrong. When you use numeric character references, the number refers to the Unicode codepoint. For numbers below 256 that is the same as the codepoint in ISO-8859-1. In 8859-1, character 151 is amongst the “C1 control codes”, and not a dash or any other visible character.
The confusion arises because character 151 is a dash in Windows code page 1252 (Western European). Many people think cp1252 is the same thing as ISO-8859-1, but in reality it's not: the characters in the C1 range (128 to 159) are different.
The first application is reading your “ASCII” file* as ISO-8859-1, but actually it's probably cp1252 and you'll need a way to clue the app in about what encoding it has to expect.
(*: “ASCII” is a misnomer if there are top-bit-set characters in the file. You probably mean “ANSI”, which is really also a misnomer, but one which has stuck in the Windows world to mean “text encoded in the current system-default code page”.)
— is not em dash, your text was mis-translated from em dash to that value.
— is the HTML decimal entity for em dash. Specifically it is referencing the Unicode code point 8212 which represents an em dash.
Your file is not ASCII if it contains an em dash. ASCII chars only encode to decimal range 0 - 127, and em dash is not a character that can be represented by ASCII encoding. If you have em dash stored as 0x97 (151 in decimal) you probably have an ANSI text file (aka Windows Codepage 1252 (w-1252)).
Your first app...
The data started as an em dash encoded in w-1252. In w-1252 the em dash maps to the decimal value 151 (0x97 in hex, or 10010111 in binary).
At some point the em dash was handled by code that thought the bytes in your file were iso-8859-1 encoded text. When that code interpreted 0x97 as a string/char it mapped 0x97 to a character according to the iso-8859-1 encoding. In iso-8859-1 0x97 maps to the char "End of guarded area".
Next, the string, which the code thinks is the "End of guarded area" control char, was encoded as utf-8. "End of guarded area" encoded in utf-8 is the two-byte sequence: 0xC2 0x97.
Your second app...
The text file was correctly interpreted as w-1252, thus the 0x97 is recognized as em dash, which was correctly encoded as the em dash in utf-8: 0xE2 0x80 0x94.
What influences this behavior
Not sure if you're dealing with web apps or what, but the concept should be the same whatever it is. We had the same 0x97->0xC297 scenario in a web app where people input data into a form. I found that the charset of the web page was declared as iso8859-1, and the browser's best way to handle the w1252 chars was to just send them along as as the iso bytes without alerting the user or the server. The server receives the data thinks it's iso and converts to utf-8, resulting in 0xC297.
Basically any time an app touches text it needs to be told how the text is encoded, or else it might fall back to a system default. If that happens you risk data corruption.
According to the HTML4 specification's character entity reference, the emdash is — (U+2014).
An ASCII file can not contain the character 0x97, as the ASCII character set only ranges from 0x00 to 0x7F. Therefore your file is not ASCII, but some other single byte encoding. The windows-1250 encoding for example has the em-dash at 0x97.
If the applications decode the text file using some other encoding than the one that was used to create the file, any character above 0x7F will be wrong.
In unicode the em-dash has the character code 0x2014, or 8212 in decimal.
Unicode Character 'EM DASH' (U+2014)
In a web page that for example uses windows-1250 as encoding, the code — will render as an em-dash:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>em-dash</title>
<meta http-equiv="content-type" content="text/html; charset=windows-1250"/>
</head>
<body>
<div>—</div>
</body>
</html>