Unknown charset accented characters convert to utf8 - perl

I have a website that users may enter an accented character search term.
Since users may come from various countries, various OS, the charset accented characters they input may be encoded in windows-1252, iso-8859-1, or even iso-8859-X, windows-125X.
I am using Perl, and my index server is Solr 8, all data in utf8.
I can use decode+encode to convert it if the source charset is known, but how could I convert an unknown accented to utf8? How could I detect the charset of the source accented characters, in Perl?
use utf8;
use Encode;
encode("utf8",decode("cp1252",$input));

The web page and the form need to specify UTF-8.
Then the browser can accept any script, and will send it to the server as UTF-8.
The form's encoding prevents the browser sending HTML entities like ă for special chars.
Header:
Content-type: text/html; charset=UTF-8
With perl (empty line for end-of-headers):
print "Content-Type: text/html; charset=UTF-8\n\n";
HTML content; in HTML 5:
<!DOCTYPE html>
<html>
<meta charset="UTF-8">
...
<form ... accept-charset="UTF-8"

Related

Can an email header have different character encoding than the body of the email?

Is an email with different character encoding for it's header and body valid?
The Use Case: While processing an email, should I check for the character encoding of it's header separately, or will checking that of it's body be sufficient?
Can someone guide me as to how to figure this out?
Thanks in advance!
Email headers should use the ASCII charset, if you want the header fields to have a different encoding you need to use the encoded word syntax: http://en.wikipedia.org/wiki/MIME#Encoded-Word
The email body can be directly encoded in different encoding only if mail servers that transfer it have 8bit mime enabled (nowadays every mail server should have it enabled, but it's not guaranteed), otherwise you need to encode the body in transfer encoding (quoted-printable or base64)
The charset can be different in each case, that is you can have every encoded word in different charset and every mail part encoded in different charset or even different transfer encoding as well.
For example you can have:
Subject: =?UTF-8?Q?Zg=C5=82oszenie?= //header value in UTF-8 encoded with quoted printable
and the body encoded:
Content-Type: text/plain; charset="iso-8859-2"
Content-Transfer-Encoding: base64
WmG/87PmIEfqtmyxIEphvPE=
different charsets, different transfer encodings in the same email, no problem.
From experience I can tell you that such mails are very common. Even worse, you can get an email that states one charset in Content-Type header and another charset in html body meta tag:
Content-Type: text/html; charset="iso-8859-2"
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charser=utf-8">
It's up to you to guess the actual charset used. Probably it's the one in meta tag.
Assume nothing. Expect everything. Take no prisoners. This is Sparta.

Laravel issue with: language character encoding

Privjet!
I don't understand for what reason I am not getting displayed the non ASCII language characters like say, "ç, ñ, я " for my different languages.
The text in question is hardcoded, it is not served from a DB.
I have seen identical questions here
Charset=utf8 not working in my PHP page
I have seen that I should write this:
header('Content-type: text/html; charset=utf-8');
But where the heck does that go? I cant write it like that, the browser just mirrors the words and displays them as plain text, no parsing.
My encoding for the frontpage says this:
<head>
<meta charset="utf-8">
</head>
which is supposed to be Unicode.
I tried to test my page in validator.w3.org and it went:
Sorry, I am unable to validate this document because on line 60 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
Line 60 actuallly has the word Español (Spanish) with that weird n.
Any hint?
thank you
best regards

JavaScript can put an ansi string in a text field, but not utf-8?

I always use UTF-8 everywhere. But I just stumbled upon a strange issue.
Here's a minimal example html file:
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script type="text/javascript">
function Foo()
{
var eacute_utf8 = "\xC3\xA9";
var eacute_ansi = "\xE9";
document.getElementById("bla1").value = eacute_utf8;
document.getElementById("bla2").value = eacute_ansi;
}
</script>
</head>
<body onload="Foo()">
<input type="text" id="bla1">
<input type="text" id="bla2">
</body>
</html>
The html contains a utf-8 charset header, thus the page uses utf-8 encoding. Hence I would expect the first field to contain an 'é' (e acute) character, and the second field something like '�', as a single E9 byte is not a valid utf-8 encoded string.
However, to my surprise, the first contains 'é' (as if the utf-8 data is interpreted as some ansi variant, probably iso-8859-1 or windows-1252), and the second contains the actual 'é' char. Why is this!?
Note that my problem is not related to the particular encoding that my text editor uses - this is exactly why I used the explicit \x character constructions. They contain the correct, binary representation (in ascii compatible notation) of this character in ansi and utf-8 encoding.
Suppose I would want to insert a 'ę' character, that's unicode U+0119, or 0xC4 0x99 in utf-8 encoding, and does not exist in iso-8859-1 or windows-1252 or latin1. How would that even be possible?
JavaScript strings are always strings of Unicode characters, never bytes. Encoding headers or meta tags do not affect the interpretation of escape sequences. The \x escapes do not specify bytes but are shorthand for individual Unicode characters. Therefore the behavior is expected. \xC3 is equivalent to \u00C3.

XMLHTTP - Read iso-8859-2 content and write UTF-8

I Need read a content from a page that is iso-8859-2 and write in UTF-8 in my code.
Code Example:
<%# language="VBSCRIPT" codepage="65001" %>
<%
set xmlhttp=Server.CreateObject("Msxml2.XMLHttp.6.0")
Set re=New RegExp
re.IgnoreCase=True
re.Global=True
xmlhttp.open "get", link, false
xmlhttp.setRequestHeader "Content-type", "application/x-www-form-urlencoded; charset=ISO-8859-2"
xmlhttp.send()
html=xmlhttp.responsetext
re.Pattern="<h1>.*?</h1>"
set aux=re.execute(html)
text = aux(0)
response.write text
%>
Original Text on Origin:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2" >
<h1>Novo público no interior</h1>
Today's output on utf-8 page:
"Novo pï¿¿o no interior"
I Need output the text correctly on UTF-8. Can anyone help me?
The problem is .ResponseText will not decode your iso-8859-2 see this statement from the MSDN Documentation
IXMLHTTP attempts to decode the response into a Unicode string. It assumes the default encoding is UTF-8, but can decode any type of UCS-2 (big or little endian) or UCS-4 encoding as long as the server sends the appropriate Unicode byte-order mark.
Try using .ResponseBody instead or failing that use ADODB.Stream to take .ResponseStream and convert it to UTF-8 see ASP: I can´t decode some character from utf-8 to iso-8859-1.

Perl Encoding for Japanese characters

Please help me for my Perl encode problem.
I create html form with some input fields.
I take parameters from input "name".
Form action is ".pl" file.
and then I filled the data input fields. and take parameter and I can see the data that I filled. But not OK for Japanese characters.
How to use Encode for that case? e.g Japanese character become ã­ã“.
You need to ensure you are seting the character encoding of your web page correctly. Usually UTF-8. So if you're using the CGI module you do something like:
my $q = CGI->new();
print $q->header( -charset=> 'utf-8' );
This is assuming your form is also generated by by the perl CGI. If its flat HTML, there are some META tags you can use to acomplish the same thing. I think its
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">