Represent encoding used for a text file - unicode

How is the encoding for a simple text file stored?
In an email there's a header
Content-Type: text/plain; charset="UTF-8"
In html we have a meta tag
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
That leaves me the question of how a text editor knows what encoding is used, since we don't explicitly set this in a text file as we do with an html file.

If it's a standard complex format, like .docx or .pdf the encoding is likely to be stored there as some sort of a property.
If it's a simple file, like .txt, .csv the encoding will not be stored anywhere. A text editor will use heuristics to determine which encoding was used to save the file, but it will only be a guess.
Read more:
How to detect the encoding of a file?
Heuristic to detect encoding

Related

Arabic text shows strange characters الÙباى انگليسى ØŒ

I have Arabic text (.sql pure text). When I view it in any document, it shows like this:
حر٠اول الÙباى انگليسى ØŒ حر٠اضاÙÙ‡ مثبت
But when I use an HTML document with <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>, it shows properly like this:
حرف اول الفباى انگليسى ، حرف اضافه مثبت
How can I convert it to readable text?
The Arabic text has been encoded to bytes using UTF-8.
You are explicitly telling the HTML document that the bytes are encoded in UTF-8, which is why any HTML viewer will be able to display the text correctly.
However, any other text viewer will not know the bytes are encoded in UTF-8, unless you put a UTF-8 BOM in front of the text, and the viewer supports BOMs. Otherwise, as you are seeing, a text viewer may instead interpret the bytes in Latin-1 or similar encoding instead. So, you would have to manually tell the text viewer to interpret the bytes as UTF-8 instead. But how you actually do that depends on the particular text viewer you are using. Not all viewers offer this option.

Laravel issue with: language character encoding

Privjet!
I don't understand for what reason I am not getting displayed the non ASCII language characters like say, "ç, ñ, я " for my different languages.
The text in question is hardcoded, it is not served from a DB.
I have seen identical questions here
Charset=utf8 not working in my PHP page
I have seen that I should write this:
header('Content-type: text/html; charset=utf-8');
But where the heck does that go? I cant write it like that, the browser just mirrors the words and displays them as plain text, no parsing.
My encoding for the frontpage says this:
<head>
<meta charset="utf-8">
</head>
which is supposed to be Unicode.
I tried to test my page in validator.w3.org and it went:
Sorry, I am unable to validate this document because on line 60 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
Line 60 actuallly has the word Español (Spanish) with that weird n.
Any hint?
thank you
best regards

Perl Encoding for Japanese characters

Please help me for my Perl encode problem.
I create html form with some input fields.
I take parameters from input "name".
Form action is ".pl" file.
and then I filled the data input fields. and take parameter and I can see the data that I filled. But not OK for Japanese characters.
How to use Encode for that case? e.g Japanese character become ã­ã“.
You need to ensure you are seting the character encoding of your web page correctly. Usually UTF-8. So if you're using the CGI module you do something like:
my $q = CGI->new();
print $q->header( -charset=> 'utf-8' );
This is assuming your form is also generated by by the perl CGI. If its flat HTML, there are some META tags you can use to acomplish the same thing. I think its
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

.ENCODING international chars (hebrew,thai,russian,chinese,....)

international html files archived by wget
should contain chars like this
(example hebrew and thai:)
אב
הם
and ยคน
instead they are saved like this:
íäáåãéú and ÃÒ¡à§é
How to get the these displayed properly?
iconv filename.html
iconv: illegal input sequence at position 1254
SOLVED: There was nothing wrong.
Only i didnt notice the default php.ini did set the charset in the http header but
to use various charsets like this meta http-equiv="Content-Type" content="text/html; charset=windows-874" you needed to set: default_charset = "empty";
....
The pages aren't "saved like this", whatever you're using to view the file is simply interpreting the encoding incorrectly. To know what encoding the file is in you should have paid attention to the HTTP Content-Type header during download; that's gone now.
Your only other chance is to parse the equivalent HTML meta tag in the <head>, if the document has one.
Otherwise, you can only guess the encoding of the document.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more required background knowledge.

How do I specify an encoding for TextCells in CellList?

I use a CellList like this
CellList<String> cellList = new CellList<String>(new TextCell());
and then give it an ArrayList<String>.
If a String contains an "ü" I get a question mark in the browser (FF4, GWT Dev Plugin). If I use ü I get ü
Where can I specify the encoding, so that "ü" works? (I'm not sure if it makes a difference, but the "ü" is currently hardcoded in the .java file and not read from somewhere else).
The GWT compiler assumes, that your Java files are encoded in UTF-8. Make sure, that your editor is set to save in that encoding.
You should also make sure to set the encoding of the HTML page to a unicode capable encoding like UTF-8 (this allows you to use even more exotic characters that you won't find in other charsets):
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
...
Moreover, if you later want to retrieve the strings from a database, make sure, that it is also set up to handle Unicode, and that your JDBC driver connects in Unicode mode (required for some databases).