How to add Unicode emoji to the Internet Archive? - unicode

When visiting a website that contains Unicode emoji through the Wayback Machine, the emoji appear to be broken, for example:
https://web.archive.org/web/20210524131521/https://tmh.conlangs.de/emoji-language/
The emoji "😀" is rendered as "😀" and so forth:
This effect happens if a page is mistakenly rendered as if it was ISO-8859-1 encoded, even though it is actually UTF-8.
So it seems that the Wayback Machine is somehow confused about the character encoding of the page.
The original page source has a HTML5 <!doctype html> declaration and is valid HTML according to W3C's validator. The encoding is specified as utf-8 using a meta charset tag.
The original page renders correctly on all major platforms and browsers, for example Chrome on Linux, Safari on Mac OS, and Edge on Windows.
Does the Internet Archive crawler require a special way of specifying the encoding, or are emoji through UTF-8 simply not supported yet?

tl;dr The original page must be served with a charset in the HTTP content-type header.
As #JosefZ pointed out in the comments, the Wayback Machine mistakenly serves the page as windows-1252 (which has a similar effect as ISO-8859-1).
This is apparently the default encoding that the Internet Archive assumes if no charset can be detected.
The meta charset tag in the original page's source never takes effect when the archived page is rendered by the browser, because with all the extra JavaScript and CSS included by the Wayback Machine, the tag comes after the first 1024 bytes, which is too late according to the HTML5 specification: https://www.w3.org/TR/2012/CR-html5-20121217/document-metadata.html#charset
So it seems that the Internet Archive does not take into account meta charset tags when crawling a page.
However, there are other archived pages such as https://web.archive.org/web/20210501053710/https://unicode.org/emoji/charts-13.0/full-emoji-list.html where Unicode emoji are displayed correctly.
It turns out that this correctly rendered page was originally served with a HTTP content-type header that includes a charset: text/html; charset=UTF-8
So, if the webserver of the original page is configured to send such a content-type HTTP header that includes the UTF-8 encoding, the Wayback Machine should display the page correctly after reindexing.
How the webserver can be configured to send the encoding with the content-type header depends on the exact webserver that is being used.
For Apache, for example, adding
AddDefaultCharset UTF-8
to the site's configuration or .htaccess file should work.
Note that for the Internet Archive to actually reindex the page, you may have to make a change to the original page's HTML content, not just change the HTTP headers.

Related

UTF-8 on FF cannot display french accents

On my FF browser, the encoding is set to UTF-8. The french accents are displayed properly on all pages except one page. On the trouble page, they show up as '?' marks. When I change the encoding to western, the trouble page displays french accents properly, while the other pages now do not display french accents properly.
On IE, the setting is UTF-8 and all pages show proper french accents
I know it's an old post. But, I was facing the same issue and I used htmlentities() in php, when nothing else worked out. This solved the purpose for me, so thought of mentioning it here so that someone else can benefit from it.
What's the web page?
Most likely the page's own encoding is ISO 8859-1 or something similar (a pure 8-bit encoding). Some web pages don't bother to specify their own encoding in the Content-Type: header, leaving the browser to guess. Apparently in this case Internet Explorer guesses better than Firefox.
If you have the curl command, try curl --head URL to see how and whether the encoding is specified, or right-click and View Page Info in Firefox.
You might consider contacting the owner of the web page and asking them to set the encoding properly (or, as I'd do, just ignore it).

how to make german web site

what type of encoding or what do I have to do to make my web site display properly the text with German characters like this: Käse and not like this: K�se ?
Here is what I use for doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
and here is what I use for encoding:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
the collation in mysql that I use is utf8_general_ci, I have never done web sites with other languages except for english (from scratch). I dont know what I am missing!
Thank you for your time!
Your encoding choise looks fine.
There is just two steps left: You have to make sure that the content type in the HTTP header also says the same, and you have to make sure that what you actually send is encoded using UTF-8.
UTF-8 should used for sites that cater for many languages, so is suitable for your needs.
The meta tag is correct too, though you may want to ensure that the server is sending the right Content-Type header.
Ensure that the HTML file is also encoded with UTF-8 and not ASCII or another codepage.
In general, you need to ensure that all steps from the DB to the browser use UTF-8 (so, DB columns are UTF-8, transferred to the server as UTF-8, rendered as UTF-8, transferred to the browser as UTF-8 with the right headers and meta tags).
From my expiriense, for utf-8 to work right:
MySql data needs to be in some of the "utf-8" collations
The meta tag needs to define charset as "utf-8"
The MySql connector needs to be set to "utf-8" (for php, its mysql_set_charset)
The server-side file (*.php or the like) needs to be saved in utf-8 (not actually necesary, but it saves some pain)

How can I properly display Vietnamese characters in ColdFusion?

I having a hard trying to properly display Vietnamese text in ColdFusion. I've proper charset set to UTF-8 but still no luck. The same texts work fine in a HTML page. What else am I missing? Any suggestion would be much appreciated.
Html:
ColdFusion:
Thanks!
There are two things you need to watch out for, as far as I recall of the top of my head.
The first is to ensure that the .cfm file itself is saved as UTF-8 - this is a file system option, and will probably be settable in your editor. This ensures that the UTF-8 characters are correctly preserved when saving the file.
The other is that every .cfm file that includes any UTF-8 text should start with:
<cfprocessingdirective pageencoding="utf-8" />
This ensures that ColdFusion delivers the page to the browser in the correct format.
Just to be sure, when you display your working HTML, can you check the page encoding used by your browser (ie. in FireFox you can right-click+page Info). Maybe your text is not UTF-8 encoded that could explain the problem...

Why are accented characters rendering inconsistently when accessing the same code on the same server at a different URL?

There is a page on our server that's reachable via two different URLs.
http://www.spotlight.com/6213-5613-0721
http://www.spotlight.com/interactive/cv/1/M103546.html
There's classic ASP behind the scenes, and both of those URLs actually do a Server.Transfer to the same underlying ASP page.
The accents in the name at the top of the page are rendering correctly on one URL and incorrectly on the other - but as far as I can tell, the two requests are returning identical responses (same markup, same headers, same everything) - and I have absolutely no idea why one URL should be rendering correctly whilst the other is corrupting the accented characters.
Is there anything else (content encoding?) that I should be examining - and if so, how can I tell what's being returned beyond the information displayed in Firebug?
I been in this problem in the past and the problem was that some file (maybe the asp file that do the transfer or some include) is not saved as ANSI.
Check that all files involved in the request has the same encoding in the server (try File -> Save As With Encoding)
I have checked the character encoding in your headers and meta tags and they are consistent across both pages. I also agree that the output of the pages is largely similar - except for the special characters, which are "messed up" in the source file.
I don't think this issue exists in the browser, the must be something behind the scenes that causes this. How does the name containing these characters get from the data store to the page?

iPhone's Mobile Safari: Special Characters

The iPhone app I'm working on uses html help files and special characters such as
ü and ê
are being mangled my iPhone's mobile Safari. Anything I can do to correct this?
If you're using XHTML, ensure that the content of your files really is the encoding specified in the doctype. If you're using just plain HTML, consider using XHTML instead, or
Use HTML entities (e.g. é)
Use the META tag to specify an encoding
Have you tried using numerical character references? Alternatively, perhaps you can use a <meta http-equiv="content-type" ... element. Also, maybe there's a better way to tell mobile Safari the character encoding of HTML files (equivalent to the server's HTTP Content-Type header)