How to change content type of static files? - mojolicious

Investigating why web fonts served on servers aren't valid, as reported by browsers, https://stackoverflow.com/a/35898578/4429472 pushed me in the right direction. I checked response Content-Type of the files, and it was erroneously text/html. How do I change it for the web font files?

I don't know which MIME type do you expect to receive, but the types offered by Mojolicious are listed here: http://mojolicious.org/perldoc/Mojolicious/Types

Related

How to add Unicode emoji to the Internet Archive?

When visiting a website that contains Unicode emoji through the Wayback Machine, the emoji appear to be broken, for example:
https://web.archive.org/web/20210524131521/https://tmh.conlangs.de/emoji-language/
The emoji "😀" is rendered as "😀" and so forth:
This effect happens if a page is mistakenly rendered as if it was ISO-8859-1 encoded, even though it is actually UTF-8.
So it seems that the Wayback Machine is somehow confused about the character encoding of the page.
The original page source has a HTML5 <!doctype html> declaration and is valid HTML according to W3C's validator. The encoding is specified as utf-8 using a meta charset tag.
The original page renders correctly on all major platforms and browsers, for example Chrome on Linux, Safari on Mac OS, and Edge on Windows.
Does the Internet Archive crawler require a special way of specifying the encoding, or are emoji through UTF-8 simply not supported yet?
tl;dr The original page must be served with a charset in the HTTP content-type header.
As #JosefZ pointed out in the comments, the Wayback Machine mistakenly serves the page as windows-1252 (which has a similar effect as ISO-8859-1).
This is apparently the default encoding that the Internet Archive assumes if no charset can be detected.
The meta charset tag in the original page's source never takes effect when the archived page is rendered by the browser, because with all the extra JavaScript and CSS included by the Wayback Machine, the tag comes after the first 1024 bytes, which is too late according to the HTML5 specification: https://www.w3.org/TR/2012/CR-html5-20121217/document-metadata.html#charset
So it seems that the Internet Archive does not take into account meta charset tags when crawling a page.
However, there are other archived pages such as https://web.archive.org/web/20210501053710/https://unicode.org/emoji/charts-13.0/full-emoji-list.html where Unicode emoji are displayed correctly.
It turns out that this correctly rendered page was originally served with a HTTP content-type header that includes a charset: text/html; charset=UTF-8
So, if the webserver of the original page is configured to send such a content-type HTTP header that includes the UTF-8 encoding, the Wayback Machine should display the page correctly after reindexing.
How the webserver can be configured to send the encoding with the content-type header depends on the exact webserver that is being used.
For Apache, for example, adding
AddDefaultCharset UTF-8
to the site's configuration or .htaccess file should work.
Note that for the Internet Archive to actually reindex the page, you may have to make a change to the original page's HTML content, not just change the HTTP headers.

Mixed Content: HTTPS site without specification

I am writing a program which will look for Mixed Content within a URL. The aim of this script is to extract all links in a page and convert these links to absolute links, and then to see if the content is mixed.
lets say we have this page https://www.example.com/xxx1/ i'm assuming that any reference to links within this page will ALWAYS connect through to the HTTPS site, unless the link is explicitly told otherwise?
E.g
/index.html = will be HTTPS
http://www.example.com/img/insecureImage.jpg = Will be HTTP - and therefore insecure?
True?
Thanks,
The situation with mixed content depends on whether the content is active or passive. If you have an HTTPS site, all active content will be blocked. If it is passive as in the case of the image you provided, it will be displayed by default, but users can choose in their browsers to block this too.
The example you give is of an image file, so that is passive mixed content and that would not be blocked by default, but could be by the user's settings as mentioned.
The following resources fit into that class:
img
audio
video
object
The guide I link to explains the active/passive mixed content quite well.
MDN Guide on Mixed Content
Yes, independent of mixed content or not, if you see a relative link it is intended to be appended to the origin domain, so in your example /index.html should be interpreted as (https://www.example.com/index.html).
If they are absolute links, determining if its mixed content is exactly like you suggest - check the uri scheme. To reference mixed content, even from the same server, you need to use absolute links, so it makes your task kind of easy.
You're on the right track.

iPhone Safari offline-cache manifest not working correctly

I'm working on a mobile site for the iphone. I've added a cache manifest and loaded it with a list of resources needed for offline capability. The manifest file has the correct content type. You can view the manifest file in the header of this page:
http://www.rvddps.com/apps/sixshot/booking.html
I had a bunch of links to pages but due to my user level i'm only allowed to post one link. You can see the manifest file there and the source code of the page i'm trying to cache.
I've set the correct MIME type on the server, but the cache only seems to work occasionally.. not all the time. I've tried following apples' official caching guidelines as well.
Can anyone point out where i'm going wrong?
Thanks
Daniel
I looked at the manifest file and found 'Â' characters in some of the blank lines. What text editor are you using? Make sure you use the proper encoding and line ending types.

How to serve .RTFs

I support a web-application that displays reports from a database. Occassionally, a report will contain an attachment (which is typically an image/document which is stored in the database as well).
We serve the attachment via a dynamic .htm resource which streams the attachment from the database, and populates the content-type based on what type of attachment it is (we support PDFs, RTFs, and various image formats)
For RTFs we've come across a problem. It seems a lot of Windows users don't defaultly have an assocation for the 'application/rtf' content-type (they do have an association for the *.rtf file extention). As a result, clicking on the link to the attachment doesn't do anything in Internet Explorer 6.
Returning 'application/msword' as the content-type seems to make the RTF viewable when clicking on the link, but only for people who have MS Office installed (some of the users won't have this installed, and will use alternate RTF readers, like OpenOffice).
This application is accessed publicly, so we don't have control of the user's machine settings.
Has anybody here solved this before? And how? Thanks!
Use application/octet-stream content-type to force download. Once it's downloaded, it should be viewable in whatever is registered to handle .rtf files.
In addition to the Content-Type header, you also need to add the following:
Content-Disposition: attachment; filename=my-document.rtf
Wordpad (which is on pretty much every Windows machine) can view RTF files. Is there an 'application/wordpad' content-type?
Alternatively, given the rarety of RTF files, your best solution might be to use a server-side component to open the RTF file, convert it to some other format (like PDF or straight HTML), and serve that to the requesting client. I don't know what language/platform you're using on the server side, so I don't know what to tell you to use for this.

How is mime type of an uploaded file determined by browser?

I have a web app where the user needs to upload a .zip file. On the server-side, I am checking the mime type of the uploaded file, to make sure it is application/x-zip-compressed or application/zip.
This worked fine for me on Firefox and IE. However, when a coworker tested it, it failed for him on Firefox (sent mime type was something like "application/octet-stream") but worked on Internet Explorer. Our setups seem to be identical: IE8, FF 3.5.1 with all add-ons disabled, Windows XP SP3, WinRAR installed as native .zip file handler (not sure if that's relevant).
So my question is: How does the browser determine what mime type to send?
Please note: I know that the mime type is sent by the browser and, therefore, unreliable. I am just checking it as a convenience--mainly to give a more friendly error message than the ones you get by trying to open a non-zip file as a zip file, and to avoid loading the (presumably heavy) zip file libraries.
Chrome
Chrome (version 38 as of writing) has 3 ways to determine the MIME type and does so in a certain order. The snippet below is from file src/net/base/mime_util.cc, method MimeUtil::GetMimeTypeFromExtensionHelper.
// We implement the same algorithm as Mozilla for mapping a file extension to
// a mime type. That is, we first check a hard-coded list (that cannot be
// overridden), and then if not found there, we defer to the system registry.
// Finally, we scan a secondary hard-coded list to catch types that we can
// deduce but that we also want to allow the OS to override.
The hard-coded lists come a bit earlier in the file: https://cs.chromium.org/chromium/src/net/base/mime_util.cc?l=170 (kPrimaryMappings and kSecondaryMappings).
An example: when uploading a CSV file from a Windows system with Microsoft Excel installed, Chrome will report this as application/vnd.ms-excel. This is because .csv is not specified in the first hard-coded list, so the browser falls back to the system registry. HKEY_CLASSES_ROOT\.csv has a value named Content Type that is set to application/vnd.ms-excel.
Internet Explorer
Again using the same example, the browser will report application/vnd.ms-excel. I think it's reasonable to assume Internet Explorer (version 11 as of writing) uses the registry. Possibly it also makes use of a hard-coded list like Chrome and Firefox, but its closed source nature makes it hard to verify.
Firefox
As indicated in the Chrome code, Firefox (version 32 as of writing) works in a similar way. Snippet from file uriloader\exthandler\nsExternalHelperAppService.cpp, method nsExternalHelperAppService::GetTypeFromExtension
// OK. We want to try the following sources of mimetype information, in this order:
// 1. defaultMimeEntries array
// 2. User-set preferences (managed by the handler service)
// 3. OS-provided information
// 4. our "extras" array
// 5. Information from plugins
// 6. The "ext-to-type-mapping" category
The hard-coded lists come earlier in the file, somewhere near line 441. You're looking for defaultMimeEntries and extraMimeEntries.
With my current profile, the browser will report text/csv because there's an entry for it in mimeTypes.rdf (item 2 in the list above). With a fresh profile, which does not have this entry, the browser will report application/vnd.ms-excel (item 3 in the list).
Summary
The hard-coded lists in the browsers are pretty limited. Often, the MIME type sent by the browser will be the one reported by the OS. And this is exactly why, as stated in the question, the MIME type reported by the browser is unreliable.
Kip, I spent some time reading RFCs, MSDN and MDN. Here is what I could understand. When a browser encounters a file for upload, it looks at the first buffer of data it receives and then runs a test on it. These tests try to determine if the file is a known mime type or not, and if known mime type it will simply further test it for which known mime type and take action accordingly. I think IE tries to do this first rather than just determining the file type from extension. This page explains this for IE http://msdn.microsoft.com/en-us/library/ms775147%28v=vs.85%29.aspx. For firefox, what I could understand was that it tries to read file info from filesystem or directory entry and then determines the file type. Here is a link for FF https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIFile. I would still like to have more authoritative info on this.
This is probably OS and possibly browser dependent, but on Windows, the MIME type for a given file extension can be found by looking in the registry under HKCR:
For example:
HKEY_CLASSES_ROOT.zip
- ContentType
To go from MIME to file extension, you can look at the keys under
HKEY_CLASSES_ROOT\Mime\Database\Content Type
To get the default extension for a particular MIME type.
While this is not an answer to your question, it does solve the problem you are trying to solve. YMMV.
As you wrote, mime type is not reliable as each browser has its way of determining it. However, browsers send the original name (including extension) of the file. So the best way to deal with the problem is to inspect extension of the file instead of the MIME type.
If you still need the mime type, you can use your own apache's mime.types to determine it server-side.
I agree with johndodo, there are so many variables that make mime types that are sent from browsers unreliable. I would exclude the subtypes that are received and just focus on the type like 'application'. if your app is php based, you can easily do this by using the function explode().
in addition, just check the file extension to make sure it is .zip or any other compression you are looking for!
According to rfc1867 - Form-based file upload in HTML:
Each part should be labelled with an appropriate content-type if the
media type is known (e.g., inferred from the file extension or
operating system typing information) or as application/octet-stream.
So my understanding is, application/octet-stream is kind of like a blanket catch-all identifier if the type cannot be inferred.