You might know that HTML related file formats are compressed using GZip compression, server side, (by mod_gzip on Apache servers), and are decompressed by compatible browsers. ("content encoding")
Does this only work for HTML/XML files? Lets say my PHP/Perl file generates some simple comma delimited data, and sends that to the browser, will it be encoded by default?
What about platforms like Silverlight or Flash, when they download such data will it be compressed/decompressed by the browser/runtime automatically? Is there any way to test this?
Does this only work for HTML/XML
files?
No : it is quite often used for CSS and JS files, for instance -- as those are amongst the biggest thing that websites are made of (except images), because of JS frameworks and full-JS applications, it represents a huge gain!
Actually, any text-based format can be compressed quite well (on the opposite, images can not, for instance, as they are generally already compressed) ; sometimes, JSON data returned from Ajax-requests are compressed too -- it's text data, afterall ;-)
Lets say my PHP/Perl file generates
some simple comma delimited data, and
sends that to the browser, will it be
encoded by default?
It's a matter of configuration : if you configured your server to compress that kind of content, it'll probably be compressed :-)
(If the browser says it accepts gzip-encoded data)
Here's a sample of configuration for Apache 2 (using mod_deflate) that I use on my blog :
<IfModule mod_deflate.c>
AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css text/javascript application/javascript application/x-javascript application/xml
</IfModule>
Here, I want html/xml/css/JS te be compressed.
And here is the same thing, plus/minus a few configuration options I used once, under Apache 1 (mod_gzip) :
<IfModule mod_gzip.c>
mod_gzip_on Yes
mod_gzip_can_negotiate Yes
mod_gzip_minimum_file_size 256
mod_gzip_maximum_file_size 500000
mod_gzip_dechunk Yes
mod_gzip_item_include file \.css$
mod_gzip_item_include file \.html$
mod_gzip_item_include file \.txt$
mod_gzip_item_include file \.js$
mod_gzip_item_include mime text/html
mod_gzip_item_exclude mime ^image/
</IfModule>
Things that can be noticed here are that I don't want too small (the gain wouldn't be quite important) or too big (would eat too much CPU to compress) files to be compressed ; and I want css/html/txt/js files to be compressed, but not images.
If you want you comma-separated data to be compressed the same way, you'll have to add either it's content-type or it's extension to the configuration of your webserver, to activate gzip-compression for it.
Is there any way to test this?
For any content returned directly to the browser, Firefox's extensions Firebug or LiveHTTPHeaders are a must-have.
For content that doesn't go through the standard communication way of the browser, it might be harder ; in the end, you may have to end up using something like Wireshark to "sniff" what is really going through the pipes... Good luck with that!
What about platforms like Silverlight or Flash,
when they download such data will it be compressed/decompressed
by the browser/runtime automatically?
To answer your question about Silverlight and Flash, if they send an Accept header indicating they support compressed content, Apache will use mod_deflate or mod_gzip. If they don’t support compression they won’t send the header. It will “just work.” – Nate
I think Apache’s mod_deflate is more common than mod_gzip, because it’s built-in and does the same thing. Look at the documentation for mod_deflate (linked above) and you’ll see that it’s easy to specify which file types to compress, based on their MIME types. Generally it’s worth compressing HTML, CSS, XML and JavaScript. Images are already compressed, so they don’t benefit from compression.
The browser sends an "Accept-Encoding" header with the types of compression that it knows how to understand. The server looks at this, along with the user-agent and decides how to encode the result. Some browsers lie about what they can understand, so this is more complex than just searching for "deflate" in the header.
Technically, any HTTP/2xx response with content can be content-encoded using any of the valid content encodings (gzip, zlib, deflate, etc.), but in practice it's wasteful to apply compression to common image types because it actually makes them larger.
You can definitely compress the response from dynamic PHP pages. The simplest method is to add:
<?php ob_start("ob_gzhandler"); ?>
to the start of every PHP page. It's better to set it up through the PHP configuration, of course.
There are many test pages, easily found with Google:
http://www.whatsmyip.org/http_compression/
http://www.gidnetwork.com/tools/gzip-test.php
http://nontroppo.org/tools/gziptest/
http://www.nibbleguru.com/tools/gzip-test.php
Related
When visiting a website that contains Unicode emoji through the Wayback Machine, the emoji appear to be broken, for example:
https://web.archive.org/web/20210524131521/https://tmh.conlangs.de/emoji-language/
The emoji "😀" is rendered as "😀" and so forth:
This effect happens if a page is mistakenly rendered as if it was ISO-8859-1 encoded, even though it is actually UTF-8.
So it seems that the Wayback Machine is somehow confused about the character encoding of the page.
The original page source has a HTML5 <!doctype html> declaration and is valid HTML according to W3C's validator. The encoding is specified as utf-8 using a meta charset tag.
The original page renders correctly on all major platforms and browsers, for example Chrome on Linux, Safari on Mac OS, and Edge on Windows.
Does the Internet Archive crawler require a special way of specifying the encoding, or are emoji through UTF-8 simply not supported yet?
tl;dr The original page must be served with a charset in the HTTP content-type header.
As #JosefZ pointed out in the comments, the Wayback Machine mistakenly serves the page as windows-1252 (which has a similar effect as ISO-8859-1).
This is apparently the default encoding that the Internet Archive assumes if no charset can be detected.
The meta charset tag in the original page's source never takes effect when the archived page is rendered by the browser, because with all the extra JavaScript and CSS included by the Wayback Machine, the tag comes after the first 1024 bytes, which is too late according to the HTML5 specification: https://www.w3.org/TR/2012/CR-html5-20121217/document-metadata.html#charset
So it seems that the Internet Archive does not take into account meta charset tags when crawling a page.
However, there are other archived pages such as https://web.archive.org/web/20210501053710/https://unicode.org/emoji/charts-13.0/full-emoji-list.html where Unicode emoji are displayed correctly.
It turns out that this correctly rendered page was originally served with a HTTP content-type header that includes a charset: text/html; charset=UTF-8
So, if the webserver of the original page is configured to send such a content-type HTTP header that includes the UTF-8 encoding, the Wayback Machine should display the page correctly after reindexing.
How the webserver can be configured to send the encoding with the content-type header depends on the exact webserver that is being used.
For Apache, for example, adding
AddDefaultCharset UTF-8
to the site's configuration or .htaccess file should work.
Note that for the Internet Archive to actually reindex the page, you may have to make a change to the original page's HTML content, not just change the HTTP headers.
When I read mails I sometimes would like to select one of the links in the mail's text to open it in a web browser.
Before you answer, I know there is urlview, but there are also BASE64-encoded (or other transfer encodings) mails from which urlview does not find any URLs. Then there are also HTML-only mails that can also be encoded with transfer encodings.
I wonder if there is a trivial and/or nice solution that I couldn't find. I cannot be the only one with this problem. It does not need to be based on urlview, of course.
urlview will work if you employ the "pipe_decode" setting. Example use in a macro, binding to "\u":
macro index,pager \\u "<enter-command>set pipe_decode = yes<enter><pipe-message>urlview<enter><enter-command>set pipe_decode = no<enter>" "view URLs"
with urlscan there exists a worthy successor to urlview.
Support for emails in quoted-printable and base64 encodings. [..] For HTML mails, a crude parser is used to render the HTML into text.
I have a series of uncompressed (binary/Octet-stream) files on Google Cloud storage. I'm trying to download them using gzip. According to this page
https://developers.google.com/storage/docs/json_api/v1/how-tos/performance
I can add
Accept-Encoding: gzip
User-Agent: my program (gzip)
and download the files compressed. This does not work for me. Am I missing something? The files always come back uncompressed. Anyone else experience the same issue?
You can add that header to indicate that you're willing to receive gzipped content, but the HTTP spec says that there is no guarantee. In case of Google Cloud Storage, unless the object was already uploaded with gzip content-encoding, the response will not have gzipped content (i.e. GCS does not dynamically compress objects).
(The linked docs page could probably be more clear about this, I'll suggest to clarify this issue.)
Apologies if this is a basic question, i'm a very casual programmer.
I'm writing a program which will search for torrents, grab them based on certain criteria (one being that they are indicated as freeware, you'll be pleased to hear) and then throw them over to utorrent. I'm getting stuck downloading the .torrent file, because, I believe, of the encoding.
I've worked out thus far that the bulk of the top of the file can be gzip deflated on the fly using HTTPrequest - but it seems that half-way through the file, something changes - and looking in a hex editor at a .torrent i've grabbed from a site directly versus the one I download here, everything is identical up to a point, then all is totally different.
If i'm being vague i'm afraid it's because i'm making this all up as I go along! Is it likely that the encoding / compression in a torrent file would change part way through, and how could I catch this in VB to avoid corrupting the latter half?
Thanks very much in advance,
Dan
I support a web-application that displays reports from a database. Occassionally, a report will contain an attachment (which is typically an image/document which is stored in the database as well).
We serve the attachment via a dynamic .htm resource which streams the attachment from the database, and populates the content-type based on what type of attachment it is (we support PDFs, RTFs, and various image formats)
For RTFs we've come across a problem. It seems a lot of Windows users don't defaultly have an assocation for the 'application/rtf' content-type (they do have an association for the *.rtf file extention). As a result, clicking on the link to the attachment doesn't do anything in Internet Explorer 6.
Returning 'application/msword' as the content-type seems to make the RTF viewable when clicking on the link, but only for people who have MS Office installed (some of the users won't have this installed, and will use alternate RTF readers, like OpenOffice).
This application is accessed publicly, so we don't have control of the user's machine settings.
Has anybody here solved this before? And how? Thanks!
Use application/octet-stream content-type to force download. Once it's downloaded, it should be viewable in whatever is registered to handle .rtf files.
In addition to the Content-Type header, you also need to add the following:
Content-Disposition: attachment; filename=my-document.rtf
Wordpad (which is on pretty much every Windows machine) can view RTF files. Is there an 'application/wordpad' content-type?
Alternatively, given the rarety of RTF files, your best solution might be to use a server-side component to open the RTF file, convert it to some other format (like PDF or straight HTML), and serve that to the requesting client. I don't know what language/platform you're using on the server side, so I don't know what to tell you to use for this.