Detect Facebook user agent in .htaccess file and disable gzip - facebook

Issue with Facebook not recognizing images due to use of gzip on our servers.
First off our websites need to use gzip so the answer of turning gzip off isnt an applicable response. Our servers use gzip by default and it's a good thing so we need to keep that in place.
I understand that gzipping images might have negligible impact but we are using it nonetheless.
What Im looking to do (hopefully) is ideally turn of gzip if the website is visited by a Facebook bot and leave gzip enabled otherwise so when the user agent detected is either of the
following...
facebookexternalhit/1.0
facebookexternalhit/1.1
Facebot
We disable gzip (ie. SetEnv no-gzip 1 I assume)
We want to do this within each sites .htaccess file
Is there a way to do this in an .htaccess file, if so can anyone supply an .htaccess sample.
Appreciate your help.

You should not be gzipping images anyway.
http://gtmetrix.com/enable-gzip-compression.html
Gzip compression won't work for images, PDF's and other binary formats which are already compressed.
Here is a good sample of mime types that work well with gzip:
application/atom+xml
application/javascript
application/json
application/rss+xml
application/vnd.ms-fontobject
application/x-font-ttf
application/x-web-app-manifest+json
application/xhtml+xml
application/xml
font/opentype
image/svg+xml
image/x-icon
text/css
text/plain
text/x-component;
https://github.com/h5bp/server-configs-nginx/blob/3db5d61f81d7229d12b89e0355629249a49ee4ac/nginx.conf#L93
Also see: https://superuser.com/a/139273

Related

How to store and retrieve static json and html files from CMS?

We already have a Hippo CMS hosted for loading static web pages. We would like to store static json and html pages in the same instance and be able to render it as is (Contenttype: application/json or text/html). Could you please guide me if that is possible in hippo CMS.
I believe the simplest solution is to upload the JSON files as content "Assets" (can do for PDF, json, CSS, JavaScript, etc)
Then the path to the file will look like http://localhost:8080/site/binaries/content/assets/my-project/my-file.json (or whatever path & filename it was uploaded to).
The Assets folder will set the mime-type and repeat that as the Content-Type header, such as Content-Type: application/json;charset=UTF-8

Is our robots.txt file formatted correctly?

I'm trying to make sure our robots.txt file is correct and would greatly appreciate some info. We want all bots to be able to crawl and index the homepage and the 'sample triallines' but that's it. Here's the file:
User-agent: *
Allow: /$
Allow: /sample-triallines$
Disallow: /
Can anyone please let me know if this is correct?
Thanks in advance.
You can test your XML sitemap directly with a robots testing tool or within the webmaster tools of most major search engines (e.g. Google Search Console). Your current robots.txt file will work for most crawlers for the exact URLs you mentioned (e.g. https://www.example/ and https://www.example/sample-triallines).
However, just to note, if your URLs deviate from these exact URLs they will be blocked to crawlers (e.g. tracking parameters). For example, the below URLs will be blocked with the current robots.txt setup, which may or may not be acceptable for what you're working on.
https://www.example/index.html
https://www.example/?marketing=promo
https://www.example/sample-triallines/
https://www.example/sample-triallines?marketing=promo
If any of these above URLs need to be crawled you'll just need to add additional directives into the robots.txt file as needed and test them within the robots testing tools. Additional information on robots directives can be found here.
Hope this helps

encoding issue on some pages (json files) served by github pages

I have an encoding issue on this repository: https://github.com/franceimage/franceimage.github.io
1/ Accents are wrong when I display https://franceimage.github.io/json/youtube.json in my browser (served by github)
2/ However, accents are right when I display the same page but run it served locally (jekyll serve)
3/ Accents are right on the html pages (served by github pages)
Can somebody explain what is happening ?
When you call json/youtube.json :
Locally, you get a Content-Type:application/json; charset=UTF-8 response header.
From github pages, you get Content-Type:application/json.
Transmitted files are identical.
As RFC 4627 states : "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
It seems that browsers are not falling back to utf-8 when they receive a Content-Type:application/json response header.
An idea can be to submit this question to Jekyll/Github pages community. Maybe you can introduce a feature request in order to get Github pages sending encoding header.
Jekyll talk can be a good entry point for such a question.

Stop web.archive.org to save the site pages

I tried accessing facebook.com webpages from previous time.
And the site showed me an error that it can not save pages because of the site robots.txt/
Can anyone tell which statements in the robots.txt are making the site inaccessible to web.archive.org
I guess it is because of the #permission statement as mentioned here (http://facebook.com/robots.txt)
Is there any other way i can do this for my site as well.
I also dont want woorank.com or builtwith.com to analyze my site.
Note : search engine bots should face no problems while crawling my site and indexing it if i add some statements to robots.txt in order to achieve results which are mentioned above.
The Internet Archive (archive.org) crawler uses the User-Agent value ia_archiver (see their documentation).
So if you want to target this bot in your robots.txt, use
User-agent: ia_archiver
And this is exactly what Facebook does in its robots.txt:
User-agent: ia_archiver
Allow: /about/privacy
Allow: /full_data_use_policy
Allow: /legal/terms
Allow: /policy.php
Disallow: /
If you would like to submit a request for archives of your site or
account to be excluded from web.archive.org, send us a request to
info#archive.org and indicate:
https://help.archive.org/help/how-do-i-request-to-remove-something-from-archive-org/

How to force links to open in iOS Safari?

my webpage has links to download Passbook .pkpass files.
This works fine in Safari for iOS since Apple's browser supports the mime type: application.com/vnd.apple.pkpass
Facebook's iOS browser (as well as others) does not (yet) support this mime type. Therefore, if a user follows a link to my site from within Facebook, they can't download my Passbook files. However, if they click on 'Open in Safari' then they can download the file.
How can I code my webpage such that clicking on a link will force open Safari on iOS?
Andrew
These headers should be helpful to what you're doing.
Content-Type "application/force-download"
Content-Description "File Transfer"
Content-Disposition attachment
ForceType "application/octet-stream"
I suggest you try to set them in your .htaccess or httpd.conf file with the following code:
<FilesMatch "\.(pkpass)$">
Header set Content-Type "application/force-download"
Header set Content-Description "File Transfer"
Header set Content-Disposition attachment
Header set ForceType "application/octet-stream"
</FilesMatch>
It's a little overkill, but will ensure the download is forced across all browsers. Change the pkpass to anything else to force the download of any other file type.
I didn't manage to find a way to do this yet. Somehow, forced pkpass downloads won't work in the Facebook mobile browser.
The best way is to guide the user to open the page in Safari.