Google bots visit disallowed pages - robots.txt

Following userAgent visits my url which is disallowed in robots.txt. However Google bot visit my page.
url: https://www.example.com/page?param=1
robots.txt
User-agent: *
Allow: /
Disallow: /page
userAgent
"userAgent":"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Related

How to redirect from image path

I'm using the Facebook Sharer snippet to allow users to share a specific post to their Facebook page. It's sharing a dynamic image passed into the FB Sharer (not the main url), so the image is www.site.com/path/to/image.jpg - when the user clicks on it from the actual Facebook post, as expected, it routes to that www.site.com/path/to/image.jpg.
My question is can I set up a redirect to ALWAYS send to the index.html page whenever the www.site.com/path is hit? I was thinking I could use javascript to redirect, but you can't fire javascript on a .jpg page. Is this an htaccess thing? If so, I have no idea how to go about that either. I am using Amazon S3 with Cloudfront.
Here is the FB Sharer -
window.open(http://www.facebook.com/sharer.php?u=${root}&t=${title})
This is working correctly, but looking for help on how to redirect any page that has a /path/ after www.site.com
This could work - you need to exclude the Facebook Crawler:
RewriteEngine on
RewriteCond %{HTTPS} off
RewriteCond %{HTTP_USER_AGENT} !facebookexternalhit/[0-9]
RewriteRule ^path/to/([^.]+\.(jpe?g|gif|bmp|png))$ http://www.newurl.com [R=301,L,NC]
Sources:
htaccess redirect all images to different location and put image name in new url
Exempt Facebook Crawler from .htaccess redirect
I tested it here: https://htaccess.madewithlove.be/
Although, the ideal/usual way is NOT to share images directly, but URLs with the image as og:image tag.

Need to prevent a website link from being displayed in Bing search results

I have a web application that shows up in the Bing search. I do not want the application link to show up in Bing search. In the application root directory we have a robots.txt file that contains the following:
User-agent: *
Disallow: /
User-agent: bingbot
Disallow: /
However the link still shows up in Bing search. I also tried using this tag in the header section for specific web pages:
<meta name="robots" content="noindex,nofollow">
However the link is still displayed in bing search. We did wait for 2-3 weeks and more but the links are still appearing.
We contacted Bing Webmaster tools support and they suggested that for a URL to be removed from their index it has to be deleted from our site so that the URL returns a 404 (Not Found) or 410 (Gone) HTTP status. They also mentioned that in order for Bing to detect that the page has in fact been removed from the site and is now returning a 404 or 410 HTTP status code, Bingbot needs to be able to access the URL, so we should not block the URL from being re-crawled through robots.txt.
Now the problem here is that we cannot delete our site or redirect it to 404 error page since it is used by our client. Google search does not show the link but bing shows it. Is there any other way by which we can make the link(s) to not appear in bing search?
You can do that from the bing console if you need quick removal: https://www.bing.com/webmaster/help/block-urls-from-bing-264e560a
Have a look at the last paragraph on blocking URL's immediately.

Switching website from http to https, lost Facebook share count

Since recently switching a website from http to https the News section on the website "which has a Facebook "Share" button" has lost all the counts for that URL.
I have tried the following:
Add og:url tag on the HEAD section of the website <meta property="og:url" content="http://www.example.com" />
Modifying the Facebook code to count http likes
Using the Facebook Developers sharing debugger - The client put the old URL it used to be before the switch and scraped the http version but there was no change to the share counter
Use 301 Redirect on web page for HTTP to HTTPS - anybody trying to go to a http URL is redirected to the new URL on https
Any help would be greatly appreciated.

How to retrieve Facebook post (preview) meta data for linking to external site, such as LinkedIn or Google Plus?

I'm trying to link to individual Facebook posts and display the post page's meta data, including an image preview, but looking at the FB post page's source, it has none of the usual meta nor open graph tags.
I found that LinkedIn and Google Plus are able to retrieve meta data when you post a link to a facebook post. See below for a LinkedIn example.
How are they doing it!?
I'm not insider of LinkedIn or Google so not sure of it. But there're two ways they can do it:
Analysing the structure of facebook post and fetch the picture/title/content of post from url.
Fetch post id of the post (fbid), and use graph api to query the details of the post from facebook.
What I wanted to do was get the post page's metadata which wasn't appearing when I requested the url from my server using http.
But what I received was Facebook's error page for "Update your browser"
So I added ...
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0
... to the header. And it worked.

Nginx config for single page angularjs website to work with prerender.io for Facebook Open Graph

I have a single page angularjs web app. I'm trying enable it to be crawlable by search engines. To achieve this I'm using prerender.io, a nodejs webserver with a phantomjs browser to render ajax pages.
I am basing my nginx config off the following gist: https://gist.github.com/Stanback/6998085
This works for the most part. I can curl my app and get the correct response: curl -o test.html domain.com/?_escaped_fragment=/path
The request is redirected to the prerender.io proxy and the proxy makes a single request with the following url: domain.com/#!/path
All other requests (js, img, css and xhr) pass through nginx as normal. Phantomjs has no trouble rendering the proxy request after waiting for the following JS variable to be set to true: window.prerenderReady = false;
This is all great... Google can crawl my website! Enter Facebook.
I'm setting a number of OG metatags so that I can use the Facebook like button (iFrame). The following metatags are set for each page:
<link rel="canonical" href="http://domain.com/#!/asset">
<meta property="og:url" content="http://domain.com/#!/asset">
<meta property="og:type" content="website">
<meta property="og:image" content="http://domain.com/image.jpg">
<meta property="fb:app_id" content="xxx">
<meta property="og:description" content="foo">
<meta property="og:title" content="bar">
<meta property="og:site_name" content="domain.com">
These metatags are updated correctly by angularjs for each asset and the phantomyjs proxy correctly waits for them to be updated before returning the response.
However when I test the URL http://domain.com/#!/asset with the Facebook URL linter I get some problems.
Facebook claims that the canonical URL and the og:url differ, however when I click "See exactly what our scraper sees for your URL" they are idential
When I click "See exactly what our scraper sees for your URL" the canonical and og:url have been replaced with domain.com/?fb_locale=en_GB#!/asset
The proxy receives 3 requests. The first for the asset then it seems it follows the canonical and og:url
When a user clicks the Like this page iFrame the link back to my website looks like domain.com/?_escaped_fragment_=/asset
Number 4 is the issue that is a deal breaker. If a user likes a page on my post it goes into their Facebook activity stream. If that user then clicks on the link back to my site in their stream it will direct them through the proxy and render the page through phantomjs!
I'm guessing that I shouldn't be sharing the links with the hash-bang through Facebook. I think I should be sharing a link and setting the canonical / og:url to something like domain.com/static/asset. The nginx config should be updated to catch /static urls, if useragent = Facebook or params contain _escaped_fragment_ then direct to proxy else redirect the user to #!/asset.
I have tired all that I can think to get a modified nginx config to work with this however it has beaten me. When I intercept those /static urls and rewrite to the proxy randomly image, css and js assets are requested through the proxy and phantomjs crashes.
Could someone please help me modify this nginx config so that I can forward web crawler requests to the proxy, allow facebook to scrape the correct og tags off my site AND have the correct link-back url specified when users share my content on Facebook?
Did you figure this out yet? Facebook doesn't do a very good job with #! urls. This StackOverflow answer does a good job explaining it: How to handle facebook sharing/like with hashbang urls?
When a user is on a page on your site (http://domain.com/#!/asset) and does a sharing action on your website, it should share the canonical url http://domain.com/asset.
Then if a user visits http://domain.com/asset, you just redirect them to http://domain.com/#!/asset.
And if Facebook accesses the canonical URL (http://domain.com/asset), then redirect it to your Prerender.io server.
Or...just switch from #! to html5 pushstate, and you won't have to do any of the #! redirecting for Facebook. That way the proxy becomes more simple, so you'd always just proxy any request from Facebook to your Prerender.io server