Why is the Facebook scraper ignoring OG tags? - facebook

It's only happening on a handful of URLs. For example:
https://gateless.com/articles/generate-leads/happy-thanksgiving-to-all-115-million-american-households
Facebook Debugger:
https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fgateless.com%2Farticles%2Fgenerate-leads%2Fhappy-thanksgiving-to-all-115-million-american-households
But this one works great:
https://gateless.com/articles/generate-leads/your-blueprint-for-converting-inbound-leads
Facebook Debugger:
https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fgateless.com%2Farticles%2Fgenerate-leads%2Fyour-blueprint-for-converting-inbound-leads
Via DNS, this site is only accessible within the US.

It was the SSL certificate. We reissued one from within AWS and now the URLs scrape OK. Don't buy SSL certificates from GeoTrust.

The articles are throwing error:
This site can’t be reached
gateless.com’s server DNS address could not be found.
Since your site is inaccessible, the crawler is unable to fetch the data for og tags.
The second URL seems to be working as it is showing cached data.
You can click on "See exactly what our scraper sees for your URL" in the Debugger to see what content is being scraped.
In this case, the page is blank which is a good indication that the crawler is unable to see the content or your page is not returning the correct data that the crawler expects.

Related

Facebook takes wrong canonical URL

Scenario of the problem:
We enforced HTTPS on a website. Any URL with HTTP now redirects (301 permanent redirect) to an appropriate HTTPS URL.
To avoid Facebook like/share buttons (that are placed on many pages of the website) loosing previous numbers of likes/shares, we made the buttons to "link" to the old HTTP URLs via the "data-href" property.
Additionally we placed the "og:url" meta tag on some pages, pointing to the old HTTP URLs.
I then scraped that pages at the Facebook debugger tool https://developers.facebook.com/tools/debug to make sure the Facebook gets the fresh data. According to the scraped data, canonical URLs were indeed pointing to the old HTTP URLs just as it should be according to our actions listed above. This was also reflected in the like/share buttons on our pages keeping the old numbers.
A few days later I discovered that some pages lose the old numbers of likes/shares. Checking the pages in the Facebook debugger shows that Facebook now takes HTTPS URLs as canonical. We did not make any changes on our pages, and the "og:url" tag is still pointing to the HTTP URLs. But the Facebook wrongly takes HTTPS URLs as canonical URLs. Now if I scrape the information again in the debugger, it agains becomes normal, showing HTTP as canonical and restoring the old number of likes/shares. But obviously it's not a solution to the problem, because we cannot constantly monitor all our pages and scrape them again and again.
Any ideas of what may causing the problem?
Facebook follows HTTP redirects as well. You need to make your old HTTP URLs available to the scraper, without redirecting it to the HTTPS version. (The scraper can be recognized by its User-Agent, see social plugins FAQ.)
The old HTTP URLs need to be available to the scraper, and not redirected to HTTPS, as the FAQ also mentions:
“This also requires that the old URL still renders a document with Open Graph tags and returns a HTTP 200 response, at least when loaded by Facebook's crawler. If you want other clients to redirect when they visit the URL, you must send your 301 HTTP response to all non-Facebook crawler clients. The old URL should contain its own og:url tag that points to itself.”

Open Graph scraping base URL instead the URL it's given

The Facebook OpenGraph debug tool is scraping the wrong page.
If I give it a full URL (pointing to an individual page on my site) that I want it to scrape, instead of scraping that page and finding its meta tags, it scrapes my site's main page and returns those meta tags (which are obviously wrong in this context).
The weird thing is, it will even find and scrape my site's main page even if it's not located at the root of my domain. For example:
I want it to scrape http://mydomain.com/myhomepage/specific_page.html
Instead, it scrapes http://mydomain.com/myhomepage/
This implies to me that the error must be a setting someplace, either on my site or on my Facebook App settings. Would the App settings do that? Redirect to whatever URL is set if a requested URL is a descendent of it?
The URL I'm requesting is not doing a 302 or anything - I can click the link from the FB debug tool even and it will take me to the appropriate page.
A few notes:
specific_page.html is not an actual file, it is routed through index.php using mod_rewrite in Apache's htaccess. I tried being specific with http://mydomain.com/myhomepage/index.php/specific_page.html and it did not work then either.
Another SO question led me to believe that the user-agent might be getting redirected if it doesn't allow cookies (as the Facebook web crawler does not) so I opened a fresh browser, disabled cookies, tried again, and I still reached the appropriate page.
As mentioned in the comments above, in your case this was due to an og:url meta tag, redirecting Facebook's crawler to that URL
In general, cases like this are usually the og:url tag, a HTTP redirect, or a canonical meta tag pointing at the 'other' / 'wrong' URL - Facebook's crawler follows those redirects looking for the final URL

The facebook debugger is giving 500 error

The facebook debugger is giving 500 error for your webpage. It is unable to get my og metatags values. However, og metatags are present in the web page:
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Fstaging.eco.ca%2Fcommunity%2Fblog%2F5-fascinating-green-jobs-youve-never-heard-of%2F68170%2F
Take a look at your IIS configuration. Something in there is refusing connections from anything that isn't a web browser.
Your page returns a 500 error for the W3C validator also.
For me it was because when Facebook, Google or other robots access the page, they don't send a language like a browser does, and in the code we were using the browser language to determine the content language.
I want to share my experience about this facebook debugger 500 error. I was getting this error without any explanation on facebook debugger. I checked my server side and domain records, tried what I found on internet with no luck. It took me long time to discover the problem.
Problem was I was sharing url within a restricted area in my website while sharing for first time. Let me explain, I have a test page in my cms, before I publish anything, I test if everything's ok. I realised whenever I test a content, facebook tries to crawl site, but it's in a restricted area. So I moved anything about facebook from my test page.

"URL is unreachable" error for Facebook comments box being cached?

Our website uses the Facebook Comments Box plugin. We include the comments box on our staging site that is behind our firewall, which means Facebook can't access it and generates the "URL is unreachable" error. This I understand.
However, once a page is published, and is reachable by Facebook, the error is still displayed. This can be easily fixed by clicking on the debug link provided along with the error, but my content editors don't want to have to do this every time, and they sometimes forget.
It seems like the reachable status is cached and reset once you use the debugger. Can anyone think of another explanation?
I suppose I could omit the Facebook comments box from the staging site, but would prefer not to. Any other ideas?
In the documentation of the Like Button they explain when the page is being scraped :
When does Facebook scrape my page?
Facebook needs to scrape your page to know how to display it around
the site.
Facebook scrapes your page every 24 hours to ensure the properties are
up to date. The page is also scraped when an admin for the Open Graph
page clicks the Like button and when the URL is entered into the
Facebook URL Linter. Facebook observes cache headers on your URLs - it
will look at "Expires" and "Cache-Control" in order of preference.
However, even if you specify a longer time, Facebook will scrape your
page every 24 hours.
The user agent of the scraper is: "facebookexternalhit/1.1
(+http://www.facebook.com/externalhit_uatext.php)"
Here are three options:
You can call the debugger by issuing a simple http request, you can do that from the server when you publish your article (or what ever you're publishing), you don't have to use the debugger tool.
You can check the user agent string for requests and if it's the facebook scraper allow it so that it can cache the page.
You can use different urls for production and staging, that way the cache of the staging pages won't matter in production.

Facebook Linter can't connect to site?

I'm getting a critical error when using the Facebook Linter to check og meta info.
Critical Errors That Must Be Fixed
Error Scraping Page: Can't Download
https://developers.facebook.com/tools/debug/og/object?q=theshinebox.com
The site/url work fine in a browser though...
I haven't checked this, but it looks to me like your og:url value is incorrect.
Facebook will actually GO to that address to get the canonical version of the page. And you've written the address without the Top Level Domain, so of course Facebook can't download from it.
David