Facebook Debugger Returning "Document returned no data" - facebook

Recently I noticed that Facebook's Object Debugger was unable to scrape any pages of my website. After troubleshooting and scouring the internet, I'm at a loss for what might be causing this bug.
Whenever I attempt to fetch a new scrape of my website, the following error is returned:
Error parsing input URL, no data was cached, or no data was scraped.
When clicking into "See exactly what our scraper sees for your URL", the scraper returns:
Document returned no data
This is obviously a bit difficult to debug given the lack of data. Here's what I've tried thus far:
Checked DNS settings, everything seems fine
Tried using "Fetch as Google," GoogleBot had no problem fetching the page HTML
Verified all the meta settings on the site. fb:app_id, og:title, og:description, og:site_name, og:url, and og:type are all present.
Made sure canonical URL references the homepage, and does not have any trailing slash or trailing data.
Rolled back commits to before the last successful crawl date
I'm at a loss for what could be causing this. If anyone has any ideas, or needs more information, I would be happy to provide it.
After checking access logs, I see the following:
173.252.112.115 - - [22/Jun/2015:20:49:02 +0000] "GET / HTTP/1.1" 404 993 "-" "facebookexternalhit/1.1
(+http://www.facebook.com/externalhit_uatext.php)"
But this is strange, as it is immediately followed by a normal user:
[user ip] - - [22/Jun/2015:20:48:09 +0000] "GET / HTTP/1.1" 200 28227
"-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML,
like Gecko) Chrome/16.0.912.63 Safari/535.7"
There is nothing in robots.txt to disallow bots.
EDIT: This site is running on Django, and AngularJS is serving my pages. I'm using django-seo-js to work with prerender to improve SEO.

When I visit your page in Chrome and send facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) as value of the User-Agent header, I get a 404 as well (I used ModHeader extension for that), whereas requests with my normal Chrome User-Agent show me your start page just fine.
So investigate if you have any plugins, “firewall” rules or similar set up to fight requests by “bots” – might be something is a little overzealous in that regard when it comes to visits from the FB scraper.
That doesn’t seem to be it though (was an educated guess only, since this is often the cause of such problems), but as you said,
it's throwing a Javascript stack trace. This seems like it might be getting caused by prerender
– let us know if you found the exact cause.

Related

Facebook ip not showing correct user agent

There are IPs ex. 66.220.145.244 which are hitting us too much. I checked and it belongs to facebook, using the command whois -h whois.radb.net -- '-i origin AS32934' | grep ^route as mentioned here.
I am able to get the IPs of facebook crawlers. Above IP is one of the facebook crawlers.
According to facebook such crawlers will show user-agent as facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) or facebookexternalhit/1.1 but i am seeing none of these. What i am seeing is Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36.
My setup is cloudflare -> Load Balancer -> nginx -> app.
I am completely confused why this is coming. It is messing up our analytics a lot. Is there any way to contact facebook and ask them to look into? I am not able to find any at my level. Any further guidance on this would be awesome.
I don't have enough rep to comment so I have to respond this way. I see the same thing. It has come and gone over the last couple months. I have a couple dozen community pages my app posts to automatically, and there will be periods of time (days on end) where shortly after posting and just after the FB crawler scrapes my pages, it is hit by this other IP from Facebook. Typically it hits a few seconds after the normal bot, but is so soon and regular that it is definitely a bot and not a person (as one of the commenters suggested).
I am getting a different user agent than you, however it is from the same IP (66.220.145.244):
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30"
It affects all my pages and posts at once for a period (days on end), then it stops for all of them (for a period of a week or more). I noted today it is "back" so I searched for the topic and found this post.
I do note that the referring URL is from l.facebook.com for these, which is FB's external link manager thingy. If I hit that referring URL then I see a message:
Leaving Facebook
We're just checking that you want to follow a link to
this website: http://URL_TO_MY_PAGE
So my guess is this is a validator for the external link system, however why it is only invoked from time to time I don't understand. I have a hunch they may spot check apps from time to time to make sure that the website is not changing content for the FB bot compared to normal browsers. Still, I don't think it's great that they pretend to be a browser they are not as it does mess up metrics.
As a solution I'm thinking about filtering from my metrics all hits on my site that occur within 5-10 seconds of when I share it on FB.

There is some errors releved from my GWT account

The GWT has sent the bellow message:
Dear webmaster,
Your smartphone users will be happier if they don’t land on error- or non-existent pages, so we recommend you make sure that your pages return the appropriate HTTP code. Currently, Googlebot for smartphones detects a significant increase in URLs returning a 200 (available) response code, but we think they should return an HTTP 404 (page not found) code.
Recommended actions
Check the Smartphone Crawl Errors page in Webmaster Tools.
Return a 404 (page not found) or 410 (gone) HTTP response code in response to a request for a non-existent URL.
Improve the user experience by configuring your site to display a custom 404 page when returning a 404 response code.
Now how to resolved this?
Have you made any significant changes lately? Like changing URLs to all pages?
First of, make sure your pages are available and working with the URLs. Try searching yourself on google with "site:yourdomain.com". Are these pages correct or do they not exist?
You should also check that IF your page does not exist (yourdomain.com/blahblah), it will return HTTP404 (Not found) and not HTTP200 (OK). You can see this in Chrome Developer Tools. Go to Network tab, reload the page, check the Status column for your HTML page.
How you change the HTTP code depends on your web server and language. In PHP you can use header().

Best way to track where visitors come from?

I am running an advertisement network where the publishers can post affiliate links in their blogposts that they then can earn money on with pay-per-click.
It is very important for us to make sure that the clicks are as natural as possible and we are automatically checking many parameters to make sure it is. One of the way we check this is to look if the visitor (The person who click the affiliate link) is coming from the bloggers website or not. This is to make sure that the publisher does not post his link on other services but his blog.
To do this we use HTTP_REFERRER in PHP, but as you all know it's not 100% reliable. About 50% of our visitors have this disabled. This means that we still catch cheaters if they get multiple clicks from a false source, if they for example get 10 clicks from 10 unique visitors, then usually atleast one of these have a referrer and we can see that the publisher is cheating.
I'm writing here to see if there are any other solutions to this than the HTTP_REFERRER? For example:
Can we use Google Analytics API to check this data instead?
If 9 visitors have HTTP_REFERRER disabled, but we get the data from the 10th one, can we match that data together with the other 9 somehow? For example maybe there is some other information but the HTTP_REFERRER included in each visit, that would match if they came from the same referrer.
Any other ideas?
In PHP you can access a $_SERVER variable that is called HTTP_REFERER.
$_SERVER["HTTP_REFERER"]
To understand what this variable is, you really gotta look at HTTP.
When a browser loads up a website it sends a request to the web server hosting the website that looks something like this:
GET / HTTP/1.1
Host: example .com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36
The web server will then send the page that was requested back to the browser.
Now imagine that you were on a webpage, and clicked on a link that takes you to another page (example: Google), the browser will then send a similar request to the one above, but it will look like this:
GET / HTTP/1.1
Host: google .com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36
Referer: http://example .com
If you see that there is an extra line called Referer, this tells the website you are visiting where you came from when you clicked a link to the site.
This is often used for analytic purposes, if you have a blog and you shared a post that you published, you would want to know where people are coming from so that you know which place will give you the most traffic in the future, etc.
In PHP, this is the value that is put in the variable
$_SERVER["HTTP_REFERER"] // http://example .com
You can use the Google Analytics API to check for this, with the caveat that if javascript is not enabled OR if the user has taken steps to block tracking by Google Analytics, their data will not show up.
Use the fullReferrer dimension to get the source and referral path (i.e. website name and URL where you can find the link to your site).
You can also use Google Webmaster Tools to see who is linking to your site. Look under Search Traffic > Links to Your Site and you will see the full list. This may help you to find those sites where your publishers have posted their affiliate links.

Like button think that my domain is other domain.. google.ru

In the header of my page I send og parameters. But the counter of the like button shows me a number near 17 000. When I use Facebook developers debug, I see that canonical url is not my url, it's google.ru, for example:
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Fseo-top-news.com.ua%2Fgoogle-vlojit-v-startapy-1-5-milliarda%2F
I really have a high opinion of Google, but I don't want have a Google like button on all of my articles :)
Can you help me?
Your site may have been hacked with malicious redirects -- if you look down by the "Redirect Path" section, Facebook's crawler is being redirected to a series of super-sketchy sounding sites.
Check your .htaccess for anything that shouldn't be there (this is a really common vector of attack), change your FTP password, and upgrade your WordPress install (and plugins).
On the debugger result page you link to, you can see the redirect path:
original: http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
302: http://dietrussia.ru/
302: http://webhitexpress.ru/in.cgi?2
302: http://www.google.ru/
The Facebook scraper is being sent through to Google.
This is because of the user-agent. If I curl your site with no user agent, I also get redirected:
curl http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
But if I curl it with a real browser user-agent, I get the proper page:
curl --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17" http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
The Facebook scraper uses the user-agent 'facebookexternalhit', so simply make sure that is being returned the full page content, not the 302.

How is it possible that I see Facebook URLs in my server logs?

I have been seeing URLs in my apache server logs that are clearly meant for another site. The most common one that I see is Facebook, but I've also seen Tumblr and YouTube. I am trying to figure out how that could happen. Here is an example of a request that was logged on my server (I removed the remote IP address):
[IP Redacted] - - [21/Sep/2011:13:31:35 +0000] "POST /ajax/chat/buddy_list.php?__a=1 HTTP/1.1" 404 797 "http://www.facebook.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko)"
This looks like someone was on facebook.com, using Facebook chat, and for whatever reason the post got sent to my server. Does anyone have an idea for how this could happen, or how I could go about investigating it further?
As far as I know, when posting a link on Facebook to your site, Facebook will commonly try to get some information (page title, article title, some text, maybe a photo) from the link. I suspect that that's what's happening here.
There are all kinds of script kiddies that try to use your apache server as a proxy. I don't know of any vulnerabilities on this, but results related to that i found also on my logs. You should not worry about them.
My guess would be "automated 'hacking' scripts blindly trying everything and anything in their arsenal" - similar to this question. If your server handles the requests correctly (i.e. returns an appropriate error page, and doesn't forward them), you should be in the clear.
Check your hosts file if any traffic is being directed or this could be the referer of the website they last used to reach your server.
Also they could be using your server as a proxy to gain access to blocked websites seeing as the traffic is going through you. I'm looking at again and it seems you are blocking Facebook because the 404?
EDIT: You could try blocking common websites from your server using the hosts file to make them piss off.