There are IPs ex. 66.220.145.244 which are hitting us too much. I checked and it belongs to facebook, using the command whois -h whois.radb.net -- '-i origin AS32934' | grep ^route as mentioned here.
I am able to get the IPs of facebook crawlers. Above IP is one of the facebook crawlers.
According to facebook such crawlers will show user-agent as facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) or facebookexternalhit/1.1 but i am seeing none of these. What i am seeing is Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36.
My setup is cloudflare -> Load Balancer -> nginx -> app.
I am completely confused why this is coming. It is messing up our analytics a lot. Is there any way to contact facebook and ask them to look into? I am not able to find any at my level. Any further guidance on this would be awesome.
I don't have enough rep to comment so I have to respond this way. I see the same thing. It has come and gone over the last couple months. I have a couple dozen community pages my app posts to automatically, and there will be periods of time (days on end) where shortly after posting and just after the FB crawler scrapes my pages, it is hit by this other IP from Facebook. Typically it hits a few seconds after the normal bot, but is so soon and regular that it is definitely a bot and not a person (as one of the commenters suggested).
I am getting a different user agent than you, however it is from the same IP (66.220.145.244):
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30"
It affects all my pages and posts at once for a period (days on end), then it stops for all of them (for a period of a week or more). I noted today it is "back" so I searched for the topic and found this post.
I do note that the referring URL is from l.facebook.com for these, which is FB's external link manager thingy. If I hit that referring URL then I see a message:
Leaving Facebook
We're just checking that you want to follow a link to
this website: http://URL_TO_MY_PAGE
So my guess is this is a validator for the external link system, however why it is only invoked from time to time I don't understand. I have a hunch they may spot check apps from time to time to make sure that the website is not changing content for the FB bot compared to normal browsers. Still, I don't think it's great that they pretend to be a browser they are not as it does mess up metrics.
As a solution I'm thinking about filtering from my metrics all hits on my site that occur within 5-10 seconds of when I share it on FB.
Related
Recently I noticed that Facebook's Object Debugger was unable to scrape any pages of my website. After troubleshooting and scouring the internet, I'm at a loss for what might be causing this bug.
Whenever I attempt to fetch a new scrape of my website, the following error is returned:
Error parsing input URL, no data was cached, or no data was scraped.
When clicking into "See exactly what our scraper sees for your URL", the scraper returns:
Document returned no data
This is obviously a bit difficult to debug given the lack of data. Here's what I've tried thus far:
Checked DNS settings, everything seems fine
Tried using "Fetch as Google," GoogleBot had no problem fetching the page HTML
Verified all the meta settings on the site. fb:app_id, og:title, og:description, og:site_name, og:url, and og:type are all present.
Made sure canonical URL references the homepage, and does not have any trailing slash or trailing data.
Rolled back commits to before the last successful crawl date
I'm at a loss for what could be causing this. If anyone has any ideas, or needs more information, I would be happy to provide it.
After checking access logs, I see the following:
173.252.112.115 - - [22/Jun/2015:20:49:02 +0000] "GET / HTTP/1.1" 404 993 "-" "facebookexternalhit/1.1
(+http://www.facebook.com/externalhit_uatext.php)"
But this is strange, as it is immediately followed by a normal user:
[user ip] - - [22/Jun/2015:20:48:09 +0000] "GET / HTTP/1.1" 200 28227
"-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML,
like Gecko) Chrome/16.0.912.63 Safari/535.7"
There is nothing in robots.txt to disallow bots.
EDIT: This site is running on Django, and AngularJS is serving my pages. I'm using django-seo-js to work with prerender to improve SEO.
When I visit your page in Chrome and send facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) as value of the User-Agent header, I get a 404 as well (I used ModHeader extension for that), whereas requests with my normal Chrome User-Agent show me your start page just fine.
So investigate if you have any plugins, “firewall” rules or similar set up to fight requests by “bots” – might be something is a little overzealous in that regard when it comes to visits from the FB scraper.
That doesn’t seem to be it though (was an educated guess only, since this is often the cause of such problems), but as you said,
it's throwing a Javascript stack trace. This seems like it might be getting caused by prerender
– let us know if you found the exact cause.
I am running an advertisement network where the publishers can post affiliate links in their blogposts that they then can earn money on with pay-per-click.
It is very important for us to make sure that the clicks are as natural as possible and we are automatically checking many parameters to make sure it is. One of the way we check this is to look if the visitor (The person who click the affiliate link) is coming from the bloggers website or not. This is to make sure that the publisher does not post his link on other services but his blog.
To do this we use HTTP_REFERRER in PHP, but as you all know it's not 100% reliable. About 50% of our visitors have this disabled. This means that we still catch cheaters if they get multiple clicks from a false source, if they for example get 10 clicks from 10 unique visitors, then usually atleast one of these have a referrer and we can see that the publisher is cheating.
I'm writing here to see if there are any other solutions to this than the HTTP_REFERRER? For example:
Can we use Google Analytics API to check this data instead?
If 9 visitors have HTTP_REFERRER disabled, but we get the data from the 10th one, can we match that data together with the other 9 somehow? For example maybe there is some other information but the HTTP_REFERRER included in each visit, that would match if they came from the same referrer.
Any other ideas?
In PHP you can access a $_SERVER variable that is called HTTP_REFERER.
$_SERVER["HTTP_REFERER"]
To understand what this variable is, you really gotta look at HTTP.
When a browser loads up a website it sends a request to the web server hosting the website that looks something like this:
GET / HTTP/1.1
Host: example .com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36
The web server will then send the page that was requested back to the browser.
Now imagine that you were on a webpage, and clicked on a link that takes you to another page (example: Google), the browser will then send a similar request to the one above, but it will look like this:
GET / HTTP/1.1
Host: google .com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36
Referer: http://example .com
If you see that there is an extra line called Referer, this tells the website you are visiting where you came from when you clicked a link to the site.
This is often used for analytic purposes, if you have a blog and you shared a post that you published, you would want to know where people are coming from so that you know which place will give you the most traffic in the future, etc.
In PHP, this is the value that is put in the variable
$_SERVER["HTTP_REFERER"] // http://example .com
You can use the Google Analytics API to check for this, with the caveat that if javascript is not enabled OR if the user has taken steps to block tracking by Google Analytics, their data will not show up.
Use the fullReferrer dimension to get the source and referral path (i.e. website name and URL where you can find the link to your site).
You can also use Google Webmaster Tools to see who is linking to your site. Look under Search Traffic > Links to Your Site and you will see the full list. This may help you to find those sites where your publishers have posted their affiliate links.
In the header of my page I send og parameters. But the counter of the like button shows me a number near 17 000. When I use Facebook developers debug, I see that canonical url is not my url, it's google.ru, for example:
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Fseo-top-news.com.ua%2Fgoogle-vlojit-v-startapy-1-5-milliarda%2F
I really have a high opinion of Google, but I don't want have a Google like button on all of my articles :)
Can you help me?
Your site may have been hacked with malicious redirects -- if you look down by the "Redirect Path" section, Facebook's crawler is being redirected to a series of super-sketchy sounding sites.
Check your .htaccess for anything that shouldn't be there (this is a really common vector of attack), change your FTP password, and upgrade your WordPress install (and plugins).
On the debugger result page you link to, you can see the redirect path:
original: http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
302: http://dietrussia.ru/
302: http://webhitexpress.ru/in.cgi?2
302: http://www.google.ru/
The Facebook scraper is being sent through to Google.
This is because of the user-agent. If I curl your site with no user agent, I also get redirected:
curl http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
But if I curl it with a real browser user-agent, I get the proper page:
curl --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17" http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
The Facebook scraper uses the user-agent 'facebookexternalhit', so simply make sure that is being returned the full page content, not the 302.
The "add to home screen" shows up on all pages of a site, and I want the URL to be the homepage that gets saved.
For example on this page:
http://www.domain.com/category/page.html
Is it possible for the "Add to home screen" to save this url:
http://www.domain.com
Any help would be greatly appreciated.
I found a sort-of workaround to this. You can detect that you were launched from the home page via window.navigator.standalone and based upon that potentially redirect.
Also, I have done a little testing and found that on the latest iOS, different user agents are reported to the server, which opens the possibility of a faster redirect. I can't find any information about whether this has always been the case.
Launch from home page:
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0_1 like Mac OS X)
AppleWebKit/536.26 (KHTML, like Gecko) Mobile/10A523
Mobile Safari:
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0_1 like Mac OS X)
AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A523 Safari/8536.25
If your page gets most of its content via AJAX or you notice the different user-agent on the server it might be possible to skip the redirect and just act "as if" you were at another URL, since in standalone mode the URL is invisible anyway. I'm investigating this but haven't got far enough to say whether it will burn you or not.
Also note that the user's choice of URL to mark as an app may be meaningful, but I'll leave that to your own UX judgment.
A combination of both WrightsCS and svachalek Answers - You can't add to home screen a remote page. However, you can redirect from the page after it has been added to the home screen.
All you need to do is use this simple javaScript:
if ("standalone" in window.navigator && window.navigator.standalone){ //checks if you're in app mode
window.location = 'http://www.example.com'; //the URL you want to refer to.
}
Make sure you add this html code to your page:
<meta name="apple-mobile-web-app-capable" content="yes">
No, without being jailbroken (and there is nothing that I know of that achieves this), there is no way to edit the actual URL.
Apple restricts this for at least one reason I can think of Security. Editing the URL would allow people to use javascript, which would inevitably lead to malware.
I got this to work for my purpose of populating the URL of a "share" bookmark on iPhone:
history.pushState('','','/');
This worked since I only needed to cut the URL to the home domain. (And I used document.title="newName" to set the screen name of the bookmark icon.)
This didn't exist back when the question was asked, but the start_url field in the manifest allows this:
https://developer.mozilla.org/en-US/docs/Web/Manifest/start_url
Unfortunately this doesn't allow dynamic control, since the manifest will usually be fetched at the start. For example, it doesn't allow having it remember the search the user was on but remove the page number, so the saved search doesn't start three pages down when it's opened later.
I have been seeing URLs in my apache server logs that are clearly meant for another site. The most common one that I see is Facebook, but I've also seen Tumblr and YouTube. I am trying to figure out how that could happen. Here is an example of a request that was logged on my server (I removed the remote IP address):
[IP Redacted] - - [21/Sep/2011:13:31:35 +0000] "POST /ajax/chat/buddy_list.php?__a=1 HTTP/1.1" 404 797 "http://www.facebook.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/534.48.3 (KHTML, like Gecko)"
This looks like someone was on facebook.com, using Facebook chat, and for whatever reason the post got sent to my server. Does anyone have an idea for how this could happen, or how I could go about investigating it further?
As far as I know, when posting a link on Facebook to your site, Facebook will commonly try to get some information (page title, article title, some text, maybe a photo) from the link. I suspect that that's what's happening here.
There are all kinds of script kiddies that try to use your apache server as a proxy. I don't know of any vulnerabilities on this, but results related to that i found also on my logs. You should not worry about them.
My guess would be "automated 'hacking' scripts blindly trying everything and anything in their arsenal" - similar to this question. If your server handles the requests correctly (i.e. returns an appropriate error page, and doesn't forward them), you should be in the clear.
Check your hosts file if any traffic is being directed or this could be the referer of the website they last used to reach your server.
Also they could be using your server as a proxy to gain access to blocked websites seeing as the traffic is going through you. I'm looking at again and it seems you are blocking Facebook because the 404?
EDIT: You could try blocking common websites from your server using the hosts file to make them piss off.