I am running an advertisement network where the publishers can post affiliate links in their blogposts that they then can earn money on with pay-per-click.
It is very important for us to make sure that the clicks are as natural as possible and we are automatically checking many parameters to make sure it is. One of the way we check this is to look if the visitor (The person who click the affiliate link) is coming from the bloggers website or not. This is to make sure that the publisher does not post his link on other services but his blog.
To do this we use HTTP_REFERRER in PHP, but as you all know it's not 100% reliable. About 50% of our visitors have this disabled. This means that we still catch cheaters if they get multiple clicks from a false source, if they for example get 10 clicks from 10 unique visitors, then usually atleast one of these have a referrer and we can see that the publisher is cheating.
I'm writing here to see if there are any other solutions to this than the HTTP_REFERRER? For example:
Can we use Google Analytics API to check this data instead?
If 9 visitors have HTTP_REFERRER disabled, but we get the data from the 10th one, can we match that data together with the other 9 somehow? For example maybe there is some other information but the HTTP_REFERRER included in each visit, that would match if they came from the same referrer.
Any other ideas?
In PHP you can access a $_SERVER variable that is called HTTP_REFERER.
$_SERVER["HTTP_REFERER"]
To understand what this variable is, you really gotta look at HTTP.
When a browser loads up a website it sends a request to the web server hosting the website that looks something like this:
GET / HTTP/1.1
Host: example .com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36
The web server will then send the page that was requested back to the browser.
Now imagine that you were on a webpage, and clicked on a link that takes you to another page (example: Google), the browser will then send a similar request to the one above, but it will look like this:
GET / HTTP/1.1
Host: google .com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36
Referer: http://example .com
If you see that there is an extra line called Referer, this tells the website you are visiting where you came from when you clicked a link to the site.
This is often used for analytic purposes, if you have a blog and you shared a post that you published, you would want to know where people are coming from so that you know which place will give you the most traffic in the future, etc.
In PHP, this is the value that is put in the variable
$_SERVER["HTTP_REFERER"] // http://example .com
You can use the Google Analytics API to check for this, with the caveat that if javascript is not enabled OR if the user has taken steps to block tracking by Google Analytics, their data will not show up.
Use the fullReferrer dimension to get the source and referral path (i.e. website name and URL where you can find the link to your site).
You can also use Google Webmaster Tools to see who is linking to your site. Look under Search Traffic > Links to Your Site and you will see the full list. This may help you to find those sites where your publishers have posted their affiliate links.
Related
I crawling data in group facebook, I using selenium and I emulator bot same real person, I crawling in m.facebook.com, it's very easy to analyze. But I got a problem, some group when I scroll only get max 20 post, example: https://m.facebook.com/groups/3260883254031956, and some gorup when I scoll can scroll limited, example: https://m.facebook.com/groups/hanoihome.
I tring change USER_AGENT to "Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0", "Mozilla/5.0 (Linux; Android 6.0; HTC One M9 Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.36", but html response in that USER_AGENT is hard than www.facebook.com.
I want crawling in www.facebook.com, but this page is very hard to parser and get data, You can help me why some group in m.facebook.com only scroll 20 post and how to fix it! Thank you! You need login to see this case, because I need crawling data from private group so I must login.
There are IPs ex. 66.220.145.244 which are hitting us too much. I checked and it belongs to facebook, using the command whois -h whois.radb.net -- '-i origin AS32934' | grep ^route as mentioned here.
I am able to get the IPs of facebook crawlers. Above IP is one of the facebook crawlers.
According to facebook such crawlers will show user-agent as facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) or facebookexternalhit/1.1 but i am seeing none of these. What i am seeing is Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36.
My setup is cloudflare -> Load Balancer -> nginx -> app.
I am completely confused why this is coming. It is messing up our analytics a lot. Is there any way to contact facebook and ask them to look into? I am not able to find any at my level. Any further guidance on this would be awesome.
I don't have enough rep to comment so I have to respond this way. I see the same thing. It has come and gone over the last couple months. I have a couple dozen community pages my app posts to automatically, and there will be periods of time (days on end) where shortly after posting and just after the FB crawler scrapes my pages, it is hit by this other IP from Facebook. Typically it hits a few seconds after the normal bot, but is so soon and regular that it is definitely a bot and not a person (as one of the commenters suggested).
I am getting a different user agent than you, however it is from the same IP (66.220.145.244):
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30"
It affects all my pages and posts at once for a period (days on end), then it stops for all of them (for a period of a week or more). I noted today it is "back" so I searched for the topic and found this post.
I do note that the referring URL is from l.facebook.com for these, which is FB's external link manager thingy. If I hit that referring URL then I see a message:
Leaving Facebook
We're just checking that you want to follow a link to
this website: http://URL_TO_MY_PAGE
So my guess is this is a validator for the external link system, however why it is only invoked from time to time I don't understand. I have a hunch they may spot check apps from time to time to make sure that the website is not changing content for the FB bot compared to normal browsers. Still, I don't think it's great that they pretend to be a browser they are not as it does mess up metrics.
As a solution I'm thinking about filtering from my metrics all hits on my site that occur within 5-10 seconds of when I share it on FB.
Recently I noticed that Facebook's Object Debugger was unable to scrape any pages of my website. After troubleshooting and scouring the internet, I'm at a loss for what might be causing this bug.
Whenever I attempt to fetch a new scrape of my website, the following error is returned:
Error parsing input URL, no data was cached, or no data was scraped.
When clicking into "See exactly what our scraper sees for your URL", the scraper returns:
Document returned no data
This is obviously a bit difficult to debug given the lack of data. Here's what I've tried thus far:
Checked DNS settings, everything seems fine
Tried using "Fetch as Google," GoogleBot had no problem fetching the page HTML
Verified all the meta settings on the site. fb:app_id, og:title, og:description, og:site_name, og:url, and og:type are all present.
Made sure canonical URL references the homepage, and does not have any trailing slash or trailing data.
Rolled back commits to before the last successful crawl date
I'm at a loss for what could be causing this. If anyone has any ideas, or needs more information, I would be happy to provide it.
After checking access logs, I see the following:
173.252.112.115 - - [22/Jun/2015:20:49:02 +0000] "GET / HTTP/1.1" 404 993 "-" "facebookexternalhit/1.1
(+http://www.facebook.com/externalhit_uatext.php)"
But this is strange, as it is immediately followed by a normal user:
[user ip] - - [22/Jun/2015:20:48:09 +0000] "GET / HTTP/1.1" 200 28227
"-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML,
like Gecko) Chrome/16.0.912.63 Safari/535.7"
There is nothing in robots.txt to disallow bots.
EDIT: This site is running on Django, and AngularJS is serving my pages. I'm using django-seo-js to work with prerender to improve SEO.
When I visit your page in Chrome and send facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php) as value of the User-Agent header, I get a 404 as well (I used ModHeader extension for that), whereas requests with my normal Chrome User-Agent show me your start page just fine.
So investigate if you have any plugins, “firewall” rules or similar set up to fight requests by “bots” – might be something is a little overzealous in that regard when it comes to visits from the FB scraper.
That doesn’t seem to be it though (was an educated guess only, since this is often the cause of such problems), but as you said,
it's throwing a Javascript stack trace. This seems like it might be getting caused by prerender
– let us know if you found the exact cause.
In the header of my page I send og parameters. But the counter of the like button shows me a number near 17 000. When I use Facebook developers debug, I see that canonical url is not my url, it's google.ru, for example:
http://developers.facebook.com/tools/debug/og/object?q=http%3A%2F%2Fseo-top-news.com.ua%2Fgoogle-vlojit-v-startapy-1-5-milliarda%2F
I really have a high opinion of Google, but I don't want have a Google like button on all of my articles :)
Can you help me?
Your site may have been hacked with malicious redirects -- if you look down by the "Redirect Path" section, Facebook's crawler is being redirected to a series of super-sketchy sounding sites.
Check your .htaccess for anything that shouldn't be there (this is a really common vector of attack), change your FTP password, and upgrade your WordPress install (and plugins).
On the debugger result page you link to, you can see the redirect path:
original: http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
302: http://dietrussia.ru/
302: http://webhitexpress.ru/in.cgi?2
302: http://www.google.ru/
The Facebook scraper is being sent through to Google.
This is because of the user-agent. If I curl your site with no user agent, I also get redirected:
curl http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
But if I curl it with a real browser user-agent, I get the proper page:
curl --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17" http://seo-top-news.com.ua/google-vlojit-v-startapy-1-5-milliarda/
The Facebook scraper uses the user-agent 'facebookexternalhit', so simply make sure that is being returned the full page content, not the 302.
I would like to make a section on my site where users can test if their websites are mobile adapted.
Is any way to do that? I see a lot of sites in the net but I don't like to frame it, I'd like make my own.
Mmmm... the trouble is that link, shows me how can mobilize my site, but what I want to make, is a php/html section of my joomla, where users can input their webs, and shows if are adapted.
Something like this: www.iphonetester.com
But, that site, doesn't use the user agent, so, if you put http://www.kalyma.com.ar shows desktop version, instead of showing http://m.kalyma.com.ar which is the mobile version. (this is because I have a plugin which redirects the site based on user agent!)
Right... the only way I can think of (and this is untested...)
You need to curl the content of the page that someone's trying to access and set the useragent as such:
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1C25 Safari/419.3');
Then you can output the html into a div on your page using jquery's load() function or similar.
You'll have to look into how to properly do it though, because you're downloading onto your web server any non-absolute links will have to have the base url set properly so images and other things work properly.
That's the only way I can think you can do it.
Let me google that for you :
http://shaunmackey.com/articles/mobile/php-auto-bowser-detection/
Try the code from Redrome.com. They show how to load the page properly.