Ask Google to Stop Googlebot Crawl - bandwidth

Okay, so a Wordpress gallery plugin lead to a massive headache - with about 17 galleries having their own pagination, the links within created what might as well be infinite number of variant URLs combining the various query variables from each gallery.
As such, Google has been not so smart and has been HAMMERING the server to the tune of 4 gigs an hour prior to my actions, and about 800 requests a minute on the same page sending the server load up to 30 at one point.
It's been about 12 hours, and regardless of the changes I've made, Google is not listening (yet) and is still hammering away.
My question is: Is there a way to contact Google support and tell them to shut their misbehaving bot down on a particular website?
I want a more immediate solution as I do not enjoy the server being bombarded.
Before you say it, even though this isn't what I'm asking about, I've done the following:
Redirected all traffic using the misused query variable back to the Googlebot IP in hopes that the bot being forwarded back to itself will be a wake up call that something is not right with the URL. (I don't care if this is a bad idea)
Blocking the most active IP address from accessing that site.
Disabled the URLs from being created by the troubled plugin.
In Google Webmaster Tools/Search Console, I've set the URL parameters to "No: Doesn't affect page content" for the query variables.
Regardless of all of this, Google is still hammering away at 800 requests per minute/13 requests a second.
Yes, I could just wait it out, but I'm looking for a "HEY GOOGLE! STOP WHAT YOU ARE DOING!" solution besides being patient and allowing resources to be wasted.

Related

Are URLs in emails indexed by search engines so they become publicly searchable?

I have read a few questions on here about e-mail clients prefetching URLs in e-mails. An answer to this seems to be to add a new confirmation page, where the user has to click a button to confirm the desired action.
But, this answer states the following:
As of Feb 2017 Outlook (https://outlook.live.com/) scans emails
arriving in your inbox and it sends all found URLs to Bing, to be
indexed by Bing crawler.
This effectively makes all one-time use links like
login/pass-reset/etc useless.
(Users of my service were complaining that one-time login links don't
work for some of them and it appeared that BingPreview/1.0b is hitting
the URL before the user even opens the inbox)
Drupal seems to be experiencing the same problem:
https://www.drupal.org/node/2828034
My major concern is with this statement:
As of Feb 2017 Outlook (https://outlook.live.com/) scans emails
arriving in your inbox and it sends all found URLs to Bing, to be
indexed by Bing crawler.
If this is the case, any URL in an e-mail meant to confirm an action, e.g. confirming a login, subscription, or unsubscription, can end up searchable in a search engine, if that's whats meant by indexed in the quote above. In this case, it's Bing. Not even a dedicated confirmation page where the user confirms the desired action truly mitigates this.
Scenario #1
If I email the user a login link with a one-time token in the URL, that URL will end up in Bing. This token will have a short lifetime, lets say 5 minutes, so I doubt anyone will manage to search on Bing and find the URL before the user clicks it or it expires.
Scenario #2
The user gets an e-mail with a link to confirm a subscription. This link is perhaps valid for 24 hours. This might(?) be long enough for someone else to stumble over the link on a search engine and accidentally (or on purpose) confirm the subscription on behalf of the user.
Scenario #2 is not uncommon, it's even best practice to use double opt-in as far as I am aware.
Scenario #3
Unsubscribe URLs in the bottom of newsletters. Maybe valid for forever? You don't want this publicly searchable in an search engine.
Assume all the one-time confirmation links land on a confirmation page where the user confirms the desired action.
Is it truly the issue that URLs in e-mails are indexed by search engines, at least Bing? And will they actually end up publicly searchable? If not, what is meant by indexed in the quote above?
I'll add for the sake of completion that I don't think I've had much of a problem with this in my own use of the web, so my gut feeling is that this is unlikely the case.
Is it truly the issue that URLs in e-mails are indexed by search engines, at least Bing?
I can't definitely say if they are being indexed or not, only Bing could answer this question, but they are surely being visited, at least with a simple GET request. I just tested this sending myself a link to a page on my website that logs the requests that are made against it, and indeed I'm seeing a GET coming from 207.46.13.181 (reverse DNS says msnbot-207-46-13-181.search.msn.com), which suggests that an automated program from search.msn.com is crawling the link. This leads me to believe that yes, they are trying to index the link's content somehow, but it's only my opinion really.
And will they actually end up publicly searchable? If not, what is meant by "indexed" in the quote above?
Well, again, impossible to say unless you work for Bing. In any case, "indexing" means exactly what you think it does: parsing the content of a page to potentially include it in search results.
The real question here is: does this somehow represent a security problem or will it compromise my website's functionality?
It surely has the potential to: if your confirmation/reset/subscription/whatever process only relies on a single GET request with the appropriate GET parameter, then you should definitely revisit the strategy, as it obviously allows anyone to perform the action (even maliciously for example enumerating possible IDs for your GET parameters).
If the link you are trying to send contains sensible information or can be used to alter important data for an user of your website, then you should at least put it behind a login page only giving access to the interested user. This way, anyone who wants to access it (including search engines) will be redirected to a login page if not already logged in.
If the link you are trying to send is just some kind of harmless confirmation link (e.g. subscribe/unsubscribe from a newsletter), then at least use a form inside the web page to do the actual confirmation through a POST request (possibly also using a CSRF token), otherwise you will unequivocally end up with false positives.

Facebook app blocked for posting too fast. What are the limits?

We (a local hackerspace) have a Tumblr blog and wanted to make ourselves a Facebook page. Before going live we wanted to import all our Tumblr content to Facebook so our fans on Facebook can browse it here as well. For this I have made an app that reads all the posts from our Tumblr blog and publishes them to our new Facebook page (backdating those posts as well). Here's my problem: after the app does about ~130 re-posts (~260 operations: publish + backdate) I start getting an error:
Received Facebook error response of type OAuthException: It looks like you were misusing this feature by going too fast. You’ve been blocked from using it.
Learn more about blocks in the Help Center. (code 368, subcode 1390008)
The block is gone the next day, but after a similar amount of operations it's back. After a couple of hours later, when the block is gone again, I introduced 6 second delays between operations, but that didn't help and after 19 re-posts I'm blocked again. Some facts:
I am publishing posts to a feed of (yet) unpublished page I am the (only) owner of.
The app is a standalone JAVA application and uses restfb to work with Facebook.
The line that is causing the error: facebookClient.publish("me/feed", FacebookType.class, params.toArray(new Parameter[0]));
All publish operations contain a link, mostly to respective posts on out Tumblr. Some contain message, caption or a name (depending on post type).
I need to re-post ~900 posts from Tumblr, I have done ~250 so far. When over, I will likely put in on server, scheduled, to keep syncing single new posts.
This app is not meant to be used publicly, it is rather a personal utility (but the code will be posted to GitHub, should anybody need it).
This is my first experience with Facebook API and I wasn't able to find a place where I could officially address them with this question. I could proceed by doing 100 posts/day, but I'm afraid I will eventually get banned for good, even though I don't feel like doing anything wrong.
I haven't put any more code here, as the code itself does not seem to be a problem, but rather the rate at which it is executed.
So, should I proceed with 100 posts/day and hope I won't be banned, or is there another "correct" way of dealing with this?
Thanks in advance!
I'm answering a bit late but I just had this problem too so I did some research : it seems that besides the rate limits shown in Facebook docs, there's also a much more limited and opaque rate for POST requests to limit spam.
It's not clearly set but it could depend on your relationship to the page you're writing to (admin or not), if you post to multiple pages and finally if you post too quickly.
To answer the question, it seems that it would have been okay if you had done like 1 post per minute or less.
I think you exceed the rate limiting for your user Id.
- Your app can make 200 calls per hour per user in aggregate. As an
example, if your app has 100 users, this means that your app can make
20,000 calls. One user could make 19,000 of those calls and another
could make 1,000, so this isn't a per-user limit. It's a per-app
limit
- That hour is a sliding window, updated every few minutes
- If your app is rate limited, all calls for that app will be limited, not
just for a specific user
- The number of users your app has is the
average daily active users of your app, plus today's new logins
Check this: https://developers.facebook.com/docs/graph-api/advanced/rate-limiting
It looks like you were misusing this feature by going too fast. You’ve been blocked from using it.
Learn more about blocks in the Help Center.
If you think you're seeing this by mistake, please let us know.

Restrict Users from Programmatically posting form data

I have a very old ASP.net Application with a Web Form with 1 Dropdown Box and 2 Text Boxes and a Submit Button.
All 3 are mandatory fields. Based on the data entered, once the user clicks Submit Button additional details are shown on the next page from the database.
On Submit data is posted via Query String that looks like
http://myserver/myapp/search.aspx?f1=1&f2=tom&f3=sales
Though the application is doing what is supposed to do, off late we came across lot of issues:
As couple of entities that are interested in our data wrote programs to programatically build the querystrings and hitting our server.
This is slowing down the server and regular users who manually search records are facing lot of slowness.
Due to come legal restrictions we couldn't implement CAPTCHA or have users get authenticated.
I would appreciate if you can let me know if any of you have come across this kinda situation and how you have dealt with it.
Thanks in advance.
You could implement source-based rate limiting. I.e. per IP address only allow so many requests per minute. If the requester makes too many requests you simply reject the requests. You could also blacklist the IP addresses that are hitting your app too aggressively. Both of these policies can be enforced by a load balancer like HAProxy or nginx.

Facebook Crawler Bot Crashing Site

Did Facebook just implement some web crawler? My website has been crashing a couple times over the past few days, severely overloaded by IPs that I've traced back to Facebook.
I have tried googling around but can't find any definitive resource regarding controling Facebook's crawler bot via robots.txt. There is a reference on adding the following:
User-agent: facebookexternalhit/1.1
Crawl-delay: 5
User-agent: facebookexternalhit/1.0
Crawl-delay: 5
User-agent: facebookexternalhit/*
Crawl-delay: 5
But I can't find any specific reference on whether Facebook bot respects the robots.txt. According to older sources, Facebook "does not crawl your site". But this is definitely false, as my server logs showed them crawling my site from a dozen+ IPs from the range of 69.171.237.0/24 and 69.171.229.115/24 at the rate of many pages each second.
And I can't find any literature on this. I suspect it is something new that FB just implemented over the past few days, due to my server never crashing previously.
Can someone please advice?
As discussed in in this similar question on facebook and Crawl-delay, facebook does not consider itself a bot, and doesn't even request your robots.txt, much less pay attention to it's contents.
You can implement your own rate limiting code as shown in the similar question link. The idea is to simply return http code 503 when you server is over capacity, or being inundated by a particular user-agent.
It appears those working for huge tech companies don't understand "improve your caching" is something small companies don't have budgets to handle. We are focused on serving our customers that actually pay money, and don't have time to fend off rampaging web bots from "friendly" companies.
We saw the same behaviour at about the same time (mid October) - floods of requests from Facebook that caused queued requests and slowness across the system. To begin with it was every 90 minutes; over a few days this increased in frequency and became randomly distributed.
The requests appeared not to respect robots.txt, so we were forced to think of a different solution. In the end we set up nginx to forward all requests with a facebook useragent to a dedicated pair of backend servers. If we were using nginx > v0.9.6 we could have done a nice regex for this, but we weren't, so we used a mapping along the lines of
map $http_user_agent $fb_backend_http {
"facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)"
127.0.0.1:80;
}
This has worked nicely for us; during the couple of weeks that we were getting hammered this partitioning of requests kept the heavy traffic away from the rest of the system.
It seems to have largely died down for us now - we're just seeing intermittent spikes.
As to why this happened, I'm still not sure - there seems to have been a similar incident in April that was attributed to a bug
http://developers.facebook.com/bugs/409818929057013/
but I'm not aware of anything similar more recently.
Whatever facebook invented you definitely need to fix your server as it is possible to crash it with external requests.
Also, just a first hit on google for facebookexternalhit: http://www.facebook.com/externalhit_uatext.php

is it possible to know where the user is coming from when he uses the back button?

For example,
if user goes to google -> example.com -> newwebsite.com
If he goes back to example.com, the http-referrer page will still be google.com
How can I detect that he went to newwebsite.com
I believe that the back button will send the HTTP headers that were sent to the site the first time around, since it's not really a new visit.
Say you displayed an error page if the user's http-referrer was newwebsite.com. The first time they visited, they would get your site. If they went to newwebsite.com, and then hit back (meaning they wanted to go back in time, through their browser history, not load the page again with new headers), then they would get an error page, and the nature of the back button would be defeated. I don't know if this inspires that behavior or not, it just makes sense to me that way.
Maybe it's possible, but it would be entirely browser-dependent. Why do you need this functionality, anyway? Newwebsite isn't referring the user to your website at all, there's no connection between the two at all--it just happens to be the last page that the user visited.
If a visitor uses the back button, the page might be loaded from browser cache. In that case, no referrer is sent.
Using google analytics, you can see how many visitors came from a given web site. This might give you some information.
I don't believe that this is generally possible. You could pull tricks with javascript on your site so that all the links navigated from there could be detected and recorded, but once the users off your site you've got no control.
If you provided the browser, ie. developed your one yourself, then you could choose to expose the browser history via an api.
http://jeremiahgrossman.blogspot.com/2006/08/i-know-where-youve-been.html
Describes a technique for exploiting the browsers agreement to modify links to indicate that they have been traversed (eg. changing the colour of the link) so that visited sites can be detected, however this only works for a pre-declared set of links, it's not a generally applicable approach.
My feeling is that attempts to hide the nature of browsers - users can hop around all over the place - tend to lead to unsatisfactory 79% solutions that mystify users.
What problem are you actually trying to solve?
You can use sessions inorder to track the path of pages.it really works wwell.try it.