Request bot to reparse robots.txt - robots.txt

I am writing a proxy server that maps youtube.com to another domain (so users can easily access youtube from countries like Germany without search results and videos being censored).
Unfortunately there was a bug in my robots.txt. Its fixed now, but Baiduspider got my old robots.txt and has been trying to index the whole website for a couple of days.
Because Youtube is a quite big website, I don't think this process will end soon :-)
I already tried redirecting baiduspider to another page and sending it a 404, but it already parsed to many paths.
What can I do about this?

Stop processing requests from Baiduspider
with lighttpd append to lighttpd.conf
$HTTP["useragent"] =~ "Baiduspider" {url.access-deny = ( "" )}
sooner or later Baiduspider should refetch the robots.txt
(see http://blog.bauani.org/2008/10/baiduspider-spider-english-faq.html)

Related

Facebook share counter reset after HTTPS

I just moved my blog to https and, of course, all the facebook shares counters are now reset to 0.
I've spent several hours reading stuff online and I got the solution to point the og:url tag to the old urls (with http instead of https).
It worked for a day but now all the counters are back to 0.
The strange thing is that if I check the urls (both with https and http) with the open graph debugger it returns me 0 shares for both the urls!
I really don't know what to do! Is there a way to have back the counters of the http-version of the urls? Or, as an alternative, is there a way to sum the two counters?
p.s. I already activated the 301 redirect for the whole blog in my .htaccess file.
Facebook treats HTTP and HTTPS as two different URLs and therefor two different Open Graph objects, even if the rest of it is the same.
p.s. I already activated the 301 redirect for the whole blog in my .htaccess file.
And that's your mistake ... You need to keep the old HTTP URLs available for the FB scraper to read the OG meta data from; if you redirect the scraper to the HTTPS version as well, then it concludes that the HTTPS versions was the actual correct URL for this piece of content - and therefor you have just undone what you tried to do by having og:url point to the old HTTP address.
See https://developers.facebook.com/docs/plugins/faqs#faq_1149655968420144 for more details.
The scraper can be recognized by the User-Agent request header it sends - see https://developers.facebook.com/docs/sharing/webmasters/crawler
(How to exclude clients that send a certain user agent from redirection via .htaccess is something that should be easy enough to research.)

Disallow certain URLs in robot.txt

I'm currently running a web service where people can browse products. The URL for that is basically just /products/product_pk/. However, we don't serve products with certain product_pks, e.g. nothing smaller than 200. Is there hence a way to discourage bots to hit URLs like /products/10/ (because they will receive a 404)?
Thank you for your help :)
I am pretty sure that crawlers don't try and fail autogenerated urls. It crawls your website and find the next links to crawl. If you have any links that return 404, that is bad design on your site, since they should not be there.

Googlebot guesses urls. How to avoid/handle this crawling

Googlebot is crawling our site. Based on our URL structure it is guessing new possible URLs.
Our structure is of the kind /x/y/z/param1.value. Now google bot exchanges the values of x,y,z and value with tons of different keywords.
Problem is, that each call triggers a very expensive operation and it will return positive results only in very rare cases.
I tried to set an url parameter in the crawling section of the webmasters tools (param1. -> no crawling). But this seems not to work, probably cause of our inline url format (would it be better to use the html get format ?param1=..?)
As Disallow: */param1.* seems not to be an allowed robots.txt entry, is there another way to disallow google from crawling this sites?
As another solution I thought of detecting the googlebot and returning him a special page.
But I have heard that this will be punished by google.
Currently we always return a http status code 200 and a human readable page, which says: "No targets for your filter critera found". Would It help to return another status code?
Note: This is probably not a general answer!
Joachim was right. It turned out that the googlebot is not guessing URLs.
Doing a bit of research I found out that I added a new DIV in my site containing those special URLs half a year ago (which I unfortunately forgot). A week ago googlebot has started crawling it.
My solution: I deleted the DIV and also I return a 404 status code on those URLs. I think, sooner or later, googlebot will now stop crawling the URLs after revisiting my site.
Thanks for the help!

How to prevent Google from indexing redirect URL I do not own

A domainname that I do not own, is redirecting to my domain. I donĀ“t know who owns it and why it is redirecting to my domain.
This domain however is showing up in Googles search results. When doing a whois it also returns this message:
"Domain:http://[baddomain].com webserver returns 307 Temporary Redirect"
Since I do not own this domain I cannot set a 301 redirect, or disable it. When clicking the baddomain in Google it shows the content of my website but the baddomain.com stays visible in the URL bar.
My question is: How can I stop Google from indexing and showing this bad domain in the search results and only show my website instead?
Thanks.
Some thoughts:
You cannot directly stop Google from indexing other sites, but what you could do is add the cannonical tag to your pages so Google can see that the original content is located on your domain and not "bad domain".
For example check out : https://support.google.com/webmasters/answer/139394?hl=en
Other actions can be taken SEO wise if the 'baddomain' is outscoring you in the search rankings, because then it sounds like your site could use some optimizing.
The better your site and domain rank in the SERPs, the less likely it is that people will see the scraped content and 'baddomain'.
You could however also look at the referrer for the request and if it is 'bad domain' you should be able to do a redirect to your own domain, change content etc, because the code is being run from your own server.
But that might be more trouble than it's worth as you'd need to investigate how the 'baddomain' is doing things and code accordingly. (properly iframe or similar from what you describe, but that can still be circumvented using scripts).
Depending on what country you and 'baddomain' are located in, there are also legal actions. So called DMCA complaints. This however can also be quite a task, and well - it's often not worth it because a new domain will just pop up.

Domain blocked and no data scraped

I recently purchased the domain www.iacro.dk from UnoEuro and installed WordPress planning to integrate blogging with Facebook. However, I cannot even get to share a link to the domain.
When I try to share any link on my timeline, it gives the error "The content you're trying to share includes a link that's been blocked for being spammy or unsafe: iacro.dk". Searching, I came across Sucuri SiteCheck which showed that McAfee TrustedSource had marked the site as having malicious content. Strange considering that I just bought it, it contains nothing but WordPress and I can't find any previous history of ownership. But I got McAfee to reclassify it and it now shows up green at SiteCheck. However, now a few days later, Facebook still blocks it. Clicking the "let us know" link in the FB block dialog got me to a "Blocked from Adding Content" form that I submitted, but this just triggered a confirmation mail stating that individual issues are not processed.
I then noticed the same behavior as here and here: When I type in any iacro.dk link on my Timeline it generates a blank preview with "(No Title)". It doesn't matter if it's the front page, a htm document or even an image - nothing is returned. So I tried the debugger which returns the very generic "Error Parsing URL: Error parsing input URL, no data was scraped.". Searching on this site, a lot of people suggest that missing "og:" tags might cause no scraping. I installed a WP plugin for that and verified tag generation, but nothing changed. And since FB can't even scrape plain htm / jpg from the domain, I assume tags can be ruled out.
Here someone suggests 301 Redirects being a problem, but I haven't set up redirection - I don't even have a .htaccess file.
So, my questions are: Is this all because of the domain being marked as "spammy"? If so, how can I get the FB ban lifted? However, I have seen examples of other "spammy" sites where the preview is being generated just fine, e.g. http://dagbok.nu described in this question. So if the blacklist is not the only problem, what else is wrong?
This is driving me nuts so thanks a lot in advance!
I don't know the details, but it is a problem that facebook has with web sites hosted on shared servers, i.e. the server hosting your web site also hosts a number of other web sites.