I have links with this structure:
http://www.example.com/tags/blah
http://www.example.com/tags/blubb
http://www.example.com/tags/blah/blubb (for all items that match BOTH tags)
I want google & co to spider all links that have ONE tag in the URL, but NOT the URLs that have two or more tags.
Currently I use the html meta tag "robots" -> "noindex, nofollow" to solve the problem.
Is there a robots.txt solution (that works at least for some search bots) or do I need to continue with "noindex, nofollow" and live with the additional traffic?
I don't think you can do it using robots.txt. The standard is pretty narrow (no wildcards, must be at the top level, etc.).
What about disallowing them based on user-agent in your server?
Related
I was going through many websites' robots.txt files to check if I could scrape some specific pages. When I see following pattern -
User-agent: *Allow: /some-pageDisallow: /some-other-page
There is nothing else on robots.txt file. Does it mean that all other remaining pages on the given website are available to be scraped?
P.S. - I tried googling this specific case but no luck.
According to this website, Allow is used to a allow a directory when it's parent may be disallowed. I found this website quite useful as well.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Regarding your question, if the remaining pages aren't included in a Disallow directory, you should be okay.
We are using the following robots.txt on our site:
User-agent: *
Disallow: /
We'd like to keep the functionality (not allow crawlers indexing of any part of the site), but we would like search engines to save meta title and description, so that these texts show up beautifully, when someone enters the domain name into the search engine.
As far as I can see the only workaround is to create a separate indexable page with only meta tags. This is the only way to achieve our goal? Will it have any side-effects?
With this robots.txt, you disallow bots to crawl documents on your host. Bots are still allowed to index the URLs to your documents (e.g., if they find links on external sites), but they aren’t allowed to access elements from your head element, so they can’t use this content to provide a title or description in their SERP.
There’s no standard way to allow bots to access the head but not the body.
Some search engines might display metadata from other sources, e.g., from the Open Directory Project (you could disallow this with the noodp value for the meta-robots element) or the Yahoo Directory (you could disallow this with the noydir value).
If you’d create a document that only contains metadata in the head, and allow bots to crawl it in your robots.txt, bots might crawl and index it, but the metadata will of course be shown for this page, not for other pages on your host.
I've a question for you, i need to maintain two sites (let's name them example.com and yyy.com), they will be something like an alias.
I want visitors to be able to access the pages with same content via both of them.
what's the best way of doing this without getting in trouble with search engines?
I know about the 301 redirect, but i want visitors to stay on example.com or yyy.com, same name to show up in address bar, not to be redirected.
One thing you could do is to use the rel=canonical tag on the pages of the site you consider to be "the copy".
Basically, in the head section of each page's HTML you can tell which page on the "original" site has the same content.
So if (for instance) your sites are called www.yourmainsite.com and www.yoursecondsite.com, you should tag testpage.htm on yoursecondsite.com like this:
<link rel="canonical" href="http://www.yourmainsite.com/testpage.htm"/>
See here for more details.
Otherwise you can simply tell search engines not to index yoursecondsite.com in your robots.txt
User-agent: *
Disallow: /
Warning: I'm not an SEO person. I did have to implement something similar, but take my advice with a grain of salt
From the theoretical point, "Content-Location" HTTP header was invented for this as defined here and explained here.
However, search engines prefer "canonical" link tag (as in Paolo's explanation) for the same purpose because "Content-Location" header is mostly being misused by the web designers.
I would probably use both.
On my site I have a directory of things which is generated through jquery ajax calls, which subsequently creates the html.
To my knwoledge goole and other bots aren't aware of dom changes after the page load, and won't index the directory.
What I'd like to achieve, is to serve the search bots a dedicated page which only contains the links to the things.
Would adding a noscript tag to the directory page be a solution? (in the noscript section, I would link to a page which merely serves the links to the things.)
I've looked at both the robots.txt and the meta tag, but neither seem to do what I want.
It looks like you stumbled on the answer to this yourself, but I'll post the answer to this question anyway for posterity:
Implement Google's AJAX crawling specification. If links to your page contain #! (a URL fragment starting with an exclamation point), Googlebot will send everything after the ! to the server in the special query string parameter _escaped_fragment_.
You then look for the _escaped_fragment_ parameter in your server code, and if present, return static HTML.
(I went into a little more detail in this answer.)
My robots.txt in Google Webmaster Tools shows the following values:
User-agent: *
Allow: /
What does it mean? I don't have enough knowledge about it, so looking for your help. I want to allow all robots to crawl my website, is this the right configuration?
That file will allow all crawlers access
User-agent: *
Allow: /
This basically allows all user agents (the *) to all parts of the site (the /).
If you want to allow every bot to crawl everything, this is the best way to specify it in your robots.txt:
User-agent: *
Disallow:
Note that the Disallow field has an empty value, which means according to the specification:
Any empty value, indicates that all URLs can be retrieved.
Your way (with Allow: / instead of Disallow:) works, too, but Allow is not part of the original robots.txt specification, so it’s not supported by all bots (many popular ones support it, though, like the Googlebot). That said, unrecognized fields have to be ignored, and for bots that don’t recognize Allow, the result would be the same in this case anyway: if nothing is forbidden to be crawled (with Disallow), everything is allowed to be crawled.
However, formally (per the original spec) it’s an invalid record, because at least one Disallow field is required:
At least one Disallow field needs to be present in a record.
I understand that this is fairly old question and has some pretty good answers. But, here is my two cents for the sake of completeness.
As per the official documentation, there are four ways, you can allow complete access for robots to access your site.
Clean:
Specify a global matcher with a disallow segment as mentioned by #unor. So your /robots.txt looks like this.
User-agent: *
Disallow:
The hack:
Create a /robots.txt file with no content in it. Which will default to allow all for all type of Bots.
I don't care way:
Do not create a /robots.txt altogether. Which should yield the exact same results as the above two.
The ugly:
From the robots documentation for meta tags, You can use the following meta tag on all your pages on your site to let the Bots know that these pages are not supposed to be indexed.
<META NAME="ROBOTS" CONTENT="NOINDEX">
In order for this to be applied to your entire site, You will have to add this meta tag for all of your pages. And this tag should strictly be placed under your HEAD tag of the page. More about this meta tag here.
It means you allow every (*) user-agent/crawler to access the root (/) of your site. You're okay.
I think you are good, you're allowing all pages to crawling
User-agent: *
allow:/