How to fix robots.txt - robots.txt

full disclaimer, I am not a programer, I am an SEO trying to learn how to not rely on my developer for every little question I have.
Currently my issue is this. I use Screaming Frog to crawl my sites to layout the page titles, meta descriptions, h1, h2, etc so I can more easily plan out my changes.
The other day I wanted to run a report for my client and my own company website and got the following back.
So I know robots.txt is a way to make pages on your site but not have google crawl them. What I don't know is why an entire site would have this message as opposed to just some pages.
Can anyone give advice on how to fix this or links to how to's? I get this issue a lot and would like to educate myself so I don't have to wait for someone else. I get these as well when I try indexing websites on Google Search Console.
Many Thanks

What I don't know is why an entire site would have this message as
apposed to just some pages.
The robots.txt for your website has not been written properly if the intention is to index its content.
Or Screaming Frog might have a but if indeed the robots.txt file is written properly.
Or some webmaster decided the content was not worth indexing on Google or that bots would eat too much bandwidth (as in not being selective to restrict access).

Checking the current robots.txt file on that website, I see this content:
User-Agent: *
Disallow:
Which means the any page of that website is allowed to be crawled by any crawler (here the explanation of that file's syntax: https://moz.com/learn/seo/robotstxt)
So the current file should not cause that error OP mentions. Seeing that this question is from June 30/2017 and the robots.txt file was last modified on Jul 11/2017, it seems since this question was opened the OP may have already fixed whatever problem they had.

Related

Why would "Disallow: /*?s=" be used in a robots.txt file?

We got notice from Google's Search Console that one of our blog posts couldn't be crawled. When inspecting the URL from the Google Search Console it reports that the paged was blocked by the following in our robots.txt file.
Disallow: /*?s=
I also ask why "Disallow: /*?s=" would be used? Why worry about parses that contain the letter "s"? If we remove it, what's the risk? Thanks so much in advance for any additional insight that can be shared - P
This query is commonly used on WordPress-based sites.
There may be several types of content on your site and the site builder wanted to allow search only for certain types of content by another way of searching.
It makes sense for example on a store site that wants to restrict users from searching for the products using a customized search form so that they do not wander behind the scenes of the site.
Google's robot has a number of ways to identify if it's a WordPress based site, which is probably why it's looking for that end of the path.

Ignoring robots.txt and meta tags in a crawler

Are there a way to make a web crawler ignore the robots.txt file and tags? Yes, I know this could come with legal repercussions. This question is much like another question but the answers were very vague and I didn't quite get it. any help is appreciated.
A web crawler doesn't have to adide by a robots.txt, because there is no physical measure in place to stop it if it doesn't do so.
A simple webcrawler might do:
FOR SITE IN SEARCH
IF ALLOWED_TO_CRAWL_BASED_ON_ROBOTS_TXT(SITE)
FOR LINK IN SITE
DO_SOMETHING
this could be modified to:
FOR SITE IN SEARCH
FOR LINK IN SITE
DO_SOMETHING

Robots.txt Disallow

I'm working with an e-commerce system at the moment that is throwing up hundreds of potential duplicate page urls and trying to work out how to hide them via robots.txt untill the developers are able to sort there ...... out.
I have managed to block most of them but got stuck on the last type so the question is:
I have 4 urls to the same product page with the below structure, how do I block the first one but not the others.
www.example.com/ProductPage
www.example.com/category/ProductPage
www.example.com/category/subcategory/ProductPage
www.example.com/category/subcategory/ProductPage/assessorypage
So far the only idea I can come up with is using:
Disallow: /*?id=*/
this however blocks everything…
EDIT: I believe I may have found a way to do it by setting up a robots.txt file to disallow all then just allow the specific paths I want again below that and then…once again disallow any specific paths after that.
Anyone know if this has a negative effect on SEO using disallow > allow > disallow.
You could set the meta tag for the rel="canonical" property. This will help search engines know which url is the 'right' one and not have more than one URL per product in search results.
Read here for more information

prevent google from indexing

hi sirs what's the best way to prevent google from showing of a folder in the search engine ?, like e.g www.example.com/support , what should i do if I want the support folder to disappear in google ?
the first thing I did was place a 'robots.txt' file and include this code
User-agent: *
Disallow: /support/etc
but the results is a total disaster, am not able to use the support page anymore unless i remove the robots.txt
what's the best thing to do ?
robots.txt shouldnt affect the way your page function. If in doubt, you can use tools to generate like http://www.searchenginepromotionhelp.com/m/robots-text-creator/simple-robots-creator.php or http://www.seochat.com/seo-tools/robots-generator/
When dissallowing in robots file, you can explicitly specify a file or subfolder rather than just a folder.
You can also use meta tag in your document to tell the crawler not to use it
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
what's the best way to prevent google from showing of a folder in the search engine ?
A robots.txt file is the right way to do this. Your example is correct for blocking the /support/etc directory and its descendants.
am not able to use the support page anymore unless i remove the robots.txt
It doesn't make sense that a robots.txt file would affect the way your site functions, and certainly it should never affect which pages can be accessed by a human. I suspect something else is awry -- check your server logs to see what kinds of errors are being recorded.
While not the preferred method of limiting robot access, Google talks about using a noindex meta tag here. This will also prevent the various pages from showing up if they are linked to by a site other than your own.
A good discussion of limiting bots that visit your site can be found here.

Ethics of blocking external hotlinking [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm just looking through some of the webmaster stats that Google provides, and noticed that the most common links to our website are to some research articles that we've put up in PDF format. The articles are also available on the site in HTML.
I was looking at the sites (mostly forums and blogs) which link to these articles and was thinking that none of the people clicking the links would actually get to see our website, and that we're giving something away for free and not even getting some page views in return.
I thought that maybe I could change my server settings to redirect external requests to these files to the HTML version. This way, the users still get the same content (albeit in an unexpected format), and we'd get these people to see our website and hopefully explore it some more. Requests coming from my site should be let through to the PDF. Though I don't know how to set this up just yet (keep an eye out for a follow-up question here), I'm sure this is technically possible. The only question is: is that a good idea?
What would you consider the downsides of redirecting traffic from external sources such that they see our site, not just get our content? Do they outweigh the benefits?
The only other alternate option I can see is to make our branding and URL much more visible in the PDF files themselves. Any thoughts?
Hopefully your PDFs are equally branded so that visitors will feel compelled to search further in your website. That might be just as important as having visitors briefly stop-over at your website.
I'm usually opposed to all such redirects as harmful to usability. However, in this case a basic content-type negotiation takes place and this might be acceptable. However, make sure that this doesn't break downloads of the PDF documents for users who might have disabled their referers in the browser (I do this, for one).
Sure you could cut them off, but there is a bigger issue at play: Why aren't these people finding you before they are finding these moocher sites?
Possible reasons are:
a) they did find your site, but not the content they were looking for, even though its obviously there, or
b) your site never appeared in their search results.
You may want to consider a site redesign in order to address those concerns before cutting off what appears to be a reliable source of information about your target audience (for you and the people who get your PDFs from elsewhere).
In the meantime, I would suggest you allow the traffic, add a cover page to all of your PDFs that are basically a full-page ad for your site and then enlarge the font on the copyright section of each page so the authorship is very prominent. You have a built in audience now, they just don't know it yet. Show them where the source is.
Eventually, the traffic will come to you and know you as a reliable source for that information.
I would do it. It's your site and your data.
The hot-linkers are essentially 'guests' and you can make the rules for your guests.
If they don't like it, they don't have to link.
I would add a page at the beginning of each article with info about the website, the current article and links to other articles on your website.
I find it more convenient than redirecting the user to a page on your website(that's annoying). Most people right click and download PDF files, what would that do when your redirect ;)
I think the proper thing to do in this situation is to leave the redirects. Here's why:
There's nothing worse than expecting to go somewhere/get something and not getting it (the negative impact would outweigh the positive.)
Modify your content to add a footer such as: "like what you saw, we've got more, check us out at www.url.com"
If your content is good, users will check out your website. These are the visitors you want, they're more likely to stick around and provide your site with value (whatever that may be.) Those that you've coerced may provide you with an extra click or two, but you will likely not see any value given back to your site.
Look at other successful sites that give something away for free: Joel on Software, Seth Godin, Tim Ferriss, 37Signals. The long term will provide better, more consistent value than the short term.
If you go for this solution, see if redirecting to the HTML version also changes the file name displayed by the browser if somebody used 'save as' on the link, else an HTML page would be saved with a pdf extension. Apart from that, I can see no reason why you shouldn't do it.
As an alternative, see if you can add a link to your site to the top of the pdf file. This way they are reminded where it comes from even if someone else sent it to them by email.