blocked links in sitemap - robots.txt

i'm using a online sitemap generator tool which generates links even for which are blocked in robots.txt. Is these blocked links affect site ranking ? . Is there anyway to overcome it ?

If you have made a robots.txt file that blocks the pages, it shouldn't affect your ranking.

Have you tried emailing the developer of the sitemaps tool you use? Maybe they can help you, so the tool obeys your robots.txt file

Related

Why would "Disallow: /*?s=" be used in a robots.txt file?

We got notice from Google's Search Console that one of our blog posts couldn't be crawled. When inspecting the URL from the Google Search Console it reports that the paged was blocked by the following in our robots.txt file.
Disallow: /*?s=
I also ask why "Disallow: /*?s=" would be used? Why worry about parses that contain the letter "s"? If we remove it, what's the risk? Thanks so much in advance for any additional insight that can be shared - P
This query is commonly used on WordPress-based sites.
There may be several types of content on your site and the site builder wanted to allow search only for certain types of content by another way of searching.
It makes sense for example on a store site that wants to restrict users from searching for the products using a customized search form so that they do not wander behind the scenes of the site.
Google's robot has a number of ways to identify if it's a WordPress based site, which is probably why it's looking for that end of the path.

Ignoring robots.txt and meta tags in a crawler

Are there a way to make a web crawler ignore the robots.txt file and tags? Yes, I know this could come with legal repercussions. This question is much like another question but the answers were very vague and I didn't quite get it. any help is appreciated.
A web crawler doesn't have to adide by a robots.txt, because there is no physical measure in place to stop it if it doesn't do so.
A simple webcrawler might do:
FOR SITE IN SEARCH
IF ALLOWED_TO_CRAWL_BASED_ON_ROBOTS_TXT(SITE)
FOR LINK IN SITE
DO_SOMETHING
this could be modified to:
FOR SITE IN SEARCH
FOR LINK IN SITE
DO_SOMETHING

How to fix robots.txt

full disclaimer, I am not a programer, I am an SEO trying to learn how to not rely on my developer for every little question I have.
Currently my issue is this. I use Screaming Frog to crawl my sites to layout the page titles, meta descriptions, h1, h2, etc so I can more easily plan out my changes.
The other day I wanted to run a report for my client and my own company website and got the following back.
So I know robots.txt is a way to make pages on your site but not have google crawl them. What I don't know is why an entire site would have this message as opposed to just some pages.
Can anyone give advice on how to fix this or links to how to's? I get this issue a lot and would like to educate myself so I don't have to wait for someone else. I get these as well when I try indexing websites on Google Search Console.
Many Thanks
What I don't know is why an entire site would have this message as
apposed to just some pages.
The robots.txt for your website has not been written properly if the intention is to index its content.
Or Screaming Frog might have a but if indeed the robots.txt file is written properly.
Or some webmaster decided the content was not worth indexing on Google or that bots would eat too much bandwidth (as in not being selective to restrict access).
Checking the current robots.txt file on that website, I see this content:
User-Agent: *
Disallow:
Which means the any page of that website is allowed to be crawled by any crawler (here the explanation of that file's syntax: https://moz.com/learn/seo/robotstxt)
So the current file should not cause that error OP mentions. Seeing that this question is from June 30/2017 and the robots.txt file was last modified on Jul 11/2017, it seems since this question was opened the OP may have already fixed whatever problem they had.

Automatic web tester for 404 links?

Is there any test framework or software that can automatically go through a site and find 404 errors from links?
You could use an extension for your favourite browser, i.e. LinkChecker for Firefox.
Are you looking for a tool that does complete validation/checking of the site? Or one that does use-case testing of specific parts of the site.
For the latter I recommend TestPlan, it has the ability to check the headers of pages and work with the so-called "meta" response of the page.
The original web-site is no longer available but the project is now hosted on Launchpad.
For the former it isn't the best tool, but as part of a test framework it is easy enough to get it to scan through links on the site looking for errors.
If you're running on Windows there is this one.

prevent google from indexing

hi sirs what's the best way to prevent google from showing of a folder in the search engine ?, like e.g www.example.com/support , what should i do if I want the support folder to disappear in google ?
the first thing I did was place a 'robots.txt' file and include this code
User-agent: *
Disallow: /support/etc
but the results is a total disaster, am not able to use the support page anymore unless i remove the robots.txt
what's the best thing to do ?
robots.txt shouldnt affect the way your page function. If in doubt, you can use tools to generate like http://www.searchenginepromotionhelp.com/m/robots-text-creator/simple-robots-creator.php or http://www.seochat.com/seo-tools/robots-generator/
When dissallowing in robots file, you can explicitly specify a file or subfolder rather than just a folder.
You can also use meta tag in your document to tell the crawler not to use it
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
what's the best way to prevent google from showing of a folder in the search engine ?
A robots.txt file is the right way to do this. Your example is correct for blocking the /support/etc directory and its descendants.
am not able to use the support page anymore unless i remove the robots.txt
It doesn't make sense that a robots.txt file would affect the way your site functions, and certainly it should never affect which pages can be accessed by a human. I suspect something else is awry -- check your server logs to see what kinds of errors are being recorded.
While not the preferred method of limiting robot access, Google talks about using a noindex meta tag here. This will also prevent the various pages from showing up if they are linked to by a site other than your own.
A good discussion of limiting bots that visit your site can be found here.