What happens when GET robots.txt returns an unrelated html file? - robots.txt

I have a web server capable of serving the assets of various web apps. When a requested asset doesn't exist, it sends back index.html. In other words:
GET /img/exists.png -> exists.png
GET /img/inexistent.png -> index.html
This also means that:
GET /robots.txt -> index.html
How will google (and other) crawlers handle this? Will they detect that robots.txt is invalid and ignore it (same as returning 404)? Or will they penalize my ranking for serving an invalid robots.txt? Is this acceptible, or should I make a point of returning 404 when the app I'm serving has no robots.txt?

Every robots.txt handler that I know of deals with invalid lines by simply discarding them. So an HTML file (which presumably does not contain any valid robots.txt directives) would be effectively treated as if it were a blank file. This is not really part of any official standards, though. The (semi-)official standard assumes that any robots.txt file will contain robot.txt directives. Behavior for a robots.txt file that contains HTML is undefined.
If you care about crawlers, your bigger problem is not that you serve an invalid robot.txt file, it's that you have no mechanism to tell crawlers (or anyone else) when a resource does not exist. From the crawlers point of view, your site will contain some normal pages plus an infinite number of exact copies of the home page. I strongly encourage you to find a way change your setup so resources that don't exist return status 404.

Related

robots.txt allows & disallows few pages, what does it mean for other pages?

I was going through many websites' robots.txt files to check if I could scrape some specific pages. When I see following pattern -
User-agent: *Allow: /some-pageDisallow: /some-other-page
There is nothing else on robots.txt file. Does it mean that all other remaining pages on the given website are available to be scraped?
P.S. - I tried googling this specific case but no luck.
According to this website, Allow is used to a allow a directory when it's parent may be disallowed. I found this website quite useful as well.
Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.
Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.
Regarding your question, if the remaining pages aren't included in a Disallow directory, you should be okay.

Rewrite .html files in Netlify

I am trying to rewrite patterns like /abc.html to /search?xyz=abc. 'abc' can be anything.
I've gone through the documentation at https://www.netlify.com/docs/redirects/
Here's what I have now in my _redirect file, but it doesn't seem to work. Kindly help.
/*.html /search?xyz=:splat 200
Disclaimer: I work for netlify
Our redirects functionality does not work to do rewrites like that. Placeholders like slugs (/blog/:year/:month/:day/:title) and stars (/assets/*) are only matched as full path components - that is, the thing between slashes in a URL or "everything after this / including files in subdirectories".
It's not an uncommon feature request, but our system doesn't work like that right now.
Some ways you can achieve similar goals:
usually you aren't intending to redirect existing paths. This portion of the docs demonstrates that a standard redirect (/articles/* /search?xyz=:splat 301) will redirect all requests for missing content - be that /articles/1 or /articles/deep/link/that/was/tpyoed.html - to /search with a xyz parameter of the pathname . This doesn't do exactly what you asked - but it is probably the closest thing and you can hopefully handle the paths appropriately in your /search page. In case you have any contents under /articles, that will still be served, and not redirected, since you didn't put a ! on the redirect to force it.
if you have a single page app that does its own routing and use a history pushstate redirect you could make the router do the right thing for your content since usually there is only one .html page on your site and any other paths would be redirected to it (where the router takes over with whatever smarts you give it)

Googlebot guesses urls. How to avoid/handle this crawling

Googlebot is crawling our site. Based on our URL structure it is guessing new possible URLs.
Our structure is of the kind /x/y/z/param1.value. Now google bot exchanges the values of x,y,z and value with tons of different keywords.
Problem is, that each call triggers a very expensive operation and it will return positive results only in very rare cases.
I tried to set an url parameter in the crawling section of the webmasters tools (param1. -> no crawling). But this seems not to work, probably cause of our inline url format (would it be better to use the html get format ?param1=..?)
As Disallow: */param1.* seems not to be an allowed robots.txt entry, is there another way to disallow google from crawling this sites?
As another solution I thought of detecting the googlebot and returning him a special page.
But I have heard that this will be punished by google.
Currently we always return a http status code 200 and a human readable page, which says: "No targets for your filter critera found". Would It help to return another status code?
Note: This is probably not a general answer!
Joachim was right. It turned out that the googlebot is not guessing URLs.
Doing a bit of research I found out that I added a new DIV in my site containing those special URLs half a year ago (which I unfortunately forgot). A week ago googlebot has started crawling it.
My solution: I deleted the DIV and also I return a 404 status code on those URLs. I think, sooner or later, googlebot will now stop crawling the URLs after revisiting my site.
Thanks for the help!

Get google to index links from javascript generated content

On my site I have a directory of things which is generated through jquery ajax calls, which subsequently creates the html.
To my knwoledge goole and other bots aren't aware of dom changes after the page load, and won't index the directory.
What I'd like to achieve, is to serve the search bots a dedicated page which only contains the links to the things.
Would adding a noscript tag to the directory page be a solution? (in the noscript section, I would link to a page which merely serves the links to the things.)
I've looked at both the robots.txt and the meta tag, but neither seem to do what I want.
It looks like you stumbled on the answer to this yourself, but I'll post the answer to this question anyway for posterity:
Implement Google's AJAX crawling specification. If links to your page contain #! (a URL fragment starting with an exclamation point), Googlebot will send everything after the ! to the server in the special query string parameter _escaped_fragment_.
You then look for the _escaped_fragment_ parameter in your server code, and if present, return static HTML.
(I went into a little more detail in this answer.)

Will redirecting a bunch of old dynamic URLs to a single new index page totally bone my pagerank?

I've got half a dozen legacy dynamic URLs and it turns out redirecting them all will require 18 Rewrite directives in my .htaccess file - that seems messy to me.
What I can do however, is redirect all of them to my new start page with a single Redirect directive. It's a tiny site and all the pages people might come in from via google searches are really easily findable from the start page so I'd like to do that however...
I'm worried this might kill the site's modest (but worth maintaining) page rank as several URLs would then be resolving to the same URL and content.
Does anyone know if this would be the case, and if so, if there are strategies to avoid that other then not implementing the above?
Thanks!
Roger.
As long as you do 301 Redirects (permanently moved) vs 302 Redirects (temporarily moved), then all the accumulated page rank from your dozen legacy URLs will transfer to the new url you are redirecting to.
So you will not "lose" the pagerank, it will simply be transfered over to the new URL.
The important thing is to ensure it's a 301 Redirect.