Robots.txt block access to all https:// pages [closed] - robots.txt

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What would the syntax be to block all access to any bots to https:// pages? I have an old site that now doesn't have an SSL and I want to block access to all https:// pages

I don’t know if it works, if the robots use/request different robots.txt for different protocols. But you could deliver a different robots.txt for requests over HTTPS.
So when http://example.com/robots.txt is requested, you deliver the normal robots.txt. And when https://example.com/robots.txt is requested, you deliver the robots.txt that disallows everything.

Related

see api calling information on chrome developer tab [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
i want to see calling of an API in chrome network tab to implement them in my web client.
i go to the domain food.shahed.ac.ir
then some api calls very fast and then redirects to another page very fast.
so i can not see the first requests .
You can check Preserve log checkbox.
Then all requests should be available.

Why is https always used with www subdomain? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have been trying to reason this out, but haven't been able to. All the https websites that I have surfed use www subdomain. Is it possible to have something like https://foo.com. If yes, then why is it so rare or uncommon?
Technically, there is no reason why HTTPS cannot be used without a www subdomain. Ex: https://mail.google.com.
You might be wrong in your observation. After all, how many of the millions of domains do we get to see.:)

Does Google update redirect URLs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I have shortened my URL on my e-commerce store to make them more SEO friendly however some of my original URL are in a good position on Google.
If I redirect my old URLs to my new URL will Google automatically update my old URLs to display my new URLs?
Yes, if you use permanent (301) redirects. That's pretty much the full answer.

Bingbot ignoring robots.txt and attempting to retrieve a trafficbasedsspsitemap.xml [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I have an app whose content should not be publicly indexed. I've therefore disallowed access to all crawlers.
robots.txt:
# Robots shouldn't index a private app.
User-agent: *
Disallow: /
However, Bing has been ignoring this and daily requests a /trafficbasedsspsitemap.xml file, which I have no need to create.
I also have no need to receive daily 404 error notifications for this file. I'd like to just make the bingbot go away, so what do I need to do to forbid it from making requests?
According to this answer, this is Bingbot checking for an XML sitemap generated by the Bing Sitemap Plugin for IIS and Apache. It apparently cannot be blocked by robots.txt.
For those coming from google-
You could block bots via apache user agent detection/ rewrite directives, that would allow you to keep bingbot out entirely.
https://superuser.com/questions/330671/wildcard-blocking-of-bots-in-apache
Block all bots/crawlers/spiders for a special directory with htaccess
etc.

Block subdirectories in robots.txt [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
My site has profiles, and then pages beyond those profiles. (Example: http://www.site.com/profile, http://www.site.com/profile/settings)
I would like to block Google crawlers from the sub folders. I want google to index the /profile/ but not anything beyond it.
Another example: - http://twitter.com/bmull <-- Allow - http://twitter.com/bmull/favorites <-- Block
You could also use <meta name="robots" content="noindex, nofollow" /> in the pages you dont want to robots to index/follow, however always remember that everything in these files is voluntary and the robots can choose not to follow so I recommend ip or user agent blocking as a better route.
This will work with Google, but isn't guaranteed to work with other spiders. As secretformula suggested, your best bet is to go with ip or user agent blocking in your server side logic
User-agent: *
Disallow: /*/settings