Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have a folder which is half public: The URL is not linked, the people that know the URL are only a few friends (which will not link it) and it is cryptic enough to make sure that nobody lands there by accident.
However, the link is send via Googlemail and Facebook messages. Is there a way to tell Facebook and Google in a local robots.txt file not to index the page?
When I add it to the "global" robots.txt file then everybody who takes a look there will see that in my /secret-folder-12argoe22v4 might be something interesting. So I will not do that. But will Facebook / Google look at /secret-folder-12argoe22v4/robots.txt?
The content would be
User-agent: *
Disallow: .
or
User-agent: *
Disallow: /secret-folder-12argoe22v4/
As CBroe mentioned, a robots.txt file must always be at the top level of the site. If you put it in a subdirecory, it will be ignored. One way you can block a directory without publicly revealing its full name is to block just part of it, like this:
User-agent: *
Disallow: /secret
This will block any URL that starts with "/secret", including "/secret-folder-12argoe22v4/".
I should point out that the above is not a 100% reliable way to keep the files out of the search engines. It will keep the search engines from directly crawling the directory, but they can still show it in search results if some other site links to it. You may consider using robots meta tags instead, but even this won't prevent someone from directly following an off-site link. The only really reliable way to keep a directory private is to put it behind a password.
Related
I have a sub-domain for testing purposes. I have set robots.txt to disallow this folder.
Some of the results are still showing for some reason. I thought it may be because I hadn't set up the robots.txt originally and Google hadn't removed some of them yet.
Now I'm worried that the robots.txt files within the individual joomla sites in this folder are causing Google to keep indexing them. Ideally I would like to stop that from happening because I don't want to have to remember to turn robots.txt back to follow when they go live (just in case).
Is there a way to override these explicitly with a robots.txt in a folder above this folder?
As far as a crawler is concerned, robots.txt exists only in the site's root directory. There is no concept of a hierarchy of robots.txt files.
So if you have http://example.com and http://foo.example.com, then you would need two different robots.txt files: one for example.com and one for foo.example.com. When Googlebot reads the robots.txt file for foo.example.com, it does not take into account the robots.txt for example.com.
When Google bot is crawling example.com, it will not under any circumstances interpret the robots.txt file for foo.example.com. And when it's crawling foo.example.com, it will not interpret the robots.txt for example.com.
Does that answer your question?
More info
When Googlebot crawls foo.com, it will read foo.com/robots.txt and use the rules in that file. It will not read and follow the rules in foo.com/portfolio/robots.txt or foo.com/portfolio/mydummysite.com/robots.txt. See the first two sentences of my original answer.
I don't fully understand what you're trying to prevent, probably because I don't fully understand your site hierarchy. But you can't change a crawler's behavior on mydummysite.com by changing the robots.txt file at foo.com/robots.txt or foo.com/portfolio/robots.txt.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I am not tech savvy but I was taking a picture that I have had on my FB act album since 2007 and trying to copy it. It was of the first moments after my daughter was born and they laid her in my arms after a Cesarian section. It is the only place I had the picture. No other copys. The actual camera was lost. I was trying to do a "then and now picture" her birth to her first day of kindergarten and now I have lost it by accidentally hitting delete instead of copying.
Is there ANYWAY to get it back? Does FB have a trash folder?
Beth
So, I know this question is probably off-topic but I can't help but feel bad for lost baby photos.
First of all, if anyone else tagged the photo, it would be linked to their account as well. If you've ever emailed anyone the photo, likely you would have either sent the photo itself or a direct link to the facebook website where it's stored. It's probably still stored there, but now you just have (unfortunately) no way to access it. So if you ever sent the link to anyone, if you find the link, it will still link to the page.
If you're someone who doesn't routinely purge your browsing history, there may be a chance that a copy of the photo was saved to your browser's temporary files folder while you were looking at the photo. If you use Internet Explorer 7 or 8, here's a guide. Otherwise just Google "<your browser name and version> temp folder location".
http://windows.microsoft.com/en-us/windows-vista/view-temporary-internet-files
Of course, neither of these may work. If so, sorry for your loss of data. Good luck!
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm not sure how to deploy best practice for SEO in a new project.
I'm building a CMS that will be used by a group of writers to post news articles to a website. I'm developing the site using Perl and Template-Toolkit (TT2). I've also embedded an open source editor (TinyMCE) in the system that will be used for content creation.
I was planning to save the news article content to the DB as text - though I could also save it to flat files and then save the corresponding file paths to the DB.
From an SEO standpoint, I think it would be very helpful if this content could be exposed to search engines. There will be lots of links and images that could help to improve rankings.
If I put this content in the DB, it won't be discoverable ... right?
If I save this content in template files (content.tt) will the .tt files be recognized by search engines?
Note that the template files (.tt) will be displayed as content via a TT2 wrapper.
I'm also planning to generate a Google XML Sitemap using the Sitemap 0.90 standard. Perhaps this is suffiecient? Or should I try to make the actual content discoverable?
Thanks ... just not sure how the google dance deals with .tt files and such.
If I put this content in the DB, it won't be discoverable ... right?
The database is part of your backend. Google cares about what you expose to the front end.
If I save this content in template files (content.tt) will the .tt files be recognized by search engines?
Your template files are also part of your backend.
Note that the template files (.tt) will be displayed as content via a TT2 wrapper.
The wrapper takes the template files and the data in the database and produces HTML pages. The HTML pages are what Google sees.
Link to those pages.
just not sure how the google dance deals with .tt files and such
Google doesn't care at all about .tt files and the like. Google cares about URLs and the resources that they represent.
When Google is given the URL of the front page of your site, it will visit that URL. Your site will respond to that request by generating the front page, presumably in HTML. Google will then parse that HTML and extract any URLs it finds. It will then visit all of those URLs and the process will repeat. Many times.
The back-end technologies don't matter at all. What matters is that your site is made up of well-constructed HTML pages with meaningful links between them.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
how to disallow all dynamic urls in robots.txt
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
i want to disallow all things that start with /?q=
The answer to your question is to use
Disallow: /?q=
The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)
According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.
The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.
This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.
As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.
That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.
If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.
If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.
More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.
In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:
User-agent: *
Allow: /index.html
Allow: /offices
Allow: /static
Disallow: /
This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file:
Sitemap: http://[omitted]/sitemap_index.xml
User-agent: Mediapartners-Google
Disallow: /scripts
User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Anything in the scripts folder are correctly blocked for both the Googlebot and Mediapartners-Google. I can see that the two robots are seeing the correct directive because the Googlebot says the scripts are blocked from line 7 while the Mediapartners-Google is blocked from line 4. And yet ANY other url I put in from the disallowed urls under the second user-agent directive are NOT blocked!
I'm wondering if my comment or using absolute urls are screwing things up...
Any insight is appreciated. Thanks.
The reason why they are ignored is that you have the fully qualified URL in the robots.txt file for Disallow entries while the specification doesn't allow it. (You should only specify relative paths, or absolute paths using /). Try the following:
Sitemap: /sitemap_index.xml
User-agent: Mediapartners-Google
Disallow: /scripts
User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: /Living/books/book-review-not-stupid.aspx
Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
As for caching, google tries to get a copy of the robots.txt file every 24 hours in average.
It's the absolute URLs. robots.txt is only supposed to include relative URIs; the domain is inferred based on the domain that the robots.txt was accessed from.
It's been up for at least a week, and Google says it was last downloaded 3 hours ago, so I'm sure it's recent.
Did you recently make this change to your robots.txt file? In my experience it seems that google caches that stuff for a really long time.