Blocking files in robots.txt with [possibly] more than one file extension - robots.txt

Is this correct syntax?
Disallow: /file_name.*
If not, is there are way to accomplish this without listing each file twice [multiple times]?

OK, according to http://tool.motoricerca.info/robots-checker.phtml
The "*" wildchar in file names is not supported by (all) the user-agents addressed by this block of code. You should use the wildchar "*" in a block of code exclusively addressed to spiders that support the wildchar (Eg. Googlebot).
So, I just use:
<meta name="robots" content="noindex,nofollow">
in each page that I wanted to block from search engines.

Related

Can I block search engines from scanning files starting with a certain letter using robots.txt?

I know I can block search engines from accessing types of files using a wild card like this:
Disallow: /*.gif$
That disallows access to gifs, or more like files ending in .gif.
But is there a way to prevent search engines from accessing for example all files starting with "_"?
Would something like this work?
Disallow: /_*.*$
Or at least perhaps this (if I absolutely need to set an extension)?
Disallow: /_*.php$
As per the "official" docs
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines.

Robot.txt special chracters disallow

example link upload.php?id=46. I want to disallow all the link ie id=1,2,3
How can I do that using a special character
will this work for me?
disallow:/upload.php?id=*
Your example will work fine for major search engines, but the final * is unnecessary, and will cause the line to be ignored by older robots that don't support wildcards. The Disallow directive basically means "block anything that starts with the following". Putting a wildcard at the end is redundant, and has no effect on what will be matched. Wildcards are not part of the original robots.txt specification, so not all robots support them. All of the major search engines do, but many older robots do not.
The following does exactly the same thing as your example, but without wildcards:
User-agent: *
Disallow: /upload.php?id=
Why not just use a header in the upload.php file? I.e. put:
header("X-Robots-Tag: noindex, nofollow", true);
At the top of upload.php. If you're using Apache to serve your files, you can also set up rule based headers in your configuration file.

How to break long URLs in Doxygen comments to satisfy maximum line length?

The coding guidelines of programming language limit the line length, e.g. to 80 characters. How can I add a URL that is longer than that limit to Doxygen comments? How do I tell Doxygen that multiple lines are to be joined to form the actual link?
Example:
##
# #file mycode.py
# #sa See the documentation: http://some.host.some.domain/and_here
# _we_have_a_very_long_URL_that_can_not_be_written_in_one_line
# _because_it_would_exceed_the_line_length_limit
The example above doesn't work, and it doesn't work either to end the lines with a backslash (the backslash is just copied to the documentation).
You can try it this way. It worked for me. However I'm not a 100% sure its going to work for you. Our IDE use whitespaces as indentation and not tabs. So when you break the line, hence the link, it might not work.
<a href="http://stackoverflow.com/questions/9098680/
doxygen-link-to-a-url-doesnt-generate-the-link-correctly">
link
</a>
You could use an alias to abbreviate the long URL, i.e.
##
# #file mycode.py
# #sa See the documentation: #longurl
and in the Doxyfile define
ALIASES = longurl="http://some.host.some.domain/and_here/..."
This is performing necromancy an old question. I am answering for C++ style comments. But, if you make you link in the form:
/**
* [link_text](http://foo.com/bar/baz/qux/wibble/flob?id=deadbeef123456789abcdefghijklmnopqrstuvwxyz)
*/
You can wrap that URL in the following ways and the generated HTML output will still contain a working anchor tag:
/**
* [link_
text]
(http://foo.com/bar/baz/qux/wibble/
flob?id=deadbeef123456789abcdefghijklmnopqrstuvwxyz)
*/
Obviously this might make the comment block less readable. But this gives you an idea of what is possible. The main things that are advantagious here are being able to put the URL on a separate line from the link text, and then being able to wrap it at least once after a /.

How do exclude specific folders via robots.txt

I want to exclude all subfolders named "ajax" in any folder from indexing by search engines.
Examples:
.com/a/ajax
.com/b/ajax
.com/c/ajax
Is this possible via robots.txt ?
It's only possible if you list out each folder explicitly. There is no wildcard support to accomplish the type of thing you want. The robots.txt exclusion standard is a little lacking in this respect.

How can I limit file types in CGI file uploads in Perl?

I am using CGI to allow the user to upload some files. I just want the just to be able to upload .txt or .csv files. If the user uploads file with any other format then I want to be able to put out an error message.
I saw that this can be done by javascript: http://www.codestore.net/store.nsf/unid/DOMM-4Q8H9E
But is there a better way to achieve this? Is there is some functionality in Perl that allows this?
The disclaimer on the site to you link to is important:
Note: This is not entirely foolproof as people can easily change the extension of a file before uploading it, or do some other trickery, as in the case of the "LoveBug" virus.
If you really want to do this right, let the user upload the file, and
then use something like File::MimeInfo::Magic (or file(1), the
UNIX utility) to guess the actual file type. If you don't like the
file type, delete the file and give the user an error message.
I just want the just to be able to upload .txt or .csv files.
Sounds easy, doesn't it? It's not. And then some.
The simple approach is just to test that the file ends in ‘.txt’ or ‘.csv’ before storing it on the filesystem. This should be part of a much more in-depth validation of what the filename is allowed to contain before you let a user-submitted filename anywhere near the filesystem.
Because the rules about what can go in a filename are complex on some platforms (especially Windows) it's usually best to create your own filename independently with a known-good name and extension.
In any case there is no guarantee that the browser will send you a file with a usable name at all, and even if it does there is no guarantee that name will have ‘.txt’ or ‘.csv’ at the end, even if it is a text or CSV file. (Some platforms simply do not use extensions for file typing.)
Whilst you can try to sniff the contents of the file to see what type it might be, this is highly unreliable. For example:
<html>,<body>,</body>,</html>
could be plain text, CSV, HTML, XML, or a variety of other formats. Better to give the user an explicit control to say what file type they're uploading (or use one file upload field per type).
Now here's where it gets really nasty. Say you've accepted the upload and stored it as /data/mygoodfilename.txt, and the web server is correctly serving it as the Content-Type ‘text/plain’. What do you think the browser interprets it as? Plain text? You should be so lucky.
The problem is that browsers (primarily IE) don't trust your Content-Type header, and instead sniff the contents of the file to see if it looks like something else. Serve the above snippet as plain text, and IE will happily treat it as HTML. This can be a huge problem, because HTML can include client-side scripts that will take over the user's access to the site (a cross-site-scripting attack).
At this point you might be tempted to sniff the file on the server-side, for example using the ‘file’ command, to check it doesn't contain ‘<html>’. But this is doomed to failure. The ‘file’ command does not sniff for all the same HTML tags as IE does, and other browsers sniff differently anyway. It's quite easy to prepare a file that ‘file’ will claim is not HTML, but that IE will nevertheless treat as if it is (with security-disaster implications).
Content-sniffing approaches such as ‘file’ will give you only a false sense of security. This is a convenience tool for loose guessing of filetypes and not an effective security measure.
At this point your last desperate possibilities are things like:
serving all user-uploaded files from a separate hostname, so that a script injection attack can't purloin the credentials of your main site;
serving all user-uploaded files through a CGI wrapper, adding the header ‘Content-Disposition: attachment’ so that browsers won't attempt to display them directly;
only accepting uploads from trusted users.
On unix the easiest way is to do an JRockway suggested. If not on unix then your options are limited. You can examine the file extension and you can examine the contents to verify. I'm assuming for you specific case that you only want "* seperated value" text files. So one of the Text::CSV::* modules may be useful in verifying the file is the type you asked for.
Security for this operation is a whole other ball of wax.
try this:
$file_name = "file.txt";
$file_cmd = "file \"$file_name"\";
$file_type = `$file_cmd`;
return 0 unless($file_type =~ /(ASCII|text)/i)