Robot.txt special chracters disallow - robots.txt

example link upload.php?id=46. I want to disallow all the link ie id=1,2,3
How can I do that using a special character
will this work for me?
disallow:/upload.php?id=*

Your example will work fine for major search engines, but the final * is unnecessary, and will cause the line to be ignored by older robots that don't support wildcards. The Disallow directive basically means "block anything that starts with the following". Putting a wildcard at the end is redundant, and has no effect on what will be matched. Wildcards are not part of the original robots.txt specification, so not all robots support them. All of the major search engines do, but many older robots do not.
The following does exactly the same thing as your example, but without wildcards:
User-agent: *
Disallow: /upload.php?id=

Why not just use a header in the upload.php file? I.e. put:
header("X-Robots-Tag: noindex, nofollow", true);
At the top of upload.php. If you're using Apache to serve your files, you can also set up rule based headers in your configuration file.

Related

Can I block search engines from scanning files starting with a certain letter using robots.txt?

I know I can block search engines from accessing types of files using a wild card like this:
Disallow: /*.gif$
That disallows access to gifs, or more like files ending in .gif.
But is there a way to prevent search engines from accessing for example all files starting with "_"?
Would something like this work?
Disallow: /_*.*$
Or at least perhaps this (if I absolutely need to set an extension)?
Disallow: /_*.php$
As per the "official" docs
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines.

Postgres full text search ignore url

I am trying to use PostgreSQL to implement a full-text search system.
I encounter this strange or may be intended feature with that.
While trying to index or search for a column which contains names of files with extension (e.g. myimage.jpg), the system treats it as a url and does not properly tokenize.
I referred to the documentation and see that via ts_debug that the file name is taken as a host of a url.
Could some one tell how to take all inputs as normal word in the FTS of PostgreSQL.
Also, on a second request, how can one do a contains, startswith, and endswith searches with it?
Update
I have now tried the statement create text search configuration..., copied from pg_catalog.english and removed host,url, and url_path and then specified the configuration for the ts_debug method. But still no go., myimage.jpg is still identified as host.
Version
I use version 9.4
tl;dr Look at pre-parsing your input and removing punctuation if you really only want words (and not emails, urls, hosts, etc).
So after trying to figure this out myself the issue is that you don't seem to be able to easily customise the parser. From my understanding the parser runs first, which generates tokens. Those tokens are then matched to dictionaries.
By removing host, url, url_path from the configuration all you are doing is making it so that these tokens don't get looked up in a dictionary, resulting in no lexeme from these tokens. Which essentially means that they don't exist in terms of search. Which is not want you want...
Ideally what you need to do is customise the parser to not generate those tokens in the first place, or to also generate overlapping tokens (similar to how hyphenated words generate a token for the entire word as well as individual components) . This doesn't seem to be possible at the moment without writing a custom parser.
The only solution to this would be to pre-parse the text to remove the full stop. Note that if you rely on other types of tokens like version (e.g. 8.3.0) or email (e.g. name#domain.com) this will break those. So you may need to be a bit clever on how you remove characters.
select ts_debug('english', replace('this-is-a-file.jpg', '.', ' '));
"(asciihword,"Hyphenated word, all ASCII",this-is-a-file,{english_stem},english_stem,{this-is-a-fil})"
"(hword_asciipart,"Hyphenated word part, all ASCII",this,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",is,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",a,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",file,{english_stem},english_stem,{file})"
"(blank,"Space symbols"," ",{},,)"
"(asciiword,"Word, all ASCII",jpg,{english_stem},english_stem,{jpg})"
In terms of your second question. Are you talking about partial word matches? You get this a little bit with the stemming when using a config like english, so running becomes run which will match if you search for run or running. If you're talking about fuzzy matching it gets a little more complicated. I suggest reading this article http://rachbelaid.com/postgres-full-text-search-is-good-enough/

How to break long URLs in Doxygen comments to satisfy maximum line length?

The coding guidelines of programming language limit the line length, e.g. to 80 characters. How can I add a URL that is longer than that limit to Doxygen comments? How do I tell Doxygen that multiple lines are to be joined to form the actual link?
Example:
##
# #file mycode.py
# #sa See the documentation: http://some.host.some.domain/and_here
# _we_have_a_very_long_URL_that_can_not_be_written_in_one_line
# _because_it_would_exceed_the_line_length_limit
The example above doesn't work, and it doesn't work either to end the lines with a backslash (the backslash is just copied to the documentation).
You can try it this way. It worked for me. However I'm not a 100% sure its going to work for you. Our IDE use whitespaces as indentation and not tabs. So when you break the line, hence the link, it might not work.
<a href="http://stackoverflow.com/questions/9098680/
doxygen-link-to-a-url-doesnt-generate-the-link-correctly">
link
</a>
You could use an alias to abbreviate the long URL, i.e.
##
# #file mycode.py
# #sa See the documentation: #longurl
and in the Doxyfile define
ALIASES = longurl="http://some.host.some.domain/and_here/..."
This is performing necromancy an old question. I am answering for C++ style comments. But, if you make you link in the form:
/**
* [link_text](http://foo.com/bar/baz/qux/wibble/flob?id=deadbeef123456789abcdefghijklmnopqrstuvwxyz)
*/
You can wrap that URL in the following ways and the generated HTML output will still contain a working anchor tag:
/**
* [link_
text]
(http://foo.com/bar/baz/qux/wibble/
flob?id=deadbeef123456789abcdefghijklmnopqrstuvwxyz)
*/
Obviously this might make the comment block less readable. But this gives you an idea of what is possible. The main things that are advantagious here are being able to put the URL on a separate line from the link text, and then being able to wrap it at least once after a /.

Blocking files in robots.txt with [possibly] more than one file extension

Is this correct syntax?
Disallow: /file_name.*
If not, is there are way to accomplish this without listing each file twice [multiple times]?
OK, according to http://tool.motoricerca.info/robots-checker.phtml
The "*" wildchar in file names is not supported by (all) the user-agents addressed by this block of code. You should use the wildchar "*" in a block of code exclusively addressed to spiders that support the wildchar (Eg. Googlebot).
So, I just use:
<meta name="robots" content="noindex,nofollow">
in each page that I wanted to block from search engines.

How do exclude specific folders via robots.txt

I want to exclude all subfolders named "ajax" in any folder from indexing by search engines.
Examples:
.com/a/ajax
.com/b/ajax
.com/c/ajax
Is this possible via robots.txt ?
It's only possible if you list out each folder explicitly. There is no wildcard support to accomplish the type of thing you want. The robots.txt exclusion standard is a little lacking in this respect.