Robots.txt: Disallow repeated subdirectories but allow main directories - robots.txt

I have these directories, there are many of them:
/dir100/media
/dir200/media
/dir300/media
i want to Disallow all */media directories
how can i do this ?

You were almost there in your question!
# User agent that should be disallowed, '*' is far 'all'
User-agent: *
Disallow: /*/media
# A less restrictive rule that would also work:
# Disallow: /dir*/media
In general search engines do want to see every resource that might be referenced from your pages, and if these resources are disallowed for crawling and are critical to understand the pages through rendering, there's a chance Google and other search engines will have a hard time understanding the pages. Keep that in mind when setting up disallow directives.

Related

Can I "fix" heading levels in rST on GitHub?

On GitHub, in a .md file I'm able to specify heading levels that are respected in they way they are displayed on there, but my .rst files are not: the "highest" level heading is treated as a level 1 heading,
For example,
## Heading
Stuff
## Sub-heading
More stuff
in a .md will treat the first as a second-level heading and the second as a third-level heading, while its equivalent (e.g as generated by pandoc),
Heading
-------
Stuff
Sub-heading
~~~~~~~~~~~
More stuff
is treated as a first-level and second-level headings.
Is there a way to overcome this? Can I "fix" the heading level in rST, at as GitHub interprets it?
No, this is not possible.
Docutils does not allow header levels to be skipped. In fact, it will crash hard on inconsistently nested levels. Additionally, here is no hard rule for which characters in the ReST syntax represent which level. It is simply assumed that they appear in the order they are found (the inconsistency comes when you step back up, then down again -- it is assumed that you use the same pattern going back down). Therefore, the first header is always a level 1 header (<h1>) regardless of which character you use. However, in Markdown the levels are explicit in the syntax. If a user starts with ### Header, then that first header in the document must be level 3 (<h3>). Under the hood, Docutils has no mechanism for retaining that info. It only knows whether a header is the "next higher" or "lower" level in consecutive order.

Sphinx with metaphone and wildcard search

we are an anatomy platform and use sphinx for our search. We want to make our search more fuzzier and started to use metaphone to correct spelling mistakes. It finds for example phalanges even though the search word is falanges.
That's good but we want more. We want that the user could type in falange or even falang and we still find phalanges. Any ideas how to accomplish this?
If you are interested you can checkout our sphinx config file here.
Thanks!
Well you can enable both metaphone and min_prefix_len on an index at once. It will sort of work.
falange*
might then just work. (to match phalanges)
The problem is the 'stripped' letters may change the 'sound' of the word (because change the pronunciation)
eg falange becomes FLNJ, but falang acully becomes FLNK - so they no longer 'substrings' of one another. (ie phalanges becomes FLNJS, which FLNK* wont match)
... to be honest I dont know a good solution. You could perhaps get better results, if was to apply stemming, BEFORE metaphone. (so the endings that change the pronouncation of the words are removed.
Alas Sphinx can't do this. If you enable stemming and metaphone together, only ONE of the processors will ever fire.
Two possible solutions, implement stemming outside of sphinx (or maybe with regexp_filter. Not sure if say a porter stemmer can be implemnented purely with regular expressions)
or modify sphinx, so that ALL morphology processors apply. (rather than just the first one that changes the word)

robots.txt - is this working?

I just ran into a robots.txt that looks like this:
User-agent: *
Disallow: /foobar
User-agent: badbot
Disallow: *
After disallowing only a few folders for all, does the specific badbot rule even apply?
Note: This question is merely for understanding the above ruleset. I know using robots.txt is not a proper security mechanism and I'm neither using nor advocating it.
Each bot only ever complies to at most a single record (block).
A block starts with one or more User-agent lines, typically followed by Disallow lines (at least one is required). Blocks are separated by blank lines.
A bot called "badbot" will look for a record with the line User-agent: badblock (or similar, as the bot "should be liberal in interpreting this field"). If no such line is found, it will look for a record with the line User-agent: *. If even this doesn’t exist, the bot is allowed to do everything (= default).
So in your example, the bot called "badbot" will follow only the second record (you probably mean Disallow: / instead of Disallow: *), while all other bots only follow the first record.

Is File::Spec really necessary?

I know all about the history of different OSes having different path formats, but at this point in time there seems to be a general agreement (with one sorta irrelevant holdout*) about how paths work. I find the whole File::Spec route of path management to be clunky and a useless pain.
Is it really worth having this baroque set of functions to manipulate paths? Please convince me I am being shortsighted.
* Irrelevant because even MS Windows allows forward slashes in paths, which means the only funky thing is the volume at the start and that has never really been a problem for me.
Two major systems have volumes. What's the parent of C:? In unix, it's C:/... In Windows, it's C:... (Unfortunately, most people misuse File::Spec to the point of breaking this.)
There are three different set of path separators in the major systems. The fact that Windows supports "/" could simplify building paths, but it doesn't help in parsing them or to canonising them.
File::Spec also provides useful functions that make it useful even if every system did use the same style of paths, such as the one that turns a path into a relative path.
That said, I never use File::Spec. I use Path::Class instead. Without sacrificing any usability or usefulness, Path::Class provides a much better interface. And it doesn't let users mishandle volumes.
For usual file management inside Perl, No, File::Spec is not necessary and using forward slahes everywhere makes much less pain and works on Win32 anyways.
cpanminus is a good example used by lots of people and have been proved work great on win32 platform. it doesn't use File::Spec for most file path manipulation and just uses forward slashes - that was even suggested so by the experienced Perl-Win32 developers.
The only place I had to use File::Spec's catfile in cpanm, though, is where I extract file paths from a perl error message (Can't locate File\Path.pm blah blah) and create a file path to pass to the command line (i.e. cmd.exe).
Meanwhile File::Spec provides useful functions such as canonical and rel2abs - that's not "necessary" per se but really useful.
Yes absolutely.
Golden rule of programming, never hard code string literals.
Edit: One of the best ways to avoid porting issues is to avoid OS specific constants especially in the form of inline literals.
i.e e.g drive + ":/" + path + "/" + filename
It is bad practice yet We all commit these attrocities in the haste of the moment or because it doesn't matter for that piece of code. File::Spec is there for when a programmer is adhering to gospel programming.
In addition it provides the values of special and often used system directories e.g tmp or devnull which can vary from one distribution/OS to another.
If anything it could probably do with some other members added to it like user to point to the users home directory
makepp (makepp.sourceforge.net) has a makefile variable $/ which is either / or \ (on non-Cygwin Win). The reason is that Win accepts / in filenames, but not in command names (where it starts an option).
From http://perldoc.perl.org/File/Spec.html:
catdir
Concatenate two or more directory names to form a complete path ending with a directory. But remove the trailing slash from the resulting string, because it doesn't look good, isn't necessary and confuses OS/2. Of course, if this is the root directory, don't cut off the trailing slash :-)
So for example in this example I wouldn't need the regex to remove the trailing slash if I would use catdir.

Is HTML::StripScripts still safe for removing modern exploits?

I need a way in Perl to strip naughty things, such as XSS, image interjection, and the works.
I found HTML::StripScripts but it hasn't updated in close to two years, and I'm not up to date with all the new exploits.
Is it safe?
What other markups languages (in Perl) would you use?
XSS is a vast topic and exploits come up every other day.
Just removing scripts will not make your code/site safe.
It is better to not try to strip (Blacklisting) certain things. It is safer to white list html/special characters you will allow on your site. i.e <b>, <i>
Defang seems to be the latest/greatest anti XSS lib for perl on cpan
Blacklisting vs Whitelisting
OWASP XSS Cheat Sheet
And I suggest playing with CAL9000 to get an idea of how widespread / tricky XSS is
HTML::StripScripts is a whitelist, and can use a tree-based parser and should be as safe as the whitelist.