Robots.txt taking out form keys - robots.txt

I was wondering if someone would be able to take a look at this to tell me if I configured it correctly? I am not trying to block all parameters (?) but just the ones with hsformkey. Here is how i wrote the directive. I tested it in search console and it says it is blocked, but I'm not sure I trust it.
Disallow: *?hsFormKey*
Thanks!

Robots.txt is a convention that many automated web crawlers (most often used by search engines to index sites) use to tell them any pages you don't want them to crawl. It's a fairly loose convention, and different crawlers support different features.
The original documentation, which is the closest there is to a universal standard, doesn't include the concept of "wildcards" or "globbing". Per "The Web Robots Pages":
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".
That being said, it's a common addition that many crawlers do support. For example, in Google's documentation of they directives they support, they describe pattern matching support that does handle using * as a wildcard. Note that since Disallow directives are beginning-of-URL instructions anyway, the asterisk at the end doesn't do anything useful.
The thing to keep in mind is that the exclusion rules work directly on the URL, not on any path or conventions used by your web server or application framework. While your application may treat characters like ? and & as delimiting parameter information (as it's a pretty common and standard thing to do), the web crawler is just interpreting the entire URL. Also, URL paths are case sensitive. So, you are excluding the web crawler from loading any URL that has ?hsFormKey in it, but not excluding URLs with ?somethingelse=value&hsFormKey=123 in them, nor are you excluding URLs with ?hsformkey in them. This may be exactly what you want, but I'm not sure of your requirements. And as I said, it will only be effective for those crawlers that recognize this style of wildcard.
Also be aware that robots.txt only prevents loading of the excluded URLs by crawlers. Excluded URLs can still show up other places that people can find (bookmarked links, links shared on social media, etc.) and can still show up in search engines. Google explicitly states they will still index and list a URL, even if they're not allowed to crawl it:
While Google won't crawl or index the content blocked by robots.txt, we might still find and index a disallowed URL from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Google Search results completely by using other URL blocking methods, such as password-protecting the files on your server or using the noindex meta tag or response header.

Related

How to refuse wget?

I am uploading images to a public directory and I would like to prevent users from downloading the whole lot using wget. Is there a way to do this?
As far as I can see, there must be. I have found a number of sites where, as a public browser, I can download a single image, but as soon as I run wget against them I get a 403 (Forbidden). I have tried using the no-robot argument, but I'm still not able to download them. (I won't name the sites here, for security reasons).
You can restrict access using user-agent string, see apache 2.4 mod_authz_core for example.
Wget also respects robots.txt directives by default. This should repent any casual user.
However, a careful look into wget manual will let to bypass these restrictions. Wget also lets to add random delays between requests, so even advanced techniques based on access pattern analysis may be bypassed.
So the proper way is to mess with wget link/reference recognition engine. Namely, the content you want to keep unmirrored should be loaded dynamically using javascript and the urls must be encoded in a way that would require js code to decode. This would protect your content, but would require to manually provide unobfuscated version for web bots you want to index your site, such as google bot (and no, it is not the only one one should care about). Also, some people do not run js scripts by default (esoteric browsers, low-end machines, mobile devices may demand such policy).

CMS for managing plain-text content, with tagging

We have some quite-specific requirements for our app that a CMS may help us with, and were hoping that someone may know of a CMS that matches these requirements (it's quite a laborous task to download each CMS and verify this manually).
We want a CMS to allow users to create and manage articles, but storing the articles in plain-text only. All of the CMSs that we have looked at so far are geared towards creating HTML pages. We want the CMS to manage workflow (approval process), and tracking of history.
The requirements for plain text only is that the intent is to allow business people to generate content which we are going to display in our Silverlight application - we don't want to go down the route of hosting and displaying arbitrary HTML in the app as we want the styling to be seamless with our app, amongst other reasons.
We would also want to allow the user to be able to link to media stored on the server, but not to external sites (i.e. HTML with no formatting, or some other way of specifying article links), and the third requirement is the ability to tag articles and search on articles.
Does anyone know of any non-HTML targetted CMS systems that may match these requirements?
I would expect several CMS systems to allow this, but eZ Publish stores content as plain XML. And you have a way of allowing certain tags if you wish; and explicitly prevent for example external links. You then have options for how to present that content according the templates you choose to use.
You also have control via a /layout/set/myLayout directive.
You could for example retrieve the content as a plain xml feed or a print layout or whatever custom format you choose at the time. With appropriate headers.
http://doc.ez.no/eZ-Publish/Technical-manual/3.10/Reference/Modules/layout/(language)/eng-GB
vs.
http://doc.ez.no/layout/set/print/eZ-Publish/Technical-manual/3.10/Reference/Modules/layout/(language)/eng-GB
You could define a layout such as /layout/set/xml/....
Workflow as in content approval processes, versioning, tagging and search are standard.
You can give Statamic a try.
http://statamic.com/
Not sure if you can disallow external links, though.

How to implement Wordpress-like Permalink

I was thinking about a building a CMS, and I want to implement the wordpress-like permalink for my posts. How do I do that?
I mean, How do I define the custom url structure for my pages?
What language are you using? I'm assuming that you are thinking about PHP (given your reference to word press). You have a few options:
Mod-Rewrite
Router
In my opinion, the best option is to find a modern web framework that provides good routing functionality. Furthermore, look at modifying an existing CMS (many exist; you seem to have heard of word press).
I'd recommend creating links that pass in a URL parameter such as ..."http://...PostID?123&CatID=232&..." so that when the person clicks on that particular link, you can parse the parameters in the URL, and get the exact post based on id, or even do further filtering by passing in other fields as needed.
If you want to build the whole thing yourself, first understand what a front controller is, as it really addresses the underlying issue of how do you execute the same code for different URLs. With this understanding, there are two ways to attack the problem with this design pattern: URL rewriting or physical file generation.
URL Rewriting
With URL rewriting, you would need intercept the requested URL and send it onto your front controller. Typically this is accomplished at the web server level, although some application servers also act as web servers. With Apache, as others have posted, you would use mod_rewrite with a rule that looks something like this:
RewriteRule ^/(.*) /path/to/front/controller.ext [E=REQUEST_URI:%{REQUEST_URI},QSA,PT,NS]
With this rule, the path originally requested with be sent to the front controller as a variable called "REQUEST_URI". Note, I'm not sure the right syntax in PHP to access it. In the front controller hash (e.g. MD5) this value and use it to lookup the record from a database - take into account whatever hashing algorithm you use will produce duplicates. The hash is necessary if you allow URLs over whatever the max column size is in your database for varchar data, assuming you can't search on CLOBs.
Physical File Generation
Physical file generation would create a file that maps to the permanent URL you're imagining. So you'd write something that creates/renames the file at time it's posted. This removes the need for storing a hash and instead you place information about the post you want to serve inside that file (i.e. ID of the post) and pass that along to the front controller.
Recommendation
My preference is the URL rewriting approach, so you don't have to worry about writing dynamic code files out at runtime. That said, if you want something with less magic, or you're expecting a lot of requests, the physical file generation is the way to go because it's more obvious and requires the server to do less work.

one robots.txt to allow crawling only live website rest of them should be disallowed

I need guideline about using of robots.txt problem is as following.
I have one live website "www.faisal.com" or "faisal.com" and have two testing web servers as follows
"faisal.jupiter.com" and "faisal.dev.com"
I want one robots.txt to handle this all, i don't want crawlers to index pages from "faisal.jupiter.com" and "faisal.dev.com" only allowed to index pages from "www.faisal.com" or "faisal.com"
I want one robots.txt file which will be on all web servers and and should allow indexing only live website.
The disallow commands specifies only relative URL so I guess you cannot have the same robots.txt file for all.
Why not force HTTP authentification on the dev/test servers ?
That way the robots wont be able to crawl these servers.
Seems like a good idea if you want to allow specific people to check them but not everybody trying to find flaws in your not yet debugged new version ...
Especially now that you gave the adresses to everybody on the web.
Depending on who needs to access the dev and test servers -- and from where, you could use .htaccess or iptables to restrict at the IP address level.
Or, you could separate your robots.txt file from the web application itself, so that you can control the contents of it relative to the environment.

How to read DOM of the iframe loaded with a page from another domain?

Is there a way to access the DOM of the document in an iframe from parent doc if the doc in the iframe is on another domain? I can easily access it if both parent and child pages are on the same domain, but I need to be able to do that when they are on different domains.
If not, maybe there is some other way to READ the contents of an iframe (one consideration was to create an ActiveX control, since this would be for internal corporate use only, but I would prefer it to be cross-browser compatible)?
Not really. This is essential for security – otherwise you could open my online banking site or webmail and mess with it.
You can loosen restriction a bit by setting document.domain, but still top level domain must be the same.
You can work around this limitation by proxying requests via your own server (but don't forget to secure it, otherwise s[cp]ammers may abuse it)
my.example.com/proxy?url=otherdomain.com/page
Theoretically you can access the the content of the iframe using the standard DOM level2 contentDocument property. Practically you may have found out that most browsers deny the access to the DOM of the external document due to security concerns.
Access to the full DOM AFAIK is not possible (though there might be some browser-specific tweak to disable the same-domain check); for cross-domain XHR a popular trick is to bounce the data back and forth the iframe and the main document using URL fragment identifiers (see e.g. this link), you can use the same technique but:
the document loaded in the iframe must cooperate, and
you don't have access to the full document anyway (you can read back some parameters, or maybe you can try and URL-encode the whole document - but that would be very ugly)
I just found postMessage method introduced with HTML5; it's already implemented in recent browser (FF3, IE8 and Safari 4). It allows the exchange of messages between any windows object inside the browser.
For the details see the documentation at MDC and this nice tutorial by John Resig.