Is sanitizing html by removing angle brackets safe? - code-injection

I want to sanitize a simple text field with a person's name, to protect from XSS and such. Stackoverflow pretty much says I must whitelist. I don't understand this. If I simply remove all < and > from the input value, or replace them with > and &ls;, does not that rule out code injection? Or am I missing something? Perhaps you only need to whitelist in more complex scenarios where you have to put up with angular brackets?
Sorry if it's a silly question, it's important to get this right.

Whether to whitelist or encode depends on how you want to use the text.
If you intend to treat the input as plain text, then encoding special characters is enough, and any HTML code entered will display as text only as long as you are careful not to allow unencoded text to end up anywhere in your HTML output. (This includes making sure any other systems you interface with don’t inappropriately use the unencoded text.)
If you want to allow some markup in the input, such as text styling or links, then you must whitelist the tags that you allow and get rid of all others.

No, it's not sufficient because if you were to include the person's name in an html attribute, you would also need to escape any double-quotes contained therein.

Related

Handling right-to-left/left-to-right override characters in user input

I need to embed user input in a string; for example, "<User> sent a message".
The problem comes if the user input includes one of the directionality override characters (U+202D or U+202E). If "<User>" includes an RLO character, the displayed string becomes "‪<User>‮ sent a message‬".
My question is how best to handle this. Are there legitimate uses for RLO and LRO, or is stripping them out a plausible option? Otherwise maybe I can wrap the user input with "Left-to-right embedding" (U+202A) and "Pop Directional Formatting" (U+202C), though doing that right probably requires me to make sure that the user input doesn't contain unbalanced PDF characters.
Are there legitimate uses for RLO and LRO, or is stripping them out a plausible option?
I strip them, along with all the other characters designated not suitable for use in markup.
Legitimacy is an arguable point, but real Arabic/Hebrew/etc keyboards can't type BiDi control characters, so you are not likely to come across them in non-malicious user input.

Wikipedia (MediaWiki) URI encoding scheme

How do Wikipedia (or MediaWiki in general) encode page titles in URIs? It's not normal URI encoding, since spaces are replaced with underscores and double quotes are not encoded and things like that.
http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restrictions%29 - here you've got some kind of description of what their engine enforces on article names.
They should have something like this in their LocalSettings.php:
$wgArticlePath = '/wiki/$1';
and proper server URI rewrites configuration - they seem to be using Apache (HTTP header), so it's probably mod_rewrite. http://www.mediawiki.org/wiki/Manual:Short_URL
You can also refer to the index.php file for an article on Wikipedia like this: http://en.wikipedia.org/w/index.php?title=Foo%20bar and get redirected by the engine to http://en.wikipedia.org/wiki/Foo_bar. Behind the scenes mod_rewrite translates it into /index.php?title=Foo_bar. For the MediaWiki engine it's the same as if you visited http://en.wikipedia.org/w/index.php?title=Foo_bar - this page doesn't redirect you.
The process is quite complex and isn't exactly pretty. You need to look at the Title class found in includes/Title.php. You should start with the newFromText method, but the bulk of the logic is in the secureAndSplit method.
Note that (as ever with MediaWiki) the code is not decoupled in the slightest. If you want to replicate it, you'll need to extract the logic rather than simply re-using the class.
The logic looks something like this:
Decode character references (e.g. é)
Convert spaces to underscores
Check whether the title is a reference to a namespace or interwiki
Remove hash fragments (e.g. Apple#Name
Remove forbidden characters
Forbid subdirectory links (e.g. ../directory/page)
Forbid triple tilde sequences (~~~) (for some reason)
Limit the size to 255 bytes
Capitalise the first letter
Furthermore, I believe I'm right in saying that quotation marks don't need to be encoded by the original user -- browsers can handle them transparently.
I hope that helps!

How do I protect against cross-site scripting?

I am using php, mysql with smarty and I places where users can put comments and etc. I've already escaped characters before inserting into database for SQL Injection. What else do I need to do?
XSS is mostly about the HTML-escaping(*). Any time you take a string of plain text and put it into an HTML page, whether that text is from the database, directly from user input, from a file, or from somewhere else entirely, you need to escape it.
The minimal HTML escape is to convert all the & symbols to & and all the < symbols to <. When you're putting something into an attribute value you would also need to escape the quote character being used to delimit the attribute, usually " to ". It does no harm to always escape both quotes (" and the single quote apostrophe '), and some people also escape > to >, though this is only necessary for one corner case in XHTML.
Any good web-oriented language should provide a function to do this for you. For example in PHP it's htmlspecialchars():
<p> Hello, <?php htmlspecialchars($name); ?>! </p>
and in Smarty templates it's the escape modifier:
<p> Hello, {$name|escape:'html'}! </p>
really since HTML-escaping is what you want 95% of the time (it's relatively rare to want to allow raw HTML markup to be included), this should have been the default. Newer templating languages have learned that making HTML-escaping opt-in is a huge mistake that causes endless XSS holes, so HTML-escape by default.
You can make Smarty behave like this by changing the default modifiers to html. (Don't use htmlall as they suggest there unless you really know what you're doing, or it'll likely screw up all your non-ASCII characters.)
Whatever you do, don't fall into the common PHP mistake of HTML-escaping or “sanitising” for HTML on the input, before it gets processed or put in the database. This is the wrong place to be performing an output-stage encoding and will give you all sort of problems. If you want to validate your input to make sure it's what the particular application expects, then fine, but weeding out or escaping “special” characters at this stage is inappropriate.
*: Other aspects of XSS are present when (a) you actually want to allow users to post HTML, in which case you have to whittle it down to acceptable elements and attributes, which is a complicated process usually done by a library like HTML Purifier, and even then there have been holes. Alternative, simpler markup schemes may help. And (b) when you allow users to upload files, which is something very difficult to make secure.
In regards to SQL Injection, escaping is not enough - you should use data access libraries where possible and parameterized queries.
For XSS (cross site scripting), start with html encoding outputted data. Again, anti XSS libraries are your friend.
One current approach is to only allow a very limited number of tags in and sanitize those in the process (whitelist + cleanup).
You'll want to make sure people can't post JavaScript code or scary HTML in their comments. I suggest you disallow anything but very basic markup.
If comments are not supposed to contain any markup, doing a
echo htmlspecialchars($commentText);
should suffice, but it's very crude. Better would be to sanitize all input before even putting it in your database. The PHP strip_tags() function could get you started.
If you want to allow HTML comments, but be safe, you could give HTML Purifier a go.
You should not modify data that is entered by the user before putting it into the database. The modification should take place as you're outputting it to the website. You don't want to lose the original data.
As you're spitting it out to the website, you want to escape the special characters into HTML codes using something like htmlspecialchars("my output & stuff", ENT_QUOTES, 'UTF-8') -- make sure to specify the charset you are using. This string will be translated into my output & stuff for the browser to read.
The best way to prevent SQL injection is simply not to use dynamic SQL that accepts user input. Instead, pass the input in as parameters; that way it will be strongly typed and can't inject code.

How to handle '&' in URL sent as HTML from iPhone Mail.app

Apologies if this has been answered already. There are similar topics but none that I could find pertaining to Cocoa & NSStrings...
I'm constructing a clickable URL to embed in an HTML email to be sent via the MFMailComposeViewController on the iPhone. i create the url then use stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding to polish up white space, etc. then add some surrounding HTML to get:
view
All's well so it's appended to emailBody. However once [mailComposer setMessageBody:emailBody isHTML:YES] all the & become & which isn't ideal within my URL.
can i control this? is there a better encoding algorithm? my HTML is a bit rusty perhaps I'm using the wrong encoding? I'm sure on the server I could parse the & back into & but looking for the Cocoa way...
Thanks!
Actually, & should always be encoded as & in HTML attributes. Including links. Including form value delimiters. So it's done exactly what you want, even though you didn't know you wanted it.
Look at it this way: in your URL, you have &age=53... That's interpreted first as a character entity, and only after that doesn't work is it interpreted as an ampersand followed by more character data.
The W3C spec is quite clear on this:
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
That should settle it: use & not &.
Are you calling MFMailComposeViewController's
setMessageBody:isHTML:
and what do you set isHTML to?
Depending on it's setting it might very well be that MFMailComposeViewController is trying to help you out be encoding the entire message body...
Either don't encode the body yourself or make the entire body HTML.

What restrictions should I impose on usernames [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
What restrictions should I impose on usernames? why?
What restrictions should I not impose on usernames? why?
P.S. db is via best-practice PDO so no risk of sql injection
Thanks
OK, so let's assume you're doing all your string-encoding tasks right. You've not got any SQL injections, HTML injections, or places where you're not URL-encoding something you should. So we don't need to worry about characters like "<&%\ being magic in some contexts. And you're using UTF-8 for everything so all of Unicode is in play. What other reasons are there to limit usernames?
To start with, all control characters, for sanity. There is no reason to have characters U+0000 to U+001F or U+007F to U+009F in a username.
Next, deny or normalise unexpected whitespace. You may want to allow a space in a username, but you almost certainly don't want to allow leading spaces, trailing spaces, or more than one space in a row. They may render the same in HTML, but are probably a user error that will confuse.
If you intend to allow that username to be used to login through HTTP Basic Authentication, you must disallow the : character, because the Basic Auth scheme encodes a ‘username:password’ pair with no escaping if there's a colon in the username or password. So at least one of the username and password must have the colon excluded, and it's better that that's the username because restricting people's choice of passwords is a much worse thing than usernames.
For Basic Authentication you may also want to disable all non-ASCII characters, as they are handled differently by different browsers. IE encodes them using the system codepage; Firefox encodes them using ISO-8859-1; Opera encodes them using UTF-8. Users should at least be warned before choosing non-ASCII names if HTTP Auth is going to be available, as actually using them will be very unreliable.
Next consider other Unicode control sequences, things like the bidi overrides and other characters listed there are unsuitable for use in markup. Probably you are going to end up putting them in markup and you don't want someone with an RLO in their name to turn a load of the text in your page backwards.
Also, if you allow Unicode do normalisation on the strings you get. Otherwise someone may have a username with a composed o-umlaut character ö, and wonder why they can't log in on a Mac, which by default would use a separate o character followed by combining umlaut. It's usual to normalise to the composed form NFC on the web. You may also want to do compatibility decompositions by using the form NFKC; this would allow a user Chris to log in from a Japanese keyboard in fullwidth romaji mode typing Chris. These are general issues it is good to solve for all your webapp's input, but for identifiers like usernames it can be more critical to get right.
Finally, make sure the length is OK to fit in the database without a silent truncation changing the name, especially if you are storing as UTF-8 bytes which you don't want to get snipped halfway through a byte sequence. Username truncations can also be a security issue in general.
If you are using usernames as a unique means of identification, you have much more to worry about: the already-mentioned problem of lookalikes such as Сhris (with a Cyrillic Es С). There are too many of these for you to handle reasonably; either restrict to ASCII or have an additional means of identifying users. (Or don't care, like SO doesn't; when I can easily call myself Chris anyway I have no need to call myself С-hris.)
Depends on many things, for instance, if the users are going to have their own URL, you want to be careful that someone who creates the username "%41llan" doesn't clash with the user called "Allan", while allowing forward-slash may cause problems. Look out for those sorts of constraints.
I've never seen the point in adding restrictions to usernames. If your code is resistant to sql injection attacks then let them put in anything they want.
The only restriction I'd add is a max length one so that it can be stored in a DB table
Let them use any Unicode character in their username.
Adding restrictions on the allowed characters will probably just annoy people using a non-ascii language.
SQL injection protection is a must, but that should probably be in your code, not in username restrictions. Certain characters should definitely be escaped, like \, %, etc.
It will on what kind of site you're running, but I think some obscene word restrictions would make your site look more professional no matter what. If someone sees that people are allowed to go around with "EXPLETIVE" as they're username, your site will look childish. Its like allowing teenagers to run rampid in your book store IMHO. You probably don't need to get much more picky than that, although its completely up to you.
This is slightly off topic, but as another piece of username advice, a great feature of any website is allowing users to change they're username over time. You can just have a number as a primary key, and allowing them to do this can save a lot of whining and people creating new accounts because they wanted to change their username. :D