ModSecurity OWASP Core Rule Set - unicode false positive - unicode

We run some web services.
We use ModSecurity for Apache webserver with the OWASP core rule set.
We have problems with greek and russian requests, because of cyrillic and greek letters.
In the rules of OWASP CRS there are patterns like
"(^[\"'´’‘;]+|[\"'´’‘;]+$)"
In the ModSecurity Log there are UTF-8 code units where should be unicode characters. All ascii letters are shown as characters as should be.
Example:
[Matched Data: \x85 2
\xce\xb7\xce\xbb\xce\xb9\xce\xbf\xcf\x85\xcf\x80\xce
found within ARGS:q: 163 45
\xcf\x83\xce\xbf\xcf\x85\xce\xbd\xce\xb9\xce\xbf\xcf\x85
2
\xce\xb7\xce\xbb\xce\xb9\xce\xbf\xcf\x85\xcf\x80\xce\xbf\xce\xbb\xce\xb7]
[Pattern match
"(?i:(?:[\"'\\xc2\\xb4\\xe2\\x80\\x99\\xe2\\x80\\x98]\\\\s*?(x?or|div|like|between|and)\\\\s*?[\\"'\xc2\xb4\xe2\x80\x99\xe2\x80\x98]?\\d)|(?:\\\\x(?:23|27|3d))|(?:^.?[\"'\\xc2\\xb4\\xe2\\x80\\x99\\xe2\\x80\\x98]$)|(?:(?:^[\\"'\xc2\xb4\xe2\x80\x99\xe2\x80\x98\\\\]*?(?:[\\
..."]
Now we know that it was triggered by a request in greek:
σουνιου ηλιουπολη (a street in Athen)
Thats not our problem. We can figure that out.
The problem is that x80 is part of the character ’ (e2 80 99)
and x80 is also part of a greek letter, thats why we get a false positive.
The actual rule that was triggered:
SecRule
REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|!REQUEST_COOKIES:/_pk_ref/|REQUEST_COOKIES_NAMES|ARGS_NAMES|ARGS|XML:/*
"(?i:(?:[\"'´’‘]\s*?(x?or|div|like|between|and)\s*?[\"'´’‘]?\d)|(?:\\x(?:23|27|3d))|(?:^.?[\"'´’‘]$)|(?:(?:^[\"'´’‘\\]?(?:[\d\"'´’‘]+|[^\"'´’‘]+[\"'´’‘]))+\s*?(?:n?and|x?x?or|div|like|between|and|not|\|\||\&\&)\s*?[\w\"'´’‘][+&!#(),.-])|(?:[^\w\s]\w+\s?[|-]\s*?[\"'´’‘]\s*?\w)|(?:#\w+\s+(and|x?or|div|like|between|and)\s*?[\"'´’‘\d]+)|(?:#[\w-]+\s(and|x?or|div|like|between|and)\s*?[^\w\s])|(?:[^\w\s:]\s*?\d\W+[^\w\s]\s*?[\"'`´’‘].)|(?:\Winformation_schema|table_name\W))"
"phase:2,capture,t:none,t:urlDecodeUni,block,msg:'Detects classic SQL
injection probings
1/2',id:'981242',tag:'OWASP_CRS/WEB_ATTACK/SQL_INJECTION',logdata:'Matched
Data: %{TX.0} found within %{MATCHED_VAR_NAME}:
%{MATCHED_VAR}',severity:'2',setvar:'tx.msg=%{rule.id}-%{rule.msg}',setvar:tx.sql_injection_score=+1,setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:'tx.%{tx.msg}-OWASP_CRS/WEB_ATTACK/SQLI-%{matched_var_name}=%{tx.0}'"
For a workaround we adjusted some patterns like [\"'´’‘] to (\"|'||\xc2\xb4|\xe2\x80\x99|\xe2\x80\x98) so it matches the actual combinations of UTF-8 code units that build a character. We could do this for all 55 SQL Injection Rules of the Core Rule Set, but this is a heavy time consuming task.
We wonder if there is just a misconfiguration with the decoding of Apache or ModSecurity. We know all non-ascii and some ascii characters as well are URL encoded with % and UTF-8 by the webbrowsers.

I don't think it's a decoding problem, that looks as expected to me, and your (annoyingly verbose) fix is fine if it is known that the application you are protecting treats all its URL input as UTF-8. (It wouldn't be ‘right’ for something that used Windows-1252, for example, as it would start to let ’ through again.)
Alternatively you could remove the smart-quote filtering entirely, assuming you are not trying to protect an application specifically known to have SQL-injection issues as well as poor Unicode handling. The smart quotes are in there because if an application flattens then to ASCII using a platform function which maps non-ASCII characters to ASCII, like Windows's misguided ‘best fit’ mappings, they could get converted to single quotes, thus evading a preceding WAF filter that tried to remove those. (It seems to me the rule fails to include some other characters that would get flattened to quotes, such as U+02B9, U+02BC, U+02C8, U+2032 and U+FF07, so it's probably already not watertight in any case.)
TBH this is par for the course for mod_security CRS rules; especially for sites that use arbitrary strings in path parts you get lots of false positives, and the larger part of deploying tools like this is configuring them to avoid the worst of the damage.
IMO: WAFs are fundamentally flawed in principle (as it's impossible to define what input might constitute an attack vs a valid request), and the default CRS is more flawed than most. They're useful as a tactical measure to block known attacks against software you can't immediately fix at source, but as a general-purpose input filter they typically cause more problems than they fix.

Related

Will precluding surrogate code points also impede entering Chinese characters?

I have a name input field in an app and would like to prevent users from entering emojis. My idea is to filter for any characters from the general categories "Cs" and "So" in the Unicode specification, as this would prevent the bulk of inappropriate characters but allow most characters for writing natural language.
But after reading the spec, I'm not sure if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points. (My understanding is still rough.)
Would excluding surrogates still leave most Chinese users with the characters they need to enter their names, or is the original Unicode space not big enough for that to be a reasonable expectation?
Your method would be both ineffective and too excessive.
Not all emoji are outside of the Basic Multilingual Plane (and thus don’t require surrogates in the first place), and not all emoji belong to the general category So. Filtering out only these two groups of characters would leave the following emoji intact:
#️⃣ *️⃣ 0️⃣ 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ 6️⃣ 7️⃣ 8️⃣ 9️⃣ ‼️ ⁉️ ℹ️ ↔️ ◼️ ◻️ ◾️ ◽️ ⤴️ ⤵️ 〰️ 〽️
At the same time, this approach would also exclude about 79,000 (and counting) non-emoji characters covering several dozen scripts – many of them historic, but some with active user communities. The majority of all Han (Chinese) characters for instance are encoded outside the BMP. While most of these are of scholarly interest only, you will need to support them regardless especially when you are dealing with personal names. You can never know how uncommon your users’ names might be.
This whole ordeal also hinges on the technical details of your app. Removing surrogates would only work if the framework you are using encodes strings in a format that actually employs surrogates (i.e. UTF-16) and if your framework is simultaneously not aware of how UTF-16 really works (as Java or JavaScript are, for example). Surrogates are never treated as actual characters; they are exceptionally reserved codepoints that exist for the sole purpose of allowing UTF-16 to deal with characters in the higher planes. Other Unicode encodings aren’t even allowed to use them at all.
If your app is written in a language that either uses a different encoding like UTF-8 or is smart enough to process surrogates correctly, then removing Cs characters on input is never going to have any effect because no individual surrogates are ever being exposed to your program. How these characters are entered by the user does not matter because all your app gets to see is the finished product (the actual character codepoints).
If your goal is to remove all emoji and only emoji, then you will have to put a lot of effort into designing your code because the Unicode emoji spec is incredibly convoluted. Most emoji nowadays are constructed out of multiple characters, not all of which are categorised as emoji by themselves. There is no easy way to filter out just emoji from a string other than maintaining an explicit list of every single official emoji which would need to be steadily updated.
Will precluding surrogate code points also impede entering Chinese characters? […] if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points.
You cannot intercept how characters are entered, whether via input method editor, copy-paste or dozens of other possibilities. You only get to see a character when it is completed (and an IME's work is done), or depending on the widget toolkit, even only after the text has been submitted. That leaves you with validation. Let's consider a realistic case. From Unihan_Readings.txt 12.0.0 (2018-11-09):
U+20009 ‹𠀉› (the same as U+4E18 丘) a hill; elder; empty; a name
U+22218 ‹𢈘› variant of 鹿 U+9E7F, a deer; surname
U+22489 ‹𢒉› a surname
U+224B9 ‹𢒹› surname
U+25874 ‹𥡴› surname
Assume the user enters 𠀉, then your unnamed – but hopefully Unicode compliant – programming language must consider the text on the grapheme level (1 grapheme cluster) or character level (1 character), not the code unit level (surrogate pair 0xD840 0xDC09). That means that it is okay to exclude characters with the Cs property.

SOLR Dropping Emoji Miscellaneous characters

It looks like SOLR is considering what should be valid Unicode characters as invalid, and dropping them.
I "proved" this by turning on query debug to see what the parser was doing with my query. Here's an example:
Query = 'ァ☀' (\u30a1\u2600)
Here's what SOLR did with it:
'debug':{
'rawquerystring':u'\u30a1\u2600',
'querystring':u'\u30a1\u2600',
'parsedquery':u'(+DisjunctionMaxQuery((text:\u30a1)))/no_coord',
'parsedquery_toString':u'+(text:\u30a1)',
As you can see, was OK with 'ァ', but it ATE the "Black Sun" character.
I haven't tried ALL of the Block, but I've confirmed it also doesn't like ⛿ (\u26ff) and ♖ (\u2656).
I'm using SOLR with Jetty, so the various TomCat issues WRT character encoding shouldn't apply.
This very likely has more to do with the Analyzer. I don't see anything specifying the treatment of those sorts of characters exactly, but they are probably being treated very much as punctuation by the StandardAnalyzer (or whatever Analyzer you may be using), and so will not be present in the final query. StandardAnalyzer implements the rules set forward in UAX-29, Unicode Text Segmentation, in order to separate input into tokens.

What Unicode characters are dangerous?

What Unicode characters (more precisely codepoints) are dangerous and should be blacklisted and prohibited for the users to use?
I know that BIDI override characters and the "zero width space" are very prone to make problems, but what others are there?
Thanks
Characters aren’t dangerous: only inappropriate uses of them are.
You might consider reading things like:
Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax
RFC 3454: Preparation of Internationalized Strings (“stringprep”)
It is impossible to guess what you mean by dangerous.
A Golden Rule in security is to whitelist instead of blacklist, instead of trying to cover all bad characters, it is a much better idea to validate based on ensuring the user only use known good characters.
There are solutions that help you build the large whitelist that is required for international whitelisting. For example, in .NET there is UnicodeCategory.
The idea is that instead of whitelisting thousands of individual characters, the library assigns them into categories like alphanumeric characters, punctuations, control characters, and such.
Tutorial on whitelisting international characters in .NET
Unicode Regex: Categories
'HANGUL FILLER' (U+3164)
Since Unicode 1.1 in 1993, there is an empty wide, zero space character.
We can't see it, neither copy/paste it alone because we can't select it!
It need to be generated, by the unix keyboard shortcut: CTRL + SHIFT + u + 3164
It can pretty much 💩 up anything: variables, function name, url, file names, mimic DNS, invalidate hash strings, database entries, blog posts, logins, allow to fake identical accounts, etc.
DEMO 1: Altering variables
The variable hijacked contains a Hangul Filler char, the console log call the variable without the char:
const normal = "Hello w488ld"
const hijaㅤcked = "Hello w488ld"
console.log(normal)
console.log(hijacked)
DEMO 2: Hijack URL's
Those 3 url will lead to xn--stackoverflow-fr16ea.com:
https://stackㅤㅤoverflow.com
https://stackㅤㅤoverflow.com
https://stackㅤㅤoverflow.com
See Unicode Security Considerations Report.
It covers various aspects, from spoofing of rendered strings to dangers of processing UTF encodings in unsafe languages.
U+2800 BRAILLE PATTERN BLANK - a Braille character without any "dots". It looks like a regular "space" but is not classified as one.

How do I recover a document that has been sent through the character encoding wringer?

Until recently, my blog used mismatched character encoding settings for PHP and MySQL. I have since fixed the underlying problem, but I still have a ton of text that is filled with garbage. For instance, ï has become ï.
Is there software that can use pattern recognition and statistics to automatically discover broken text and fix it?
For example, it looks like U+00EF (UTF-8 0xC3 0xAF) has become U+00C3 U+00AF (UTF-8 0xC3 0x83 0xC2 0xAF). In other words, the hexadecimal encoding has been used for the code points. This pattern has happened to (seemingly random) non-ASCII characters across my site.
The example you cite looks like good old utf8-over-latin1. You might quickly try out a query like:
select convert(convert(the_problem_column using binary) using utf8)
to see if it irons out the problem.
An encoding conversion along those lines should work as long as all of your data went through the same sequence of encoding transformations, and as long as none of those transformations were lossy - you're just reversing the effect of some of those transformations.
If you can't rely on the data having gone through the same set of encoding transformations, then it's a matter of scanning through the data for garbage characters and replacing them with the intended character, which is risky because it depends on somebody's definition of what was garbage and what was intended.
Some discussion in this answer on how you might do that kind of repair using handmade scripts. I don't know of a tool that's aware of the full range of natural languages and encodings, that takes a more advanced statistical approach in spotting possible problems, and that recommends the exact transformation to fix the problem - something like that would be useful.
You probably want to look into regex, http://en.wikipedia.org/wiki/Regular_expression.
Using this you can then search out and replace the characters in question.
Here is the MySQL regex documentation, http://dev.mysql.com/doc/refman/5.1/en/regexp.html.

Should I use accented characters in URLs?

When one creates web content in languages different than English the problem of search engine optimized and user friendly URLs emerge.
I'm wondering whether it is the best practice to use de-accented letters in URLs -- risking that some words have completely different meanings with and without certain accents -- or it is better to stick to the usage of non-english characters where appropriate sacrificing the readability of those URLs in less advanced environments (e.g. MSIE, view source).
"Exotic" letters could appear anywhere: in titles of documents, in tags, in user names, etc, so they're not always under the complete supervision of the maintainer of the website.
A possible approach of course would be setting up alternate -- unaccented -- URLs as well which would point to the original destination, but I would like to learn your opinions about using accented URLs as primary document identifiers.
There's no ambiguity here: RFC3986 says no, that is, URIs cannot contain unicode characters, only ASCII.
An entirely different matter is how browsers represent encoded characters when displaying a URI, for example some browsers will display a space in a URL instead of '%20'. This is how IDN works too: punycoded strings are encoded and decoded by browsers on the fly, so if you visit café.com, you're really visiting xn--caf-dma.com. What appears to be unicode chars in URLs is really only 'visual sugar' on the part of the browser: if you use a browser that doesn't support IDN or unicode, the encoded version won't work because the underlying definition of URLs simply doesn't support it, so for it to work consistently, you need to % encode.
When faced with a similar problem, I took advantage of URL rewriting to allow such pages to be accessible by either the accented or unaccented character. The actual URL would be something like
http://www.mysite.com/myresume.html
And a rewriting+character translating function allows this reference
http://www.mysite.com/myresumé.html
to load the same resource. So to answer your question, as the primary resource identifier, I confine myself to 0-9, A-Z, a-z and the occasional hyphen.
Considering URLs with accents often tend to end up looking like this :
http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant
...which is not that nice... I think we'll still be using de-accented URLs for some time.
Though, things should get better, as accented URLs are now accepted by web browsers, it seems.
The firefox 3.5 I'm currently using displays the URL the nice way, and not with %stuff, btw ; this seems to be "new" since firefox 3.0 (see Firefox 3: UTF-8 support in location bar) ; so, not probably not supported in IE 6, at least -- and there are still quite too many people using this one :-(
Maybe URL with no accent are not looking the best that could be ; but, still, people are used to them, and seem to generally understand them quite well.
You should avoid non-ASCII characters in URLs that may be entered in browser manually by users. It's ok for embedded links pre-encoded by server.
We found out that browser can encode the URL in different ways and it's very hard to figure out what encoding it uses. See my question on this issue,
Handling Character Encoding in URI on Tomcat
There are several areas in a full URL, and each one might has different rules.
The protocol is plain ASCII.
The DNS entry is governed by IDN (International Domain Names) rules, and can contain (most) of the Unicode characters.
The path (after the first /), the user name and the password can again be everything. They are escaped (as %XX), but those are just bytes. What is the encoding of these bytes is difficult to know (is interpreted by the http server).
The parameters part (after the first ?) is passed "as is" (after %XX unescapeing) to some server-side application thing (php, asp, jsp, cgi), and how that interprets the bytes is another story).
It is recommended that the path/user/password/arguments are utf-8, but not mandatory, and not everyone respects that.
So you should definitely allow for non-ASCII (we are not in the 80s anymore), but exactly what you do with that might be tricky. Try to use Unicode and stay away from legacy code pages, tag your content with the proper encoding/charset if you can (using meta in html, language directives for asp/jsp, etc.)