SOLR Dropping Emoji Miscellaneous characters - unicode

It looks like SOLR is considering what should be valid Unicode characters as invalid, and dropping them.
I "proved" this by turning on query debug to see what the parser was doing with my query. Here's an example:
Query = 'ァ☀' (\u30a1\u2600)
Here's what SOLR did with it:
'debug':{
'rawquerystring':u'\u30a1\u2600',
'querystring':u'\u30a1\u2600',
'parsedquery':u'(+DisjunctionMaxQuery((text:\u30a1)))/no_coord',
'parsedquery_toString':u'+(text:\u30a1)',
As you can see, was OK with 'ァ', but it ATE the "Black Sun" character.
I haven't tried ALL of the Block, but I've confirmed it also doesn't like ⛿ (\u26ff) and ♖ (\u2656).
I'm using SOLR with Jetty, so the various TomCat issues WRT character encoding shouldn't apply.

This very likely has more to do with the Analyzer. I don't see anything specifying the treatment of those sorts of characters exactly, but they are probably being treated very much as punctuation by the StandardAnalyzer (or whatever Analyzer you may be using), and so will not be present in the final query. StandardAnalyzer implements the rules set forward in UAX-29, Unicode Text Segmentation, in order to separate input into tokens.

Related

How this mixed-character string split on unicode word boundaries

Consider the string "abc를". According to unicode's demo implementation of word segmentation, this string should be split into two words, "abc" and "를". However, 3 different Rust implementations of word boundary detection (regex, unic-segment, unicode-segmentation) have all disagreed, and grouped that string into one word. Which behavior is correct?
As a follow up, if the grouped behavior is correct, what would be a good way to scan this string for the search term "abc" in a way that still mostly respects word boundaries (for the purpose of checking the validity of string translations). I'd want to match something like "abc를" but don't match something like abcdef.
I'm not so certain that the demo for word segmentation should be taken as the ground truth, even if it is on an official site. For example, it considers "abc를" ("abc\uB97C") to be two separate words but considers "abc를" ("abc\u1105\u1173\u11af") to be one, even though the former decomposes to the latter.
The idea of a word boundary isn't exactly set in stone. Unicode has a Word Boundary specification which outlines where word-breaks should and should not occurr. However, it has an extensive notes section for elaborating on other cases (emphasis mine):
It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated.
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
...
My understanding is that the crates you list are following the spec without further contextual analysis. Why the demo disagrees I cannot say, but it may be an attempt to implement one of these edge cases.
To address your specific problem, I'd suggest using Regex with \b for matching a word boundary. This unfortunately follows the same unicode rules and will not consider "를" to be a new word. However, this regex implementation offers an escape hatch to fallback to ascii behaviour. Simply use (?-u:\b) to match a non-unicode boundary:
use regex::Regex;
fn main() {
let pattern = Regex::new("(?-u:\\b)abc(?-u:\\b)").unwrap();
println!("{:?}", pattern.find("some abcdef abc를 sentence"));
}
You can run it for yourself on the playground to test your cases and see if this works for you.

Parsing commas in Sphinx

I have a field which can have multiple commas which are actually critical to some regex pattern matching.
Commas however do not index and adding them to the charset breaks it (for a # of technical reasons on how sphinx searches/indexes).
I cannot change the char prior to indexing (e.g. COMMA) so I have some anchor for the pattern and can't properly pattern extract w/o.
My only thought is to use exceptions to map ,=>COMMA (this won't process large text fields so not a huge issue). I'm curious as to if this will work and what the pipeline is i.e. what it could possibly break that I'm not considering. AFAIK Exceptions happen first and do not obey charset so this might in fact work. I get I can test it to see if it does but again I am more concenred with what this might break given my rudimentary knowledge of the pipeline of Sphinx Indexing.
Commas however do not index and adding them to the charset breaks it (for a # of technical reasons on how sphinx searches/indexes).
Just use U+2C to add comma to your charset_table, e.g.
charset_table=a..z,A..Z,0..9,U+2C
You might also want to add it to blend_chars instead to consider a comma both a word separator and not.

ModSecurity OWASP Core Rule Set - unicode false positive

We run some web services.
We use ModSecurity for Apache webserver with the OWASP core rule set.
We have problems with greek and russian requests, because of cyrillic and greek letters.
In the rules of OWASP CRS there are patterns like
"(^[\"'´’‘;]+|[\"'´’‘;]+$)"
In the ModSecurity Log there are UTF-8 code units where should be unicode characters. All ascii letters are shown as characters as should be.
Example:
[Matched Data: \x85 2
\xce\xb7\xce\xbb\xce\xb9\xce\xbf\xcf\x85\xcf\x80\xce
found within ARGS:q: 163 45
\xcf\x83\xce\xbf\xcf\x85\xce\xbd\xce\xb9\xce\xbf\xcf\x85
2
\xce\xb7\xce\xbb\xce\xb9\xce\xbf\xcf\x85\xcf\x80\xce\xbf\xce\xbb\xce\xb7]
[Pattern match
"(?i:(?:[\"'\\xc2\\xb4\\xe2\\x80\\x99\\xe2\\x80\\x98]\\\\s*?(x?or|div|like|between|and)\\\\s*?[\\"'\xc2\xb4\xe2\x80\x99\xe2\x80\x98]?\\d)|(?:\\\\x(?:23|27|3d))|(?:^.?[\"'\\xc2\\xb4\\xe2\\x80\\x99\\xe2\\x80\\x98]$)|(?:(?:^[\\"'\xc2\xb4\xe2\x80\x99\xe2\x80\x98\\\\]*?(?:[\\
..."]
Now we know that it was triggered by a request in greek:
σουνιου ηλιουπολη (a street in Athen)
Thats not our problem. We can figure that out.
The problem is that x80 is part of the character ’ (e2 80 99)
and x80 is also part of a greek letter, thats why we get a false positive.
The actual rule that was triggered:
SecRule
REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|!REQUEST_COOKIES:/_pk_ref/|REQUEST_COOKIES_NAMES|ARGS_NAMES|ARGS|XML:/*
"(?i:(?:[\"'´’‘]\s*?(x?or|div|like|between|and)\s*?[\"'´’‘]?\d)|(?:\\x(?:23|27|3d))|(?:^.?[\"'´’‘]$)|(?:(?:^[\"'´’‘\\]?(?:[\d\"'´’‘]+|[^\"'´’‘]+[\"'´’‘]))+\s*?(?:n?and|x?x?or|div|like|between|and|not|\|\||\&\&)\s*?[\w\"'´’‘][+&!#(),.-])|(?:[^\w\s]\w+\s?[|-]\s*?[\"'´’‘]\s*?\w)|(?:#\w+\s+(and|x?or|div|like|between|and)\s*?[\"'´’‘\d]+)|(?:#[\w-]+\s(and|x?or|div|like|between|and)\s*?[^\w\s])|(?:[^\w\s:]\s*?\d\W+[^\w\s]\s*?[\"'`´’‘].)|(?:\Winformation_schema|table_name\W))"
"phase:2,capture,t:none,t:urlDecodeUni,block,msg:'Detects classic SQL
injection probings
1/2',id:'981242',tag:'OWASP_CRS/WEB_ATTACK/SQL_INJECTION',logdata:'Matched
Data: %{TX.0} found within %{MATCHED_VAR_NAME}:
%{MATCHED_VAR}',severity:'2',setvar:'tx.msg=%{rule.id}-%{rule.msg}',setvar:tx.sql_injection_score=+1,setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:'tx.%{tx.msg}-OWASP_CRS/WEB_ATTACK/SQLI-%{matched_var_name}=%{tx.0}'"
For a workaround we adjusted some patterns like [\"'´’‘] to (\"|'||\xc2\xb4|\xe2\x80\x99|\xe2\x80\x98) so it matches the actual combinations of UTF-8 code units that build a character. We could do this for all 55 SQL Injection Rules of the Core Rule Set, but this is a heavy time consuming task.
We wonder if there is just a misconfiguration with the decoding of Apache or ModSecurity. We know all non-ascii and some ascii characters as well are URL encoded with % and UTF-8 by the webbrowsers.
I don't think it's a decoding problem, that looks as expected to me, and your (annoyingly verbose) fix is fine if it is known that the application you are protecting treats all its URL input as UTF-8. (It wouldn't be ‘right’ for something that used Windows-1252, for example, as it would start to let ’ through again.)
Alternatively you could remove the smart-quote filtering entirely, assuming you are not trying to protect an application specifically known to have SQL-injection issues as well as poor Unicode handling. The smart quotes are in there because if an application flattens then to ASCII using a platform function which maps non-ASCII characters to ASCII, like Windows's misguided ‘best fit’ mappings, they could get converted to single quotes, thus evading a preceding WAF filter that tried to remove those. (It seems to me the rule fails to include some other characters that would get flattened to quotes, such as U+02B9, U+02BC, U+02C8, U+2032 and U+FF07, so it's probably already not watertight in any case.)
TBH this is par for the course for mod_security CRS rules; especially for sites that use arbitrary strings in path parts you get lots of false positives, and the larger part of deploying tools like this is configuring them to avoid the worst of the damage.
IMO: WAFs are fundamentally flawed in principle (as it's impossible to define what input might constitute an attack vs a valid request), and the default CRS is more flawed than most. They're useful as a tactical measure to block known attacks against software you can't immediately fix at source, but as a general-purpose input filter they typically cause more problems than they fix.

Encoding special chars in XSLT output

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.

How do I recover a document that has been sent through the character encoding wringer?

Until recently, my blog used mismatched character encoding settings for PHP and MySQL. I have since fixed the underlying problem, but I still have a ton of text that is filled with garbage. For instance, ï has become ï.
Is there software that can use pattern recognition and statistics to automatically discover broken text and fix it?
For example, it looks like U+00EF (UTF-8 0xC3 0xAF) has become U+00C3 U+00AF (UTF-8 0xC3 0x83 0xC2 0xAF). In other words, the hexadecimal encoding has been used for the code points. This pattern has happened to (seemingly random) non-ASCII characters across my site.
The example you cite looks like good old utf8-over-latin1. You might quickly try out a query like:
select convert(convert(the_problem_column using binary) using utf8)
to see if it irons out the problem.
An encoding conversion along those lines should work as long as all of your data went through the same sequence of encoding transformations, and as long as none of those transformations were lossy - you're just reversing the effect of some of those transformations.
If you can't rely on the data having gone through the same set of encoding transformations, then it's a matter of scanning through the data for garbage characters and replacing them with the intended character, which is risky because it depends on somebody's definition of what was garbage and what was intended.
Some discussion in this answer on how you might do that kind of repair using handmade scripts. I don't know of a tool that's aware of the full range of natural languages and encodings, that takes a more advanced statistical approach in spotting possible problems, and that recommends the exact transformation to fix the problem - something like that would be useful.
You probably want to look into regex, http://en.wikipedia.org/wiki/Regular_expression.
Using this you can then search out and replace the characters in question.
Here is the MySQL regex documentation, http://dev.mysql.com/doc/refman/5.1/en/regexp.html.