How to ignore leading special characters in sphinx search? - sphinx

For example, we want the search term "Test" to match the word "#Test" and "(Test)"in the results.

it does this by default. Unless you have added these chars to the charset_table in the config

Did you try with escape character backslash just before special characters.
Also check there is EscapeString function available in sphinx API.

Related

Searching For Special Characters in postgresql

i am trying to find rows in a postgresql table where a specific column contains special characters excluding the following
#^$.!\-#+'~_
any help appreciated
Hi I think I figured it out. I found a solution that worked for me using Posix Regular Expressions.
SELECT *
FROM TABLE_NAME
WHERE fieldName ~ '[^A-Za-z0-9#^\\$.!\-#+~_]'
The regular expression matches any character that is not between A-Z, a-z, 0-9 and is also not any of your whitelisted characters ^$.!-#+~_. Notice that in the regex I had to escape the backslash and the hyphen, because they have a special meaning in regex. Maybe start by evaluating my proposed regex online with a few examples, e.g. here: https://regex101.com

Perl regex presumably removing non ASCII characters

I found a code with regex where it is claimed that it strips the text of any non-ASCII characters.
The code is written in Perl and the part of code that does it is:
$sentence =~ tr/\000-\011\013-\014\016-\037\041-\055\173-\377//d;
I want to understand how this regex works and in order to do this I have used regexr. I found out that \000, \011, \013, \014, \016, \037, \041, \055, \173, \377 mean separate characters as NULL, TAB, VERTICAL TAB ... But I still do not get why "-" symbols are used in the regex. Do they really mean "dash symbol" as shown in regexr or something else? Is this regex really suited for deleting non-ASCII characters?
This isn't really a regex. The dash indicates a character range, like inside a regex character class [a-z].
The expression deletes some ASCII characters, too (mainly whitespace) and spares a range of characters which are not ASCII; the full ASCII range would simply be \000-\177.
To be explicit, the d flag says to delete any characters not between the first pair of slashes. See further the documentation.

Escaping vs. charset_table in sphinx

Do I need to include special characters in conf charset_table if I "manually" escape them in my code (python)? I haven't included and it's working fine :-/
They do slightly different things. charset_table, influences how the 'input text' itself is tokenized and indexed as words. (as well as how the query itself is tokenized)
So if you want these 'special chars' to taken as seperators between words, then leave them out of charset table, and escape them in the query[1]. (This seems to be what you have)
But if you want these chars to be taken as word charactors - included as part of words, then they should be included in charset_table and still escaped[1]
[1] (well only needs escaping if they can be mistaken as search query syntax).

Allowed characters in CSS 'content' property?

I've read that we must use Unicode values inside the content CSS property i.e. \ followed by the special character's hexadecimal number.
But what characters, other than alphanumerics, are actually allowed to be placed as is in the value of content property? (Google has no clue, hence the question.)
The rules for “escaping” characters are in the CSS 2.1 specification, clause 4.1.3 Characters and case. The special rules for quoted strings, as in content property value, are in clause 4.3.7 Strings. Within a quoted string, any character may appear as such, except for the character used to quote the string (" or '), a newline character, or a backslash character \.
The information that you must use \ escapes is thus wrong. You may use them, and may even need to use them if the character encoding of the document containing the style sheet does not let you enter all characters directly. But if the encoding is UTF-8, and is properly declared, then you can write content: '☺ Я Ω ⁴ ®'.
As far as I know, you can insert any Unicode character. (Here's a useful list of Unicode characters and their codes.)
To utilize these codes, you must escape them, like so:
U+27BA Becomes \27BA
Or, alternatively, I think you may just be able to escape the character itself:
content: '\➺';
Source: http://mathiasbynens.be/notes/css-escapes

how to treat * as string and not as wildcard in lucene.net

I have a term test* and I want to treat it as string and not a wildcard of test.
How will I be able to do that in Lucene.Net. Any help???
Yes you can use a backslash to escape special characters. Both in the QueryParser and custom built searches. List of characters that require escaping can be found here.
If you're using the newer versions of Lucene.Net, you can use QueryParser.Escape("test*") to escape your search term. QueryParser.Escape() takes a string and returns the string after properly escaping all characters that are special for Lucene.