Indexing Euro (€) and Lb (£) in Sphinx - sphinx

These don't seem to index, even when I explicitly add them to my charset_table:
charset_table=... U+20AC->U+20AC, U+00A3->U+00A3
I even tried mapping them to the dollar sign
U+0024->U+0024, U+20AC->U+0024, U+00A3->U+0024
Yet in each case they are unrecognized in other words MATCH('£1000') will not find 'cost is £1000' and if I try to map to $ as per the second example then MATCH('$1000)` will not either.
If I do a MySQL Search however where field like '%£%' I do get records leading me to believe the MySQL is encoding UTF-8 correctly. Meaning the Pound Sign and Euro characters are being stored correctly in MySQL but the Sphinx index is not recognizing them regardless, even after I explicitly add their Unicode characters to my charset_table.
Relevant portion of config:
`min_stemming_len = 1
stopword_step = 0
html_strip = 1
min_word_len = 1
min_infix_len = 0
index_zones = title,description
charset_type = utf8mb4_unicode_ci
charset_table = 0..9, A..Z->a..z, _, a..z, U+0026->U+0026, U+0027->U+0027, U+002E->U+002E, U+002D->U+002D, U+2014->U+002D#, U+2019->U+0027, U+0024->U+0024, U+20AC->U+0024, U+00A3->U+0024
Confirmed that the table/column is using utf8mb4_unicode_ci
Confirmed I can do a mysql search on Euro: Where Title like '%€%'
Confirmed I cannot find same record with SphinxQL: where MATCH('€')

There are a three things you should check:
First, look at This Question to check your MySQL char encoding;
Secondly, look in your Sphinx config to check charset_type matches it.
Lastly, remember, after any changes to charset_type or charset_table you need to rebuild indexes.
If none of the above helps, you could post your Sphinx Config here, which might give further clues as to the problem.

Related

Issue with string comparison contains "_" in Postgres

Issue with comparison in Postgres (Version 11) while comparing _ sign.
I have a string (shown below) I want it to compare it with a word and wants to check whether this word _WIN_ exists in the string. If yes it should give true.
But when I search like this '%_WIN_%' it is giving as TRUE even though the searched string doesn't exactly contains this word.
Can anyone please suggest what I'm doing wrong?
select 'New_vit_Vitamin_D_IND-tonline_WINTERSEAS_2020.02.09' ILIKE '%_WIN_%'
Note: Expected RESULT should be FALSE but giving as TRUE
The underscore is the wildcard for a single character in SQL. If you want to search for the character itself you need to escape it:
ILIKE '%\_WIN\_%' ESCAPE '\'

Wildcard searching between words with CRC mode in Sphinx

I use sphinx with CRC mode and min_infix_length = 1 and I want to use wildcard searching between character of a keyword. Assume I have some data like these in my index files:
name
-------
mickel
mick
mickol
mickil
micknil
nickol
nickal
and when I search for all record that their's name start with 'mick' and end with 'l':
select * from all where match ('mick*l')
I expect the results should be like this:
name
-------
mickel
mickol
mickil
micknil
but nothing returned. How can I do that?
I know that I can do this in dict=keywords mode but I should use crc mode for some reasons.
I also used '^' and '$' operators and didn't work.
You can't use 'middle' wildcards with CRC. One of the reaons for dict=keywords, the wildcards it can support are much more flexible.
With CRC, it 'precomputes' all the wildcard combinations, and injects them as seperate keywords in index, eg for
eg mickel as a document word, and with min_prefix_len=1, indexer willl create the words:
mickel
mickel*
micke*
mick*
mic*
mi*
m*
... as words in index, so all the combinations can match. If using min_infix_len, it also has to do all the combinations at the start as well (so (word_length)^2 + 1 combinations)
... if it had to precompute all the combinations for wildcards in the middle, would be a lot more again. Particularly if then allows all for middle AND start/end combinations as well)
Although having said that, you can rewrite
select * from all where match ('mick*l')
as
select * from all where match ('mick* *l')
because with min_infix_len, the start and end will be indexed as sperate words. Jus need to insist that both match. (although can't think how to make them bot match the same word!)

PCRE Regex - How to return matches with multiline string looking for multiple strings in any order

I need to use Perl-compatible regex to match several strings which appear over multiple lines in a file.
The matches need to appear in any order (server servernameA.company.com followed by servernameZ.company.com followed by servernameD.company.com or any order combination of the three). Note: All matches will appear at the beginning of each line.
In my testing with grep -P, I haven't even been able to produce a match on simple string terms that appear in any order over new lines (even when using the /s and /m modifiers). I am pretty sure from reading I need a look-ahead assertion but the samples I used didn't produce a match for me even after analyzing each bit of the regex to make sure it was relevant to my scenario.
Since I need to support this in Production, I would like an answer that is simple and relatively straight-forward to interpret.
Sample Input
irrelevant_directive = 0
# Comment
server servernameA.company.com iburst
additional_directive = yes
server servernameZ.company.com iburst
server servernameD.company.com iburst
# Additional Comment
final_directive = true
Expectation
The regex should match and return the 3 lines beginning with server (that appear in any order) if and only if there is a perfect match for strings'serverA.company.com', 'serverZ.company.com', and 'serverD.company.com' followed by iburst. All 3 strings must be included.
Finally, if the answer (or a very similar form of the answer) can address checking for strings in any order on a single line, that would be very helpful. For example, if I have a single-line string of: preauth param audit=true silent deny=5 severe=false unlock_time=1000 time=20ms and I want to ensure the terms deny=5 and time=20ms appear in any order and if so match.
Thank you in advance for your assistance.
Regarding the main issue [for the secondary question see Casimir et Hippolyte answer] (using x modifier): https://regex101.com/r/mkxcap/5
(?:
(?<a>.*serverA\.company\.com\s+iburst.*)
|(?<z>.*serverZ\.company\.com\s+iburst.*)
|(?<d>.*serverD\.company\.com\s+iburst.*)
|[^\n]*(?:\n|$)
)++
(?(a)(?(z)(?(d)(*ACCEPT))))(*SKIP)(*F)
The matches are now all in the a, z and d capturing groups.
It's not the most efficient (it goes three times over each line with backtracking...), but the main takeaway is to register the matches with capturing groups and then checking for them being defined.
You don't need to use the PCRE features, you can simply write in ERE:
grep -E '.*(\bdeny=5\b.*\btime=20ms\b|\btime=20ms\b.*\bdeny=5\b).*' file
The PCRE approach will be different: (however you can also use the previous pattern)
grep -P '^(?=.*\bdeny=5\b).*\btime=20ms\b.*' file

Sphinx not returning results in some cases

I'm using extended2 match mode. Here is an example:
Keyword 'méz': 0 results
Keyword 'mézes': 4657 results
Charset is in utf-8, and min_word_len is 3.
Any suggestions?

Allowed characters in filename

Where can I find a list of allowed characters in filenames, depending on the operating system?
(e.g. on Linux, the character : is allowed in filenames, but not on Windows)
You should start with the Wikipedia Filename page. It has a decent-sized table (Comparison of filename limitations), listing the reserved characters for quite a lot of file systems.
It also has a plethora of other information about each file system, including reserved file names such as CON under MS-DOS. I mention that only because I was bitten by that once when I shortened an include file from const.h to con.h and spent half an hour figuring out why the compiler hung.
Turns out DOS ignored extensions for devices so that con.h was exactly the same as con, the input console (meaning, of course, the compiler was waiting for me to type in the header file before it would continue).
OK, so looking at Comparison of file systems if you only care about the main players file systems:
Windows (FAT32, NTFS): Any Unicode except NUL, \, /, :, *, ?, ", <, >, |. Also, no space character at the start or end, and no period at the end.
Mac(HFS, HFS+): Any valid Unicode except : or /
Linux(ext[2-4]): Any byte except NUL or /
so any byte except NUL, \, /, :, *, ?, ", <, >, | and you can't have files/folders call . or .. and no control characters (of course).
On Windows OS create a file and give it a invalid character like \ in the filename. As a result you will get a popup with all the invalid characters in a filename.
To be more precise about Mac OS X (now called MacOS) / in the Finder is interpreted to : in the Unix file system.
This was done for backward compatibility when Apple moved from Classic Mac OS.
It is legitimate to use a / in a file name in the Finder, looking at the same file in the terminal it will show up with a :.
And it works the other way around too: you can't use a / in a file name with the terminal, but a : is OK and will show up as a / in the Finder.
Some applications may be more restrictive and prohibit both characters to avoid confusion or because they kept logic from previous Classic Mac OS or for name compatibility between platforms.
Rather than trying to identify all the characters that are unwanted,
you could just look for anything except the acceptable characters. Here's a regex for anything except posix characters:
cleaned_name = re.sub(r'[^[:alnum:]._-]', '', name)
For "English locale" file names, this works nicely. I'm using this for sanitizing uploaded file names. The file name is not meant to be linked to anything on disk, it's for when the file is being downloaded hence there are no path checks.
$file_name = preg_replace('/([^\x20-~]+)|([\\/:?"<>|]+)/g', '_', $client_specified_file_name);
Basically it strips all non-printable and reserved characters for Windows and other OSs. You can easily extend the pattern to support other locales and functionalities.
I took a different approach. Instead of looking if the string contains only valid characters, I look for invalid/illegal characters instead.
NOTE: I needed to validate a path string, not a filename. But if you need to check a filename, simply add / to the set.
def check_path_validity(path: str) -> bool:
# Check for invalid characters
for char in set('\?%*:|"<>'):
if char in path:
print(f"Illegal character {char} found in path")
return False
return True
Here is the code to clean file name in python.
import unicodedata
def clean_name(name, replace_space_with=None):
"""
Remove invalid file name chars from the specified name
:param name: the file name
:param replace_space_with: if not none replace space with this string
:return: a valid name for Win/Mac/Linux
"""
# ref: https://en.wikipedia.org/wiki/Filename
# ref: https://stackoverflow.com/questions/4814040/allowed-characters-in-filename
# No control chars, no: /, \, ?, %, *, :, |, ", <, >
# remove control chars
name = ''.join(ch for ch in name if unicodedata.category(ch)[0] != 'C')
cleaned_name = re.sub(r'[/\\?%*:|"<>]', '', name)
if replace_space_with is not None:
return cleaned_name.replace(' ', replace_space_with)
return cleaned_name