Sphinx search: bug in handling multiple blend_chars in a single term? - sphinx

I'm using Sphinx 2.2.11 and believe I've found a bug regarding how Sphinx indexes terms that contain more than one instance of a blend character.
For example, I have the hyphen and period set as blend_chars:
blend_chars = ., -
Let's say I have a term in the database as follows:
part1-part2.part3
I would expect that Sphinx would index this term in all possible combinations for each blend_char. For example:
Variant 1: part1-part2.part3
Variant 2: part1 part2.part3
Variant 3: part1-part2 part3
Variant 4: part1 part2 part3
However, that doesn't seem to be the case.
If I search for:
part2.part3
I don't find the record containing the term part1-part2.part3.
However, if I search for:
part2 part3
OR
part1 part2 part3
I do find the record.
This suggests to me that Sphinx does not index all possible combinations of the blend_chars. Instead, it appears to index just two versions:
part1-part2.part3 (with blend_chars intact)
part1 part2 part3 (with blend_chars ignored, treated as whitespace)
If true, I would consider this a bug, as it tends to break searches that use just one of the blend_chars.
Can anyone confirm that they are seeing the same behavior? And can anyone suggest tips on how to fix or work around it?
Thanks very much!

When you have blend_chars = ., - and search for part2.part3 or part1-part2 Sphinx leaves those as single tokens, it doesn't convert them to part2 AND part3 and part1 AND part2.
BUT when you index part1-part2.part3 it generates 4 tokens: part1-part2.part3, part1, part2 and part3. That's why you can't find neither of them with part1-part2 or part2.part3.
The solution is to not use blended chars in your query. If you want to automate it you can use CALL KEYWORDS to see how it would be tokenized during indexation prior to your search query and then use the results to modify your query, e.g.:
mysql> call keywords('part1-part2.part3', 'blend');
+------+-------------------+-------------------+
| qpos | tokenized | normalized |
+------+-------------------+-------------------+
| 1 | part1-part2.part3 | part1-part2.part3 |
| 1 | part1 | part1 |
| 2 | part2 | part2 |
| 3 | part3 | part3 |
+------+-------------------+-------------------+
4 rows in set (0.00 sec)

Related

How to remove multiple characters between 2 special characters in a column in SSIS expression

I want to remove the multiple characters starting from '#' till the ';' in derived column expression in SSIS.
For example,
my input column values are,
and want the output as,
Note: Length after '#' is not fixed.
Already tried in SQL but want to do it via SSIS derived column expression.
First of all: Please do not post pictures. We prefer copy-and-pastable sample data. And please try to provide a minimal, complete and reproducible example, best served as DDL, INSERT and code as I do it here for you.
And just to mention this: If you control the input, you should not mix information within one string... If this is needed, try to use a "real" text container like XML or JSON.
SQL-Server is not meant for string manipulation. There is no RegEx or repeated/nested pattern matching. So we would have to use a recursive / procedural / looping approach. But - if performance is not so important - you might use a XML hack.
--DDL and INSERT
DECLARE #tbl TABLE(ID INT IDENTITY,YourString VARCHAR(1000));
INSERT INTO #tbl VALUES('Here is one without')
,('One#some comment;in here')
,('Two comments#some comment;in here#here is the second;and some more text')
--The query
SELECT t.ID
,t.YourString
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML) SeeTheIntermediateXML
,CAST(REPLACE(REPLACE((SELECT t.YourString AS [*] FOR XML PATH('')),'#','<!--'),';','--> ') AS XML).value('.','nvarchar(max)') CleanedValue
FROM #tbl t
The result
+----+-------------------------------------------------------------------------+-----------------------------------------+
| ID | YourString | CleanedValue |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 1 | Here is one without | Here is one without |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 2 | One#some comment;in here | One in here |
+----+-------------------------------------------------------------------------+-----------------------------------------+
| 3 | Two comments#some comment;in here#here is the second;and some more text | Two comments in here and some more text |
+----+-------------------------------------------------------------------------+-----------------------------------------+
The idea in short:
Using some string methods we can wrap your unwanted text in XML comments.
Look at this
Two comments<!--some comment--> in here<!--here is the second--> and some more text
Reading this XML with .value() the content will be returned without the comments.
Hint 1: Use '-->;' in your replacement to keep the semi-colon as delimiter.
Hint 2: If there might be a semi-colon ; somewhere else in your string, you would see the --> in the result. In this case you'd need a third REPLACE() against the resulting string.

Postgres Full-Text Search with Hyphen and Numerals

I have observed what seems to me an odd behavior of Postgres' to_tsvector function.
SELECT to_tsvector('english', 'abc-xyz');
returns
'abc':2 'abc-xyz':1 'xyz':3
However,
SELECT to_tsvector('english', 'abc-001');
returns
'-001':2 'abc':1
Why not something like this?
'abc':2 'abc-001':1 '001':3
And what should I do to be able to search by the numeric portion alone, without the hyphen?
Seems the text search parser identifies the hyphen followed by digits to be the sign of a signed integer. Debug with ts_debug():
SELECT * FROM ts_debug('english', 'abc-001');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | abc | {simple} | simple | {abc}
int | Signed integer | -001 | {simple} | simple | {-001}
Other text search configurations (like 'simple' instead of 'english') won't help as the parser itself is "at fault" here (debatable).
A simple way around it (other than modifying the parser, which I never tried) would to pre-process strings and replace hyphens with m-dash (—) or just blanks to make sure those are identified as "Space symbols". (Actual signed integers lose their negative sign in the process.)
SELECT to_tsvector('english', translate('abc-001', '-', '—'))
## to_tsquery ('english', '001'); -- true now
db<>fiddle here
This can be circumvented with PG13's dict-int addon's absval option. See the official documentation.
But in case you're stuck with an earlier PG version, here's the generalized version of a "number or negative number" workaround in a query.
select regexp_replace($$'test' & '1':* & '2'$$::tsquery::text,
'''([.\d]+''(:\*)?)', '(''\1 | ''-\1)', 'g')::tsquery;
This results in:
'test' & ( '1':* | '-1':* ) & ( '2' | '-2' )
It replaces lexemes that look like positive numbers with "number or negative number" kind of subqueries.
The double cast ::tsquery::text is just there to show how you would pass a tsquery casted to text.
Note that it handles prefix matching numeric lexemes as well.

PostgreSQL tuple format

Is there any document describing the tuple format that PostgreSQL server adheres to? The official documentation appears arcane about this.
A single tuple seems simple enough to figure out, but when it comes to arrays of tuples, arrays of composite tuples, and finally nested arrays of composite tuples, it is impossible to be certain about the format simply by looking at the output.
I am asking this following my initial attempt at implementing pg-tuple, a parser that's still missing today, to be able to parse PostgreSQL tuples within Node.js
Examples
create type type_A as (
a int,
b text
);
with a simple text: (1,hello)
with a complex text: (1,"hello world!")
create type type_B as (
c type_A,
d type_A[]
);
simple-value array: {"(2,two)","(3,three)"}
for type_B[] we can get:
{"(\"(7,inner)\",\"{\"\"(88,eight-1)\"\",\"\"(99,nine-2)\"\"}\")","(\"(77,inner)\",\"{\"\"(888,eight-3)\"\",\"\"(999,nine-4)\"\"}\")"}
It gets even more complex for multi-dimensional arrays of composite types.
UPDATE
Since it feels like there is no specification at all, I have started working on reversing it. Not sure if it can be done fully though, because from some initial examples it is often unclear what formatting rules are applied.
As Nick posted, according to docs:
the whitespace will be ignored if the field type is integer, but not
if it is text.
and
The composite output routine will put double quotes around field
values if they are empty strings or contain parentheses, commas,
double quotes, backslashes, or white space.
and
Double quotes and backslashes embedded in field values will be
doubled.
and now quoting Nick himself:
nested elements are converted to strings, and then quoted / escaped
like any other string
I give shorted example below, comfortably compared against its nested value:
a=# create table playground (t text, ta text[],f float,fa float[]);
CREATE TABLE
a=# insert into playground select 'space here',array['','bs\'],8.0,array[null,8.1];
INSERT 0 1
a=# insert into playground select 'no_space',array[null,'nospace'],9.0,array[9.1,8.0];
INSERT 0 1
a=# select playground,* from playground;
playground | t | ta | f | fa
---------------------------------------------------+------------+----------------+---+------------
("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
(no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
If you go for deeper nested quoting, look at:
a=# select nested,* from (select playground,* from playground) nested;
nested | playground | t | ta | f | fa
-------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+------------+----------------+---+------------
("(""space here"",""{"""""""",""""bs\\\\\\\\""""}"",8,""{NULL,8.1}"")","space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | ("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
("(no_space,""{NULL,nospace}"",9,""{9.1,8}"")",no_space,"{NULL,nospace}",9,"{9.1,8}") | (no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
As you can see, the output again follows rules the above.
This way in short answers to your questions would be:
why array is normally presented inside double-quotes, while an empty array is suddenly an open value? (text representation of empty array does not contain comma or space or etc)
why a single " is suddenly presented as \""? (text representation of 'one\ two', according to rules above is "one\\ two", and text representation of the last is ""one\\\\two"" and it is just what you get)
why unicode-formatted text is changing the escaping for \? How can we tell the difference then? (According to docs,
PostgreSQL also accepts "escape" string constants, which are an
extension to the SQL standard. An escape string constant is specified
by writing the letter E (upper or lower case) just before the opening
single quote
), so it is not unicode text, but the the way you tell postgres that it should interpret escapes in text not as symbols, but as escapes. Eg E'\'' will be interpreted as ' and '\'' will make it wait for closing ' to be interpreted. In you example E'\\ text' the text represent of it will be "\\ text" - we add backslsh for backslash and take value in double quotes - all as described in online docs.
the way that { and } are escaped is not always clear (I could not anwer this question, because it was not clear itself)

Search text between symbol

I have this text (taken from concatenated field row)
Astronomic Event 2013/1434H - Aceh ....
How do We search it by 2013 or 1434h keywords?
I have tried below code but it resulting no row.
to_tsvector result:
'2013/1434h':8,12 'aceh':1 'bin.....
Sample Case:
WITH sample_table as
(SELECT to_tsvector('Astronomic Event 2013/1434H - Aceh') sample_content)
SELECT *
FROM sample_table, to_tsquery('2013') q
WHERE sample_content ## q
How do We search it by 2013 or 1434h keywords?
It seems like you want to replace:
to_tsquery('1434h') q
with:
to_tsquery('1434h | 2013') q
http://www.postgresql.org/docs/current/static/functions-textsearch.html
Side note: the to_tsquery() syntax is extremely capricious. It doesn't allow for much if any fantasy, and many of the assumptions in Postgres are everything but end-user friendly.
More often than not, you'll be better off using plainto_tsquery(), which allows any amount of garbage to be thrown at it. Thus, consider pre-processing the string before issuing the query. For instance, you could split the string, and OR the original parts together:
where sc.text_index ## (plainto_tsquery('1434h') || plainto_tsquery('2013'))
Doing so will make your code a bit more complex, but it won't rely on your users needing to understand that (contrary to what they're accustomed to in Google) they should enter 'quick & brown & fox & jumps & lazy & dog' instead of plain 'The quick brown fox jumps over the lazy dog'.
Edit: I ended up actually trying your sample query, and it seems you're actually running into a parser issue:
# SELECT alias, description, token FROM ts_debug('Astronomic Event 2013/1434H - Aceh');
alias | description | token
-----------+-------------------+------------
asciiword | Word, all ASCII | Astronomic
blank | Space symbols |
asciiword | Word, all ASCII | Event
blank | Space symbols |
file | File or path name | 2013/1434H
blank | Space symbols |
blank | Space symbols | -
asciiword | Word, all ASCII | Aceh
(8 rows)
http://www.postgresql.org/docs/current/static/textsearch-parsers.html
It looks like you might need to write (or find) and configure an app-specific parser. Having never done this personally, the best I can do is to highlight that Postgres allows this and includes a sample:
http://www.postgresql.org/docs/current/static/test-parser.html
Alternatively, change your tsvector-related trigger so that it matches e.g. \d{4}/\d+[a-zA-Z] or whatever seems most appropriate, and adds spaces accordingly, before converting it to a tsvector. Something as simple as the following might do the trick if you never need to store file names:
SELECT alias, description, token
FROM ts_debug(replace('Astronomic Event 2013/1434H - Aceh', '/', ' / '));

Easy to remember fingerprints for data?

I need to create fingerprints for RSA keys that users can memorize or at least easily recognize. The following ideas have come to mind:
Break the SHA1 hash into portions of, say 4 bits and use them as coordinates for Bezier splines. Draw the splines and use that picture as a fingerprint.
Use the SHA1 hash as input for some fractal algorithm. The result would need to be unique for a given input, i.e. the output can't be a solid square half the time.
Map the SHA1 hash to entries in a word list (as used in spell checkers or password lists). This would create a passphrase consisting of real words.
Instead of a word list, use some other large data set like Google maps (map the SHA1 hash to map coordinates and use the map region(s) as a fingerprint)
Any other ideas? I'm sure this has been implemented in one form or another.
OpenSSH contains something like that, under the name "visual host key". Try this:
ssh -o VisualHostKey=yes somesshhost
where somesshhost is some machine with a SSH server running. It will print out a "fingerprint" of the server key, both in hexadecimal, and as an ASCII-art image which may look like this:
+--[ RSA 2048]----+
| .+ |
| + o |
| o o + |
| + o + |
| . o E S |
| + * . |
| X o . |
| . * o |
| .o . |
+-----------------+
Or like this:
+--[ RSA 1024]----+
| .*BB+ |
| . .++o |
| = oo. |
| . =o+.. |
| So+.. |
| ..E. |
| |
| |
| |
+-----------------+
Apparently, this is inspired from techniques described in this article. OpenSSH is opensource, with a BSD-like license, so chances are that you could simply reuse their code (it seems to be in the key.c file, function key_fingerprint_randomart()).
For item 3 (entries in a word list), see RFC-1751 - A Convention for Human-Readable 128-bit Keys, which notes that
The authors of S/Key devised a system to make the 64-bit one-time
password easy for people to enter.
Their idea was to transform the password into a string of small
English words. English words are significantly easier for people to
both remember and type. The authors of S/Key started with a
dictionary of 2048 English words, ranging in length from one to four
characters. The space covered by a 64-bit key (2^64) could be covered
by six words from this dictionary (2^66) with room remaining for
parity. For example, an S/Key one-time password of hex value:
EB33 F77E E73D 4053
would become the following six English words:
TIDE ITCH SLOW REIN RULE MOT
You could also use a compound fingerprint to improve memorability, like english words followed (or preceeded) by one or more key-dependent images.
For generating the image, you could use things like Identicon, Wavatar, MonsterID, or RoboHash.
Example:
TIDE ITCH SLOW
REIN RULE MOT
I found something called random art which generates an image from a hash. There is a Python implementation available for download: http://www.random-art.org/about/
There is also a paper about using random art for authentication: http://sparrow.ece.cmu.edu/~adrian/projects/validation/validation.pdf
It's from 1999; I don't know if further research has been done on this.
Your first suggestion (draw the path of splines for every four bytes, then fill using the nonzero fill rule) is exactly what I use for visualization in hashblot.