SphinxQL match equivalent of MySQL LIKE %searchterm% - sphinx

In my MySQL database I have this result querying my data:
mysql> select count(*) from emails where email like '%johndoe%';
+----------+
| count(*) |
+----------+
| 102 |
+----------+
1 row in set (15.50 sec)
My data is indexed under Sphinx (Manticore Search actually) with min_word_len = 1. Now, when I search with SphinxQL I get only partial results:
mysql> SELECT count(*) FROM search1 WHERE MATCH('#email johndoe') LIMIT 1000 OPTION max_matches=1000;
+----------+
| count(*) |
+----------+
| 16 |
+----------+
1 row in set (0.00 sec)
Any idea how to match the results MySQL give me? I tried SPH_MATCH_ANY and SPH_MATCH_EXTENDED with the sphinxapi, same results.

I suspect it's mainly due to whole word matching. Sphinx matches whole words.
With 'words' defined as per charset_table http://sphinxsearch.com/docs/current/conf-charset-table.html
ie MATCH('#email johndoe') is only matching addresses with johndoe in them as a word. default charset_table keeps . - and # (common in emails!) all as separators so would match johndoe#domain.com or email#johndoe.com, but NOT email#myjohndoe.com, the word being indexed is myjohndoe not johndoe
Whereas mysql 'LIKE' will happy match part words. eg email like '%johndoe%' would johndoesmith#domain.com, johndoes555#domain.com and 555#johndoes.comor whatever. Its a pure substring match.
In short might want to tweak charset_table. could . - and # all be word chars, so email would be whole word.
alternatively might just enable part word matching with min_infix_len.
http://sphinxsearch.com/docs/current.html#conf-min-infix-len
then could do MATCH('#email *johndoe*') which would get much closer results.
complementary to min_infix_len would be expand_keywords http://sphinxsearch.com/docs/current.html#conf-expand-keywords
then the * wildcards would be added automatically, so could go back to MATCH('#email johndoe')

Related

How to combine PostgreSQL text search with fuzzystrmatch

I'd like to be able to query words from column of type ts_vector but everything which has a levenshtein distance below X should be considered a match.
Something like this where my_table is:
id | my_ts_vector_colum | sentence_as_text
------------------------------------------------------
1 | 'bananna':3 'tasty':2 'very':1 | Very tasty bananna
2 | 'banaana':2 'yellow':1 | Yellow banaana
3 | 'banana':2 'usual':1 | Usual banana
4 | 'baaaanaaaanaaa':2 'black':1 | Black baaaanaaaanaaa
I want to query something like "Give me id's of all rows, which contain the word banana or words similar to banana where similar means that its Levenshtein distance is less than 4". So the result should be 1, 2 and 3.
I know i can do something like select id from my_table where my_ts_vector_column ## to_tsquery('banana');, but this would only get me exact matches.
I also know i could do something like select id from my_table where levenshtein(sentence_as_text, 'banana') < 4;, but this would work only on a text column and would work only if the match would contain only the word banana.
But I don't know if or how I could combine the two.
P.S. Table where I want to execute this on contains about 2 million records and the query should be blazing fast (less than 100ms for sure).
P.P.S - I have full control on the table's schema, so changing datatypes, creating new columns, etc would be totally feasible.
2 million short sentences presumably contains far fewer distinct words than that. But if all your sentences have "creative" spellings, maybe not.
So you can perhaps create a table of distinct words to search relatively quickly with the unindexed distance function:
create materialized view words as
select distinct unnest(string_to_array(lower(sentence_as_text),' ')) word from my_table;
And create an exact index into the larger table:
create index on my_table using gin (string_to_array(lower(sentence_as_text),' '));
And then join the together:
select * from my_table join words
ON (ARRAY[word] <# string_to_array(lower(sentence_as_text),' '))
WHERE levenshtein(word,'banana')<4;

How to optimize inverse pattern matching in Postgresql?

I have Pg version 13.
CREATE TABLE test_schemes (
pattern TEXT NOT NULL,
some_code TEXT NOT NULL
);
Example data
----------- | -----------
pattern | some_code
----------- | -----------
__3_ | c1
__34 | c2
1_3_ | a12
_7__ | a10
7138 | a19
_123|123_ | a20
___253 | a28
253 | a29
2_1 | a30
This table have about 300k rows. I want to optimize simple query like
SELECT * FROM test_schemes where '1234' SIMILAR TO pattern
----------- | -----------
pattern | some_code
----------- | -----------
__3_ | c1
__34 | c2
1_3_ | a12
_123|123_ | a20
The problem is that this simple query will do a full scan of 300k rows to find all the matches. Given this design, how can I make the query faster (any use of special index)?
Internally, SIMILAR TO works similar to regexes, which would be evident by running an EXPLAIN on the query. You may want to just switch to regexes straight up, but it is also worth looking at text_pattern_ops indexes to see if you can improve the performance.
If the pipe is the only feature of SIMILAR TO (other than those present in LIKE) which you use, then you could process it into a form you can use with the much faster LIKE.
SELECT * FROM test_schemes where '1234' LIKE any(string_to_array(pattern,'|'))
In my hands this is about 25 times faster, and gives the same answer as your example on your example data (augmented with a few hundred thousand rows of garbage to get the table row count up to about where you indicated). It does assume there is no escaping of any pipes.
If you store the data already broken apart, it is about 3 times faster yet, but of course give cosmetically different answers.
create table test_schemes2 as select unnest as pattern, somecode from test_schemes, unnest(string_to_array(pattern,'|'));
SELECT * FROM test_schemes2 where '1234' LIKE pattern;

Postgres Full Text Search - Find Other Similar Documents

I am looking for a way to use Postgres (version 9.6+) Full Text Search to find other documents similar to an input document - essentially looking for a way to produce similar results to Elasticsearch's more_like_this query. Far as I can tell Postgres offers no way to compare ts_vectors to each other.
I've tried various techniques like converting the source document back into a ts_query, or reprocessing the original doc but that requires too much overhead.
Would greatly appreciate any advice - thanks!
Looks like the only option is to use pg_trgm instead of the Postgres built in full text search. Here is how I ended up implementing this:
Using a simple table (or materialized view in this case) - it holds the primary key to the post and the full text body in two columns.
Materialized view "public.text_index"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
id | integer | | | | plain | |
post | text | | | | extended | |
View definition:
SELECT posts.id,
posts.body AS text
FROM posts
ORDER BY posts.publication_date DESC;
Then using a lateral join we can match rows and order them by similarity to find posts that are "close to" or "related" to any other post:
select * from text_index tx
left join lateral (
select similarity(tx.text, t.text) from text_index t where t.id = 12345
) s on true
order by similarity
desc limit 10;
This of course is a naive way to match documents and may require further tuning. Additionally using a tgrm gin index on the text column will speed up the searches significantly.

Postgresql tsvector not searching few strings

I am using PostgreSQL 11, created tsvector with gin index on column search_fields.
Data in table test
id | name | search_fields
-------+--------------------------+--------------------------------
19973 | Ongoing 10x consultation | '10x' 'Ongoing' 'consultation'
19974 | 5x marketing | '5x' 'marketing'
19975 | Ongoing 15x consultation | '15x' 'Ongoing' 'consultation'
The default text search config is set as 'pg_catalog.english'.
Below both queries output 0 rows.
select id, name, search_fields from test where search_fields ## to_tsquery('ongoing');
id | name | search_fields
----+------+---------------
(0 rows)
select id, name, search_fields from test where search_fields ## to_tsquery('simple','ongoing');
id | name | search_fields
----+------+---------------
(0 rows)
But when I pass string as '10x' or 'consultation' it returns the correct output.
Any idea, why it is not searching for 'ongoing' words?
Afterwards, I have created the triggers using function tsvector_update_trigger() and update the search_fields and also set default_text_search_config to 'pg_catalog.simple' in postgresql.conf file, then I updated the search_fields with search_fields and it output as
select id, name, search_fields from test where search_fields ## to_tsquery('ongoing');
id | name | search_fields
----+---------------------------------+-----------------------------------------
19973 | Ongoing 10x consultation | '10x':2 'consultation':3 'ongoing':1
This time when I ran query passing 'ongoing' string it output as per the expected result.
select id, name, search_fields from test where search_fields ## to_tsquery('ongoing');
id | name | search_fields
-------+--------------------------+--------------------------------
19973 | Ongoing 10x consultation | '10x':2 'consultation':3 'ongoing':1
19975 | Ongoing 15x consultation | '15x':2 'consultation':3 'ongoing':1
As per above experiment, setting trigger and default_text_search_config to 'pg_catalog.simple' help to achieve the result.
Now, I don't know what is the reason why it didn't work with default_text_search_config to 'pg_catalog.english'.
Is trigger always required when tsvector is used?
Any help in understanding the difference between both would be appreciated.
Thanks,
Nishit
You don't describe how you create your search_fields initially. It was not constructed correctly. Since we don't know what you did, we don't know what you did wrong. If you rebuild it correctly, then it will start working. When you changed default_text_search_config to 'simple', you appear to have correctly repopulated the search_fields, which is why it worked. If you change back to 'english' and correctly repopulate the search_fields then it will also work.
You don't always need a trigger. A trigger is one way. Another way is to just manually update the tsvector column every time you update the text column. My usual favorite way is not to store the tsvector at all, and just derive it on the fly:
select id, name, search_fields from test where
to_tsvector('english',name) ## to_tsquery('english','ongoing');
If you want to do it this way, you need to specify the configuration, not rely on default_text_search_config, otherwise the expressional gin index will not be used. Also, this way is not a good idea if you want to use phrase searching, as the rechecking will be slow.

Build a list of grouped values

I'm new to this page and this is the first time i post a question. Sorry for anything wrong. The question may be old, but i just can't find any answer for SQL AnyWhere.
I have a table like
Order | Mark
======|========
1 | AA
2 | BB
1 | CC
2 | DD
1 | EE
I want to have result as following
Order | Mark
1 | AA,CC,EE
2 | BB,DD
My current SQL is
Select Order, Cast(Mark as NVARCHAR(20))
From #Order
Group by Order
and it just give me with result completely the same with the original table.
Any idea for this?
You can use the ASA LIST() aggregate function (untested, you might need to enclose the order column name into quotes as it is also a reserved name):
SELECT Order, LIST( Mark )
FROM #Order
GROUP BY Order;
You can customize the separator character and order if you need.
Note: it is rather a bad idea to
name your table and column name with like regular SQL clause (Order by)
use the same name for column an table (Order)