Why Sphinx CALL KEYWORDS ignores stopwords? - sphinx

I have an index with stopwords set up.
I execute CALL KEYWORDS ('word1 stopword word2', 'my_index') and expect to see only two tokens: word1, word2, but I receive three tokens: word1, stopword, word2.
Why it is so? How I can make it work as expected ?

Related

Regex : findall with a repeated capture group

I would like to understand why :
re.findall(r"(\d[A-Za-z]+)", "My user name is 3e4r 5fg")
returns
['3e', '4r', '5fg']
while :
re.findall(r"(\d[A-Za-z]+)+", "My user name is 3e4r 5fg")
returns
['4r', '5fg']
I tested some combinations with spaces between groups of "digit-letter" and 2 points clearly are involved in :
spaces between those groups
last "+".
I don't really understand why adding "+" after the group changes the result. Can someone explain me the steps of the process which leads to those different answers? Thank you very much.
When you put + after parenthesis you are searching for a pattern that contains one or more sub pattern with 1 digit and (one or more) letters'
so this phrase: "(\d[A-Za-z]+)+" return 2 matches:
3e4r
5fg
When you put a sub-pattern in parenthesis it means that all matches this sub-pattern will enter in a group, the groups is:
3e
5fg
The function re.findall returns only the groups (Unless there are no groups then it returns the matches ).

Prefix/wildcard searches with 'websearch_to_tsquery' in PostgreSQL Full Text Search?

I'm currently using the websearch_to_tsquery function for full text search in PostgreSQL. It all works well except for the fact that I no longer seem to be able to do partial matches.
SELECT ts_headline('english', q.\"Content\", websearch_to_tsquery('english', {request.Text}), 'MaxFragments=3,MaxWords=25,MinWords=2') Highlight, *
FROM (
SELECT ts_rank_cd(f.\"SearchVector\", websearch_to_tsquery('english', {request.Text})) AS Rank, *
FROM public.\"FileExtracts\" f, websearch_to_tsquery('english', {request.Text}) as tsq
WHERE f.\"SearchVector\" ## tsq
ORDER BY rank DESC
) q
Searches for customer work but cust* and cust:* do not.
I've had a look through the documentation and a number of articles but I can't find a lot of info on it. I haven't worked with it before so hopefully it's just something simple that I'm doing wrong?
You can't do this with websearch_to_tsquery but you can do it with to_tsquery (because ts_query allows to add a :* wildcard) and add the websearch syntax yourself in in your backend.
For example in a node.js environment you could do smth. like this:
let trimmedSearch = req.query.search.trim()
let searchArray = trimmedSearch.split(/\s+/) //split on every whitespace and remove whitespace
let searchWithStar = searchArray.join(' & ' ) + ':*' //join word back together adds AND sign in between an star on last word
let escapedSearch = yourEscapeFunction(searchWithStar)
and than use it in your SQL
search_column ## to_tsquery('english', ${escapedSearch})
You need to write the tsquery directly if you want to use partial matching. plainto_tsquery doesn't pass through partial match notation either, so what were you doing before you switched to websearch_to_tsquery?
Anything that applies a stemmer is going to have hard time handling partial match. What is it supposed to do, take off the notation, stem the part, then add it back on again? Not do stemming on the whole string? Not do stemming on just the token containing the partial match indicator? And how would it even know partial match was intended, rather than just being another piece of punctuation?
To add something on top of the other good answers here, you can also compose your query with both websearch_to_tsquery and to_tsquery to have everything from both worlds:
select * from your_table where ts_vector_col ## to_tsquery('simple', websearch_to_tsquery('simple', 'partial query')::text || ':*')
Another solution I have come up with is to do the text transform as part of the query so building the tsquery looks like this
to_tsquery(concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & '), ':*'));
(trim) Removes leading/trailing whitespace
(regexp_replace) Splits the search string on non word chars and adds trailing wildcards to each term, then ANDs the terms (:* & )
(concat) Adds a trailing wildcard to the final term
(to_tsquery) Converts to a ts_query
You can test the string manipulation by running
SELECT concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & ', 'gm'), ':*')
the result should be
all:* & the:* & search:* & terms:* & here:*
So you have multi word partial matches e.g. searching spi ma would return results matching spider man

Wildcard searching between words with CRC mode in Sphinx

I use sphinx with CRC mode and min_infix_length = 1 and I want to use wildcard searching between character of a keyword. Assume I have some data like these in my index files:
name
-------
mickel
mick
mickol
mickil
micknil
nickol
nickal
and when I search for all record that their's name start with 'mick' and end with 'l':
select * from all where match ('mick*l')
I expect the results should be like this:
name
-------
mickel
mickol
mickil
micknil
but nothing returned. How can I do that?
I know that I can do this in dict=keywords mode but I should use crc mode for some reasons.
I also used '^' and '$' operators and didn't work.
You can't use 'middle' wildcards with CRC. One of the reaons for dict=keywords, the wildcards it can support are much more flexible.
With CRC, it 'precomputes' all the wildcard combinations, and injects them as seperate keywords in index, eg for
eg mickel as a document word, and with min_prefix_len=1, indexer willl create the words:
mickel
mickel*
micke*
mick*
mic*
mi*
m*
... as words in index, so all the combinations can match. If using min_infix_len, it also has to do all the combinations at the start as well (so (word_length)^2 + 1 combinations)
... if it had to precompute all the combinations for wildcards in the middle, would be a lot more again. Particularly if then allows all for middle AND start/end combinations as well)
Although having said that, you can rewrite
select * from all where match ('mick*l')
as
select * from all where match ('mick* *l')
because with min_infix_len, the start and end will be indexed as sperate words. Jus need to insist that both match. (although can't think how to make them bot match the same word!)

Sphinx search entire field but not begin/end

I am trying to match a field that contains all the words in a phrase but so far have only been able to use ^ and $ to do it. For instance
^Word1 Word2$
Returns a record named "Word1 Word2" but not "Word3 Word1 Word2".
However what I want in fact is also "Word2 Word1"
So I get how to use the ^ and $ to mean start and end of the field but that forces the words I put in to be in particular order. Clearly I could also search for "Word2 Word1" but it gets more complex (3+ word terms, etc)
Is there a way to tell sphinx to look in an entire field in any order. In other words I want "Word1 Word2" to match "Word1 Word2" and "Word2 Word1" but not "Word3 Word2 Word1"
Well can use NEAR/ or the proximity operator to require the words ajoining(but in any order), but there isnt really a good way to require 'entire field'.
Closest would probably to use index_field_lengths, then get the field len can use in a custom ranking expression. But if multiple fields in your index will be very tricky to implement.

PostgreSQL prevent non-matching tsqueries from matching tsvector

Given the following query:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('cats ate');
This query will return true as a result. Now, what if I don't want "cats" to also match the word "cat", is there any way I can prevent this?
Also, is there any way I can make sure that the tsquery matches the entire string in that particular order (e.g. the "cats ate" is counted as a single token rather than two). At the moment the following query will also match:
select to_tsvector('fat cat ate rat') ## plainto_tsquery('ate cats');
cat matching cats is due to english stemming, english being probably your default text search configuration. See the result of show default_text_search_config to be sure.
It can be avoided by using the simple configuration. Try the function calls with explicit text configurations:
select to_tsvector('simple', 'fat cat ate rat') ## plainto_tsquery('simple', 'cats ate');
Or change it with:
set default_text_search_config='simple';