Sphinx search entire field but not begin/end - sphinx

I am trying to match a field that contains all the words in a phrase but so far have only been able to use ^ and $ to do it. For instance
^Word1 Word2$
Returns a record named "Word1 Word2" but not "Word3 Word1 Word2".
However what I want in fact is also "Word2 Word1"
So I get how to use the ^ and $ to mean start and end of the field but that forces the words I put in to be in particular order. Clearly I could also search for "Word2 Word1" but it gets more complex (3+ word terms, etc)
Is there a way to tell sphinx to look in an entire field in any order. In other words I want "Word1 Word2" to match "Word1 Word2" and "Word2 Word1" but not "Word3 Word2 Word1"

Well can use NEAR/ or the proximity operator to require the words ajoining(but in any order), but there isnt really a good way to require 'entire field'.
Closest would probably to use index_field_lengths, then get the field len can use in a custom ranking expression. But if multiple fields in your index will be very tricky to implement.

Related

MongoDB: How to only pull document with exact search term

I'm sure this is somewhere on here but I can't seem to find it. I'm trying to pull a document from a large file that only matches an exact term in a field, as opposed to anything with those letters in it.
More precisely, I'm trying to use .find({"name":"Eli"}) to pull the documents with that name, but my search is pulling every name with those letters (such as elizabeth or ophelia)
You can use a regular expression match to make sure you do not return names that share the same character formation.
Something like this:
const name = "Eli"
const query = new RegExp(`^${name}$`)
const user = await Collection.find({ name: { $regex: query } })
I am using 2 key operators from RegEx here: ^ and $
Putting ^ in front of a regular expression will match all strings that start with the pattern given.
Putting $ at the end of a regular expression will match all strings that end with the pattern given.
So essentially you are asking mongoose to find the record where the name both begins and ends with Eli. This will prevent Elizabeth from showing up in your result, but won't filter out other Eli's.

Prefix/wildcard searches with 'websearch_to_tsquery' in PostgreSQL Full Text Search?

I'm currently using the websearch_to_tsquery function for full text search in PostgreSQL. It all works well except for the fact that I no longer seem to be able to do partial matches.
SELECT ts_headline('english', q.\"Content\", websearch_to_tsquery('english', {request.Text}), 'MaxFragments=3,MaxWords=25,MinWords=2') Highlight, *
FROM (
SELECT ts_rank_cd(f.\"SearchVector\", websearch_to_tsquery('english', {request.Text})) AS Rank, *
FROM public.\"FileExtracts\" f, websearch_to_tsquery('english', {request.Text}) as tsq
WHERE f.\"SearchVector\" ## tsq
ORDER BY rank DESC
) q
Searches for customer work but cust* and cust:* do not.
I've had a look through the documentation and a number of articles but I can't find a lot of info on it. I haven't worked with it before so hopefully it's just something simple that I'm doing wrong?
You can't do this with websearch_to_tsquery but you can do it with to_tsquery (because ts_query allows to add a :* wildcard) and add the websearch syntax yourself in in your backend.
For example in a node.js environment you could do smth. like this:
let trimmedSearch = req.query.search.trim()
let searchArray = trimmedSearch.split(/\s+/) //split on every whitespace and remove whitespace
let searchWithStar = searchArray.join(' & ' ) + ':*' //join word back together adds AND sign in between an star on last word
let escapedSearch = yourEscapeFunction(searchWithStar)
and than use it in your SQL
search_column ## to_tsquery('english', ${escapedSearch})
You need to write the tsquery directly if you want to use partial matching. plainto_tsquery doesn't pass through partial match notation either, so what were you doing before you switched to websearch_to_tsquery?
Anything that applies a stemmer is going to have hard time handling partial match. What is it supposed to do, take off the notation, stem the part, then add it back on again? Not do stemming on the whole string? Not do stemming on just the token containing the partial match indicator? And how would it even know partial match was intended, rather than just being another piece of punctuation?
To add something on top of the other good answers here, you can also compose your query with both websearch_to_tsquery and to_tsquery to have everything from both worlds:
select * from your_table where ts_vector_col ## to_tsquery('simple', websearch_to_tsquery('simple', 'partial query')::text || ':*')
Another solution I have come up with is to do the text transform as part of the query so building the tsquery looks like this
to_tsquery(concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & '), ':*'));
(trim) Removes leading/trailing whitespace
(regexp_replace) Splits the search string on non word chars and adds trailing wildcards to each term, then ANDs the terms (:* & )
(concat) Adds a trailing wildcard to the final term
(to_tsquery) Converts to a ts_query
You can test the string manipulation by running
SELECT concat(regexp_replace(trim(' all the search terms here '), '\W+', ':* & ', 'gm'), ':*')
the result should be
all:* & the:* & search:* & terms:* & here:*
So you have multi word partial matches e.g. searching spi ma would return results matching spider man

Can I Exclude Certain Pattern Matches In Rosie?

I want to match all five digit numbers except for a specific pattern. So I want to be able to match 12345 but exclude 00000. Is there a pattern which I can use in Rosie to match this set of patterns?
Yes this is possible. Given the example above the correct expression would be
allButFiveZeroes = {!"00000" [0-9]{5}}
The !"00000" is referred to as negative lookahead.

Wildcard searching between words with CRC mode in Sphinx

I use sphinx with CRC mode and min_infix_length = 1 and I want to use wildcard searching between character of a keyword. Assume I have some data like these in my index files:
name
-------
mickel
mick
mickol
mickil
micknil
nickol
nickal
and when I search for all record that their's name start with 'mick' and end with 'l':
select * from all where match ('mick*l')
I expect the results should be like this:
name
-------
mickel
mickol
mickil
micknil
but nothing returned. How can I do that?
I know that I can do this in dict=keywords mode but I should use crc mode for some reasons.
I also used '^' and '$' operators and didn't work.
You can't use 'middle' wildcards with CRC. One of the reaons for dict=keywords, the wildcards it can support are much more flexible.
With CRC, it 'precomputes' all the wildcard combinations, and injects them as seperate keywords in index, eg for
eg mickel as a document word, and with min_prefix_len=1, indexer willl create the words:
mickel
mickel*
micke*
mick*
mic*
mi*
m*
... as words in index, so all the combinations can match. If using min_infix_len, it also has to do all the combinations at the start as well (so (word_length)^2 + 1 combinations)
... if it had to precompute all the combinations for wildcards in the middle, would be a lot more again. Particularly if then allows all for middle AND start/end combinations as well)
Although having said that, you can rewrite
select * from all where match ('mick*l')
as
select * from all where match ('mick* *l')
because with min_infix_len, the start and end will be indexed as sperate words. Jus need to insist that both match. (although can't think how to make them bot match the same word!)

how to get matched group number in pcre2

I want to use pcre2 to match string.
For example, I have several string pattern, "a","b","c","d", and "e".
I have a long text "str" to match.
Now I construct a pattern "a|b|c|d|e" to match "str" use pcre2_match.
How to know which pattern is matched?
I just want to get the matched pattern number, not "a" or "b", as I don't want to compare the matched pattern with "a","b","c","d","e" again.
Assuming you're using the PCRE2 library directly and have access to all of its features, you have several solutions for this, from the simplest to the most involved:
Use numbered capture groups: (a)|(b)|(c)|(d)
Use named capture groups: (?<a>a)|(?<b>b)|(?<c>c)|(?<d>d)
Use marks: a(*MARK:a)|b(*MARK:b)|c(*MARK:c)|d(*MARK:d)
Use callouts: a(?C{a})|b(?C{b})|c(?C{c})|d(?C{d})
If you really can't modify your input pattern, use PCRE2_AUTO_CALLOUT and find some way to map pattern offsets to branches, then rememeber the last pattern offset seen before the end of the match