How to make Sphinx prefer complex keywords to its parts - sphinx

We have two fields: keywords (weight 10) and text (weight 1).
Let's see three records:
A: keywords = "some stuff, happy cat", text = "This is A"
B: keywords = "where stuff is, some dogs", text = "This is B some stuff"
C: keywords = "where some stuff is", text = "This is B some stuff"
When searching for some stuff we want to have A record above the B and C.
Sphinx shows A below the others, because it has less mentions for the stuff. But A has exact match in keywords (comma really means), so it is the only right answer.
How to configure Sphinx to reach that? Any kinds of texts preprocessing are allowed.

You can check various ranking modes as per your requirement.
Please see SPH_RANK_SPH04 ranking mode, this should work as per your expectation
You should mention which version of sphinx you are using.
Please read more details on ranking modes here

In your example C is most relevant.
You can filter by exact matches using quotes around search term.
You need to set matching mode to SPH_MATCH_MODE_EXTENDED2 which tell sphinx to get documents which containt exact string.
I recommend you take a look at extended search syntax.

Related

How to find common patterns in thousands of strings?

I don't want to find "abc" in strings ["kkkabczzz", "shdirabckai"]
Not like that.
But bigger patterns like this:
If I have to __, then I will ___.
["If I have to do it, then I will do it right.", "Even if I have to make it, I will not make it without Jack.", "....If I have to do, I will not...."]
I want to discover patterns in a large array or database of strings. Say going over the contents of an entire book.
Is there a way to find patterns like this?
I can work with JavaScript, Python, PHP.
The following could be a starting point:
The RegExp rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g looks for small (multiple word) patterns that occur at least twice in the text.
By playing around with the repeat quantifier + after (\s+\w+\b) (i.e. changing it to something like {2}) you can restrict your word patterns to any number of words (in the above case to 3: original + 2 repetitions) and you will get different results.
(?=.+\1)+ is a look ahead pattern that will not consume any of the matched parts of the string, so there is "more string" left for the remaining match attempts in the while loop.
const str="If I have to do it, then I will do it right. Even if I have to make it, I will not make it without Jack. If I have to do, I will not."
const rx=/(\b\w+(\s+\w+\b)+)(?=.+\1)+/g, r={};
let t;
while (t=rx.exec(str)) r[t[1]]=(rx.lastIndex+=1-t[1].length);
const res=Object.keys(r).map(p=>
[p,[...str.matchAll(p)].length]).sort((a,b)=>b[1]-a[1]||b[0].localeCompare(a[0]));
// list all repeated patterns and their occurrence counts,
// ordered by occurrence count and alphabet:
console.log(res);
I extended my snippet a little bit by collecting all the matches as keys in an object (r). At the end I list all the keys of this object alphabetically with Object.keys(r).sort().
In the while loop I also reset the rx.lastIndex property to start the search for that next pattern immediately after the start of the last one found: rx.lastIndex+=1-t[1].length.

Stemming words ending with y

I'm trying to use Postgres' full text search, but I'm struggling to get certain query phrases working properly when stemming is involved.
strawberries matches strawberry
fruity does not match fruit
From what I've read these stemming algorithms are internal to Postgres and can't necessarily be modified easily. Does anyone know if the -y suffix can be stemmed properly?
This is too long for a comment.
I assume you are at least familiar with the documentation on the subject. I think the simplest method would be to create a synonym dictionary with the pairs that are equivalent.
You need to be careful. There are lots of words in English where you cannot remove the "y":
lay <> la
Gaily <> Gail (woman's name)
Daily <> Dail (Irish parliament)
foxy <> fox
analog <> analogy
And this doesn't include the zillions of words where removing the "y" creates a non-word (a large class are -ly words; another, -way words).
You will need to manual create these yourselves.
I am not intimately familiar with Postgres's dictionaries. But you should be able to accomplish what you want.

Configure Sphinx to handle space as possible words separator

Suppose I have a text Foo Bar Baz-Qux. How can I configure Sphinx's indexer so Sphinx be able to find match for any of given strings?
Foo Bar Baz-Qux
Foo BazQux Bar
Baz Qux Foo Bar
Currently I've a dash symbol as value of ignore_chars setting, and Sphinx gives me result for first two queries but not for third.
Please note that solution must be general and not rely on particular words from example or on their relevant order.
Thanks!
I have found a solution (or a workaround): use of regexp_filter.
So Sphinx index config looks now like this:
...
ignore_chars = -
regexp_filter = \b([\w\d]+)-([\w\d]+)\b => \1\2 \1 \2
...
So right before Sphinx will put text into its index it will split all dash-containing words into two forms: first one where dash is simply removed and second where dash replaced with a space. At the moment of index creation three words of text "Foo-Bar" will be indexed: "FooBar", "Foo" and "Bar". This lets me to search with any of the following queries: "Foo-Bar" (dash will be removed since it is in ignore_chars list), "FooBar" (this words is in the index) and "Foo Bar" (both words are in the index).
The main problem here is that you cannot use exact phase match for both types of the queries at same time. I. e. if you search for "Bar BazQux" or "Bar Baz-Qux" you'll get a result. But for "Bar Baz Qux" you will get nothing. In my specific case it is not a issue, but for any who want to use this approach - I've warned you.
If you know better way to do this thing, or this workaround has some disadvantages that I have missed, please let me know.
Another possible solution is using of trigrams as shown here. This way also helps with possible user's mistakes but more difficult to implement.

get the exact match and the total number of keywords

Is it possible to get the exact match of the keywords based the number of them? Below is more clear I guess :)
In index I have this
record 1 "This is text"
record 2 "This is text and text"
then when I search for "This is text" I need to find only the first record.
Please note that I tried many filter but none seems to work, I always get both of them.
An extended match mode query of
"^this is test$"
should do it. Read up on the field-start, field-end and phrase operators for more inforation

ensure if hashtag matches in search, that it matches whole hashtag

I have an app that utilizes hashtags to help tag posts. I am trying to have a more detailed search.
Lets say one of the records I'm searching is:
The #bird flew very far.
When I search for "flew", "fle", or "#bird", it should return the record.
However, when I search "#bir", it should NOT return the sentence because the whole the tag being searched for doesn't match.
I'm also not sure if "bird" should even return the sentence. I'd be interested how to do that though as well.
Right now, I have a very basic search:
SELECT "posts".* FROM "posts" WHERE (body LIKE '%search%')
Any ideas?
You could do this with LIKE but it would be rather hideous, regexes will serve you better here. If you want to ignore the hashes then a simple search like this will do the trick:
WHERE body ~ E'\\mbird\M''
That would find 'The bird flew very far.' and 'The #bird flew very far.'. You'd want to strip off any #s before search though as this:
WHERE body ~ E'\\m#bird\M''
wouldn't find either of those results due to the nature of \m and \M.
If you don't want to ignore #s in body then you'd have to expand and modify the \m and \M shortcuts yourself with something like this:
WHERE body ~ E'(^|[^\\w#])#bird($|[^\\w#])'
-- search term goes here^^^^^
Using E'(^|[^\\w#])#bird($|[^\\w#])' would find 'The #bird flew very far.' but not 'The bird flew very far.' whereas E'(^|[^\\w#])bird($|[^\\w#])' would find 'The bird flew very far.' but not 'The #bird flew very far.'. You might also want to look at \A instead of ^ and \Z instead of $ as there are subtle differences but I think $ and ^ would be what you want.
You should keep in mind that none of these regex searches (or your LIKE search for that matter) will uses indexes so you're setting yourself up for lots of table scans and performance problems unless you can restrict the searches using something that will use an index. You might want to look at a full-text search solution instead.
It might help to parse the hash tags out of the text and store them in an array in a separate column called say hashtags when the articles are inserted/updated. Remove them from the article body before feeding it into to_tsvector and store the tsvector in a column of the table. Then use:
WHERE body_tsvector ## to_tsquery('search') OR 'search' IN hashtags
You could use a trigger on the table to maintain the hashtags column and the body_tsvector stripped of hash tags, so that the application doesn't have to do the work. Parse them out of the text when entries are INSERTed or UPDATEd.