Search on last word in a field using Sphinx? - sphinx

I'm using SphinxQL to prepare Sphinx searches (in fact part of a NOT operator) but am unable to do something that is pretty simple with Mysql: like '% Word'. I simply need to know when a specific word is the last one in the field/string but SphinxQL doesn't seem to lend itself to that.
The quick brown fox jumped over the lazy dog.
Lazy Dog day afternoons
I'm essentially looking to search on
select Description from idx_Table WHERE (MATCH('#(Description) Fox Dog (not like '% Dog'))
I get that the above is not proper SphinxQL at all but is essentially what I am trying to achieve.

There is a field end modifier, so can specifically match the last word in the a field.
... WHERE MATCH('#(Description) Fox Dog$')
Will only get you matches where the last word is Dog. Use phrase marks if want the last two (or more!) words.
... WHERE MATCH('#(Description) "Fox Dog$ " ')
But there is still no assertions to say, match this, EXCEPT when it's the last word
... WHERE MATCH('#(Description) Fox Dog$ -Dog$')
would execute, may well be excluding 'valid' matches.

Related

Supporting typos in Postgres FTS

This returns true
SELECT to_tsvector('The quick brown fox jumped over the lazy j-80 dog')
## to_tsquery('j-80');
These return false:
-- no minus char
SELECT to_tsvector('The quick brown fox jumped over the lazy j-80 dog')
## to_tsquery('j80');
-- a typo, typing 9 instead of 8
SELECT to_tsvector('The quick brown fox jumped over the lazy j-80 dog')
## to_tsquery('j90');
-- the user searches with a space 'j 80'
SELECT to_tsvector('The quick brown fox jumped over the lazy j-80 dog')
## to_tsquery('j & 80');
How do i improve the queries or maybe the tsvector so that i get true for all of the above ?
It is hard to operate effectively on an unannotated mixture of ordinary English and technical jargon, like part numbers. Adding in the shortness of the part numbers, the inconsistent punctuation (particularly if the part number can have embedded spaces), and the possibilities of misspellings, and it all adds up to a very hard problem. If you can somehow extract the part numbers into their own column and standardize the punctuation both in that column and in the query (by removing all punctuation, for example), then you can use a pg_trgm index or operators. But with the part number being only 3 characters longs, you still don't have much to go with. For example, j80 and j90 are just barely related at all in the trigram algorithm:
create extension if not exists pg_trgm;
select similarity('j80', 'j90');
similarity
------------
0.142857
Basically, they both start with j is all you have there. (They also both end with 0, but trigrams need at least 2 characters at the end of a word to be the same to consider it a match--beginnings have more weight than endings).

Odd to_tsquery results for s:* and t:*

I was experimenting with PostgreSQL's text search feature - particularly with the normalization function to_tsquery.
I was using english dictionary(config) and for some reason s and t won't normalize. I understand why i and a would not, but s and t? Interesting.
Are they matched to single space and tab?
Here is the query:
select
to_tsquery('english', 'a:*') as for_a,
to_tsquery('english', 's:*') as for_s,
to_tsquery('english', 't:*') as for_t,
to_tsquery('english', 'u:*') as for_u
fiddle just in case.
You would see 'u:*' is returning as 'u:*' and 'a:*' is not returning anything.
The letters s and t are considered stop words in the english text search dictionary, therefore they get discarded. You can read the stop word list under tsearch_data/english.stop in the postgres shared folder, which you can locate by typing pg_config --sharedir
With pg 11 on ubuntu/debian/mint, that would be
cat /usr/share/postgresql/11/tsearch_data/english.stop
Quoting from the docs,
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching.
It is best to discard english grammar and think of words in a programmatic and logical way as described above. Full text search does not try to infer context based on sentence structuring so it has no use for these words. After all, it's called full text search and not natural language search.
As to how they arrived on the conclusion to add s and t to the stop word list, statistical analysis must have revealed these characters to be noise.

Field position limit in Sphinx to *start* search at character position?

As far as I can tell "Field position limit" in sphinx only allows you to force search to the first N characters in a document? Is there anyway to use it to force search AFTER the first N characters instead?
The quick brown fox jumped over the lazy dog and he was crazy as a fox and just as fast
Fox[20]
will find the first fox and not the second.
What I am looking for is something like
Fox[50] that won't start search until char 50 ("and he was crazy as a fox and just as fast")
Well you could say
"bla bla" #field[50] -"bla bla"
But you have the old problem of it also exlcuding items with it after as well as before.
Otherwise think you will have to look at ranking expressions, there is min_hit_pos which can use. Would have to use the ranking expression to change the ranking calculation, and then 'post filter' based on the weight. Can use the weight in WHERE, via virtual attributes.
(this wont work either, see comments!)

ensure if hashtag matches in search, that it matches whole hashtag

I have an app that utilizes hashtags to help tag posts. I am trying to have a more detailed search.
Lets say one of the records I'm searching is:
The #bird flew very far.
When I search for "flew", "fle", or "#bird", it should return the record.
However, when I search "#bir", it should NOT return the sentence because the whole the tag being searched for doesn't match.
I'm also not sure if "bird" should even return the sentence. I'd be interested how to do that though as well.
Right now, I have a very basic search:
SELECT "posts".* FROM "posts" WHERE (body LIKE '%search%')
Any ideas?
You could do this with LIKE but it would be rather hideous, regexes will serve you better here. If you want to ignore the hashes then a simple search like this will do the trick:
WHERE body ~ E'\\mbird\M''
That would find 'The bird flew very far.' and 'The #bird flew very far.'. You'd want to strip off any #s before search though as this:
WHERE body ~ E'\\m#bird\M''
wouldn't find either of those results due to the nature of \m and \M.
If you don't want to ignore #s in body then you'd have to expand and modify the \m and \M shortcuts yourself with something like this:
WHERE body ~ E'(^|[^\\w#])#bird($|[^\\w#])'
-- search term goes here^^^^^
Using E'(^|[^\\w#])#bird($|[^\\w#])' would find 'The #bird flew very far.' but not 'The bird flew very far.' whereas E'(^|[^\\w#])bird($|[^\\w#])' would find 'The bird flew very far.' but not 'The #bird flew very far.'. You might also want to look at \A instead of ^ and \Z instead of $ as there are subtle differences but I think $ and ^ would be what you want.
You should keep in mind that none of these regex searches (or your LIKE search for that matter) will uses indexes so you're setting yourself up for lots of table scans and performance problems unless you can restrict the searches using something that will use an index. You might want to look at a full-text search solution instead.
It might help to parse the hash tags out of the text and store them in an array in a separate column called say hashtags when the articles are inserted/updated. Remove them from the article body before feeding it into to_tsvector and store the tsvector in a column of the table. Then use:
WHERE body_tsvector ## to_tsquery('search') OR 'search' IN hashtags
You could use a trigger on the table to maintain the hashtags column and the body_tsvector stripped of hash tags, so that the application doesn't have to do the work. Parse them out of the text when entries are INSERTed or UPDATEd.

How to search a string for different tenses?

I can use Stemmers, Filters etc. No problem.
But what about this case, for example the source text contains the phrase:
The fox made a jump.
User has entered: fox AND make
Results = 0;
The question is how to process irregular forms of words?
There aren't that many irregular verbs. Get a list of the most common ones like here and use it to do a find/replace in your query string to replace "make" with ("make" OR "made") before submitting.