Return rows where words match in two columns OR in match in one column and the other column is empty? - sphinx

This is a follow-up to another question I recently asked.
I currently have a SphinxQL query like this:
SELECT * FROM my_index
WHERE MATCH(\'#field1 "a few words"/1 #field2 "more text here"/1\')
However, I would still like it to match rows in the case where one of the fields in the row is empty.
For example, let's say the following rows exist in the database:
field1 | field2
-----------------------
words in here | text in here
| text in here
The above query would match the first row, but it would not match the second row because the quorum operator specifies that there has to be one or more matches for each field.
Is what I'm asking possible?
The actual query I'm trying to make this work with was provided in Barry Hunter's answer to my previous question:
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1 #tags2 "one two thee"/1') AND w = 2
OPTION ranker=expr('SUM(IF(word_count>=IF(user_weight=2,tags2_len,tags_len),1,0))'),
field_weights=(tags=1,tags2=2);

First problem is sphinx doesn't index "empty" so you can't search for it. (well actually the field_len attribute will be zero. But it can be hard to combine attribute filter with MATCH())
... so arrange for empty to be something to index
sql_query = SELECT id,...,IF(tags='','_empty_',tags) AS tags FROM ...
Then modify the query. As it happens your quorum search is easy!
#field1 "a few words _empty_"/1
Its just another word. But a more complex query would just have to be OR'ed with the word.
Then there is making it work within your complex query. But as luck would have it, its really easy. _empty_ is just another word. And in the case of the field being empty, one word will match. (ie there are no words in the field, not in the query)
So just add _empty_ into the two quorums and you done!

Related

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

Returning all rows of a query where the string values in one column match the string values in another column

I have been searching endlessly for the answer to this problem I have been having:
Our team uses a query that returns a dataset with 13 columns. We want to narrow down the results by returning only rows where any string value in column "Actual Collection" is in the adjacent column "PrvPrComments". Additionally we want to do the same thing for column "Actual Manufacturer" and "PrvPrComments". If a string value in either Actual collection or Actual manufacturer exsists in PrvPrComments then we want to return that row and if it does not then exclude it.
The tricky part is that PrvPrComments is a column that has long text strings in them and so the query needs to parse through to find and match the string. They also need to be exact matches so "Pillow Perfect" and "pillow" would not be the same thing.
Here is an example posted below. I would want to return rows that contains "cowboy" and "chandelier" because there is a match but not the others:
Example of data
My initial guess would be to write a query that uses Full Text Index and/or contains. Any help would be greatly appreciated and I apologize for not having a foundation code to post here, I'm fairly new to this and am having trouble with where to start.
Thank you
where '%' + actualCollection + '%' like PrvPrComments
If data is not that much you can use (like expression) to return the data,
WHERE PrvPrComments LIKE '%' + actualCollection + '%'
But if data is huge and full-text search will not be that much useful, you might have another column as a flag and populate the same at INSERTION time, (when the actualCollection is LIKE PrvPrComments then set the flag as 1 ). later you need to query against rows having flag as 1

why is this postgresql full text search query returning ts_rank of 0?

Before I invest in using solr or lucene or sphinx, I wanted to try to implement a search capability on my system using postgresql full text search.
I have a national list of businesses in a table that I want to search. I created a ts vector that combines the business name and city so that I can do a search like "outback atlanta".
I am also implementing an auto-completion function by using the wildcard capability of the search by appending ":" to the search pattern and inserting " & " between keywords, so the search pattern "outback atl" turns into the "outback & atl:" before getting converted into a query using to_tsquery().
Here's the problem that I am running into currently.
if the search pattern is entered as "ou", many "Outback Steakhouse" records are returned.
if the search pattern is entered as "out", no results are returned.
if the search pattern is entered as "outb", many "Outback Steakhouse" records are returned.
doing a little debugging, I came up with this:
select ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('ou:*')) as "ou",
ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('out:*')) as "out",
ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('outb:*')) as "outb"
which results this:
ou out outb
0.0607927 0 0.0607927
What am I doing wrong?
Is this a limitation of pg full text search?
Is there something that I can do with my dictionary or configuration to get around this anomaly?
UPDATE:
I think that "out" may be a stop word.
when I run this debug query, I don't get any lexemes for "out"
SELECT * FROM ts_debug('english','out back outback');
alias description token dictionaries dictionary lexemes
asciiword Word all ASCII out {english_stem} english_stem {}
blank Space symbols {}
asciiword Word all ASCII back {english_stem} english_stem {back}
blank Space symbols {}
asciiword Word all ASCII outback {english_stem} english_stem {outback}
So now I ask how do I modify the stop word list to remove a word?
UPDATE:
here is the query that I currently using:
select id,name,address,city,state,likes
from view_business_favorite_count
where textsearchable_index_col ## to_tsquery('simple',$1)
ORDER BY ts_rank(textsearchable_index_col, to_tsquery('simple',$1)) DESC
When I execute the query (I'm using Strongloop Loopback + Express + Node), I pass the pattern in to replace $1 param. The pattern (as stated above) will look something like "keyword:" or "keyword1 & keyword2 & ... & keywordN:"
thanks
The problem here is that you are searching against business names and as #Daniel correctly pointed out - 'english' dictionary will not help you to find "fuzzy" match for NON-dictionary words like "Outback Steakhouse" etc;
'simple' dictionary
'simple' dictionary on its own will not help you neither, in your case business names will work only for exact match as all words are unstemmed.
'simple' dictionary + pg_trgm
But if you use 'simple' dictionary together with pg_trgm module - it will be exactly what you need, in particular:
for to_tsvector('simple','<business name>') you don't need to worry about stop words "hack", you will get all the lexemes unstemmed;
using similarity() from pg_trgm you will get the the highest "rank"
for the best match,
look at this:
WITH pg_trgm_test(business_name,search_pattern) AS ( VALUES
('Outback Steakhouse','ou'),
('Outback Steakhouse','out'),
('Outback Steakhouse','outb')
)
SELECT business_name,search_pattern,similarity(business_name,search_pattern)
FROM pg_trgm_test;
result:
business_name | search_pattern | similarity
--------------------+----------------+------------
Outback Steakhouse | ou | 0.1
Outback Steakhouse | out | 0.15
Outback Steakhouse | outb | 0.2
(3 rows)
Ordering by similarity DESC you will be able to get what you need.
UPDATE
For you situation there are 2 possible options.
Option #1.
Just create trgm index for name column in view_business_favorite_count table; index definition may be the following:
CREATE INDEX name_trgm_idx ON view_business_favorite_count USING gin (name gin_trgm_ops);
Query will look something like that:
SELECT
id,
name,
address,
city,
state,
likes,
similarity(name,$1) AS trgm_rank -- similarity score
FROM
view_business_favorite_count
WHERE
name % $1 -- trgm search
ORDER BY trgm_rank DESC;
Option #2.
With full text search, you need to :
create a separate table, for example unnested_business_names, where you will store 2 columns: 1st column will keep all lexemes from to_tsvector('simple',name) function, 2nd column will have vbfc_id(FK for id from view_business_favorite_count table);
add trgm index for the column, which contains lexemes;
add trigger for unnested_business_names, which will update OR insert OR delete new values from view_business_favorite_count to keep all words up to date

Longest matching substring

How would you search for the longest match within a varchar variable? For example, table GOB has entries as follows:
magic_word | prize
===================
sh| $0.20
sha| $0.40
shaz| $0.60
shaza| $1.50
I would like to write a plpgsql function that takes amongst other arguments a string as input (e.g. shazam), and returns the 'prize' column on the row of GOB with the longest matching substring. In the example shown, that would be $1.50 on the row with magic_word shaza.
All the function format I can handle, it's just the matching bit. I can't think of an elegant solution. I'm guessing it's probably really easy, but I am scratching my head. I don't know the input string at the start, as it will be derived from the result of a query on another table.
Any ideas?
Simple solution
SELECT magic_word
FROM gob
WHERE 'shazam' LIKE (magic_word || '%')
ORDER BY magic_word DESC
LIMIT 1;
This works because the longest match sorts last - so I sort DESC and pick the first match.
I am assuming from your example that you want to match left-anchored, from the beginning of the string. If you want to match anywhere in the string (which is more expensive and even harder to back up with an index), use:
...
WHERE 'shazam' LIKE ('%' || magic_word || '%')
...
SQL Fiddle.
Performance
The query is not sargable. It might help quite a bit if you had additional information, like a minimum length that you could base an index on, to reduce the number of rows to consider. It needs to be criteria that gets you less than ~ 5% of the table to be effective. So, initials (a natural minimum pick) may or may not be useful. But two or three letters at the start might help quite a bit.
In fact you could optimize this iteratively. Something along the line of:
Try a partial index of words with 15 letters+
If not found, try 12 letters+
If not found, try 9 letters+
...
A simple case of what I outlined in this related answer on dba.SE:
Can spatial index help a “range - order by - limit” query
Another approach would be to use a trigram index. You'd need the additional module pg_trgm for that. Normally you would search with a short pattern in a table with longer strings. But trigrams work for your reverse approach, too, with some limitations. Obviously you couldn't match a string with just two characters in the middle of a longer string using trigrams ... Test for corner cases.
There are a number of answers here on SO with more information. Example:
Effectively query on column that includes a substring
Advanced solution
Consider the solution under this closely related question for a whole table of search strings. Implemented with a recursive CTE:
Longest Prefix Match
How about
1
select max(FOO.matchingValue)
from
(
select magic_word as matchingValue
from T
where substr( "abracadabra", 1, length(magic_word)) = magic_word
)
as FOO
2
select prize from
T
join
(
select max(FOO.matchingValue) as MaxValue
from
(
select magic_word as matchingValue
from T
where substr( "abracadabra", 1, length(magic_word)) = magic_word
)
as FOO
) as BAR
on BAR.MaxValue = T.magic_word

Sqlite statement match exact string iphone

I have the following words in table database:
name id
aba 0
abac 1
abaca 2
abace 3
I want to make the following select:
Select id from words where name='abaca';
I tried but it doesn't work. I want the exact match of the word I'm entering to match.
I tried with your entries and following query does work perfectly fine for an Exact match
SELECT * FROM Match WHERE title='abaca';
Where Match is the table name with following schema
CREATE TABLE Match (title text,recordId integer)
Following is the record inserted
Though I made the same mistake by Taking Match as Table name as you did by taking id as the column name, but I only realized it while posting the code. But that should really not make much difference here.
Have you used it in some code to fetch result from database? in case yes you should consider avoiding KEYWORDS as variable,column names.. Just a suggestion.