why is this postgresql full text search query returning ts_rank of 0? - postgresql

Before I invest in using solr or lucene or sphinx, I wanted to try to implement a search capability on my system using postgresql full text search.
I have a national list of businesses in a table that I want to search. I created a ts vector that combines the business name and city so that I can do a search like "outback atlanta".
I am also implementing an auto-completion function by using the wildcard capability of the search by appending ":" to the search pattern and inserting " & " between keywords, so the search pattern "outback atl" turns into the "outback & atl:" before getting converted into a query using to_tsquery().
Here's the problem that I am running into currently.
if the search pattern is entered as "ou", many "Outback Steakhouse" records are returned.
if the search pattern is entered as "out", no results are returned.
if the search pattern is entered as "outb", many "Outback Steakhouse" records are returned.
doing a little debugging, I came up with this:
select ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('ou:*')) as "ou",
ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('out:*')) as "out",
ts_rank(to_tsvector('Outback Steakhouse'),to_tsquery('outb:*')) as "outb"
which results this:
ou out outb
0.0607927 0 0.0607927
What am I doing wrong?
Is this a limitation of pg full text search?
Is there something that I can do with my dictionary or configuration to get around this anomaly?
UPDATE:
I think that "out" may be a stop word.
when I run this debug query, I don't get any lexemes for "out"
SELECT * FROM ts_debug('english','out back outback');
alias description token dictionaries dictionary lexemes
asciiword Word all ASCII out {english_stem} english_stem {}
blank Space symbols {}
asciiword Word all ASCII back {english_stem} english_stem {back}
blank Space symbols {}
asciiword Word all ASCII outback {english_stem} english_stem {outback}
So now I ask how do I modify the stop word list to remove a word?
UPDATE:
here is the query that I currently using:
select id,name,address,city,state,likes
from view_business_favorite_count
where textsearchable_index_col ## to_tsquery('simple',$1)
ORDER BY ts_rank(textsearchable_index_col, to_tsquery('simple',$1)) DESC
When I execute the query (I'm using Strongloop Loopback + Express + Node), I pass the pattern in to replace $1 param. The pattern (as stated above) will look something like "keyword:" or "keyword1 & keyword2 & ... & keywordN:"
thanks

The problem here is that you are searching against business names and as #Daniel correctly pointed out - 'english' dictionary will not help you to find "fuzzy" match for NON-dictionary words like "Outback Steakhouse" etc;
'simple' dictionary
'simple' dictionary on its own will not help you neither, in your case business names will work only for exact match as all words are unstemmed.
'simple' dictionary + pg_trgm
But if you use 'simple' dictionary together with pg_trgm module - it will be exactly what you need, in particular:
for to_tsvector('simple','<business name>') you don't need to worry about stop words "hack", you will get all the lexemes unstemmed;
using similarity() from pg_trgm you will get the the highest "rank"
for the best match,
look at this:
WITH pg_trgm_test(business_name,search_pattern) AS ( VALUES
('Outback Steakhouse','ou'),
('Outback Steakhouse','out'),
('Outback Steakhouse','outb')
)
SELECT business_name,search_pattern,similarity(business_name,search_pattern)
FROM pg_trgm_test;
result:
business_name | search_pattern | similarity
--------------------+----------------+------------
Outback Steakhouse | ou | 0.1
Outback Steakhouse | out | 0.15
Outback Steakhouse | outb | 0.2
(3 rows)
Ordering by similarity DESC you will be able to get what you need.
UPDATE
For you situation there are 2 possible options.
Option #1.
Just create trgm index for name column in view_business_favorite_count table; index definition may be the following:
CREATE INDEX name_trgm_idx ON view_business_favorite_count USING gin (name gin_trgm_ops);
Query will look something like that:
SELECT
id,
name,
address,
city,
state,
likes,
similarity(name,$1) AS trgm_rank -- similarity score
FROM
view_business_favorite_count
WHERE
name % $1 -- trgm search
ORDER BY trgm_rank DESC;
Option #2.
With full text search, you need to :
create a separate table, for example unnested_business_names, where you will store 2 columns: 1st column will keep all lexemes from to_tsvector('simple',name) function, 2nd column will have vbfc_id(FK for id from view_business_favorite_count table);
add trgm index for the column, which contains lexemes;
add trigger for unnested_business_names, which will update OR insert OR delete new values from view_business_favorite_count to keep all words up to date

Related

How to perform a search query on a column value containing a string with comma separated values?

I have a table which looks like below
date | tags | name
------------+-----------------------------------+--------
2018-10-08 | 100.21.100.1, cpu, del ZONE1
2018-10-08 | 100.21.100.1, mem, blr ZONE2
2018-10-08 | 110.22.100.3, cpu, blr ZONE3
2018-10-09 | 110.22.100.3, down, hyd ZONE2
2018-10-09 | 110.22.100.3, down, del ZONE1
I want to select the name for those rows which have certain strings in the tags column
Here column tags has values which are strings containing comma separated values.
For example I have a list of strings ["down", "110.22.100.3"]. Now if I do a look up into the table which contains the strings in the list, I should get the last two rows which have the names ZONE2, ZONE1 respectively.
Now I know there is something called in operator but I am not quite sure how to use it here.
I tried something like below
select name from zone_table where 'down, 110.22.100.3' in tags;
But I get syntax error.How do I do it?
You can do something like this.
select name from zone_table where
string_to_array(replace(tags,' ',''),',')#>
string_to_array(replace('down, 110.22.100.3',' ',''),',');
1) delete spaces in the existing string for proper string_to_array separation without any spaces in the front using replace
2)string_to_array converts your string to array separated by comma.
3) #> is the contains operator
(OR)
If you want to match as a whole
select name from zone_table where POSITION('down, 110.22.100.3' in tags)!=0
For separate matches you can do
select name from zone_table where POSITION('down' in tags)!=0 and
POSITION('110.22.100.3' in tags)!=0
More about position here
We can try using the LIKE operator here, and check for the presence of each tag in the CSV tag list:
SELECT name
FROM zone_table
WHERE ', ' || tags LIKE '%, down,%' AND ', ' || tags LIKE '%, 110.22.100.3,%';
Demo
Important Note: It is generally bad practice to store CSV data in your SQL tables, for the very reason that it is unnormalized and makes it hard to work with. It would be much better design to have each individual tag persisted in its own record.
demo: db<>fiddle
I would do a check with array overlapping (&& operator):
SELECT name
FROM zone_table
WHERE string_to_array('down, 110.22.100.3', ',') && string_to_array(tags,',')
Split your string lists (the column values and the compare text 'down, 110.22.100.3') into arrays with string_to_array() (of course if your compare text is an array already you don't have to split it)
Now the && operator checks if both arrays overlap: It checks if one array element is part of both arrays (documentation).
Notice:
"date" is a reserved word in Postgres. I recommend to rename this column.
In your examples the delimiter of your string lists is ", " and not ",". You should take care of the whitespace. Either your string split delimiter is ", " too or you should concatenate the strings with a simple "," which makes some things easier (aside the fully agreed thoughts about storing the values as string lists made by #TimBiegeleisen)

Postgres Find and Return Keywords From List Within Select

I have a simple postgres table that contains a comments (text) column.
Within a view, I need to search that comments field for a list of words and then return a comma separated list of the words found as a column (as well as a bunch of normal columns).
The list of defined keywords contains about 20 words. I.e. apples, bananas, pear, peach, plum.
Ideal result would be something like:
id | comments | keywords
-----------------------------------------------------
1 | I like bananas! | bananas
2 | I like apples. | apples
3 | I don't like fruit |
4 | I like apples and bananas! | apples,bananas
I'm thinking I need to do a sub query and array_agg? Or possibly 'where in'. But I can't figure out how to bolt it together.
Many thanks,
Steve
You can use full-text search facilities to achieve results:
Setup new ispell dictionary with your list of words.
Create full-text search configuration which will be based on your dictionary. Don't forget to remove all other dictionaries from configuration, because in your case all other words actually are stopwords.
After that when you execute
select plainto_tsquery('<your config name>', 'I like apples and bananas!')
you will get only your keywords: 'apples' & 'bananas' or even 'apple' & 'banana' if you setup dictionary properly.
By default, english configuration uses snowball dictionaries which reduce word endings, so if you run
select plainto_tsquery('english', 'I like apples and bananas!')
you will get
'like' & 'appl' & 'banana'
which is not exact suitable for your case.
Another easier way (but slower):
create dict table:
create table keywords (nm text);
insert into keywords (nm)
values ('apples'), ('bananas');
Execute the following script against your text to extract keywords
select string_agg(regexp_replace(foo, '[^a-zA-Z\-]*', '', 'ig'), ',') s
from regexp_split_to_table('I like apples and bananas!', E'\\s+') foo
where regexp_replace(foo, '[^a-zA-Z\-]*', '', 'ig') in (select nm from keywords)
This solution is worse in terms of semantic, so banana and bananas will be different keywords.

How to match rows with one or more words in query, but without any words not in query?

I have a table in a MySQL database that has a list of comma separated tags in it.
I want users to be able to enter a list of comma separated tags and then use Sphinx or MySQL to select rows that have at least one of the tags in the query but not any tags the query doesn't have.
The query can have additional tags that are not in the rows, but the rows should not be matched if they have tags not in the query.
I either want to use Sphinx or MySQL to do the searching.
Here's an example:
creatures:
----------------------------
| name | tags |
----------------------------
| cat | wily,hairy |
| dog | cute,hairy |
| fly | ugly |
| bear | grumpy,hungry |
----------------------------
Example searches:
wily,hairy <-- should match cat
cute,hairy,happy <-- should match dog
happy,cute <-- no match (dog has hairy)
ugly,yuck,gross <-- should match fly
hairy <-- no match (dog has cute cat has wily)
grumpy <-- no match (bear has hungry)
grumpy,hungry <-- should match bear
wily,grumpy,hungry <-- should match bear
Is it possible to do this with Sphinx or MySQL?
To reiterate, the query will be a list of comma separated tags and rows that have at least one of the entered tags but not any tags the query doesn't have should be selected.
Sphinx expression ranker should be able to do this.
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1') AND w > 0
OPTION ranker=expr('IF(word_count>=tags_len,1,0)');
basically you want the number of matched tags never to be less than the number of tags.
Note these just gives all documents a weight of 1, if want to get more elaborate ranking (eg to match other keywords) it gets more complicated.
You need index_field_lengths enabled on the index to get the tags_len attribute.
(the same concept is obviouslly possible in mysql.. probably using FIND_IN_SET to do matching. And either a second column to store the number, or compute the number of tags, using say the REPLACE function)
Edit to add, details about multiple fields...
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1 #tags2 "one two thee"/1') AND w = 2
OPTION ranker=expr('SUM(IF(word_count>=IF(user_weight=2,tags2_len,tags_len),1,0))'),
field_weights=(tags=1,tags2=2);
The SUM function is run for each field in turn, so need to use the user_weight system to get be able to distinguish which field currently enumerating.

Return rows where words match in two columns OR in match in one column and the other column is empty?

This is a follow-up to another question I recently asked.
I currently have a SphinxQL query like this:
SELECT * FROM my_index
WHERE MATCH(\'#field1 "a few words"/1 #field2 "more text here"/1\')
However, I would still like it to match rows in the case where one of the fields in the row is empty.
For example, let's say the following rows exist in the database:
field1 | field2
-----------------------
words in here | text in here
| text in here
The above query would match the first row, but it would not match the second row because the quorum operator specifies that there has to be one or more matches for each field.
Is what I'm asking possible?
The actual query I'm trying to make this work with was provided in Barry Hunter's answer to my previous question:
sphinxQL> SELECT *, WEIGHT() AS w FROM index
WHERE MATCH('#tags "cute hairy happy"/1 #tags2 "one two thee"/1') AND w = 2
OPTION ranker=expr('SUM(IF(word_count>=IF(user_weight=2,tags2_len,tags_len),1,0))'),
field_weights=(tags=1,tags2=2);
First problem is sphinx doesn't index "empty" so you can't search for it. (well actually the field_len attribute will be zero. But it can be hard to combine attribute filter with MATCH())
... so arrange for empty to be something to index
sql_query = SELECT id,...,IF(tags='','_empty_',tags) AS tags FROM ...
Then modify the query. As it happens your quorum search is easy!
#field1 "a few words _empty_"/1
Its just another word. But a more complex query would just have to be OR'ed with the word.
Then there is making it work within your complex query. But as luck would have it, its really easy. _empty_ is just another word. And in the case of the field being empty, one word will match. (ie there are no words in the field, not in the query)
So just add _empty_ into the two quorums and you done!

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.