Look up in a white space separated string in Postgres - postgresql

I have a character varying field in postgres containing a 1-white-space-separated set of strings. E.g.:
--> one two three <--
--> apples bananas pears <--
I put --> and <-- to show where the strings start and end (they are not part of the stored string itself)
I need to query this field to find out if the whole string contains a certain word (apple for instance). A possible query would be
SELECT * FROM table WHERE thefield LIKE '%apple%'
But it sucks and won't scale as b-tree indexes only scale if the pattern is attached to the beginning of the string while in my case the searched string could be positioned anywhere in the field.
How would you recommend approaching the problem?

Consider database-normalization first.
While working with your current design, support the query with a trigram index, that will be pretty fast.
More details and links in this closely related answer:
PostgreSQL LIKE query performance variations
Even more about pattern matching and indexes in this related answer on dba.SE:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Related

Postgres LIKE query triggers full table scan

I have a Postgres query where we have several indices set up, including one on a text field where we have a GIN index. My understanding of this based on the pg_trgm documentation is that it's only applicable if the search string is made up of alphanumeric text. Testing bears this out and in a database with tens of millions of records, doing something like the following works great:
SELECT * FROM my_table WHERE target_field LIKE '%foo%'
I've read in various places that anything that's not an alphanumeric string is treated as a separate word in the trigram search, so something like the following also works quite well:
SELECT * FROM my_table WHERE target_field LIKE '%foo & bar%'
However someone ran a search that was literally just three question marks in a row and it triggered a full table scan. For some reason, when multiple ampersand or question marks are used alone in the query, they're being treated differently than a single one placed next to or among actual alpha-numeric characters.
The research I've done has implied that it might be how some database drivers handle the question mark, sometimes interpreting it as a parameter that needs to be supplied, but then gets confused because it can't find the parameters and triggers a table scan. I don't really believe this is the case. I might be inclined to believe it would throw an error rather than completing the query, but running it anyway seems like a design flaw.
What makes more sense is that a question mark isn't an alpha-numeric character and thus it's treated differently. In some technologies, common symbols such as & are considered alpha-numeric, but I don't think that's the case with Postgres. In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
What's weird is that I can search for %foo & bar%, which seems to work fine. I can even search for %&% and it returns quickly, though not with the results I wanted. But if I put (for example) three of them together like this: %&&&%, it triggers a full table scan.
After running various experiments, here's what I've seen:
%%: uses the index
%&%: uses the index
%?%: uses the index
%foo & bar%: uses the index
%foo ? bar%: uses the index
%foo && bar%: uses the index
%foo ?? bar%: uses the index
%&&%: triggers a full table scan
%??%: triggers a full table scan
%foo&bar%: uses the index, but returns no results
I think that all of those make sense until you get to #8 and #9. And if if the ampersand were a word boundary, shouldn't #10 return results?
Anyone have an explanation of why multiple consecutive punctuation characters would be treated differently than a single punctuation character?
I can't reproduce this in v11 on a table full of md5 hashes: I get seq scans (full table scans) for the first 3 of your patterns.
If I force them to use the index by setting enable_seqscan=false, then I go get it to use the index, but it is actually slower than doing the seq scan. So it made the right call there. How about for you? You shouldn't force it to use the index just on principle when it is actually slower.
It would be interesting to see the estimated number of rows it thinks it will return for all of those examples.
In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
The G in GIN is for "generalized". You can't make blanket statements like that about something which is generalized. They don't even need to operate on text at all. But in your case, you are using the LIKE operator, and the LIKE operator doesn't care about word boundaries. Any GIN index which claims to support the LIKE operator must return the correct results for the LIKE operator. If it can't do that, then it is a bug for it to claim to support it.
It is true that pg_trgm treats & and ? the same as white space when extracting trigrams, but it is obliged to insulate LIKE from the effects if this decision. It does this by two methods. One is that it returns "MAYBE" results, meaning all the tuples it reports must be rechecked to see if they actually satisfy the LIKE. So '%foo&bar%' and '%foo & bar%' will return the same set of tuples to the heap scan, but the heap scan will recheck them and so finally return a different set to the user, depending on which ones survive the recheck. The second thing is, if the pg_trgm can't extract any trigrams at all out of the query string, then it must return the entire table to then be rechecked. This is what would happen with '%%', '%?%', '%??%', etc. Of course rechecking all rows is slower than just doing the seq scan in the first place.

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

PostgreSQL Full Text Search and Trigram Confusion

I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:
SELECT * FROM articles
WHERE search_vector ## plainto_tsquery('english', 'cat, bat, rat');
The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:
Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?
They serve very different purposes.
Full Text Search is used to return documents that match a search query of stemmed words.
Trigrams give you a method for comparing two strings and determining how similar they look.
Consider the following examples:
SELECT 'cat' % 'cats'; --true
The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).
SELECT 'there is a cat with a dog' % 'cats'; --false
The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.
SELECT to_tsvector('there is a cat with a dog') ## to_tsquery('cats'); --true
This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.
It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.
If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.

ends with (suffix) and contains string search using MATCH in SQLite FTS

I am using SQLite FTS extension in my iOS application.
It performs well but the problem is that it matches only string prefixes (or starts with keyword search).
i.e.
This works:
SELECT FROM tablename WHERE columnname MATCH 'searchterm*'
but following two don't:
SELECT FROM tablename WHERE columnname MATCH '*searchterm'
SELECT FROM tablename WHERE columnname MATCH '\*searchterm\*'
Is there any workaround for this or any way to use FTS to build a query similar to LIKE '%searchterm%' query.
EDIT:
As pointed out by Retterdesdialogs, storing the entire text in reverse order and running a prefix search on a reverse string is a possible solution for ends with/suffix search problem, which was my original question, but it won't work for 'contains' search. I have updated the question accordingly.
In my iOS and Android applications, I have shied away from FTS search for exactly the reason that it doesn't support substring matches due to lack of suffix queries.
The workarounds seem complicated.
I have resorted to using LIKE queries, which while being less performant than MATCH, served my needs.
The workaround is to store the reverse string in an extra column. See this link (its not exactly the same it should give a idea):
Search Suffix using Full Text Search
To get it to work for contains queries, you need to store all suffixes of the terms you want to be able to search. This has the downside of making the database really large, but that can be avoided by compressing the data.
SQLite FTS contains and suffix matches