I have a Postgres query where we have several indices set up, including one on a text field where we have a GIN index. My understanding of this based on the pg_trgm documentation is that it's only applicable if the search string is made up of alphanumeric text. Testing bears this out and in a database with tens of millions of records, doing something like the following works great:
SELECT * FROM my_table WHERE target_field LIKE '%foo%'
I've read in various places that anything that's not an alphanumeric string is treated as a separate word in the trigram search, so something like the following also works quite well:
SELECT * FROM my_table WHERE target_field LIKE '%foo & bar%'
However someone ran a search that was literally just three question marks in a row and it triggered a full table scan. For some reason, when multiple ampersand or question marks are used alone in the query, they're being treated differently than a single one placed next to or among actual alpha-numeric characters.
The research I've done has implied that it might be how some database drivers handle the question mark, sometimes interpreting it as a parameter that needs to be supplied, but then gets confused because it can't find the parameters and triggers a table scan. I don't really believe this is the case. I might be inclined to believe it would throw an error rather than completing the query, but running it anyway seems like a design flaw.
What makes more sense is that a question mark isn't an alpha-numeric character and thus it's treated differently. In some technologies, common symbols such as & are considered alpha-numeric, but I don't think that's the case with Postgres. In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
What's weird is that I can search for %foo & bar%, which seems to work fine. I can even search for %&% and it returns quickly, though not with the results I wanted. But if I put (for example) three of them together like this: %&&&%, it triggers a full table scan.
After running various experiments, here's what I've seen:
%%: uses the index
%&%: uses the index
%?%: uses the index
%foo & bar%: uses the index
%foo ? bar%: uses the index
%foo && bar%: uses the index
%foo ?? bar%: uses the index
%&&%: triggers a full table scan
%??%: triggers a full table scan
%foo&bar%: uses the index, but returns no results
I think that all of those make sense until you get to #8 and #9. And if if the ampersand were a word boundary, shouldn't #10 return results?
Anyone have an explanation of why multiple consecutive punctuation characters would be treated differently than a single punctuation character?
I can't reproduce this in v11 on a table full of md5 hashes: I get seq scans (full table scans) for the first 3 of your patterns.
If I force them to use the index by setting enable_seqscan=false, then I go get it to use the index, but it is actually slower than doing the seq scan. So it made the right call there. How about for you? You shouldn't force it to use the index just on principle when it is actually slower.
It would be interesting to see the estimated number of rows it thinks it will return for all of those examples.
In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
The G in GIN is for "generalized". You can't make blanket statements like that about something which is generalized. They don't even need to operate on text at all. But in your case, you are using the LIKE operator, and the LIKE operator doesn't care about word boundaries. Any GIN index which claims to support the LIKE operator must return the correct results for the LIKE operator. If it can't do that, then it is a bug for it to claim to support it.
It is true that pg_trgm treats & and ? the same as white space when extracting trigrams, but it is obliged to insulate LIKE from the effects if this decision. It does this by two methods. One is that it returns "MAYBE" results, meaning all the tuples it reports must be rechecked to see if they actually satisfy the LIKE. So '%foo&bar%' and '%foo & bar%' will return the same set of tuples to the heap scan, but the heap scan will recheck them and so finally return a different set to the user, depending on which ones survive the recheck. The second thing is, if the pg_trgm can't extract any trigrams at all out of the query string, then it must return the entire table to then be rechecked. This is what would happen with '%%', '%?%', '%??%', etc. Of course rechecking all rows is slower than just doing the seq scan in the first place.
Related
I will be having queries on my database with where clauses similar to this:
SELECT * FROM table WHERE a = 'string_value' AND b = 'other_string_value' AND t > <timestamp>
and less often to this:
SELECT * FROM table WHERE a = 'string_value' AND t > <timestamp>
I have created a multicolumn index on a, b and t on that order. However I am not sure if it will be optimal for my second -less frequent- query.
Will this index do an index scan on b or skip it and move to the t index immediately? (To be honest Im not sure how index scans work exactly). Should I create a second multi-column index on a and t only for the second query?
The docs state that
'the index is most efficient when there are constraints on the leading (leftmost) columns'
But in the example it doesn't highlight my case where the 'b' equality column is missing in the where clause.
The 2nd query will be much less effective with the btree index on (a,b,t) because the absence of b means t cannot be used efficiently (it can still be used as an in-index filter, but that is not nearly as good as being used as a start/stop point). An index on (a,t) will be able to support the 2nd query much more efficiently.
But that doesn't mean you have to create that index as well. Indexes take space and must be maintained, so are far from free. It might be better to just live with less-than-optimal plans for the 2nd query, since that query is used "less often". On the other hand, you did bother to post about it, so maybe "less often" is still pretty often. So you might be better off just to build the extra index and spend your time worrying about something else.
A btree index can be thought of like a phonebook, which is sorted on last name, then first name, then middle name. Your first query is like searching for "people named Mary Smith with a middle name less than Cathy" You can use binary search to efficiently find the first "Mary Smith", then you scan through those until the middle name is > 'Cathy', and you are done. Compare that to "people surnamed Smith with a middle name less than Cathy". Now you have to scan all the Smith's. You can't stop at the first middle name > Cathy, because any change in first name resets the order of the middle names.
Given that b only has 10 distinct values, you could conceivably use the (a,b,t) index in a skip scan quite efficiently. But PostgreSQL doen't yet implement skip scans natively. You can emulate them, but that is fragile, ugly, a lot of work, and easy to screw up. Nothing you said here makes me think it would be worthwhile to do.
I'm using Postgresql 13 and my problem was easily solved with #> operator like this:
select id from documents where keywords #> '{"winter", "report", "2020"}';
meaning that keywords array should contain all these elements. Also I've created a GIN index on this column.
Is it possible to achieve similar behavior even if I provide my request like '{"re", "202", "w"}' ? I heard that ngrams have semantics like this, but "intersection" capabilities of arrays are crucial for me.
In your example, the matches are all prefixes. Is that the general rule here? If so, you would probably went to use the match feature of full text search, not trigrams. It would require you reformat your data, or at least your query.
select * from
(values (to_tsvector('simple','winter report 2020'))) f(x)
where x## 're:* & 202:* & w:*'::tsquery;
If the strings can contain punctuation which you want preserved, you would need to take pains to properly format them into a quoted tsvector yourself rather than just letting to_tsvector deal with it. Using 'simple' config gets rid of the stemming and stop word removal features, which would interfere with what you want to do.
I have 2 tables (projects and tasks) that both contain a name field. I want users to be able to search both tables at the same time when entering a new item. I want to rank results based on all the terms entered. A user should be able to enter text in any order he/she chooses.
For example, searching on:
office bmt
should yield these results:
PR BMT Time - Office
BMT Office - Development
BMT Office - Development
...
The following search should also work:
BMT canter
should contain this result:
Canterburry - BMT time
So partial matches need to work too.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
I now use something like this:
where to_tsvector(projects.name || ' - ' || tasks.name) ## to_tsquery('OFF:*&BMT:*')
I build the search string itself in the Ruby backend by splitting the user entry according to its spaces.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
For example searching for:
off bmt
Gives results that don't contain Off at all because off is ignored completely.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
I could do it by building a list of AND statements in the WHERE clause with LIKE '% ... %' but that would probably hurt performance and doesn't support fuzzysearch.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
Don't just believe, check and see!
select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*
Yes indeed, to_tsquery does omit stop words, even with the :* thingy.
One option is to use 'simple' rather than 'english' as your configuration:
select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*
Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:
select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*
Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?
If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.
I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.
Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.
In your case, to_tsquery() need to indicate that all words are required, you can use to_tsquery('english', 'off & bmt') and indicates a particular dictionary containing the 'off' word, listed in the link 4, below.
Some tips to use tsvector:
Create a field on your table that contains all fields with terms that you want to search, this field should be the type tsvector
Your search should use tsquery as you mentioned in your answer. In search, you can make some good tricks, like as follow:
2.a. Create a rank, with ts_rank(), indicating the search priority, this indicates the priority and how much the tsquery approximates with original terms
2.b. If you have specific words (like my case, search of chemical terms), you can create a dictionary with the commonly words used, this words can be used to extract radical or parts to compare the similarity.
2.c. About the performance: The tsquery works very well with gin and gist indexes. I have used full text search in a table with +200k registers and the search returns in < 0.4secs.
If you need more fuzzy search in words, you can also use the fuzzy match. I used with tsquery, the levenshtein_less_equal search, using a distance of 3. The function searches words with 3 or minus letters differing from the search, for unique words is a good way to search.
tsquery and tsvector: https://www.postgresql.org/docs/10/datatype-textsearch.html
text search: https://www.postgresql.org/docs/10/textsearch-controls.html#TEXTSEARCH-RANKING
Fuzzy: https://www.postgresql.org/docs/11/fuzzystrmatch.html#id-1.11.7.24.6
Lexize: https://www.postgresql.org/docs/10/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY
I have a Postgres table with a phone field stored as varchar(10), but we search on the area code frequently, e.g.:
select * from bus_t where bus_phone like '555%'
I wanted to create an index to facilitate with these searches, but I got an error when trying:
CREATE INDEX bus_ph_3 ON bus_t USING btree (bus_phone::varchar(3));
ERROR: 42601: syntax error at or near "::"
My first question is, how do I accomplish this, but also I am wondering if it makes sense to index on the first X characters of a field or if indexing on the entire field is just as effective.
Actually, a plain B-tree index is normally useless for pattern matching with LIKE (~~) or regexp (~), even with left-anchored patterns, if your installation runs on any other locale than "C", which is the typical case. Here is an overview over pattern matching and indices in a related answer on dba.SE
Create an index with the varchar_pattern_ops operator class (matching your varchar column) and be sure to read the chapter on operator classes in the manual.
CREATE INDEX bus_ph_pattern_ops_idx ON bus_t (bus_phone varchar_pattern_ops);
Your original query can use this index:
... WHERE bus_phone LIKE '555%'
Performance of a functional index on the first 3 characters as described in the answer by #a_horse is pretty much the same in this case.
-> SQLfiddle demo.
Generally a functional index on relevant leading characters would be be a good idea, but your column has only 10 characters. Consider that the overhead per tuple is already 28 bytes. Saving 7 bytes is just not substantial enough to make a big difference. Add the cost for the function call and the fact that xxx_pattern_ops are generally a bit faster.
In Postgres 9.2 or later the index on the full column can also serve as covering index in index-only scans.
However, the more characters in the columns, the bigger the benefit from a functional index.
You may even have to resort to a prefix index (or some other kind of hash) if the strings get too long. There is a maximum length for indices.
If you decide to go with the functional index, consider using the xxx_pattern_ops variant for a small additional performance benefit. Be sure to read about the pros and cons in the manual and in Peter Eisentraut's blog entry:
CREATE INDEX bus_ph_3 ON bus_t (left(bus_phone, 3) varchar_pattern_ops);
Explain error message
You'd have to use the standard SQL cast syntax for functional indices. This would work - pretty much like the one with left(), but like #a_horse I'd prefer left().
CREATE INDEX bus_ph_3 ON bus_t USING btree (cast(bus_phone AS varchar(3));
When using like '555%' an index on the complete column will be used just as well. There is no need to only index the first three characters.
If you do want to index only the first 3 characters (e.g. to save space), then you could use the left() funcion:
CREATE INDEX bus_ph_3 ON bus_t USING btree (left(bus_phone,3));
But in order for that index to be used, you would need to use that expression in your where clause:
where left(bus_phone,3) = '555';
But again: that is most probably overkill and the index on the complete column will be good enough and can be used for other queries as well e.g. bus_phone = '555-1234' which the index on just the first three characters would not.
I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.