Postgresql: optimal use of multicolumn-index when subset of index is missing from the where clause - postgresql

I will be having queries on my database with where clauses similar to this:
SELECT * FROM table WHERE a = 'string_value' AND b = 'other_string_value' AND t > <timestamp>
and less often to this:
SELECT * FROM table WHERE a = 'string_value' AND t > <timestamp>
I have created a multicolumn index on a, b and t on that order. However I am not sure if it will be optimal for my second -less frequent- query.
Will this index do an index scan on b or skip it and move to the t index immediately? (To be honest Im not sure how index scans work exactly). Should I create a second multi-column index on a and t only for the second query?
The docs state that
'the index is most efficient when there are constraints on the leading (leftmost) columns'
But in the example it doesn't highlight my case where the 'b' equality column is missing in the where clause.

The 2nd query will be much less effective with the btree index on (a,b,t) because the absence of b means t cannot be used efficiently (it can still be used as an in-index filter, but that is not nearly as good as being used as a start/stop point). An index on (a,t) will be able to support the 2nd query much more efficiently.
But that doesn't mean you have to create that index as well. Indexes take space and must be maintained, so are far from free. It might be better to just live with less-than-optimal plans for the 2nd query, since that query is used "less often". On the other hand, you did bother to post about it, so maybe "less often" is still pretty often. So you might be better off just to build the extra index and spend your time worrying about something else.
A btree index can be thought of like a phonebook, which is sorted on last name, then first name, then middle name. Your first query is like searching for "people named Mary Smith with a middle name less than Cathy" You can use binary search to efficiently find the first "Mary Smith", then you scan through those until the middle name is > 'Cathy', and you are done. Compare that to "people surnamed Smith with a middle name less than Cathy". Now you have to scan all the Smith's. You can't stop at the first middle name > Cathy, because any change in first name resets the order of the middle names.
Given that b only has 10 distinct values, you could conceivably use the (a,b,t) index in a skip scan quite efficiently. But PostgreSQL doen't yet implement skip scans natively. You can emulate them, but that is fragile, ugly, a lot of work, and easy to screw up. Nothing you said here makes me think it would be worthwhile to do.

Related

Get all ltree nodes at depth

I have an ltree column containing a tree with a depth of 3. I'm trying to write a query that can select all children at a specific depth (level 1 = get all parents, 2 = get all children, 3 = get all grandchildren). I know this is pretty straightforward with n_level:
SELECT path FROM hierarchies
WHERE
nlevel(path) = 1
LIMIT 1000;
I have 200,000 dummy records and it's pretty fast (~170 ms). However, this query uses a sequential scan. I think it'd be better to write it in a way that takes advantage of the ltree operators supported by the GiST index. Frustratingly, I can't seem to wrap my brain around them, and I haven't found a similar question on SO or DBA (besides this one on finding leaves)
Any advice is appreciated!
The only index that could support your query is a simple b-tree index on an expression.
create index on hierarchies((nlevel(path)))
Note however that it is quite possible for the planner to choose a sequential scan anyway, exemplary in the case the number of rows with level 1 is much more than other levels.

Postgres LIKE query triggers full table scan

I have a Postgres query where we have several indices set up, including one on a text field where we have a GIN index. My understanding of this based on the pg_trgm documentation is that it's only applicable if the search string is made up of alphanumeric text. Testing bears this out and in a database with tens of millions of records, doing something like the following works great:
SELECT * FROM my_table WHERE target_field LIKE '%foo%'
I've read in various places that anything that's not an alphanumeric string is treated as a separate word in the trigram search, so something like the following also works quite well:
SELECT * FROM my_table WHERE target_field LIKE '%foo & bar%'
However someone ran a search that was literally just three question marks in a row and it triggered a full table scan. For some reason, when multiple ampersand or question marks are used alone in the query, they're being treated differently than a single one placed next to or among actual alpha-numeric characters.
The research I've done has implied that it might be how some database drivers handle the question mark, sometimes interpreting it as a parameter that needs to be supplied, but then gets confused because it can't find the parameters and triggers a table scan. I don't really believe this is the case. I might be inclined to believe it would throw an error rather than completing the query, but running it anyway seems like a design flaw.
What makes more sense is that a question mark isn't an alpha-numeric character and thus it's treated differently. In some technologies, common symbols such as & are considered alpha-numeric, but I don't think that's the case with Postgres. In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
What's weird is that I can search for %foo & bar%, which seems to work fine. I can even search for %&% and it returns quickly, though not with the results I wanted. But if I put (for example) three of them together like this: %&&&%, it triggers a full table scan.
After running various experiments, here's what I've seen:
%%: uses the index
%&%: uses the index
%?%: uses the index
%foo & bar%: uses the index
%foo ? bar%: uses the index
%foo && bar%: uses the index
%foo ?? bar%: uses the index
%&&%: triggers a full table scan
%??%: triggers a full table scan
%foo&bar%: uses the index, but returns no results
I think that all of those make sense until you get to #8 and #9. And if if the ampersand were a word boundary, shouldn't #10 return results?
Anyone have an explanation of why multiple consecutive punctuation characters would be treated differently than a single punctuation character?
I can't reproduce this in v11 on a table full of md5 hashes: I get seq scans (full table scans) for the first 3 of your patterns.
If I force them to use the index by setting enable_seqscan=false, then I go get it to use the index, but it is actually slower than doing the seq scan. So it made the right call there. How about for you? You shouldn't force it to use the index just on principle when it is actually slower.
It would be interesting to see the estimated number of rows it thinks it will return for all of those examples.
In fact, the documentation suggests that non-alphanumeric characters are treated as word boundaries in a GIN-based index.
The G in GIN is for "generalized". You can't make blanket statements like that about something which is generalized. They don't even need to operate on text at all. But in your case, you are using the LIKE operator, and the LIKE operator doesn't care about word boundaries. Any GIN index which claims to support the LIKE operator must return the correct results for the LIKE operator. If it can't do that, then it is a bug for it to claim to support it.
It is true that pg_trgm treats & and ? the same as white space when extracting trigrams, but it is obliged to insulate LIKE from the effects if this decision. It does this by two methods. One is that it returns "MAYBE" results, meaning all the tuples it reports must be rechecked to see if they actually satisfy the LIKE. So '%foo&bar%' and '%foo & bar%' will return the same set of tuples to the heap scan, but the heap scan will recheck them and so finally return a different set to the user, depending on which ones survive the recheck. The second thing is, if the pg_trgm can't extract any trigrams at all out of the query string, then it must return the entire table to then be rechecked. This is what would happen with '%%', '%?%', '%??%', etc. Of course rechecking all rows is slower than just doing the seq scan in the first place.

How to force postgresql to use particular index when all fields match

Consider I have a table T with fields a,b,c,d with two indices: first on a,b,c fields and second on a,b,d fields. Types of a,b,c and d are integer. Both indices are almost the same (on production they both have about 2Gb size, they have the same creation time and the same statistics of usage, table overall have about 60 millions rows).
I make two queries:
select * from T where a=... and b=... and c=...;
select * from T where a=... and b=... and d=...;
I expect that for the first query index on a,b,c fields is used and for the second index on a,b,d fields is used. However it's not the case and for both queries first index is used, but in second case with "filter"(I used expect analyze to gain this knowledge). For me such behavior is unacceptable, because in some circumstances number of entries in filter grows very fast and autovacuum/analyze (which actually helps the planner to use the right index) works too slow to prevent the unexpected latencies and downtime.
So my question is: how can I force postgresql not to use wrong index with filtering, but rather use the right index when all fields in query's 'where' and in that index match?
Finally I found the solution. It's not perfect and I will not mark it as the best one, however it works and could help someone.
What I have actually done is I have changed the indices, instead of a,b,c and a,b,d indices now I'm having c,a,b and d,a,b.
One problem appeared: I needed an index on 'a', because some queries rely on it. However when I add index solely on 'a', the problem from the first post appears again (the index on 'a' is used when planner thinks that its cost is lower than cost of c,a,b or d,a,b). So I decided to add new field to the table which is a copy (has the same data) of a field 'a', let's call it 'a1', and I added the index on this field. Now when I need to filter contents of 'a' somehow instead I'm filtering on 'a1' field. It's weird, but I couldn't find another solution.

Create index on first 3 characters (area code) of phone field?

I have a Postgres table with a phone field stored as varchar(10), but we search on the area code frequently, e.g.:
select * from bus_t where bus_phone like '555%'
I wanted to create an index to facilitate with these searches, but I got an error when trying:
CREATE INDEX bus_ph_3 ON bus_t USING btree (bus_phone::varchar(3));
ERROR: 42601: syntax error at or near "::"
My first question is, how do I accomplish this, but also I am wondering if it makes sense to index on the first X characters of a field or if indexing on the entire field is just as effective.
Actually, a plain B-tree index is normally useless for pattern matching with LIKE (~~) or regexp (~), even with left-anchored patterns, if your installation runs on any other locale than "C", which is the typical case. Here is an overview over pattern matching and indices in a related answer on dba.SE
Create an index with the varchar_pattern_ops operator class (matching your varchar column) and be sure to read the chapter on operator classes in the manual.
CREATE INDEX bus_ph_pattern_ops_idx ON bus_t (bus_phone varchar_pattern_ops);
Your original query can use this index:
... WHERE bus_phone LIKE '555%'
Performance of a functional index on the first 3 characters as described in the answer by #a_horse is pretty much the same in this case.
-> SQLfiddle demo.
Generally a functional index on relevant leading characters would be be a good idea, but your column has only 10 characters. Consider that the overhead per tuple is already 28 bytes. Saving 7 bytes is just not substantial enough to make a big difference. Add the cost for the function call and the fact that xxx_pattern_ops are generally a bit faster.
In Postgres 9.2 or later the index on the full column can also serve as covering index in index-only scans.
However, the more characters in the columns, the bigger the benefit from a functional index.
You may even have to resort to a prefix index (or some other kind of hash) if the strings get too long. There is a maximum length for indices.
If you decide to go with the functional index, consider using the xxx_pattern_ops variant for a small additional performance benefit. Be sure to read about the pros and cons in the manual and in Peter Eisentraut's blog entry:
CREATE INDEX bus_ph_3 ON bus_t (left(bus_phone, 3) varchar_pattern_ops);
Explain error message
You'd have to use the standard SQL cast syntax for functional indices. This would work - pretty much like the one with left(), but like #a_horse I'd prefer left().
CREATE INDEX bus_ph_3 ON bus_t USING btree (cast(bus_phone AS varchar(3));
When using like '555%' an index on the complete column will be used just as well. There is no need to only index the first three characters.
If you do want to index only the first 3 characters (e.g. to save space), then you could use the left() funcion:
CREATE INDEX bus_ph_3 ON bus_t USING btree (left(bus_phone,3));
But in order for that index to be used, you would need to use that expression in your where clause:
where left(bus_phone,3) = '555';
But again: that is most probably overkill and the index on the complete column will be good enough and can be used for other queries as well e.g. bus_phone = '555-1234' which the index on just the first three characters would not.

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!
You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.
You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).
When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance
Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.