I have a query that has several filter conditions after WHERE clause.
Also, most of the columns involved have indexes on them.
When I run EXPLAIN command, I see:
-> Bitmap Index Scan on feature_expr_idx (cost=0.00..8.10 rows=14 width=0)
feature_expr_idx is an index on one of the columns in WHERE clause.
But indexes for other columns are not shown. Instead, they are shown in FILTER row:
Filter: ((NOT is_deleted) AND (vehicle_type = 'car'::text) AND (source_type = 'NONE'::text))
Why only a single Index is shown in the result, while other columns also having index are instead part of Filter?
Postgresql has a clever engine which tries to plan the best way to run your query. Often, this involves reading as little as possible from disk, because disk operations are slow. One of the reasons why indexes are so helpful is that by reading from the index, we can find a small number of rows in the table that need to be read in order to satisfy the query, and thus we can avoid reading through the entire table. Note, however, that the index is also on disk, and so reading the index also takes some time.
Now, imagine your query has two filters, one over column A and one over column B, both of which are indexed. According to the statistics postgresql has collected, there are about 5 rows that satisfy the filter on column A, and about 1000 rows that satisfy the filter on column B. In that case, it makes sense to read only the index on column A, then read all the matching 5 (or so) rows, and filter out any of them which don't match the filter on column B. Reading the index on column B would probably be more expensive than just reading the 5 rows!
The actual reason may be different than my example, but the point is that postgresql is simply trying to be as efficient as possible.
Related
I have below query to fetch list of tickets.
EXPLAIN select * from ticket_type
where ticket_type.event_id='89898'
and ticket_type.active=true
and (ticket_type.is_unlimited = true OR ticket_type.number_of_sold_tickets < ticket_type.number_of_tickets)
order by ticket_type.ticket_type_order
I have created below indexes but not working.
Index on (ticket_type_order,event_id,is_unlimited,active)
Index on (ticket_type_order,event_id,active,number_of_sold_tickets,number_of_tickets).
The perfect index for this query would be
CREATE INDEX ON ticket_type (event_id, ticket_type_order)
WHERE active AND (is_unlimited OR number_of_sold_tickets < number_of_tickets);
Of course, a partial index like that might only be useful for this specific query.
If the WHERE conditions from the index definition are not very selective, or a somewhat slower execution is also acceptable, you can omit parts of or the whole WHERE clause. That makes the index more widely useful.
What is the size of the table and usual query result? The server is usually smart enough and disables indexes, if it expects to return more than the half of the table.
Index makes no sense, if the result is rather small. If the server has - let say - 1000 records after several filtration steps, the server stops using indexes. It is cheaper the finish the query using CPU, then loading an index from HDD. As result, indexes are never applied to small tables.
Order by is applied at the very end of the query processing. The first field in the index should be one of the fields from the where filter.
Boolean fields are seldom useful in the index. It has only two possible values. Index should be created for fields with a lot of different values.
Avoid or filtering. It is easy in your case. Put a very big number into number_of_tickets, if the tickets are unlimited.
The better index in your case would be just event_id. If the database server supports functional indexes, then you can try to add number_of_tickets - number_of_sold_tickets. Rewrite the statement as where number_of_tickets - number_of_sold_tickets > 0
UPDATE: Postgresql calls it "Index on Expression":
https://www.postgresql.org/docs/current/indexes-expressional.html
I'm trying to reason about how Postgres partial indexes are stored inside Postgres. Suppose I create an index like this
CREATE INDEX orders_unbilled_index ON orders (order_nr)
WHERE billed is not true
in order to quickly run a query like
SELECT *
FROM orders
WHERE billed is not true AND order_nr > 1000000
Postgres obviously stores an index on order_nr built over a subset of the orders table as defined by the conditional expression billed is not true. However, I have a couple of questions related to this:
Does Postgres store another index internally on billed is not true to quickly find the rows associated with the partial index?
If (1) is not the case, would it then make the query above run faster if I made a separate index on billed is not true? (assuming a large table and few rows with billed is true)
EDIT: My example query based on the docs is not the best due to how boolean indexes are rarely used, but please consider my questions in the context of any conditional expression.
A b-tree index can be thought of an ordered list of index entries, each with a pointer to a row in the table.
In a partial index, the list is just smaller: there are only index entries for rows that meet the condition.
If you have the index condition in your WHERE clause, PostgreSQL knows it can use the index and doesn't have to check the index condition, because it will be satisfied automatically.
So:
No, any row found via the index will automatically satisfy the index condition, so using the index is enough to make sure it is satisfied.
No, an index on a boolean column will not be used, because it would not be cheaper than this partial index, and the partial index can be used to check the condition on order_nr as well.
It is actually the other way around: the partial index could well be used for queries that only have the boolean column in the WHERE condition, if there are few enough rows that satisfy the condition.
It is my understanding that Postgres will simply build an index which can only be used to lookup records which have billed as not being true. That is, the resulting B-tree would be indexed by the order_nr, but would only link back to the original table when billed be false.
If you keep reading the documentation, immediately after what you cited, you will find the following query as an example:
SELECT * FROM orders WHERE billed is not true AND amount > 5000.00;
It is the case that Postgres might even choose to use the index you defined on the above query. It can use your index to satisfy this query by scanning the entire index. If there are a relatively small number of orders which are not yet billed, then scanning the index on order_nr might still be preferable to doing a full table scan.
So, the answer to your question #1 is that no, there is no separate index for billed, but rather the index on order_nr only can be used for records which have billed set to false. And for #2, yes, a second index on billed is not true could be used assuming few records are unbilled. However, even your current index might even be used as is.
It clear how bitmap indexes works with two possible values (gender: male and female). But how is it possible to use with 3 or more values?
Can anyone explain how it works in postgresql?
PostgreSQL does not have a bitmap index, it can do bitmap index scans over regular B-tree indexes.
For that it is not important how many values the indexed column (or columns) can have.
This is how it works:
The index is scanned for the search condition.
Rather than visiting the table for each row found, PostgreSQL builds a bitmap. This bitmap normally has one bit per table row, and the rows are sorted in physical address (ctid) order. The value of the bit indicates if that row matches the search condition or not (which has nothing to do with the value range of the indexed columns).
If work_mem is too small to contain a bitmap with one bit per row, PostgreSQL degrades to storing one bit per 8KB page. This shows up as “lossy” entries in the EXPLAIN (ANALYZE) output and will lead to false positive hits in the next step which affects performance.
During a second step, the bitmap heap scan, PostgreSQL visits the table and fetches (and re-checks, if necessary) all rows that show a hit in the bitmap.
The advantages of a bitmap index scan are:
Even if many rows are selected, PostgreSQL has to visit each table block only once, and the rows are visited in physical order.
Several bitmap index scans on the same table can be combined with a “bitmap AND” or a “bitmap OR” before the table is scanned. This can handle ORs efficiently and combine several conditions with AND if each of the conditions alone is not selective enough to warrant an index scan.
I have a table in PostgreSQL, it has 20 columns, which are mostly of an enum type. And this table has millions of rows.
I'd like to support and speed up for queries searching for rows with multiple fields, for instance: col2=value1&col3=value2&col5=value3 page=1
I can't use PostgreSQL's compound index,
because it only works with a fixed order of the columns. For instance, If I build an index on (col2,col3,col5), then it can't be used for queries searching for col1=value1&col2=value2
And I'd like also to support queries like:
col1=value1&col2=(value3 or value4) orderby=col3 page=1
What would be a solution to this problem? And if I don't need full-text search on any of these columns (since they are all enum types), could the solution be lightweight?
If you want an OR in your search condition, that's pretty mush “game over” for performance (I'm exaggerating a little for effect).
But if you have only ANDs and equality conditions, I want to call your attention to Bloom filters.
You just have to
CREATE EXTENSION bloom;
and then create an index USING bloom on all columns together.
Unlike other indexes, this single index can speed up queries with all possible combinations of columns in the WHERE condition. The index is just a filter that will pass some false positives, so there always has to be a recheck of the condition, but it will significantly speed up the query.
I could not reach any conclusive answers reading some of the existing posts on this topic.
I have certain data at 100 locations the for past 10 years. The table has about 800 million rows. I need to primarily generate yearly statistics for each location. Some times I need to generate monthly variation statistics and hourly variation statistics as well. I'm wondering if I should generate two indexes - one for location and another for year or generate one index on both location and year. My primary key currently is a serial number (Probably I could use location and timestamp as the primary key).
Thanks.
Regardless of how many indices have you created on relation, only one of them will be used in a certain query (which one depends on query, statistics etc). So in your case you wouldn't get a cumulative advantage from creating two single column indices. To get most performance from index I would suggest to use composite index on (location, timestamp).
Note, that queries like ... WHERE timestamp BETWEEN smth AND smth will not use the index above while queries like ... WHERE location = 'smth' or ... WHERE location = 'smth' AND timestamp BETWEEN smth AND smth will. It's because the first attribute in index is crucial for searching and sorting.
Don't forget to perform
ANALYZE;
after index creation in order to collect statistics.
Update:
As #MondKin mentioned in comments certain queries can actually use several indexes on the same relation. For example, query with OR clauses like a = 123 OR b = 456 (assuming that there are indexes for both columns). In this case postgres would perform bitmap index scans for both indexes, build a union of resulting bitmaps and use it for bitmap heap scan. In certain conditions the same scheme may be used for AND queries but instead of union there would be an intersection.
There is no rule of thumb for situations like these, I suggest you experiment in a copy of your production DB to see what works best for you: a single multi-column index or 2 single-column indexes.
One nice feature of Postgres is you can have multiple indexes and use them in the same query. Check this chapter of the docs:
... PostgreSQL has the ability to combine multiple indexes ... to handle cases that cannot be implemented by single index scans ....
... Sometimes multicolumn indexes are best, but sometimes it's better to create separate indexes and rely on the index-combination feature ...
You can even experiment creating both the individual and combined indexes, and checking how big each one is and determine if it's worth having them at the same time.
Some things that you can also experiment with:
If your table is too large, consider partitioning it. It looks like you could partition either by location or by date. Partitioning splits your table's data in smaller tables, reducing the amount of places where a query needs to look.
If your data is laid out according to a date (like transaction date) check BRIN indexes.
If multiple queries will be processing your data in a similar fashion (like aggregating all transactions over the same period, check materialized views so you only need to do those costly aggregations once.
About the order in which to put your multi-column index, put first the column on which you will have an equality operation, and later the column in which you have a range, >= or <= operation.
An index on (location,timestamp) should work better that 2 separate indexes for you case. Note that the order of the columns is important.