Postgres: how do you optimize queries on date column with low selectivity? - postgresql

I have a table with 143 million rows (and growing), its current size is 107GB. One of the columns in the table is of type date and it has low selectivity. For any given date, its reasonable to assume that there are somewhere between 0.5 to 4 million records with the same date value.
Now, if someone tries to do something like this:
select * from large_table where date_column > '2020-01-01' limit 100
It will execute "forever", and if you EXPLAIN ANALYZE it, you can see that its doing a table scan. So the first (and only so far) idea is to try and make this into an index scan. If Postgres can scan a subsection of an index and return the "limit" number of records, it sounds fast to me:
create index our_index_on_the_date_column ON large_table (date_column DESC);
VACUUM ANALYZE large_table;
EXPLAIN ANALYZE select * from large_table where date_column > '2020-01-01' limit 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..37.88 rows=100 width=893) (actual time=0.034..13.520 rows=100 loops=1)
-> Seq Scan on large_table (cost=0.00..13649986.80 rows=36034774 width=893) (actual time=0.033..13.506 rows=100 loops=1)
Filter: (date_column > '2020-01-01'::date)
Rows Removed by Filter: 7542
Planning Time: 0.168 ms
Execution Time: 18.412 ms
(6 rows)
It still reverts to a sequential scan. Please disregard the execution time as this took 11 minutes before caching came into action. We can force it to use the index, by reducing the number of returned columns to what's being covered by the index:
select date_column from large_table where date_column > '2019-01-15' limit 100
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..3.42 rows=100 width=4) (actual time=0.051..0.064 rows=100 loops=1)
-> Index Only Scan using our_index_on_the_date_column on large_table (cost=0.57..907355.11 rows=31874888 width=4) (actual time=0.050..0.056 rows=100 loops=1)
Index Cond: (date_column > '2019-01-15'::date)
Heap Fetches: 0
Planning Time: 0.082 ms
Execution Time: 0.083 ms
(6 rows)
But this is of course a contrived example, since the table is very wide and covering all parts of the table in the index is not feasible.
So, anyone who can share some guidance on how to get some performance when using columns with low selectivity as predicates?

Related

Covering index in Postgres 12

My table has 5500000 test records (Postgres 12).
I have an uncertainty because of big costs when I'm using a covering index on my tables.
So, I create index:
create index concurrently idx_log_type
on logs (log_type)
include (log_body);
After I execute a request:
explain (analyze)
select log_body
from logs where log_type = 1
limit 100;
I have getting a result:
Limit (cost=0.43..5.05 rows=100 width=33) (actual time=0.118..0.138 rows=100 loops=1)
- -> Index Only Scan using idx_log_type on logs (cost=0.43..125736.10 rows=2720667 width=33) (actual time=0.107..0.118 rows=100 loops=1)
Index Cond: (log_type = 1)
Heap Fetches: 0
Planning Time: 0.558 ms
Execution Time: 0.228 ms
At first glance, this is so good, but the cost range is very large.
Is this normal or should you optimize?
That's an artifact of EXPLAIN: it shows the cost for the complete index scan, just as if there were no LIMIT. But the whole plan is priced correctly.
So you can ignore the high upper cost limit on the index scan.

PostgresQL index not used

I have a table with several million rows called item with columns that look like this:
CREATE TABLE item (
id bigint NOT NULL,
company_id bigint NOT NULL,
date_created timestamp with time zone,
....
)
There is an index on company_id
CREATE INDEX idx_company_id ON photo USING btree (company_id);
This table is often searched for the last 10 items for a certain customer, i.e.,
SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;
Currently, there is one customer that accounts for about 75% of the data in that table, the other 25% of the data is spread across 25 or so other customers, meaning that 75% of the rows have a company id of 5, the other rows have company ids between 6 and 25.
The query generally runs very fast for all companies except the predominant one (id = 5). I can understand why since the index on company_id can be used for companies except 5.
I have experimented with different indexes to make this search more efficient for company 5. The one that seemed to make the most sense is
CREATE INDEX idx_date_created
ON item (date_created DESC NULLS LAST);
If I add this index, queries for the predominant company (id = 5) are greatly improved, but queries for all other companies go to crap.
Some results of EXPLAIN ANALYZE for company id 5 & 6 with and without the new index:
Company Id 5
Before new index
QUERY PLAN
Limit (cost=214874.63..214874.65 rows=10 width=639) (actual time=10481.989..10482.017 rows=10 loops=1)
-> Sort (cost=214874.63..218560.33 rows=1474282 width=639) (actual time=10481.985..10481.994 rows=10 loops=1)
Sort Key: photo_created
Sort Method: top-N heapsort Memory: 35kB
-> Seq Scan on photo (cost=0.00..183015.92 rows=1474282 width=639) (actual time=0.009..5345.551 rows=1473561 loops=1)
Filter: (company_id = 5)
Rows Removed by Filter: 402513
Total runtime: 10482.075 ms
After new index:
QUERY PLAN
Limit (cost=0.43..1.98 rows=10 width=639) (actual time=0.087..0.120 rows=10 loops=1)
-> Index Scan using idx_photo__photo_created on photo (cost=0.43..228408.04 rows=1474282 width=639) (actual time=0.084..0.099 rows=10 loops=1)
Filter: (company_id = 5)
Rows Removed by Filter: 26
Total runtime: 0.164 ms
Company Id 6
Before new index:
QUERY PLAN
Limit (cost=2204.27..2204.30 rows=10 width=639) (actual time=0.044..0.053 rows=3 loops=1)
-> Sort (cost=2204.27..2207.55 rows=1310 width=639) (actual time=0.040..0.044 rows=3 loops=1)
Sort Key: photo_created
Sort Method: quicksort Memory: 28kB
-> Index Scan using idx_photo__company_id on photo (cost=0.43..2175.96 rows=1310 width=639) (actual time=0.020..0.026 rows=3 loops=1)
Index Cond: (company_id = 6)
Total runtime: 0.100 ms
After new index:
QUERY PLAN
Limit (cost=0.43..1744.00 rows=10 width=639) (actual time=0.039..3938.986 rows=3 loops=1)
-> Index Scan using idx_photo__photo_created on photo (cost=0.43..228408.04 rows=1310 width=639) (actual time=0.035..3938.975 rows=3 loops=1)
Filter: (company_id = 6)
Rows Removed by Filter: 1876071
Total runtime: 3939.028 ms
I have run a full VACUUM and ANALYZE on the table, so PostgreSQL should have up-to-date statistics. Any ideas how I can get PostgreSQL to choose the right index for the company being queried?
This is known as the "abort-early plan problem", and it's been a chronic mis-optimization for years. Abort-early plans are amazing when they work, but terrible when they don't; see that linked mailing list thread for a more detailed explanation. Basically, the planner thinks it'll find the 10 rows you want for customer 6 without scanning the whole date_created index, and it's wrong.
There isn't any hard-and-fast way to improve this query categorically prior to PostgreSQL 10 (not in beta). What you'll want to do is nudge the query planner in various ways in hopes of getting what you want. Primary methods include anything which makes PostgreSQL more likely to use multi-column indexes, such as:
lowering random_page_cost (a good idea anyway if you're on SSDs).
lowering cpu_index_tuple_cost
It's also possible that you may be able to fix the planner behavior by playing with the table statistics. This includes:
raising statistics_target for the table and running ANALYZE again, in order to make PostgreSQL take more samples and get a better picture of row distribution;
increasing n_distinct in the stats to accurately reflect the number of customer_ids or different created_dates.
However, all of these solutions are approximate, and if query performance goes to heck as your data changes in the future, this should be the first query you look at.
In PostgreSQL 10, you'll be able to create Cross-Column Stats which should improve the situation more reliably. Depending on how broken this is for you, you could try using the beta.
If none of that works, I suggest the #postgresql IRC channel on Freenode or the pgsql-performance mailing list. Folks there will ask for your detailed table stats in order to make some suggestions.
Yet another point:
Why do you create index
CREATE INDEX idx_date_created ON item (date_created DESC NULLS LAST);
But call:
SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;
May be you mean
SELECT * FROM item WHERE company_id = 5 ORDER BY date_created DESC NULLS LAST LIMIT 10;
Also is better to create combine index:
CREATE INDEX idx_company_id_date_created ON item (company_id, date_created DESC NULLS LAST);
And after that:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..28.11 rows=10 width=16) (actual time=0.120..0.153 rows=10 loops=1)
-> Index Only Scan using idx_company_id_date_created on item (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.118..0.145 rows=10 loops=1)
Index Cond: (company_id = 5)
Heap Fetches: 10
Planning time: 1.003 ms
Execution time: 0.209 ms
(6 rows)
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..28.11 rows=10 width=16) (actual time=0.085..0.115 rows=10 loops=1)
-> Index Only Scan using idx_company_id_date_created on item (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.084..0.108 rows=10 loops=1)
Index Cond: (company_id = 6)
Heap Fetches: 10
Planning time: 0.136 ms
Execution time: 0.155 ms
(6 rows)
On your server it might be slower but in any case much better than in above examples.

Postgres JSONB timestamp query very slow compared to timestamp column query

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:
data: {
"lastUpdated": "2016-12-26T12:09:43.901Z",
"lastUpdatedTimestamp": "1482754183"
}
}
The actual JSONB column stores much more information, but I've omitted the irrelevant data. The data format cannot be changed since this is legacy information.
I'm trying to efficiently obtain a count of all records where the lastUpdated value is greater or equal to some reference time (I'll use 2015-12-01T10:10:10Z in the following examples):
explain analyze SELECT count(*) FROM "accounts"
WHERE data->>'lastUpdated' >= '2015-12-01T10:10:10Z';
This takes over 22 seconds:
Aggregate (cost=843795.05..843795.06 rows=1 width=0) (actual time=22292.584..22292.584 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..842317.05 rows=591201 width=0)
(actual time=1.410..22142.046 rows=1773603 loops=1)
Filter: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.234 ms
Execution time: 22292.671 ms
I've tried adding the following text index:
CREATE INDEX accounts_last_updated ON accounts ((data->>'lastUpdated'));
But the query is still rather slow, at over 17 seconds:
Aggregate (cost=815548.64..815548.65 rows=1 width=0) (actual time=17172.844..17172.845 rows=1 loops=1)
-> Bitmap Heap Scan on accounts (cost=18942.24..814070.64 rows=591201 width=0)
(actual time=1605.454..17036.081 rows=1773603 loops=1)
Recheck Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Heap Blocks: exact=28955 lossy=397518
-> Bitmap Index Scan on accounts_last_updated (cost=0.00..18794.44 rows=591201 width=0)
(actual time=1596.645..1596.645 rows=1773603 loops=1)
Index Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.373 ms
Execution time: 17172.974 ms
I've also tried following the directions in Create timestamp index from JSON on PostgreSQL and have tried creating the following function and index:
CREATE OR REPLACE FUNCTION text_to_timestamp(text)
RETURNS timestamp AS
$$SELECT to_timestamp($1, 'YYYY-MM-DD HH24:MI:SS.MS')::timestamp; $$
LANGUAGE sql IMMUTABLE;
CREATE INDEX accounts_last_updated ON accounts
(text_to_timestamp(data->>'lastUpdated'));
But this doesn't give me any improvement, in fact it was slower, taking over 24 seconds for the query, versus 22 seconds for the unindexed version:
explain analyze SELECT count(*) FROM "accounts"
WHERE text_to_timestamp(data->>'lastUpdated') >= '2015-12-01T10:10:10Z';
Aggregate (cost=1287195.80..1287195.81 rows=1 width=0) (actual time=24143.150..24143.150 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..1285717.79 rows=591201 width=0)
(actual time=4.044..23971.723 rows=1773603 loops=1)
Filter: (text_to_timestamp((data ->> 'lastUpdated'::text)) >= '2015-12-01 10:10:10'::timestamp without time zone)
Planning time: 1.107 ms
Execution time: 24143.183 ms
In one last act of desperation, I decided to add another timestamp column and update it to contain the same values as data->>'lastUpdated':
alter table accounts add column updated_at timestamp;
update accounts set updated_at = text_to_timestamp(data->>'lastUpdated');
create index accounts_updated_at on accounts(updated_at);
This has given me by far the best performance:
explain analyze SELECT count(*) FROM "accounts" where updated_at >= '2015-12-01T10:10:10Z';
Aggregate (cost=54936.49..54936.50 rows=1 width=0) (actual time=676.955..676.955 rows=1 loops=1)
-> Index Only Scan using accounts_updated_at on accounts
(cost=0.43..50502.48 rows=1773603 width=0) (actual time=0.026..552.442 rows=1773603 loops=1)
Index Cond: (updated_at >= '2015-12-01 10:10:10'::timestamp without time zone)
Heap Fetches: 0
Planning time: 4.643 ms
Execution time: 678.962 ms
However, I'd very much like to avoid adding another column just to improve the speed of ths one query.
This leaves me with the following question: is there any way to improve the performance of my JSONB query so it can be as efficient as the individual column query (the last query where I used updated_at instead of data->>'lastUpdated')? As it stands, it takes from 17 seconds to 24 seconds for me to query the JSONB data using data->>'lastUpdated', while it takes only 678 ms to query the updated_at column. It doesn't make sense that the JSONB query would be so much slower. I was hoping that by using the text_to_timestamp function that it would improve the performance, but it hasn't been the case (or I'm doing something wrong).
In your first and second try most execution time is spent on index recheck or filtering, which must read every json field index hits, reading json is expensive. If index hits a couple hundred rows, query will be fast, but if index hits thousands or hundreds of thousand rows - filtering/rechecking json field will take some serious time. In second try, using additionally another function makes it even worse.
JSON field is good for storing data, but are not intended to be used in analytic queries like summaries, statistics and its bad practice to use json objects to be used in where conditions, atleast as main filtering condition like in your case.
That last act of depression of yours is the right way to go :)
To improve query performance, you must add one or some several columns with key vales which will be used most in where conditions.

How to optimize BETWEEN condition on big table in PostgreSQL

I have a big table (about ten million rows) and I need to perform query with ? BETWEEN columnA AND columnB.
Script to create database with table and sample data:
CREATE DATABASE test;
\c test
-- Create test table
CREATE TABLE test (id INT PRIMARY KEY, range_start NUMERIC(12, 0), range_end NUMERIC(12, 0));
-- Fill the table with sample data
INSERT INTO test (SELECT value, value, value FROM (SELECT generate_series(1, 10000000) AS value) source);
-- Query I want to be optimized
SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end;
I want to create INDEX so that PostgreSQL can do fast INDEX SCAN instead of SEQ SCAN. However I failed with my initial (and obvious) attempts:
CREATE INDEX test1 ON test (range_start, range_end);
CREATE INDEX test2 ON test (range_start DESC, range_end);
CREATE INDEX test3 ON test (range_end, range_start);
Also note that the number in the query is specifically chosen to be in the middle of generated values (otherwise PostgreSQL is able to recognize that the value is near range boundary and perform some optimizations).
Any ideas or thoughts would be appreciated.
UPDATE 1 Based on the official documentation it seems that PostgreSQL is not able to properly use indexes for multicolumn inequality conditions. I am not sure why there is such limitation and if there is anything I can do to significantly speed up the query.
UPDATE 2 One possible approach would be to limit the INDEX SCAN by knowing what is the largest range I have, lets say it is 100000:
SELECT * FROM test WHERE range_start BETWEEN 4900000 AND 5000000 AND range_end > 5000000;
Why don't you try a range with a gist index ?
alter table test add numr numrange;
update test set numr = numrange(range_start,range_end,'[]');
CREATE INDEX test_idx ON test USING gist (numr);
EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000.0 <# numr;
Bitmap Heap Scan on public.test (cost=2367.92..130112.36 rows=50000 width=48) (actual time=0.150..0.151 rows=1 loops=1)
Output: id, range_start, range_end, numr
Recheck Cond: (5000000.0 <# test.numr)
-> Bitmap Index Scan on test_idx (cost=0.00..2355.42 rows=50000 width=0) (actual time=0.142..0.142 rows=1 loops=1)
Index Cond: (5000000.0 <# test.numr)
Total runtime: 0.189 ms
After a second thought it is quite obvious why PostgreSQL can not use multicolumn index for two-column inequality condition. However what I did not understand was why there is SEQ SCAN even with LIMIT clause (sorry for not expressing that in my question):
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.09 rows=1 width=16) (actual time=4743.035..4743.037 rows=1 loops=1)
-> Seq Scan on test (cost=0.00..213685.51 rows=2499795 width=16) (actual time=4743.032..4743.032 rows=1 loops=1)
Filter: ((5000000::numeric >= range_start) AND (5000000::numeric <= range_end))
Total runtime: 4743.064 ms
Then it hit me that PostgreSQL can not know that it is less probable that the result will be in range_start=1 than range_start=4999999. That is why it starts scanning from the first row until it finds matching row(s).
The solution might be to convince PostgreSQL that there is some benefit to using the index:
test=# EXPLAIN ANALYZE SELECT * FROM test WHERE 5000000 BETWEEN range_start AND range_end ORDER BY range_start DESC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.53 rows=1 width=16) (actual time=0.102..0.103 rows=1 loops=1)
-> Index Scan Backward using test1 on test (cost=0.00..3667714.71 rows=2403325 width=16) (actual time=0.099..0.099 rows=1 loops=1)
Index Cond: ((5000000::numeric >= range_start) AND (5000000::numeric <= range_end))
Total runtime: 0.125 ms
Quite a performance boost I would say :). However still, this boost will only work if such range exists. Otherwise it will be as slow as SEQ SCAN. So it might be good to combine this approach with what I have outlined in my second update to the original question.

Postgresql 9.x: Index to optimize `xpath_exists` (XMLEXISTS) queries

We have queries of the form
select sum(acol)
where xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol)
What index can be built to speed up the where clause ?
A btree index created using
create index idx_01 using btree(xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol))
does not seem to be used at all.
EDIT
Setting enable_seqscan to off, the query using xpath_exists is much faster (one order of magnitude) and clearly shows using the corresponding index (the btree index built with xpath_exists).
Any clue why PostgreSQL would not be using the index and attempt a much slower sequential scan ?
Since I do not want to disable sequential scanning globally, I am back to square one and I am happily welcoming suggestions.
EDIT 2 - Explain plans
See below - Cost of first plan (seqscan off) is slightly higher but processing time much faster
b2box=# set enable_seqscan=off;
SET
b2box=# explain analyze
Select count(*)
from B2HEAD.item
where cluster = 'B2BOX' and ( ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) ) ) offset 0 limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.042..606.042 rows=1 loops=1)
-> Aggregate (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.039..606.039 rows=1 loops=1)
-> Bitmap Heap Scan on item (cost=1058.65..22701.38 rows=26102 width=0) (actual time=3.290..603.823 rows=4085 loops=1)
Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
-> Bitmap Index Scan on item_counter_01 (cost=0.00..1052.13 rows=56515 width=0) (actual time=2.283..2.283 rows=4085 loops=1)
Index Cond: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) = true)
Total runtime: 606.136 ms
(7 rows)
plan on explain.depesz.com
b2box=# set enable_seqscan=on;
SET
b2box=# explain analyze
Select count(*)
from B2HEAD.item
where cluster = 'B2BOX' and ( ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) ) ) offset 0 limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.163..10864.163 rows=1 loops=1)
-> Aggregate (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.160..10864.160 rows=1 loops=1)
-> Seq Scan on item (cost=0.00..22490.45 rows=26102 width=0) (actual time=33.574..10861.672 rows=4085 loops=1)
Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
Rows Removed by Filter: 108945
Total runtime: 10864.242 ms
(6 rows)
plan on explain.depesz.com
Planner cost parameters
Cost of first plan (seqscan off) is slightly higher but processing time much faster
This tells me that your random_page_cost and seq_page_cost are probably wrong. You're likely on storage with fast random I/O - either because most of the database is cached in RAM or because you're using SSD, SAN with cache, or other storage where random I/O is inherently fast.
Try:
SET random_page_cost = 1;
SET seq_page_cost = 1.1;
to greatly reduce the cost param differences and then re-run. If that does the job consider changing those params in postgresql.conf..
Your row-count estimates are reasonable, so it doesn't look like a planner mis-estimation problem or a problem with bad table statistics.
Incorrect query
Your query is also incorrect. OFFSET 0 LIMIT 1 without an ORDER BY will produce unpredictable results unless you're guaranteed to have exactly one match, in which case the OFFSET ... LIMIT ... clauses are unnecessary and can be removed entirely.
You're usually much better off phrasing such queries as SELECT max(...) or SELECT min(...) where possible; PostgreSQL will tend to be able to use an index to just pluck off the desired value without doing an expensive table scan or an index scan and sort.
Tips
BTW, for future questions the PostgreSQL wiki has some good information in the performance category and a guide to asking Slow query questions.