Covering index in Postgres 12 - postgresql

My table has 5500000 test records (Postgres 12).
I have an uncertainty because of big costs when I'm using a covering index on my tables.
So, I create index:
create index concurrently idx_log_type
on logs (log_type)
include (log_body);
After I execute a request:
explain (analyze)
select log_body
from logs where log_type = 1
limit 100;
I have getting a result:
Limit (cost=0.43..5.05 rows=100 width=33) (actual time=0.118..0.138 rows=100 loops=1)
- -> Index Only Scan using idx_log_type on logs (cost=0.43..125736.10 rows=2720667 width=33) (actual time=0.107..0.118 rows=100 loops=1)
Index Cond: (log_type = 1)
Heap Fetches: 0
Planning Time: 0.558 ms
Execution Time: 0.228 ms
At first glance, this is so good, but the cost range is very large.
Is this normal or should you optimize?

That's an artifact of EXPLAIN: it shows the cost for the complete index scan, just as if there were no LIMIT. But the whole plan is priced correctly.
So you can ignore the high upper cost limit on the index scan.

Related

Postgres: how do you optimize queries on date column with low selectivity?

I have a table with 143 million rows (and growing), its current size is 107GB. One of the columns in the table is of type date and it has low selectivity. For any given date, its reasonable to assume that there are somewhere between 0.5 to 4 million records with the same date value.
Now, if someone tries to do something like this:
select * from large_table where date_column > '2020-01-01' limit 100
It will execute "forever", and if you EXPLAIN ANALYZE it, you can see that its doing a table scan. So the first (and only so far) idea is to try and make this into an index scan. If Postgres can scan a subsection of an index and return the "limit" number of records, it sounds fast to me:
create index our_index_on_the_date_column ON large_table (date_column DESC);
VACUUM ANALYZE large_table;
EXPLAIN ANALYZE select * from large_table where date_column > '2020-01-01' limit 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..37.88 rows=100 width=893) (actual time=0.034..13.520 rows=100 loops=1)
-> Seq Scan on large_table (cost=0.00..13649986.80 rows=36034774 width=893) (actual time=0.033..13.506 rows=100 loops=1)
Filter: (date_column > '2020-01-01'::date)
Rows Removed by Filter: 7542
Planning Time: 0.168 ms
Execution Time: 18.412 ms
(6 rows)
It still reverts to a sequential scan. Please disregard the execution time as this took 11 minutes before caching came into action. We can force it to use the index, by reducing the number of returned columns to what's being covered by the index:
select date_column from large_table where date_column > '2019-01-15' limit 100
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..3.42 rows=100 width=4) (actual time=0.051..0.064 rows=100 loops=1)
-> Index Only Scan using our_index_on_the_date_column on large_table (cost=0.57..907355.11 rows=31874888 width=4) (actual time=0.050..0.056 rows=100 loops=1)
Index Cond: (date_column > '2019-01-15'::date)
Heap Fetches: 0
Planning Time: 0.082 ms
Execution Time: 0.083 ms
(6 rows)
But this is of course a contrived example, since the table is very wide and covering all parts of the table in the index is not feasible.
So, anyone who can share some guidance on how to get some performance when using columns with low selectivity as predicates?

Postgres not using Covering Index with Aggregate

Engine version: 12.4
Postgres wasn't using index only scan, then I ran vacuum analyze verbose table_name. After that it started using index only scan. Earlier when I had ran analyze verbose table_name without vacuum that time index only scan wasn't used.
So it means there is very heavy dependency on vacuum to use index only plan. Is there any way to eliminate this dependency OR should we configure vacuum very regularly? frequency like daily.
Our objective is to reduce cpu usage.. Overall machine cpu usage is 10%-15% throughout the day but when this query runs then cpu goes very high( this query runs in multiple threads at same time with diff values)
EXPLAIN ANALYZE SELECT COALESCE(requested_debit, 0) AS requestedDebit, COALESCE(requested_credit, 0) AS requestedCredit
FROM (SELECT COALESCE(Sum(le.credit), 0) AS requested_credit, COALESCE(Sum(le.debit), 0) AS requested_debit
FROM ledger_entries le
WHERE le.accounting_entity_id = 1
AND le.general_ledger_id = 503
AND le.post_date BETWEEN '2020-09-10' AND '2020-11-30') AS requested_le;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on requested_le (cost=66602.65..66602.67 rows=1 width=64) (actual time=81.263..81.352 rows=1 loops=1)
-> Finalize Aggregate (cost=66602.65..66602.66 rows=1 width=64) (actual time=81.261..81.348 rows=1 loops=1)
-> Gather (cost=66602.41..66602.62 rows=2 width=64) (actual time=79.485..81.331 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=65602.41..65602.42 rows=1 width=64) (actual time=74.293..74.294 rows=1 loops=3)
-> Parallel Index Only Scan using post_date_gl_id_accounting_entity_id_include_idx on ledger_entries le (cost=0.56..65203.73 rows=79735 width=8) (actual time=47.874..74.212 rows=197 loops=3)
Index Cond: ((post_date >= '2020-09-10'::date) AND (post_date <= '2020-11-30'::date) AND (general_ledger_id = 503) AND (accounting_entity_id = 1))
Heap Fetches: 0
Planning Time: 0.211 ms
Execution Time: 81.395 ms
(11 rows)
There is a very strong connection between VACUUM and index-only scans in PostgreSQL: an index-only scan can only skip fetching the table row (to check for visibility) if the block containing the row is marked all-visible in the visibility map. And the visibility map is updated by VACUUM.
So yes, you have to VACUUM often to get efficient index-only scans.
Typically, there is no need to schedule a manual VACUUM, you can simply
ALTER TABLE mytab SET (autovacuum_vacuum_scale_factor = 0.01);
(or a similar value) and let autovacuum do the job.
The only case where this is problematic are insert-only tables, because for them autovacuum won't be triggered for PostgreSQL versions below v13. In v13, you can simply change autovacuum_vacuum_insert_scale_factor, while in older versions you will have to set autovacuum_freeze_max_age to a low value for that table.

Inordinately slow Nested Loop with join on simple query

I'm running the query below against the primary key lt_id (no other index bar the pkey btree) and joining against 1000 ids.
It might be just my lack of experience with postgres but it seems like it's maybe an order of magnitude slow.. There are 800k rows in the table in total.
This is a low spec machine(4G mem) but still thought it should be faster. CPU is idle.
EXPLAIN (ANALYZE,BUFFERS) SELECT lt_id FROM "mytable" d INNER JOIN ( VALUES (1839147),(...998 more rows here...),(1756908)) v(id) ON (d.lt_id = v.id);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.42..7743.00 rows=1000 width=4) (actual time=69.852..20743.393 rows=1000 loops=1)
Buffers: shared hit=2395 read=1607
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.004..4.770 rows=1000 loops=1)
-> Index Only Scan using lt_id_idx on mytable d (cost=0.42..7.73 rows=1 width=4) (actual time=20.732..20.732 rows=1 loops=1000)
Index Cond: (lt_id = "*VALUES*".column1)
Heap Fetches: 1000
Buffers: shared hit=2395 read=1607
Planning Time: 86.284 ms
Execution Time: 20744.223 ms
(9 rows)
psql 11.7 , I was using 9 but upgraded to 11.7 , no real difference in speed observed.
free
total used free shared buff/cache available
Mem: 3783732 158076 3400932 55420 224724 3366832
Swap: 0 0 0
Even though it's low spec should it really be taking 20 seconds? In fact many other queries are taking twice as long or more. 20 seconds seems to be the best case scenario. There are a couple of other text columns in the table with some small text articles which I doubt is the issue.
I was previously using IN operator but observed similar or worse speeds.
I also made a couple of small changes from the default config, but it doesn't seem to make much difference.
work_mem = 32MB
shared_buffers = 512MB
Any ideas if this is expected performance given the machine? Or is there something else I can try?
edit: I guess what I'm curious about it the time in the actual loop
actual time=20.732..20.732 rows=1 loops=1000
It seems like the actual time is less than or equal 1ms per loop which in worst case would be less than 1 second for 1000 iterations and other operations also seem negligible. Does this mean the issue is simple IO ? slow disk ? What would typically be the situation here.
I notice if I run the query on my desktop which only has 8G ram but is using an SSD the query is massively faster..
Using an SSD is fine of course but I'd like to know if something in my config or query/setup is not optimal..
As #pifor suggested, set track_io_timing=on , can see that this is indeed almost entirely IO slowness..
Nested Loop (cost=0.42..7743.00 rows=1000 width=69) (actual time=0.026..14901.004 rows=1000 loops=1)
Buffers: shared hit=2859 read=1145
I/O Timings: read=14861.578
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.002..5.497 rows=1000 loops=1)
-> Index Scan using mytable_pkey on mytable d (cost=0.42..7.73 rows=1 width=69) (actual time=14.888..14.888 rows=1 loops=1000)
Index Cond: (lt_id = "*VALUES*".column1)
Buffers: shared hit=2859 read=1145
I/O Timings: read=14861.578
Planning Time: 0.420 ms
Execution Time: 14901.734 ms
(10 rows)

PostgresQL index not used

I have a table with several million rows called item with columns that look like this:
CREATE TABLE item (
id bigint NOT NULL,
company_id bigint NOT NULL,
date_created timestamp with time zone,
....
)
There is an index on company_id
CREATE INDEX idx_company_id ON photo USING btree (company_id);
This table is often searched for the last 10 items for a certain customer, i.e.,
SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;
Currently, there is one customer that accounts for about 75% of the data in that table, the other 25% of the data is spread across 25 or so other customers, meaning that 75% of the rows have a company id of 5, the other rows have company ids between 6 and 25.
The query generally runs very fast for all companies except the predominant one (id = 5). I can understand why since the index on company_id can be used for companies except 5.
I have experimented with different indexes to make this search more efficient for company 5. The one that seemed to make the most sense is
CREATE INDEX idx_date_created
ON item (date_created DESC NULLS LAST);
If I add this index, queries for the predominant company (id = 5) are greatly improved, but queries for all other companies go to crap.
Some results of EXPLAIN ANALYZE for company id 5 & 6 with and without the new index:
Company Id 5
Before new index
QUERY PLAN
Limit (cost=214874.63..214874.65 rows=10 width=639) (actual time=10481.989..10482.017 rows=10 loops=1)
-> Sort (cost=214874.63..218560.33 rows=1474282 width=639) (actual time=10481.985..10481.994 rows=10 loops=1)
Sort Key: photo_created
Sort Method: top-N heapsort Memory: 35kB
-> Seq Scan on photo (cost=0.00..183015.92 rows=1474282 width=639) (actual time=0.009..5345.551 rows=1473561 loops=1)
Filter: (company_id = 5)
Rows Removed by Filter: 402513
Total runtime: 10482.075 ms
After new index:
QUERY PLAN
Limit (cost=0.43..1.98 rows=10 width=639) (actual time=0.087..0.120 rows=10 loops=1)
-> Index Scan using idx_photo__photo_created on photo (cost=0.43..228408.04 rows=1474282 width=639) (actual time=0.084..0.099 rows=10 loops=1)
Filter: (company_id = 5)
Rows Removed by Filter: 26
Total runtime: 0.164 ms
Company Id 6
Before new index:
QUERY PLAN
Limit (cost=2204.27..2204.30 rows=10 width=639) (actual time=0.044..0.053 rows=3 loops=1)
-> Sort (cost=2204.27..2207.55 rows=1310 width=639) (actual time=0.040..0.044 rows=3 loops=1)
Sort Key: photo_created
Sort Method: quicksort Memory: 28kB
-> Index Scan using idx_photo__company_id on photo (cost=0.43..2175.96 rows=1310 width=639) (actual time=0.020..0.026 rows=3 loops=1)
Index Cond: (company_id = 6)
Total runtime: 0.100 ms
After new index:
QUERY PLAN
Limit (cost=0.43..1744.00 rows=10 width=639) (actual time=0.039..3938.986 rows=3 loops=1)
-> Index Scan using idx_photo__photo_created on photo (cost=0.43..228408.04 rows=1310 width=639) (actual time=0.035..3938.975 rows=3 loops=1)
Filter: (company_id = 6)
Rows Removed by Filter: 1876071
Total runtime: 3939.028 ms
I have run a full VACUUM and ANALYZE on the table, so PostgreSQL should have up-to-date statistics. Any ideas how I can get PostgreSQL to choose the right index for the company being queried?
This is known as the "abort-early plan problem", and it's been a chronic mis-optimization for years. Abort-early plans are amazing when they work, but terrible when they don't; see that linked mailing list thread for a more detailed explanation. Basically, the planner thinks it'll find the 10 rows you want for customer 6 without scanning the whole date_created index, and it's wrong.
There isn't any hard-and-fast way to improve this query categorically prior to PostgreSQL 10 (not in beta). What you'll want to do is nudge the query planner in various ways in hopes of getting what you want. Primary methods include anything which makes PostgreSQL more likely to use multi-column indexes, such as:
lowering random_page_cost (a good idea anyway if you're on SSDs).
lowering cpu_index_tuple_cost
It's also possible that you may be able to fix the planner behavior by playing with the table statistics. This includes:
raising statistics_target for the table and running ANALYZE again, in order to make PostgreSQL take more samples and get a better picture of row distribution;
increasing n_distinct in the stats to accurately reflect the number of customer_ids or different created_dates.
However, all of these solutions are approximate, and if query performance goes to heck as your data changes in the future, this should be the first query you look at.
In PostgreSQL 10, you'll be able to create Cross-Column Stats which should improve the situation more reliably. Depending on how broken this is for you, you could try using the beta.
If none of that works, I suggest the #postgresql IRC channel on Freenode or the pgsql-performance mailing list. Folks there will ask for your detailed table stats in order to make some suggestions.
Yet another point:
Why do you create index
CREATE INDEX idx_date_created ON item (date_created DESC NULLS LAST);
But call:
SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;
May be you mean
SELECT * FROM item WHERE company_id = 5 ORDER BY date_created DESC NULLS LAST LIMIT 10;
Also is better to create combine index:
CREATE INDEX idx_company_id_date_created ON item (company_id, date_created DESC NULLS LAST);
And after that:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..28.11 rows=10 width=16) (actual time=0.120..0.153 rows=10 loops=1)
-> Index Only Scan using idx_company_id_date_created on item (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.118..0.145 rows=10 loops=1)
Index Cond: (company_id = 5)
Heap Fetches: 10
Planning time: 1.003 ms
Execution time: 0.209 ms
(6 rows)
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..28.11 rows=10 width=16) (actual time=0.085..0.115 rows=10 loops=1)
-> Index Only Scan using idx_company_id_date_created on item (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.084..0.108 rows=10 loops=1)
Index Cond: (company_id = 6)
Heap Fetches: 10
Planning time: 0.136 ms
Execution time: 0.155 ms
(6 rows)
On your server it might be slower but in any case much better than in above examples.

Postgresql 9.x: Index to optimize `xpath_exists` (XMLEXISTS) queries

We have queries of the form
select sum(acol)
where xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol)
What index can be built to speed up the where clause ?
A btree index created using
create index idx_01 using btree(xpath_exists('/Root/KeyValue[Key="val"]/Value//text()', xmlcol))
does not seem to be used at all.
EDIT
Setting enable_seqscan to off, the query using xpath_exists is much faster (one order of magnitude) and clearly shows using the corresponding index (the btree index built with xpath_exists).
Any clue why PostgreSQL would not be using the index and attempt a much slower sequential scan ?
Since I do not want to disable sequential scanning globally, I am back to square one and I am happily welcoming suggestions.
EDIT 2 - Explain plans
See below - Cost of first plan (seqscan off) is slightly higher but processing time much faster
b2box=# set enable_seqscan=off;
SET
b2box=# explain analyze
Select count(*)
from B2HEAD.item
where cluster = 'B2BOX' and ( ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) ) ) offset 0 limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.042..606.042 rows=1 loops=1)
-> Aggregate (cost=22766.63..22766.64 rows=1 width=0) (actual time=606.039..606.039 rows=1 loops=1)
-> Bitmap Heap Scan on item (cost=1058.65..22701.38 rows=26102 width=0) (actual time=3.290..603.823 rows=4085 loops=1)
Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
-> Bitmap Index Scan on item_counter_01 (cost=0.00..1052.13 rows=56515 width=0) (actual time=2.283..2.283 rows=4085 loops=1)
Index Cond: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) = true)
Total runtime: 606.136 ms
(7 rows)
plan on explain.depesz.com
b2box=# set enable_seqscan=on;
SET
b2box=# explain analyze
Select count(*)
from B2HEAD.item
where cluster = 'B2BOX' and ( ( xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()', content) ) ) offset 0 limit 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.163..10864.163 rows=1 loops=1)
-> Aggregate (cost=22555.71..22555.72 rows=1 width=0) (actual time=10864.160..10864.160 rows=1 loops=1)
-> Seq Scan on item (cost=0.00..22490.45 rows=26102 width=0) (actual time=33.574..10861.672 rows=4085 loops=1)
Filter: (xpath_exists('/MessageInfo[FinalRecipient="ABigBank"]//text()'::text, content, '{}'::text[]) AND ((cluster)::text = 'B2BOX'::text))
Rows Removed by Filter: 108945
Total runtime: 10864.242 ms
(6 rows)
plan on explain.depesz.com
Planner cost parameters
Cost of first plan (seqscan off) is slightly higher but processing time much faster
This tells me that your random_page_cost and seq_page_cost are probably wrong. You're likely on storage with fast random I/O - either because most of the database is cached in RAM or because you're using SSD, SAN with cache, or other storage where random I/O is inherently fast.
Try:
SET random_page_cost = 1;
SET seq_page_cost = 1.1;
to greatly reduce the cost param differences and then re-run. If that does the job consider changing those params in postgresql.conf..
Your row-count estimates are reasonable, so it doesn't look like a planner mis-estimation problem or a problem with bad table statistics.
Incorrect query
Your query is also incorrect. OFFSET 0 LIMIT 1 without an ORDER BY will produce unpredictable results unless you're guaranteed to have exactly one match, in which case the OFFSET ... LIMIT ... clauses are unnecessary and can be removed entirely.
You're usually much better off phrasing such queries as SELECT max(...) or SELECT min(...) where possible; PostgreSQL will tend to be able to use an index to just pluck off the desired value without doing an expensive table scan or an index scan and sort.
Tips
BTW, for future questions the PostgreSQL wiki has some good information in the performance category and a guide to asking Slow query questions.