Large COUNT DISTINCT performs slowly in postgresql - postgresql

I am running a big count(DISTINCT) query group by query against a table on postgresql 12. The table is roughly 32GB, 300MM rows. It is partitioned by YEAR. The groups are more or less exactly distributed:
EXPLAIN (ANALYZE,BUFFERS)
SELECT
date_trunc('month', condition_start_date::timestamp) as dt,
condition_source_value,
COUNT(DISTINCT person_id)
FROM synpuf5.condition_occurrence_yrpart
GROUP BY date_trunc('month', condition_start_date::timestamp), condition_source_value
ORDER BY COUNT(DISTINCT person_id) DESC LIMIT 10;
Here is the output of the query planner:
QUERY PLAN
Limit (cost=50052765961.82..50052765961.85 rows=10 width=21) (actual time=691022.306..691022.308 rows=10 loops=1)
Buffers: shared hit=3062256 read=222453
-> Sort (cost=50052765961.82..50052777188.87 rows=4490820 width=21) (actual time=690786.364..690786.364 rows=10 loops=1)
Sort Key: (count(DISTINCT condition_occurrence_yrpart_2007.person_id)) DESC
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=3062256 read=222453
-> GroupAggregate (cost=50049709699.80..50052668916.82 rows=4490820 width=21) (actual time=567099.326..690705.612 rows=360849 loops=1)
Group Key: (date_trunc('month'::text, (condition_occurrence_yrpart_2007.condition_start_date)::timestamp without time zone)), condition_occurrence_yrpart_2007.condition_source_value
Buffers: shared hit=3062253 read=222453
-> Sort (cost=50049709699.80..50050432663.48 rows=289185472 width=17) (actual time=567098.345..619461.044 rows=289182385 loops=1)
Sort Key: (date_trunc('month'::text, (condition_occurrence_yrpart_2007.condition_start_date)::timestamp without time zone)), condition_occurrence_yrpart_2007.condition_source_value
Sort Method: quicksort Memory: 30333184kB
Buffers: shared hit=3062246 read=222453
-> Append (cost=10000000000.00..50009068412.44 rows=289185472 width=17) (actual time=0.065..74222.771 rows=289182385 loops=1)
Buffers: shared hit=3062240 read=222453
-> Seq Scan on condition_occurrence_yrpart_2007 (cost=10000000000.00..10000001125.61 rows=42774 width=17) (actual time=0.064..13.756 rows=42774 loops=1)
Buffers: shared read=484
-> Seq Scan on condition_occurrence_yrpart_2008 (cost=10000000000.00..10002732063.72 rows=103678448 width=17) (actual time=0.039..21209.532 rows=103676930 loops=1)
Buffers: shared hit=954918 read=221969
-> Seq Scan on condition_occurrence_yrpart_2009 (cost=10000000000.00..10003024874.44 rows=114743696 width=17) (actual time=0.142..20191.131 rows=114743002 loops=1)
Buffers: shared hit=1303719
-> Seq Scan on condition_occurrence_yrpart_2010 (cost=10000000000.00..10001864406.36 rows=70720224 width=17) (actual time=0.050..12464.117 rows=70719679 loops=1)
Buffers: shared hit=803603
-> Seq Scan on condition_occurrence_yrpart_2011 (cost=10000000000.00..10000000014.95 rows=330 width=17) (actual time=0.022..0.022 rows=0 loops=1)
I have also heavily configured my postgresql to attempt to fit all the data in memory, including:
shared_buffers = 80GB
work_mem = 32GB
max_worker_processes = 32
max_parallel_workers_per_gather = 16
max_parallel_workers = 32
wal_compression = on
max_wal_size = 8GB
enable_seqscan = off
enable_partitionwise_join = on
enable_partitionwise_aggregate = on
parallel_tuple_cost = 0.01
parallel_setup_cost = 100.0
shared_preload_libraries = 'pg_prewarm'
effective_cache_size = 192GB
The VM I am running is quite behemoth. 256 GB ram, 32 cores. SSD which is where the postgres directory is housed...
Several questions here:
Why is it so slow?
Why is it not operating in parallel?
Why does performance not increase when I run again despite pg_prewarm?
Why is memory released when my session ends? I am using prewarm?

Why is it so slow?
Sorting 300 million rows takes a while, even with generous work_mem. More than 9 minutes of the query execution time are spent sorting for the GROUP BY.
Why is it not operating in parallel?
Because sorting cannot be parallelized in PostgreSQL.
Why does performance not increase when I run again despite pg_prewarm?
Because everything is already cached.
Why is memory released when my session ends? I am using prewarm?
The memory that your backend uses certainly will be freed when your session ends. The memory used for shared_buffers will not be freed, because that is the cache shared by all processes in the database. You don't want that memory freed.
This is a heavy query, and it takes some time. I don't think that can be improved.
You don't tell us what the partitioning expression is, but since it probably is not date_trunc('month', condition_start_date::timestamp), you don't get partitionwise aggregation despite enable_partitionwise_aggregate = on. PostgreSQL is not smart enough to infer that it could actually do that (assuming you partition on condition_start_date).

It is slow because doing stuff with 300 million rows takes some time.
It is not operating in parallel I think because COUNT(DISTINCT...) code is very old and has not seen much attention lately. It doesn't know how to use hash aggregation, nor operate in parallel. (In my hands, if I lower parallel_tuple_cost all the way to zero, it does operate in parallel, but the gather is below the massive sort and doesn't do any good. But I'm not working with your real data, so could get different results.)
You can get around the inflexibility of COUNT(DISTINCT...) by doing the DISTINCT and the COUNT in separate steps:
select dt, condition_source_value, count(person_id) from (
SELECT distinct
date_trunc('month', condition_start_date::timestamp) as dt,
condition_source_value,
person_id
FROM condition_occurrence_yrpart
) foo
GROUP BY dt, condition_source_value
ORDER BY COUNT(person_id) DESC LIMIT 10;
It still might not do the parallelization in the right spot, though.

I would rewrite the query like this:
SELECT date_trunc('month', condition_start_date::timestamp) as dt
, condition_source_value
, CNT
FROM (
SELECT CONDITION_START_DATE
, condition_source_value
, COUNT(PERSON_ID) AS CNT
FROM (SELECT CONDITION_START_DATE, CONDITION_SOURCE_VALUE, PERSON_ID
FROM synpuf5.condition_occurrence_yrpart
GROUP BY CONDITION_START_DATE, CONDITION_SOURCE_VALUE, PERSON_ID) A
GROUP BY CONDITION_START_DATE, CONDITION_SOURCE_VALUE
ORDER BY COUNT(PERSON_ID) DESC
LIMIT 10) B;
My answer is almost the same with JJANES' answer. But I think you had better reduce the number of DATE_TRUNC function calls. In the revised query, PostgreSQL would run DATE_TRUNC functioins just 10 times.
The following is the URL explaining the tuning concept of this query.
https://blog.naver.com/naivety1/222371990077

Related

Improve Postgres performance

I am new to Postgres and sure Iā€™m doing something wrong.
So I just wondered if anybody had experienced something similar to my experiences below or could point me in the right direction to improve Postgres performance.
My initial goal was to speed up the analytical processing of my Datamarts in various Dashboards by moving from MS SQL Server to Postgres.
To get a sample query to compare speeds I ran query profiler on MS SQL Server whilst referencing a BI dashboard, which produced something similar to this (I know there are redundant columns in the sub query):
SELECT COUNT(*)
FROM (
SELECT
BM.Key_Date, BM.[Actual Date], BM.[Month]
,BM.[Month Number], BM.[Month Year], BM.[No of Working Days]
,SDI.Key_Delivery, SDI.[Order Number], SDI.[Quantity SKU]
,SDI.[Quantity Sales Unit], SDI.[FactSales - GBP], SDI.[NNSA Capsules]
,SFI.[Ship-to], SFI.[Sold-to], SFI.[Sales Force Type], SFI.Region
,SFI.[Top Level Account], SFI.[Customer Organisation]
,EX.Rate
,PDI.[Product Description], PDI.[Product Type Group], PDI.[Product Type],
PDI.[Main Product Categories], PDI.Section, PDI.Family
FROM Fact.SalesDataInvoiced AS SDI
JOIN Dimension.SalesforceInvoiced AS SFI
ON SDI.[Key_Ship-to]=SFI.[Key_Ship-to]
JOIN Dimension.BillingMonth AS BM
ON SDI.[Key_Billing Month]=BM.Key_Date
JOIN Dimension.ProductDataInvoiced AS PDI
ON SDI.[Key_Product Code]=PDI.[Key_Product Code]
CROSS JOIN Dimension.Exchange AS EX
WHERE BM.[Actual Date] BETWEEN '20160101' AND '20211001'
) AS a
GROUP BY [Product Type], [Product Type Group],[Main Product Categories]
I then installed Postgres 14 (on Centos 8) and MS SQL Server Developer 2017 (on windows 10) on separate identical laptops and created a Database and tables from the same csv data files to enable the replication of the above query.
Running a Postgres query with indexing performs massively slower than MS SQL without indexing.
Adding indexes to MS SQL produces results almost instantly.
Because of the difference in processing time I even installed Citus with Postgres14 and created Fact.SalesDataInvoiced as a columnar table (This made the processing time worse).
I have played about with memory settings in postgresql.conf but nothing seems to enable speeds comparable to MSSQL.
Explain Analyze shows that despite the indexes it always runs a sequential scan of all tables. Forcing indexed scans doesn't make any difference to processing time.
Would I be right in thinking Postgres would perform significantly better using a cluster and partitioning? Even if this is the case surely a simple query like the one I'm trying to run on a stand alone machine should be faster?
TABLE DETAILS
Dimension.BillingMonth
Records 120,
Primary Key is KeyDate,
Clustered Unique Index on KeyDate
Dimension.Exchange
Records 1
Dimension.ProductDataInvoiced
Records 275563,
Primary Key is KeyProduct,
Clustered Unique Index on KeyProduct
Dimension.SalesforceInvoiced
Records 377414,
Primary Key is KeyShipTo,
Clustered Unique Index on KeyShipTo
Fact.SalesDataInvoiced
Records 43807943,
Non-Clustered Unique Index on KeyShipTo, KeyProduct, KeyBillingMonth
Any help would be appreciated as previously mentioned I'm sure I must be missing something obvious.
Many thanks in advance.
David
Thank you for the responses. I have placed additional info below.
Forgot to add my postgres performance woes were after i'd carried out a Full Vacuum and Reindex. I performed these maintenance tasks after I had imported the data and created my indexes.
Output after querying pg_indexes
tablename
indexname
indexdef
BillingMonth
BillingMonth_pkey
CREATE UNIQUE INDEX BillingMonth_pkey ON public.BillingMonth USING btree (KeyDate)
ProductDataInvoiced
ProductDataInvoiced_pkey
CREATE UNIQUE INDEX ProductDataInvoiced_pkey ON public.ProductDataInvoiced USING btree (KeyProductCode)
SalesforceInvoiced
SalesforceInvoiced_pkey
CREATE UNIQUE INDEX SalesforceInvoiced_pkey ON public.SalesforceInvoiced USING btree (KeyShipTo)
SalesDataInvoiced
CI_SalesData
CREATE INDEX CI_SalesData ON public.SalesDataInvoiced USING btree (KeyShipTo, KeyProductCode, KeyBillingMonth)
Output After running EXPLAIN (ANALYZE, BUFFERS)
Finalize GroupAggregate (cost=1435439.30..1435565.71 rows=480 width=53) (actual time=25960.468..25973.326 rows=31 loops=1)
Group Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Buffers: shared hit=71246 read=859119
-> Gather Merge (cost=1435439.30..1435551.31 rows=960 width=53) (actual time=25960.458..25973.282 rows=89 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=71246 read=859119
-> Sort (cost=1434439.28..1434440.48 rows=480 width=53) (actual time=25956.982..25956.989 rows=30 loops=3)
Sort Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=71246 read=859119
Worker 0: Sort Method: quicksort Memory: 29kB
Worker 1: Sort Method: quicksort Memory: 29kB
-> Partial HashAggregate (cost=1434413.10..1434417.90 rows=480 width=53) (actual time=25956.878..25956.895 rows=30 loops=3)
Group Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Batches: 1 Memory Usage: 49kB
Buffers: shared hit=71230 read=859119
Worker 0: Batches: 1 Memory Usage: 49kB
Worker 1: Batches: 1 Memory Usage: 49kB
-> Parallel Hash Join (cost=62124.74..1327935.46 rows=10647764 width=45) (actual time=285.864..19240.004 rows=14602648 loops=3)
Hash Cond: (sdi."KeyShipTo" = sfi."KeyShipTo")
Buffers: shared hit=71230 read=859119
-> Hash Join (cost=19648.48..1257508.51 rows=10647764 width=49) (actual time=204.794..12862.063 rows=14602648 loops=3)
Hash Cond: (sdi."KeyProductCode" = pdi."KeyProductCode")
Buffers: shared hit=32264 read=859119
-> Hash Join (cost=3.67..1091456.95 rows=10647764 width=8) (actual time=0.143..7076.104 rows=14602648 loops=3)
Hash Cond: (sdi."KeyBillingMonth" = bm."KeyDate")
Buffers: shared hit=197 read=859119
-> Parallel Seq Scan on "SalesData_Invoiced" sdi (cost=0.00..1041846.10 rows=18253310 width=12) (actual
time=0.071..2585.596 rows=14602648 loops=3)
Buffers: shared hit=194 read=859119
-> Hash (cost=2.80..2.80 rows=70 width=4) (actual time=0.049..0.050 rows=70 loops=3)
Hash Cond: (sdi."KeyBillingMonth" = bm."KeyDate")
Buffers: shared hit=197 read=859119
-> Parallel Seq Scan on "SalesData_Invoiced" sdi (cost=0.00..1041846.10 rows=18253310 width=12) (actual
time=0.071..2585.596 rows=14602648 loops=3)
Buffers: shared hit=194 read=859119
-> Hash (cost=2.80..2.80 rows=70 width=4) (actual time=0.049..0.050 rows=70 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
Buffers: shared hit=3
-> Seq Scan on "BillingMonth" bm (cost=0.00..2.80 rows=70 width=4) (actual time=0.012..0.028
rows=70 loops=3)
Filter: (("ActualDate" >= '2016-01-01'::date) AND ("ActualDate" <= '2021-10-01'::date))
Rows Removed by Filter: 50
Buffers: shared hit=3
-> Hash (cost=16200.27..16200.27 rows=275563 width=49) (actual time=203.237..203.238 rows=275563 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 26832kB
Buffers: shared hit=32067
-> Nested Loop (cost=0.00..16200.27 rows=275563 width=49) (actual time=0.034..104.143 rows=275563 loops=3)
Buffers: shared hit=32067
-> Seq Scan on "Exchange" ex (cost=0.00..1.01 rows=1 width=0) (actual time=0.024..0.024 rows=
1 loops=3)
Buffers: shared hit=3
-> Seq Scan on "ProductData_Invoiced" pdi (cost=0.00..13443.63 rows=275563 width=49) (actual
time=0.007..48.176 rows=275563 loops=3)
Buffers: shared hit=32064
-> Parallel Hash (cost=40510.56..40510.56 rows=157256 width=4) (actual time=79.536..79.536 rows=125805 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 18912kB
Buffers: shared hit=38938
-> Parallel Seq Scan on "Salesforce_Invoiced" sfi (cost=0.00..40510.56 rows=157256 width=4) (actual time=
0.011..42.968 rows=125805 loops=3)
Buffers: shared hit=38938
Planning:
Buffers: shared hit=426
Planning Time: 1.936 ms
Execution Time: 25973.709 ms
(55 rows)
Firstly, remember to run VACUUM ANALYZE after rebuilding indexes, or sometimes after importing large amount of data. (VACUUM FULL is mainly useful for the OS to reclaim disk space, and you'd still need to analyse afterwards, especially after rebuilding indexes.)
It seems from your query that your main table is SalesDataInvoiced (SDI) and that you'd want to use an index on KeyBillingMonth if possible (since it's the main restriction you're placing). In general, you'd also want indexes, at least on the other tables on the columns that are used for the joins.
As the documentation for multi-column indexes in PostgreSQL says:
A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned. For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to be scanned through. This index could in principle be used for queries that have constraints on b and/or c with no constraint on a ā€” but the entire index would have to be scanned, so in most cases the planner would prefer a sequential table scan over using the index.
In your example, the main column you'd want to use a constraint on (KeyBillingMonth) is in third position, so it's unlikely to be used.
CREATE INDEX CI_SalesData ON public.SalesDataInvoiced
USING btree (KeyShipTo, KeyProductCode, KeyBillingMonth)
Creating this should make it more likely to be used:
CREATE INDEX ON SalesDataInvoiced(KeyBillingMonth);
Then, run VACUUM ANALYZE and try your query again.
You may also want an index on BillingMonth(ActualDate), but that's not necessarily useful since there seems to be few rows (and most of them are returned in your query).
It's not clear what the BillingMonth table is for. If it's basically about truncating the ActualDate to have the first day of the month, you could for example get rid of the join on BillingMonth and use the constraint on SalesDataInvoiced.KeyBillingMonth directly. For example ... WHERE SDI.KeyBillingMonth BETWEEN '2016-01-01' AND '2021-10-01' ....
As a side-note, as far as I know, BETWEEN is inclusive for its upper bound. I'd imagine a query like this is meant to represent some monthly statistics, hence should probably not include what's on 2021-10-01 (but not the rest of that month).

Efficient querying/indexing in Postgres for WHERE IN (...) and ORDER BY

I have a table of posts, and each post belongs to a classroom. I want to be able to query for most recent posts across several classrooms, like this:
SELECT * FROM posts
WHERE posts.classroom_id IN (6691, 6693, 6695, 6702)
ORDER BY date desc, created_at desc
LIMIT 30;
Unfortunately, this results in Postgres pulling and sorting tens of thousands of records - it has to get all the posts for each classroom, and sort all of them together, in order to find the 30 most recent overall.
Here's the explain+analyze:
-> Sort (cost=67525.77..67571.26 rows=18194 width=489) (actual time=9373.376..9373.381 rows=30 loops=1)
Sort Key: date DESC, created_at DESC
Sort Method: top-N heapsort Memory: 62kB
-> Bitmap Heap Scan on posts (cost=350.74..66988.42 rows=18194 width=489) (actual time=41.360..9271.782 rows=42924 loops=1)
Recheck Cond: (classroom_id = ANY ('{6691,6693,6695,6702}'::integer[]))
Heap Blocks: exact=29456
-> Bitmap Index Scan on optimize_finding_photos_and_tagged_posts_by_classroom (cost=0.00..346.19 rows=18194 width=0) (actual time=16.205..16.205 rows=42924 loops=1)
Index Cond: (classroom_id = ANY ('{6691,6693,6695,6702}'::integer[]))
Planning time: 0.216 ms
Execution time: 9390.323 ms
From various index options, the planner chose one that starts with classroom_id, which makes sense (the subsequent fields in that index are irrelevant). But it seems so inefficient that it has to gather 42,924 rows and sort them all.
It seems it could take a big shortcut by retrieving only the 30 most recent for each classroom, and then sorting those. To facilitate this, I tried adding a new index on [classroom_id, date DESC, created_at DESC], but the planner chose not to use it. Is Postgres just not quite clever enough to use the shortcut I describe? Or is there something I'm overlooking?
So, is there a better way to index or query, so that this kind of lookup can be more efficient?
One more side question: in the explain+analyze, why does the sort node take so little time? I would expect the sorting to be fairly slow/expensive.
Creating a test database...
CREATE TABLE posts( classroom_id INT NOT NULL, date FLOAT NOT NULL, foo TEXT );
INSERT INTO posts SELECT random()*100, random() FROM generate_series( 1,1500000 );
CREATE INDEX posts_cd ON posts( classroom_id, date );
CREATE INDEX posts_date ON posts( date );
VACUUM ANALYZE posts;
Note the "foo" column is there to avoid an index-only scan on posts which would be very fast on this test setup which only contains indexed columns classroom_id,date but would be useless for you since you will select other columns also.
If you have an index on date that you use for other things, like displaying the most recent posts for all classrooms, then you can use it here too:
EXPLAIN ANALYZE SELECT * FROM posts WHERE posts.classroom_id IN (1,2,6)
ORDER BY date desc LIMIT 30;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..55.67 rows=30 width=44) (actual time=0.040..0.983 rows=30 loops=1)
-> Index Scan Backward using posts_date on posts (cost=0.29..5447.29 rows=2951 width=44) (actual time=0.039..0.978 rows=30 loops=1)
Filter: (classroom_id = ANY ('{1,2,6}'::integer[]))
Rows Removed by Filter: 916
Planning time: 0.117 ms
Execution time: 1.008 ms
This one is a bit risky since the condition on classroom is not indexed: since it will scan the date index backwards, if many classrooms that are excluded by the WHERE condition have recent posts it may have to skip lots of rows in the index before finding the requested rows. My test data distribution is random, but this query may have different performance if your data distribution is different.
Now, without the index on date.
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=10922.61..10922.69 rows=30 width=44) (actual time=41.038..41.049 rows=30 loops=1)
-> Sort (cost=10922.61..11028.44 rows=42331 width=44) (actual time=41.036..41.040 rows=30 loops=1)
Sort Key: date DESC
Sort Method: top-N heapsort Memory: 26kB
-> Bitmap Heap Scan on posts (cost=981.34..9672.39 rows=42331 width=44) (actual time=10.275..33.056 rows=44902 loops=1)
Recheck Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
Heap Blocks: exact=8069
-> Bitmap Index Scan on posts_cd (cost=0.00..970.76 rows=42331 width=0) (actual time=8.613..8.613 rows=44902 loops=1)
Index Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
Planning time: 0.145 ms
Execution time: 41.086 ms
Note I've adjusted the number of rows in the table so the bitmap scan finds about the same number as yours.
It's the same plan you had, including the Top-N heapsort which is much faster than a complete sort (and uses a lot less memory):
One more side question: in the explain+analyze, why does the sort node take so little time?
Basically what it does is only keep the top N rows in the heapsort buffer since the rest will be discarded by the LIMIT anyway, so it doesn't have to sort everything. As the rows are fetched they are pushed into the heapsort buffer (or discarded if they would be discarded by the LIMIT anyway). So the sort doesn't happen as a separate step after the data to be sorted is gathered, instead it happens while the data is gathered, which is why it takes the same time as retrieving the data.
Now, my query is a lot faster than yours, while they use the same plan. Several reasons could explain this, for example I run it on a SSD which is fast. But I think the most likely explanation is that your posts table probably contains ... posts ... which means large-ish TEXT data. This means a lot of data will have to be fetched, then discarded, keeping only 30 rows. In order to test this I just did:
UPDATE posts SET foo= 992 bytes of text
VACUUM ANALYZE posts;
...and the query is much slower, 360ms, and it says:
Heap Blocks: exact=41046
So that's probably your problem. In order to solve it, the query should not fetch large amounts of data then discard them, which means we're going to use the primary key... you must have one already but I forgot it, so here it is.
ALTER TABLE posts ADD post_id SERIAL PRIMARY KEY;
VACUUM ANALYZE posts;
DROP INDEX posts_cd;
CREATE INDEX posts_cdi ON posts( classroom_id, date, post_id );
I add the PK to the index, and drop the previous index, because I want an index-only scan in order to avoid fetching all the data from the table. Scanning only the index involves much less data since it doesn't contain the actual posts. Of course, we only get the PKs, so we have to JOIN back to the main table to get the posts, but that happens only after all the filtering is done, so it's only 30 rows.
EXPLAIN ANALYZE SELECT p.* FROM posts p
JOIN (SELECT post_id FROM posts WHERE posts.classroom_id IN (1,2,6)
ORDER BY date desc LIMIT 30) pids USING (post_id)
ORDER BY date desc LIMIT 30;
Limit (cost=3212.05..3212.12 rows=30 width=1012) (actual time=38.410..38.421 rows=30 loops=1)
-> Sort (cost=3212.05..3212.12 rows=30 width=1012) (actual time=38.410..38.419 rows=30 loops=1)
Sort Key: p.date DESC
Sort Method: quicksort Memory: 85kB
-> Nested Loop (cost=2957.71..3211.31 rows=30 width=1012) (actual time=38.108..38.329 rows=30 loops=1)
-> Limit (cost=2957.29..2957.36 rows=30 width=12) (actual time=38.092..38.105 rows=30 loops=1)
-> Sort (cost=2957.29..3067.84 rows=44223 width=12) (actual time=38.092..38.104 rows=30 loops=1)
Sort Key: posts.date DESC
Sort Method: top-N heapsort Memory: 26kB
-> Index Only Scan using posts_cdi on posts (cost=0.43..1651.19 rows=44223 width=12) (actual time=0.023..22.186 rows=44902 loops=1)
Index Cond: (classroom_id = ANY ('{1,2,6}'::integer[]))
Heap Fetches: 0
-> Index Scan using posts_pkey on posts p (cost=0.43..8.45 rows=1 width=1012) (actual time=0.006..0.006 rows=1 loops=30)
Index Cond: (post_id = posts.post_id)
Planning time: 0.305 ms
Execution time: 38.468 ms
OK. Much faster now. This trick is pretty useful: when the table contains lots of data, or even lots of columns, that will have to be lugged around inside the query engine then filtered and most of it discarded, sometimes it is faster to do the filtering and sorting on only the few small columns that are actually used, then fetching the rest of the data only for the rows that remain after the filtering is done. Sometimes it is worth it to split the table in two even, with the columns used for filtering and sorting in one table, and all the rest in the other table.
To go even faster we can make the query ugly:
SELECT p.* FROM posts p
JOIN (
SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=1 ORDER BY date desc LIMIT 30) a
UNION ALL
SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=2 ORDER BY date desc LIMIT 30) b
UNION ALL
SELECT * FROM (SELECT post_id, date FROM posts WHERE posts.classroom_id=3 ORDER BY date desc LIMIT 30) c
ORDER BY date desc LIMIT 30
) q USING (post_id)
ORDER BY date desc LIMIT 30;
This exploits the fact that, if there is only one classroom_id in the WHERE condition, then postgres will use index scan backward on (classroom_id,date) directly. And since I've added post_id to it, it doesn't even need to touch the table. And since the three selects in the union have the same sort order, it combines them with a merge, which means it doesn't even need to sort or even fetch the rows that ate cut off by the outer LIMIT 30.
Limit (cost=257.97..258.05 rows=30 width=1012) (actual time=0.357..0.367 rows=30 loops=1)
-> Sort (cost=257.97..258.05 rows=30 width=1012) (actual time=0.356..0.364 rows=30 loops=1)
Sort Key: p.date DESC
Sort Method: quicksort Memory: 85kB
-> Nested Loop (cost=1.73..257.23 rows=30 width=1012) (actual time=0.063..0.319 rows=30 loops=1)
-> Limit (cost=1.31..3.28 rows=30 width=12) (actual time=0.050..0.085 rows=30 loops=1)
-> Merge Append (cost=1.31..7.24 rows=90 width=12) (actual time=0.049..0.081 rows=30 loops=1)
Sort Key: posts.date DESC
-> Limit (cost=0.43..1.56 rows=30 width=12) (actual time=0.024..0.032 rows=12 loops=1)
-> Index Only Scan Backward using posts_cdi on posts (cost=0.43..531.81 rows=14136 width=12) (actual time=0.024..0.029 rows=12 loops=1)
Index Cond: (classroom_id = 1)
Heap Fetches: 0
-> Limit (cost=0.43..1.55 rows=30 width=12) (actual time=0.018..0.024 rows=9 loops=1)
-> Index Only Scan Backward using posts_cdi on posts posts_1 (cost=0.43..599.55 rows=15950 width=12) (actual time=0.017..0.023 rows=9 loops=1)
Index Cond: (classroom_id = 2)
Heap Fetches: 0
-> Limit (cost=0.43..1.56 rows=30 width=12) (actual time=0.006..0.014 rows=11 loops=1)
-> Index Only Scan Backward using posts_cdi on posts posts_2 (cost=0.43..531.81 rows=14136 width=12) (actual time=0.006..0.014 rows=11 loops=1)
Index Cond: (classroom_id = 3)
Heap Fetches: 0
-> Index Scan using posts_pkey on posts p (cost=0.43..8.45 rows=1 width=1012) (actual time=0.006..0.007 rows=1 loops=30)
Index Cond: (post_id = posts.post_id)
Planning time: 0.445 ms
Execution time: 0.432 ms
The resulting speedup is pretty ridiculous. I think this should work.
To facilitate this, I tried adding a new index on [classroom_id, date DESC, created_at DESC], but the planner chose not to use it. Is Postgres just not quite clever enough to use the shortcut I describe?
It is just not clever enough. You could write it out explicitly to get the execution you envision. It is ugly, but it should be effective:
(SELECT * FROM posts WHERE classroom_id = 6691 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6693 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6695 ORDER BY date desc, created_at desc LIMIT 30)
union all
(SELECT * FROM posts WHERE classroom_id = 6697 ORDER BY date desc, created_at desc LIMIT 30)
order by date desc, created_at desc LIMIT 30;
One more side question: in the explain+analyze, why does the sort node take so little time? I would expect the sorting to be fairly slow/expensive.
CPUs are very fast, and 40,000 rows is just not very many. Unlike CPUs however, your storage is not nearly as fast, and stomping all over a very large table to collect 40,000 rows takes a lot of time. There are all kinds of ways try to address this (other than fixing the planner or rewriting your query). Get faster primary storage or more caching. Get it to use an index-only scan (is selecting * really necessary?), clustering or partitioning the table on classroom_id so that rows of the same classroom are located together, etc.
If you don't want to rearrange your data or change your hardware or rewrite your query, then maybe the simplest thing to try would be just to build an index on (date, created_at), which might lead to another plan which is less good than the perfect plan, but much better than the current plan. It could use the index to walk the data in already-ordered order, collecting rows which meet the IN condition until it has collected 30.

Inordinately slow Nested Loop with join on simple query

I'm running the query below against the primary key lt_id (no other index bar the pkey btree) and joining against 1000 ids.
It might be just my lack of experience with postgres but it seems like it's maybe an order of magnitude slow.. There are 800k rows in the table in total.
This is a low spec machine(4G mem) but still thought it should be faster. CPU is idle.
EXPLAIN (ANALYZE,BUFFERS) SELECT lt_id FROM "mytable" d INNER JOIN ( VALUES (1839147),(...998 more rows here...),(1756908)) v(id) ON (d.lt_id = v.id);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.42..7743.00 rows=1000 width=4) (actual time=69.852..20743.393 rows=1000 loops=1)
Buffers: shared hit=2395 read=1607
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.004..4.770 rows=1000 loops=1)
-> Index Only Scan using lt_id_idx on mytable d (cost=0.42..7.73 rows=1 width=4) (actual time=20.732..20.732 rows=1 loops=1000)
Index Cond: (lt_id = "*VALUES*".column1)
Heap Fetches: 1000
Buffers: shared hit=2395 read=1607
Planning Time: 86.284 ms
Execution Time: 20744.223 ms
(9 rows)
psql 11.7 , I was using 9 but upgraded to 11.7 , no real difference in speed observed.
free
total used free shared buff/cache available
Mem: 3783732 158076 3400932 55420 224724 3366832
Swap: 0 0 0
Even though it's low spec should it really be taking 20 seconds? In fact many other queries are taking twice as long or more. 20 seconds seems to be the best case scenario. There are a couple of other text columns in the table with some small text articles which I doubt is the issue.
I was previously using IN operator but observed similar or worse speeds.
I also made a couple of small changes from the default config, but it doesn't seem to make much difference.
work_mem = 32MB
shared_buffers = 512MB
Any ideas if this is expected performance given the machine? Or is there something else I can try?
edit: I guess what I'm curious about it the time in the actual loop
actual time=20.732..20.732 rows=1 loops=1000
It seems like the actual time is less than or equal 1ms per loop which in worst case would be less than 1 second for 1000 iterations and other operations also seem negligible. Does this mean the issue is simple IO ? slow disk ? What would typically be the situation here.
I notice if I run the query on my desktop which only has 8G ram but is using an SSD the query is massively faster..
Using an SSD is fine of course but I'd like to know if something in my config or query/setup is not optimal..
As #pifor suggested, set track_io_timing=on , can see that this is indeed almost entirely IO slowness..
Nested Loop (cost=0.42..7743.00 rows=1000 width=69) (actual time=0.026..14901.004 rows=1000 loops=1)
Buffers: shared hit=2859 read=1145
I/O Timings: read=14861.578
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.002..5.497 rows=1000 loops=1)
-> Index Scan using mytable_pkey on mytable d (cost=0.42..7.73 rows=1 width=69) (actual time=14.888..14.888 rows=1 loops=1000)
Index Cond: (lt_id = "*VALUES*".column1)
Buffers: shared hit=2859 read=1145
I/O Timings: read=14861.578
Planning Time: 0.420 ms
Execution Time: 14901.734 ms
(10 rows)

Slow on first query

I'm having troubles when I perform the first query on a table. Subsequent queries are much faster, even if I change the range date to look for. I assume that PostgreSQL implements a caching mechanism that allows the subsequent queries to be much faster. I can try to warmup the cache so the first user request can hit the cache. However, I think I can somehow improve the following query:
SELECT
y.id, y.title, x.visits, x.score
FROM (
SELECT
article_id, visits,
COALESCE(ROUND((visits / NULLIF(hits ,0)::float)::numeric, 4), 0) score
FROM (
SELECT
article_id, SUM(visits) visits, SUM(hits) hits
FROM
article_reports
WHERE
a.site_id = 'XYZ' AND a.date >= '2017-04-13' AND a.date <= '2017-06-28'
GROUP BY
article_id
) q ORDER BY score DESC, visits DESC LIMIT(20)
) x
INNER JOIN
articles y ON x.article_id = y.id
Any ideas on how can I improve this. The following is the result of EXPLAIN:
Nested Loop (cost=84859.76..85028.54 rows=20 width=272) (actual time=12612.596..12612.836 rows=20 loops=1)
-> Limit (cost=84859.34..84859.39 rows=20 width=52) (actual time=12612.502..12612.517 rows=20 loops=1)
-> Sort (cost=84859.34..84880.26 rows=8371 width=52) (actual time=12612.499..12612.503 rows=20 loops=1)
Sort Key: q.score DESC, q.visits DESC
Sort Method: top-N heapsort Memory: 27kB
-> Subquery Scan on q (cost=84218.04..84636.59 rows=8371 width=52) (actual time=12513.168..12602.649 rows=28965 loops=1)
-> HashAggregate (cost=84218.04..84301.75 rows=8371 width=36) (actual time=12513.093..12536.823 rows=28965 loops=1)
Group Key: a.article_id
-> Bitmap Heap Scan on article_reports a (cost=20122.78..77122.91 rows=405436 width=36) (actual time=135.588..11974.774 rows=398242 loops=1)
Recheck Cond: (((site_id)::text = 'XYZ'::text) AND (date >= '2017-04-13'::date) AND (date <= '2017-06-28'::date))
Heap Blocks: exact=36911
-> Bitmap Index Scan on index_article_reports_on_site_id_and_article_id_and_date (cost=0.00..20021.42 rows=405436 width=0) (actual time=125.846..125.846 rows=398479 loops=1)"
Index Cond: (((site_id)::text = 'XYZ'::text) AND (date >= '2017-04-13'::date) AND (date <= '2017-06-28'::date))
-> Index Scan using articles_pkey on articles y (cost=0.42..8.44 rows=1 width=128) (actual time=0.014..0.014 rows=1 loops=20)
Index Cond: (id = q.article_id)
Planning time: 1.443 ms
Execution time: 12613.689 ms
Thanks in advance
There are two levels of "cache" that Postgres uses:
OS file cache
shared buffers.
Important: Postgres directly controls only the second one, and relies on the first one, which is under OS' control.
First thing I would check are these two settings in postgresql.conf:
effective_cache_size ā€“ usually I set it to ~3/4 of all RAM available. Notice that it's not a setting that tells Postgres how to allocate memory, it's just "an advice" to Postgres planner telling some estimate of OS file cache size
shared_buffers ā€“ usually I set it to 1/4 of RAM size. This is allocation setting.
Also, I'd check other memory-related settings (work_mem, maintenance_work_mem) to understand how much RAM might be consumed, so will my effective_cache_size estimation be correct at most times.
But if you just turned your Postgres on, the first queries will most probably be long because there is no data in OS file cache and in shared buffers. You can check it with advanced EXPLAIN options:
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
-- you will see how many buffers were fetched from disk ("read") or from cache ("hit")
Here you can find good material on using EXPLAIN: http://www.dalibo.org/_media/understanding_explain.pdf
Additionally, there is an extension aiming to solve "cold cache" problem: pg_prewarm https://www.postgresql.org/docs/current/static/pgprewarm.html
Also, working with SSD disks instead of magnetic ones will mean that disk reads will be much faster.
Have fun and well working Postgres :-)
If it is the first query after inserting several rows you must run an
ANALYZE
in all the database or over the involved tables. Try executing it at database level.

Trivial order by double type: performance crash

Characters:
id BIGINT
geo_point POINT (PostGIS)
stroke_when TIMESTAMPTZ (indexed!)
stroke_when_second DOUBLE PRECISION
PostgeSQL 9.1, PostGIS 2.0.
1. Query:
SELECT ST_AsText(geo_point)
FROM lightnings
ORDER BY stroke_when DESC, stroke_when_second DESC
LIMIT 1
Total runtime: 31100.911 ms !
EXPLAIN (ANALYZE on, VERBOSE off, COSTS on, BUFFERS on):
Limit (cost=169529.67..169529.67 rows=1 width=144) (actual time=31100.869..31100.869 rows=1 loops=1)
Buffers: shared hit=3343 read=120342
-> Sort (cost=169529.67..176079.48 rows=2619924 width=144) (actual time=31100.865..31100.865 rows=1 loops=1)
Sort Key: stroke_when, stroke_when_second
Sort Method: top-N heapsort Memory: 17kB
Buffers: shared hit=3343 read=120342
-> Seq Scan on lightnings (cost=0.00..156430.05 rows=2619924 width=144) (actual time=1.589..29983.410 rows=2619924 loops=1)
Buffers: shared hit=3339 read=120342
2. Selecting another field:
SELECT id
FROM lightnings
ORDER BY stroke_when DESC, stroke_when_second DESC
LIMIT 1
Total runtime: 2144.057 ms.
EXPLAIN (ANALYZE on, VERBOSE off, COSTS on, BUFFERS on):
Limit (cost=162979.86..162979.86 rows=1 width=24) (actual time=2144.013..2144.014 rows=1 loops=1)
Buffers: shared hit=3513 read=120172
-> Sort (cost=162979.86..169529.67 rows=2619924 width=24) (actual time=2144.011..2144.011 rows=1 loops=1)
Sort Key: stroke_when, stroke_when_second
Sort Method: top-N heapsort Memory: 17kB
Buffers: shared hit=3513 read=120172
-> Seq Scan on lightnings (cost=0.00..149880.24 rows=2619924 width=24) (actual time=0.056..1464.904 rows=2619924 loops=1)
Buffers: shared hit=3509 read=120172
3. Correct optimization:
SELECT id
FROM lightnings
ORDER BY stroke_when DESC
LIMIT 1
Total runtime: 0.044 ms
EXPLAIN (ANALYZE on, VERBOSE off, COSTS on, BUFFERS on):
Limit (cost=0.00..3.52 rows=1 width=16) (actual time=0.020..0.020 rows=1 loops=1)
Buffers: shared hit=5
-> Index Scan Backward using lightnings_idx on lightnings (cost=0.00..9233232.80 rows=2619924 width=16) (actual time=0.018..0.018 rows=1 loops=1)
Buffers: shared hit=5
As you can see there are two bad and very different collisions though the query is a quite primitive when the SQL optimizer uses index:
Even if the optimizer doesnt use the index, why using As_Text(geo_point) instead of id takes so much more time? There is only one row in result!
Impossibility of using first order index when an unindexed field is presented in ORDER BY. Mention that as on practice only few rows on each second are presented in DB.
Of course above is a simplified query, extracted from a more complex construction. Usually I select rows by date range, applying complicated filters.
PostgreSQL can't use your index to produce values in the desired order for the first two queries. When two or more rows have identical store_when identical they are returned from the index scan in arbitrary order. To decide the correct order for the rows would require a secondary sorting pass. Because PostgreSQL executor doesn't have a facility to perform that secondary sort it falls back to a full sort approach.
If you regularly need to query the table with that order then replace your current index with a composite index that includes both columns.
You can transform your current query into a form that explicitly specifies the secondary sort on only the largest value of store_when:
SELECT ST_AsText(geo_point) FROM lightnings
WHERE store_when = (SELECT max(store_when) FROM lightnings)
ORDER BY stroke_when_second DESC LIMIT 1
First step could be: create a composite index on {stroke_when, stroke_when_second}