PostgreSQL chooses not to use index despite improved performance - postgresql

I had a DB in MySQL and am in the process of moving to PostgreSQL with a Django front-end.
I have a table of 650k-750k rows on which I perform the following query:
SELECT "MMG", "Gene", COUNT(*) FROM at_summary_typing WHERE "MMG" != '' GROUP BY "MMG", "Gene" ORDER BY COUNT(*);
In the MySQL this returns in ~0.5s. However when I switched to PostgreSQL the same query takes ~3s. I have put an index on MMG and Gene together to try and speed it up but when using EXPLAIN (analyse, buffers, verbose) I see the output shows the index is not used :
Sort (cost=59013.54..59053.36 rows=15927 width=14) (actual time=2880.222..2885.475 rows=39314 loops=1)
Output: "MMG", "Gene", (count(*))
Sort Key: (count(*))
Sort Method: external merge Disk: 3280kB
Buffers: shared hit=16093 read=11482, temp read=2230 written=2230
-> GroupAggregate (cost=55915.50..57901.90 rows=15927 width=14) (actual time=2179.809..2861.679 rows=39314 loops=1)
Output: "MMG", "Gene", count(*)
Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
-> Sort (cost=55915.50..56372.29 rows=182713 width=14) (actual time=2179.782..2830.232 rows=180657 loops=1)
Output: "MMG", "Gene"
Sort Key: at_summary_typing."MMG", at_summary_typing."Gene"
Sort Method: external merge Disk: 8168kB
Buffers: shared hit=16093 read=11482, temp read=1819 written=1819
-> Seq Scan on public.at_summary_typing (cost=0.00..36821.60 rows=182713 width=14) (actual time=0.010..224.658 rows=180657 loops=1)
Output: "MMG", "Gene"
Filter: ((at_summary_typing."MMG")::text <> ''::text)
Rows Removed by Filter: 559071
Buffers: shared hit=16093 read=11482
Total runtime: 2888.804 ms
After some searching I found that I could force the use of the index by setting SET enable_seqscan = OFF; and the EXPLAIN now shows the following :
Sort (cost=1181591.18..1181631.00 rows=15927 width=14) (actual time=555.546..560.839 rows=39314 loops=1)
Output: "MMG", "Gene", (count(*))
Sort Key: (count(*))
Sort Method: external merge Disk: 3280kB
Buffers: shared hit=173219 read=87094 written=7, temp read=411 written=411
-> GroupAggregate (cost=0.42..1180479.54 rows=15927 width=14) (actual time=247.546..533.202 rows=39314 loops=1)
Output: "MMG", "Gene", count(*)
Buffers: shared hit=173219 read=87094 written=7
-> Index Only Scan using mm_gene_idx on public.at_summary_typing (cost=0.42..1178949.93 rows=182713 width=14) (actual time=247.533..497.771 rows=180657 loops=1)
Output: "MMG", "Gene"
Filter: ((at_summary_typing."MMG")::text <> ''::text)
Rows Removed by Filter: 559071
Heap Fetches: 739728
Buffers: shared hit=173219 read=87094 written=7
Total runtime: 562.735 ms
Performance now comparable with the MySQL.
The problem is that I understand that setting this is bad practice and that I should try and find a way to improve my query/encourage the use of the index automatically. However I'm very inexperienced with PostgreSQL and cannot work out how or why it is choosing to use a Seq Scan over an Index Scan in the first place.

why it is choosing to use a Seq Scan over an Index Scan in the first place
Because the seq scan is actually twice as fast as the index scan (224ms vs. 497ms) despite the fact that the index was nearly completely in the cache, but the table was not.
So choosing the seq scan was the right thing to do.
The bottleneck in the first plan is the sorting and grouping that needs to be done on disk.
The better strategy would be to increase work_mem to something more realistic than the really small default of 4MB. You might want to start with something like 16MB, by running
set work_mem=16MB;
before running your query. If that doesn't remove the "Sort Method: external merge Disk" steps, increase it work_mem further.
By increasing the work_mem it also is possible that Postgres switches to a hash aggregate instead of the sorting that it currently does which will probably be faster anyway (but isn't feasible if not enough memory is available)
Once you find a good value, you might want to make that permanent by putting the new value into postgresql.conf
Don't set this too high: that memory may be requested multiple times for each query.
If your where condition is static, you could also create a partial index matching that criteria:
create index on at_summary_typing ("MMG", "Gene")
where "MMG" <> '';
Don't forget to analyze the table to update the statistics.

Related

Improve Postgres performance

I am new to Postgres and sure Iā€™m doing something wrong.
So I just wondered if anybody had experienced something similar to my experiences below or could point me in the right direction to improve Postgres performance.
My initial goal was to speed up the analytical processing of my Datamarts in various Dashboards by moving from MS SQL Server to Postgres.
To get a sample query to compare speeds I ran query profiler on MS SQL Server whilst referencing a BI dashboard, which produced something similar to this (I know there are redundant columns in the sub query):
SELECT COUNT(*)
FROM (
SELECT
BM.Key_Date, BM.[Actual Date], BM.[Month]
,BM.[Month Number], BM.[Month Year], BM.[No of Working Days]
,SDI.Key_Delivery, SDI.[Order Number], SDI.[Quantity SKU]
,SDI.[Quantity Sales Unit], SDI.[FactSales - GBP], SDI.[NNSA Capsules]
,SFI.[Ship-to], SFI.[Sold-to], SFI.[Sales Force Type], SFI.Region
,SFI.[Top Level Account], SFI.[Customer Organisation]
,EX.Rate
,PDI.[Product Description], PDI.[Product Type Group], PDI.[Product Type],
PDI.[Main Product Categories], PDI.Section, PDI.Family
FROM Fact.SalesDataInvoiced AS SDI
JOIN Dimension.SalesforceInvoiced AS SFI
ON SDI.[Key_Ship-to]=SFI.[Key_Ship-to]
JOIN Dimension.BillingMonth AS BM
ON SDI.[Key_Billing Month]=BM.Key_Date
JOIN Dimension.ProductDataInvoiced AS PDI
ON SDI.[Key_Product Code]=PDI.[Key_Product Code]
CROSS JOIN Dimension.Exchange AS EX
WHERE BM.[Actual Date] BETWEEN '20160101' AND '20211001'
) AS a
GROUP BY [Product Type], [Product Type Group],[Main Product Categories]
I then installed Postgres 14 (on Centos 8) and MS SQL Server Developer 2017 (on windows 10) on separate identical laptops and created a Database and tables from the same csv data files to enable the replication of the above query.
Running a Postgres query with indexing performs massively slower than MS SQL without indexing.
Adding indexes to MS SQL produces results almost instantly.
Because of the difference in processing time I even installed Citus with Postgres14 and created Fact.SalesDataInvoiced as a columnar table (This made the processing time worse).
I have played about with memory settings in postgresql.conf but nothing seems to enable speeds comparable to MSSQL.
Explain Analyze shows that despite the indexes it always runs a sequential scan of all tables. Forcing indexed scans doesn't make any difference to processing time.
Would I be right in thinking Postgres would perform significantly better using a cluster and partitioning? Even if this is the case surely a simple query like the one I'm trying to run on a stand alone machine should be faster?
TABLE DETAILS
Dimension.BillingMonth
Records 120,
Primary Key is KeyDate,
Clustered Unique Index on KeyDate
Dimension.Exchange
Records 1
Dimension.ProductDataInvoiced
Records 275563,
Primary Key is KeyProduct,
Clustered Unique Index on KeyProduct
Dimension.SalesforceInvoiced
Records 377414,
Primary Key is KeyShipTo,
Clustered Unique Index on KeyShipTo
Fact.SalesDataInvoiced
Records 43807943,
Non-Clustered Unique Index on KeyShipTo, KeyProduct, KeyBillingMonth
Any help would be appreciated as previously mentioned I'm sure I must be missing something obvious.
Many thanks in advance.
David
Thank you for the responses. I have placed additional info below.
Forgot to add my postgres performance woes were after i'd carried out a Full Vacuum and Reindex. I performed these maintenance tasks after I had imported the data and created my indexes.
Output after querying pg_indexes
tablename
indexname
indexdef
BillingMonth
BillingMonth_pkey
CREATE UNIQUE INDEX BillingMonth_pkey ON public.BillingMonth USING btree (KeyDate)
ProductDataInvoiced
ProductDataInvoiced_pkey
CREATE UNIQUE INDEX ProductDataInvoiced_pkey ON public.ProductDataInvoiced USING btree (KeyProductCode)
SalesforceInvoiced
SalesforceInvoiced_pkey
CREATE UNIQUE INDEX SalesforceInvoiced_pkey ON public.SalesforceInvoiced USING btree (KeyShipTo)
SalesDataInvoiced
CI_SalesData
CREATE INDEX CI_SalesData ON public.SalesDataInvoiced USING btree (KeyShipTo, KeyProductCode, KeyBillingMonth)
Output After running EXPLAIN (ANALYZE, BUFFERS)
Finalize GroupAggregate (cost=1435439.30..1435565.71 rows=480 width=53) (actual time=25960.468..25973.326 rows=31 loops=1)
Group Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Buffers: shared hit=71246 read=859119
-> Gather Merge (cost=1435439.30..1435551.31 rows=960 width=53) (actual time=25960.458..25973.282 rows=89 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=71246 read=859119
-> Sort (cost=1434439.28..1434440.48 rows=480 width=53) (actual time=25956.982..25956.989 rows=30 loops=3)
Sort Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=71246 read=859119
Worker 0: Sort Method: quicksort Memory: 29kB
Worker 1: Sort Method: quicksort Memory: 29kB
-> Partial HashAggregate (cost=1434413.10..1434417.90 rows=480 width=53) (actual time=25956.878..25956.895 rows=30 loops=3)
Group Key: pdi."ProductType", pdi."ProductTypeGroup", pdi."MainProductCategories"
Batches: 1 Memory Usage: 49kB
Buffers: shared hit=71230 read=859119
Worker 0: Batches: 1 Memory Usage: 49kB
Worker 1: Batches: 1 Memory Usage: 49kB
-> Parallel Hash Join (cost=62124.74..1327935.46 rows=10647764 width=45) (actual time=285.864..19240.004 rows=14602648 loops=3)
Hash Cond: (sdi."KeyShipTo" = sfi."KeyShipTo")
Buffers: shared hit=71230 read=859119
-> Hash Join (cost=19648.48..1257508.51 rows=10647764 width=49) (actual time=204.794..12862.063 rows=14602648 loops=3)
Hash Cond: (sdi."KeyProductCode" = pdi."KeyProductCode")
Buffers: shared hit=32264 read=859119
-> Hash Join (cost=3.67..1091456.95 rows=10647764 width=8) (actual time=0.143..7076.104 rows=14602648 loops=3)
Hash Cond: (sdi."KeyBillingMonth" = bm."KeyDate")
Buffers: shared hit=197 read=859119
-> Parallel Seq Scan on "SalesData_Invoiced" sdi (cost=0.00..1041846.10 rows=18253310 width=12) (actual
time=0.071..2585.596 rows=14602648 loops=3)
Buffers: shared hit=194 read=859119
-> Hash (cost=2.80..2.80 rows=70 width=4) (actual time=0.049..0.050 rows=70 loops=3)
Hash Cond: (sdi."KeyBillingMonth" = bm."KeyDate")
Buffers: shared hit=197 read=859119
-> Parallel Seq Scan on "SalesData_Invoiced" sdi (cost=0.00..1041846.10 rows=18253310 width=12) (actual
time=0.071..2585.596 rows=14602648 loops=3)
Buffers: shared hit=194 read=859119
-> Hash (cost=2.80..2.80 rows=70 width=4) (actual time=0.049..0.050 rows=70 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 11kB
Buffers: shared hit=3
-> Seq Scan on "BillingMonth" bm (cost=0.00..2.80 rows=70 width=4) (actual time=0.012..0.028
rows=70 loops=3)
Filter: (("ActualDate" >= '2016-01-01'::date) AND ("ActualDate" <= '2021-10-01'::date))
Rows Removed by Filter: 50
Buffers: shared hit=3
-> Hash (cost=16200.27..16200.27 rows=275563 width=49) (actual time=203.237..203.238 rows=275563 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 26832kB
Buffers: shared hit=32067
-> Nested Loop (cost=0.00..16200.27 rows=275563 width=49) (actual time=0.034..104.143 rows=275563 loops=3)
Buffers: shared hit=32067
-> Seq Scan on "Exchange" ex (cost=0.00..1.01 rows=1 width=0) (actual time=0.024..0.024 rows=
1 loops=3)
Buffers: shared hit=3
-> Seq Scan on "ProductData_Invoiced" pdi (cost=0.00..13443.63 rows=275563 width=49) (actual
time=0.007..48.176 rows=275563 loops=3)
Buffers: shared hit=32064
-> Parallel Hash (cost=40510.56..40510.56 rows=157256 width=4) (actual time=79.536..79.536 rows=125805 loops=3)
Buckets: 524288 Batches: 1 Memory Usage: 18912kB
Buffers: shared hit=38938
-> Parallel Seq Scan on "Salesforce_Invoiced" sfi (cost=0.00..40510.56 rows=157256 width=4) (actual time=
0.011..42.968 rows=125805 loops=3)
Buffers: shared hit=38938
Planning:
Buffers: shared hit=426
Planning Time: 1.936 ms
Execution Time: 25973.709 ms
(55 rows)
Firstly, remember to run VACUUM ANALYZE after rebuilding indexes, or sometimes after importing large amount of data. (VACUUM FULL is mainly useful for the OS to reclaim disk space, and you'd still need to analyse afterwards, especially after rebuilding indexes.)
It seems from your query that your main table is SalesDataInvoiced (SDI) and that you'd want to use an index on KeyBillingMonth if possible (since it's the main restriction you're placing). In general, you'd also want indexes, at least on the other tables on the columns that are used for the joins.
As the documentation for multi-column indexes in PostgreSQL says:
A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned. For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to be scanned through. This index could in principle be used for queries that have constraints on b and/or c with no constraint on a ā€” but the entire index would have to be scanned, so in most cases the planner would prefer a sequential table scan over using the index.
In your example, the main column you'd want to use a constraint on (KeyBillingMonth) is in third position, so it's unlikely to be used.
CREATE INDEX CI_SalesData ON public.SalesDataInvoiced
USING btree (KeyShipTo, KeyProductCode, KeyBillingMonth)
Creating this should make it more likely to be used:
CREATE INDEX ON SalesDataInvoiced(KeyBillingMonth);
Then, run VACUUM ANALYZE and try your query again.
You may also want an index on BillingMonth(ActualDate), but that's not necessarily useful since there seems to be few rows (and most of them are returned in your query).
It's not clear what the BillingMonth table is for. If it's basically about truncating the ActualDate to have the first day of the month, you could for example get rid of the join on BillingMonth and use the constraint on SalesDataInvoiced.KeyBillingMonth directly. For example ... WHERE SDI.KeyBillingMonth BETWEEN '2016-01-01' AND '2021-10-01' ....
As a side-note, as far as I know, BETWEEN is inclusive for its upper bound. I'd imagine a query like this is meant to represent some monthly statistics, hence should probably not include what's on 2021-10-01 (but not the rest of that month).

Why do seq/index scans take so long when running query after a while? How to make it fast?

Problem:
I have a query that joins three tables. Whenever I run this query after a while (say 24hrs), it would take a lot of time to execute. But from that time onwards, it would execute really fast (~ 70x faster). I wanted to know what's the problem that it takes so long to execute for the first time, and how to solve it.
Table conditions:
The tables are: property_2, property_attribute_2, and property_address_2. Each of which is a partition of a bigger table (i.e. property, property_attribute, and property_address). Also, rows in property_attribute_2 and property_address_2 have reference key to property_2 using column property_id. These columns (property.id, property_attribute_2.property_id, and property_address_2.property_id) are all indexed.
The query is:
select * from public.property_2 a
inner join public.property_attribute_2 b on a.id = b.property_id
left join public.property_address_2 c on a.id=c.property_id
The query plan when I run it after a while is:
Hash Right Join (cost=670010.33..983391.75 rows=2477776 width=185) (actual time=804159.499..1065892.338 rows=2477924 loops=1)
Hash Cond: (c.property_id = a.id)
-> Seq Scan on property_address_2 c (cost=0.00..131660.48 rows=4257948 width=72) (actual time=289.781..247906.955 rows=4257973 loops=1)
-> Hash (cost=595483.13..595483.13 rows=2477776 width=117) (actual time=803833.183..803833.185 rows=2477921 loops=1)
Buckets: 32768 Batches: 128 Memory Usage: 3165kB
-> Hash Join (cost=94193.96..595483.13 rows=2477776 width=117) (actual time=98061.326..802753.642 rows=2477921 loops=1)
Hash Cond: (a.id = b.property_id)
-> Seq Scan on property_2 a (cost=0.00..265463.84 rows=6176884 width=105) (actual time=1349.284..696922.438 rows=4272433 loops=1)
-> Hash (cost=48702.76..48702.76 rows=2477776 width=20) (actual time=95497.307..95497.308 rows=2477921 loops=1)
Buckets: 65536 Batches: 64 Memory Usage: 2624kB
-> Seq Scan on property_attribute_2 b (cost=0.00..48702.76 rows=2477776 width=20) (actual time=464.476..94126.890 rows=2477921 loops=1)
Planning time: 4.034 ms
Execution time: 1065995.827 ms
And the query plan after the first run is:
Hash Right Join (cost=670010.33..983391.75 rows=2477776 width=185) (actual time=8828.873..13764.283 rows=2477924 loops=1)
Hash Cond: (c.property_id = a.id)
-> Seq Scan on property_address_2 c (cost=0.00..131660.48 rows=4257948 width=72) (actual time=0.050..1411.877 rows=4257973 loops=1)
-> Hash (cost=595483.13..595483.13 rows=2477776 width=117) (actual time=8826.620..8826.623 rows=2477921 loops=1)
Buckets: 32768 Batches: 128 Memory Usage: 3165kB
-> Hash Join (cost=94193.96..595483.13 rows=2477776 width=117) (actual time=1356.224..7925.850 rows=2477921 loops=1)
Hash Cond: (a.id = b.property_id)
-> Seq Scan on property_2 a (cost=0.00..265463.84 rows=6176884 width=105) (actual time=0.034..2652.013 rows=4272433 loops=1)
-> Hash (cost=48702.76..48702.76 rows=2477776 width=20) (actual time=1354.828..1354.829 rows=2477921 loops=1)
Buckets: 65536 Batches: 64 Memory Usage: 2624kB
-> Seq Scan on property_attribute_2 b (cost=0.00..48702.76 rows=2477776 width=20) (actual time=0.023..630.081 rows=2477921 loops=1)
Planning time: 1.181 ms
Execution time: 13872.977 ms
Also worth noting that I have a couple of other Postgres databases on this machine and different jobs use different tables on these databases on a regular basis.
If cold cache is the problem, as it seems to be the case, you can warm it up before running the query. Postgres ships with the additional module pg_prewarm providing a range of tools to populate the cache.
Instructions how to set it up here:
PostgreSQL: Force data into memory
Then you run something like:
SELECT pg_prewarm('public.property_2', 'prefetch');
SELECT pg_prewarm('public.property_attribute_2', 'prefetch');
SELECT pg_prewarm('public.property_address_2', 'prefetch');
Of course, if you always run the same SELECT query without filter predicates, you might as well just run the same query to populate the cache, without using the fancy module. Possibly scheduled with a cron job?
... are all indexed.
As you can see in the EXPLAIN output, your indexes go unused. You fetch all rows without filter predicate, so indexes typically won't help. And you do it with SELECT *, i.e. get all columns from all joined tables, so index-only scans are out, too. You might improve performance by listing only the columns you actually need in the SELECT list.
Obviously, more RAM (and proper configuration for PostgreSQL buffer cache) can help, too.
Or you might be able to reduce RAM requirements with VACUUM (FULL) or, possibly, with an optimized table definition with proper column types and order. Not just for the tables at hand, also for other tables competing for the same resources (thereby evicting "your" blocks from the cache). See:
Calculating and saving space in PostgreSQL
The difference must be caching: the first time, the data are read from disk, in subsequent runs they are found in RAM. Run EXPLAIN (ANALYZE, BUFFERS) with track_io_timing = on to confirm that.
However, it seems that either your I/O system is really slow or your tables are quite bloated. EXPLAIN (ANALYZE, BUFFERS) would show how many blocks are read, so you would know.
If bloat is indeed your problem, VACUUM (FULL) would help.

Phrase frequency counter with FULL Text Search of PostgreSQL 9.6

I need to calculate the number of times that a phrase appears using ts_query against an indexed text field (ts_vector data type). It works but it is very slow because the table is huge. For single words I pre-calculated all the frequencies but I have no ideas for increasing the speed of a phrase search.
Edit: Thank you for your reply #jjanes.
This is my query:
SELECT substring(date_input::text,0,5) as myear, ts_headline('simple',text_input,q, 'StartSel=<b>, StopSel=</b>,MaxWords=2, MinWords=1, ShortWord=1, HighlightAll=FALSE, MaxFragments=9999, FragmentDelimiter=" ... "') as headline
FROM
db_test, to_tsquery('simple','united<->kingdom') as q WHERE date_input BETWEEN '2000-01-01'::DATE AND '2019-12-31'::DATE and idxfti_simple ## q
And this is the EXPLAIN (ANALYZE, BUFFERS) output:
Nested Loop (cost=25408.33..47901.67 rows=5509 width=64) (actual time=286.536..17133.343 rows=38127 loops=1)
Buffers: shared hit=224723
-> Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual time=0.005..0.007 rows=1 loops=1)
-> Append (cost=25408.33..46428.00 rows=5510 width=625) (actual time=285.372..864.868 rows=38127 loops=1)
Buffers: shared hit=165713
-> Bitmap Heap Scan on db_test (cost=25408.33..46309.01 rows=5509 width=625) (actual time=285.368..791.111 rows=38127 loops=1)
Recheck Cond: ((idxfti_simple ## q.q) AND (date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
Rows Removed by Index Recheck: 136
Heap Blocks: exact=29643
Buffers: shared hit=165607
-> BitmapAnd (cost=25408.33..25408.33 rows=5509 width=0) (actual time=278.370..278.371 rows=0 loops=1)
Buffers: shared hit=3838
-> Bitmap Index Scan on idxftisimple_idx (cost=0.00..1989.01 rows=35869 width=0) (actual time=67.280..67.281 rows=176654 loops=1)
Index Cond: (idxfti_simple ## q.q)
Buffers: shared hit=611
-> Bitmap Index Scan on db_test_date_input_idx (cost=0.00..23142.24 rows=1101781 width=0) (actual time=174.711..174.712 rows=1149456 loops=1)
Index Cond: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
Buffers: shared hit=3227
-> Seq Scan on test (cost=0.00..118.98 rows=1 width=451) (actual time=0.280..0.280 rows=0 loops=1)
Filter: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date) AND (idxfti_simple ## q.q))
Rows Removed by Filter: 742
Buffers: shared hit=106
Planning time: 0.332 ms
Execution time: 17176.805 ms
Sorry, I can't set track_io_timing turned on. I do know that ts_headline is not recommended but I need it to calculate the number of times that a phrase appears on the same field.
Thank you in advance for your help.
Note that fetching the rows in Bitmap Heap Scan is quite fast, <0.8 seconds, and almost all the time is spent in the top-level node. That time is likely to be spent in ts_headline, reparsing the text_input document. As long as you keep using ts_headline, there isn't much you can do about this.
ts_headline doesn't directly give you what you want (frequency), so you must be doing some kind of post-processing of it. Maybe you could move to postprocessing the tsvector directly, so the document doesn't need to be reparsed.
Another option is to upgrade further, which could allow the work of ts_headline to be spread over multiple CPUs. PostgreSQL 9.6 was the first version which supported parallel query, and it was not mature enough in that version to be able to parallelize this type of thing. v10 is probably enough to get parallel query for this, but you might as well jump all the way to v12.
Version 9.2 is old and out of support. It didn't have native support for phrase searching in the first place (introduced in 9.6).
Please upgrade.
And if it is still slow, show us the query, and the EXPLAIN (ANALYZE, BUFFERS) for it, preferably with track_io_timing turned on.

Slow on first query

I'm having troubles when I perform the first query on a table. Subsequent queries are much faster, even if I change the range date to look for. I assume that PostgreSQL implements a caching mechanism that allows the subsequent queries to be much faster. I can try to warmup the cache so the first user request can hit the cache. However, I think I can somehow improve the following query:
SELECT
y.id, y.title, x.visits, x.score
FROM (
SELECT
article_id, visits,
COALESCE(ROUND((visits / NULLIF(hits ,0)::float)::numeric, 4), 0) score
FROM (
SELECT
article_id, SUM(visits) visits, SUM(hits) hits
FROM
article_reports
WHERE
a.site_id = 'XYZ' AND a.date >= '2017-04-13' AND a.date <= '2017-06-28'
GROUP BY
article_id
) q ORDER BY score DESC, visits DESC LIMIT(20)
) x
INNER JOIN
articles y ON x.article_id = y.id
Any ideas on how can I improve this. The following is the result of EXPLAIN:
Nested Loop (cost=84859.76..85028.54 rows=20 width=272) (actual time=12612.596..12612.836 rows=20 loops=1)
-> Limit (cost=84859.34..84859.39 rows=20 width=52) (actual time=12612.502..12612.517 rows=20 loops=1)
-> Sort (cost=84859.34..84880.26 rows=8371 width=52) (actual time=12612.499..12612.503 rows=20 loops=1)
Sort Key: q.score DESC, q.visits DESC
Sort Method: top-N heapsort Memory: 27kB
-> Subquery Scan on q (cost=84218.04..84636.59 rows=8371 width=52) (actual time=12513.168..12602.649 rows=28965 loops=1)
-> HashAggregate (cost=84218.04..84301.75 rows=8371 width=36) (actual time=12513.093..12536.823 rows=28965 loops=1)
Group Key: a.article_id
-> Bitmap Heap Scan on article_reports a (cost=20122.78..77122.91 rows=405436 width=36) (actual time=135.588..11974.774 rows=398242 loops=1)
Recheck Cond: (((site_id)::text = 'XYZ'::text) AND (date >= '2017-04-13'::date) AND (date <= '2017-06-28'::date))
Heap Blocks: exact=36911
-> Bitmap Index Scan on index_article_reports_on_site_id_and_article_id_and_date (cost=0.00..20021.42 rows=405436 width=0) (actual time=125.846..125.846 rows=398479 loops=1)"
Index Cond: (((site_id)::text = 'XYZ'::text) AND (date >= '2017-04-13'::date) AND (date <= '2017-06-28'::date))
-> Index Scan using articles_pkey on articles y (cost=0.42..8.44 rows=1 width=128) (actual time=0.014..0.014 rows=1 loops=20)
Index Cond: (id = q.article_id)
Planning time: 1.443 ms
Execution time: 12613.689 ms
Thanks in advance
There are two levels of "cache" that Postgres uses:
OS file cache
shared buffers.
Important: Postgres directly controls only the second one, and relies on the first one, which is under OS' control.
First thing I would check are these two settings in postgresql.conf:
effective_cache_size ā€“ usually I set it to ~3/4 of all RAM available. Notice that it's not a setting that tells Postgres how to allocate memory, it's just "an advice" to Postgres planner telling some estimate of OS file cache size
shared_buffers ā€“ usually I set it to 1/4 of RAM size. This is allocation setting.
Also, I'd check other memory-related settings (work_mem, maintenance_work_mem) to understand how much RAM might be consumed, so will my effective_cache_size estimation be correct at most times.
But if you just turned your Postgres on, the first queries will most probably be long because there is no data in OS file cache and in shared buffers. You can check it with advanced EXPLAIN options:
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
-- you will see how many buffers were fetched from disk ("read") or from cache ("hit")
Here you can find good material on using EXPLAIN: http://www.dalibo.org/_media/understanding_explain.pdf
Additionally, there is an extension aiming to solve "cold cache" problem: pg_prewarm https://www.postgresql.org/docs/current/static/pgprewarm.html
Also, working with SSD disks instead of magnetic ones will mean that disk reads will be much faster.
Have fun and well working Postgres :-)
If it is the first query after inserting several rows you must run an
ANALYZE
in all the database or over the involved tables. Try executing it at database level.

Trivial order by double type: performance crash

Characters:
id BIGINT
geo_point POINT (PostGIS)
stroke_when TIMESTAMPTZ (indexed!)
stroke_when_second DOUBLE PRECISION
PostgeSQL 9.1, PostGIS 2.0.
1. Query:
SELECT ST_AsText(geo_point)
FROM lightnings
ORDER BY stroke_when DESC, stroke_when_second DESC
LIMIT 1
Total runtime: 31100.911 ms !
EXPLAIN (ANALYZE on, VERBOSE off, COSTS on, BUFFERS on):
Limit (cost=169529.67..169529.67 rows=1 width=144) (actual time=31100.869..31100.869 rows=1 loops=1)
Buffers: shared hit=3343 read=120342
-> Sort (cost=169529.67..176079.48 rows=2619924 width=144) (actual time=31100.865..31100.865 rows=1 loops=1)
Sort Key: stroke_when, stroke_when_second
Sort Method: top-N heapsort Memory: 17kB
Buffers: shared hit=3343 read=120342
-> Seq Scan on lightnings (cost=0.00..156430.05 rows=2619924 width=144) (actual time=1.589..29983.410 rows=2619924 loops=1)
Buffers: shared hit=3339 read=120342
2. Selecting another field:
SELECT id
FROM lightnings
ORDER BY stroke_when DESC, stroke_when_second DESC
LIMIT 1
Total runtime: 2144.057 ms.
EXPLAIN (ANALYZE on, VERBOSE off, COSTS on, BUFFERS on):
Limit (cost=162979.86..162979.86 rows=1 width=24) (actual time=2144.013..2144.014 rows=1 loops=1)
Buffers: shared hit=3513 read=120172
-> Sort (cost=162979.86..169529.67 rows=2619924 width=24) (actual time=2144.011..2144.011 rows=1 loops=1)
Sort Key: stroke_when, stroke_when_second
Sort Method: top-N heapsort Memory: 17kB
Buffers: shared hit=3513 read=120172
-> Seq Scan on lightnings (cost=0.00..149880.24 rows=2619924 width=24) (actual time=0.056..1464.904 rows=2619924 loops=1)
Buffers: shared hit=3509 read=120172
3. Correct optimization:
SELECT id
FROM lightnings
ORDER BY stroke_when DESC
LIMIT 1
Total runtime: 0.044 ms
EXPLAIN (ANALYZE on, VERBOSE off, COSTS on, BUFFERS on):
Limit (cost=0.00..3.52 rows=1 width=16) (actual time=0.020..0.020 rows=1 loops=1)
Buffers: shared hit=5
-> Index Scan Backward using lightnings_idx on lightnings (cost=0.00..9233232.80 rows=2619924 width=16) (actual time=0.018..0.018 rows=1 loops=1)
Buffers: shared hit=5
As you can see there are two bad and very different collisions though the query is a quite primitive when the SQL optimizer uses index:
Even if the optimizer doesnt use the index, why using As_Text(geo_point) instead of id takes so much more time? There is only one row in result!
Impossibility of using first order index when an unindexed field is presented in ORDER BY. Mention that as on practice only few rows on each second are presented in DB.
Of course above is a simplified query, extracted from a more complex construction. Usually I select rows by date range, applying complicated filters.
PostgreSQL can't use your index to produce values in the desired order for the first two queries. When two or more rows have identical store_when identical they are returned from the index scan in arbitrary order. To decide the correct order for the rows would require a secondary sorting pass. Because PostgreSQL executor doesn't have a facility to perform that secondary sort it falls back to a full sort approach.
If you regularly need to query the table with that order then replace your current index with a composite index that includes both columns.
You can transform your current query into a form that explicitly specifies the secondary sort on only the largest value of store_when:
SELECT ST_AsText(geo_point) FROM lightnings
WHERE store_when = (SELECT max(store_when) FROM lightnings)
ORDER BY stroke_when_second DESC LIMIT 1
First step could be: create a composite index on {stroke_when, stroke_when_second}