I have this view that is using lateral against another function. The query is running fine and fast but as soon as I add the condition the where clause and order by. It crawls.
CREATE OR REPLACE VIEW public.vw_top_info_v1_0
AS
SELECT pse.symbol,
pse.order_book,
pse.company_name,
pse.logo_url,
pse.display_logo,
pse.base_url,
stats.value::numeric(20,4) AS stock_value,
stats.volume::numeric(20,0) AS volume,
stats.last_trade_price,
stats.stock_date AS last_trade_date
FROM ( SELECT pse_1.symbol,
pse_1.company_name,
pse_1.order_book,
pse_1.display_logo,
pse_1.base_url,
pse_1.logo_url
FROM vw_pse_traded_companies pse_1
WHERE pse_1.group_name::text = 'N'::text) pse,
LATERAL iq_get_stats_security_for_top_data_v1_0(pse.order_book, (( SELECT date(d.added_date) AS date
FROM prod_itchbbo_p_small_message d
ORDER BY d.added_date DESC
LIMIT 1))::timestamp without time zone) stats(value, volume, stock_date, last_trade_price)
WHERE stats.value IS NOT NULL
ORDER BY stats.value DESC;***
Here's the explain output.
Subquery Scan on vw_top_info_v1_0 (cost=161022.59..165450.34 rows=354220 width=192)
-> Sort (cost=161022.59..161908.14 rows=354220 width=200)
Sort Key: stats.value DESC
InitPlan 1 (returns $0)
-> Limit (cost=49734.18..49734.18 rows=1 width=12)
-> Sort (cost=49734.18..51793.06 rows=823553 width=12)
Sort Key: d.added_date DESC
-> Seq Scan on prod_itchbbo_p_small_message d (cost=0.00..45616.41 rows=823553 width=12)
-> Nested Loop (cost=188.59..10837.44 rows=354220 width=200)
-> Sort (cost=188.34..189.23 rows=356 width=2866)
Sort Key: info.order_book, listed.symbol
-> Hash Join (cost=18.19..173.25 rows=356 width=2866)
Hash Cond: ((info.symbol)::text = (listed.symbol)::text)
-> Seq Scan on prod_stock_information info (cost=0.00..151.85 rows=1220 width=12)
Filter: ((group_name)::text = 'N'::text)
-> Hash (cost=13.64..13.64 rows=364 width=128)
-> Seq Scan on prod_pse_listed_companies listed (cost=0.00..13.64 rows=364 width=128)
-> Function Scan on iq_get_stats_security_for_top_data_v1_0 stats (cost=0.25..10.25 rows=995 width=32)
Filter: (value IS NOT NULL)
Is there a way to improve the query?
I don't fully understand what this is doing but there's a significant cost is coming from the seq scan on prod_itchbbo_p_small_message to sort by date to find the max.
You indicated the cost changes when you add the sort, so if you don't have one, I'd add a b-tree index on prod_itchbbo_p_small_message.added_date.
Related
We have two tables - event_deltas and deltas_to_retrieve - which both have BTREE indexes on the same two columns:
CREATE TABLE event_deltas
(
event_id UUID REFERENCES events(id) NOT NULL,
version INT NOT NULL,
json_patch JSONB NOT NULL,
PRIMARY KEY (event_id, version)
);
CREATE TABLE deltas_to_retrieve(event_id UUID NOT NULL, version INT NOT NULL);
CREATE UNIQUE INDEX event_id_version ON deltas_to_retrieve (event_id, version);
In terms of table size, deltas_to_retrieve is a tiny lookup table of ~500 rows. The event_deltas table contains ~7,000,000 rows. Due to the size of the latter table, we want to limit how much we retrieve at once. Therefore, the tables are queried as follows:
SELECT ed.event_id, ed.version
FROM deltas_to_retrieve zz, event_deltas ed
WHERE zz.event_id = ed.event_id
AND ed.version > zz.version
ORDER BY ed.event_id, ed.version
LIMIT 5000;
Without the LIMIT, for the example I'm looking at the query returns ~30,000 rows.
What's odd about this query is the impact of the ORDER BY. Due to the existing indexes, the data comes back in the order we want with or without it. I would rather keep the explicit ORDER BY there so we're future-proofed against future changes, as well as for readability etc. However, as things stand it has a significant negative impact on performance.
According to the docs:
An important special case is ORDER BY in combination with LIMIT n: an explicit sort will have to process all the data to identify the first n rows, but if there is an index matching the ORDER BY, the first n rows can be retrieved directly, without scanning the remainder at all.
This makes me think that, given the indexes we already have in place, the ORDER BY should not slow down the query at all. However, in practice I'm seeing execution times of ~10s with the ORDER BY and <1s without. I've included the plans outputted by EXPLAIN below:
Without ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.56..20033.38 rows=5000 width=20)
-> Nested Loop (cost=0.56..331980.39 rows=82859 width=20)
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..9.37 rows=537 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..616.66 rows=154 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.56..20039.35 rows=5000 width=20) (actual time=3.675..2083.063 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Nested Loop (cost=0.56..1055082.88 rows=263260 width=20) (actual time=3.673..2080.745 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..27.00 rows=1700 width=20) (actual time=0.022..0.307 rows=241 loops=1)
Buffers: local hit=2
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..619.07 rows=155 width=20) (actual time=1.317..8.612 rows=21 loops=241)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Heap Fetches: 5000
Buffers: shared hit=1450 read=4783
Planning Time: 1.150 ms
Execution Time: 2084.647 ms
With ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20)
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20)
-> Materialize (cost=0.28..6178.03 rows=1700 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20)
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20) (actual time=4457.770..506706.443 rows=5000 loops=1)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20) (actual time=4457.768..506704.815 rows=5000 loops=1)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20) (actual time=4.566..505443.407 rows=1813438 loops=1)
Heap Fetches: 1814767
Buffers: shared hit=78806 read=1071004 dirtied=148
-> Materialize (cost=0.28..6178.03 rows=1700 width=20) (actual time=0.063..2.524 rows=5000 loops=1)
Buffers: local hit=63
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20) (actual time=0.056..0.663 rows=78 loops=1)
Heap Fetches: 78
Buffers: local hit=63
Planning Time: 1.088 ms
Execution Time: 506709.819 ms
I'm not very experienced at reading these plans, but it's obviously thinking that it needs to retrieve everything, sort it and then return TOP N, rather than just grabbing the first N using the index. It's doing a Seq Scan on the smaller deltas_to_retrieve table rather than an Index Only Scan - is that the problem? That table is v. small (~500 rows), so I wonder if it's just not bothering to use the index because of that?
Postgres version: 11.12
Upgrading to Postgres 13 fixed this for us, with the introduction of incremental sort. From some docs on the feature:
Incremental sorting: Sorting is a performance-intensive task, so every improvement in this area can make a difference. Now PostgreSQL 13 introduces incremental sorting, which leverages early-stage sorts of a query and sorts only the incremental unsorted fields, increasing the chances the sorted block will fit in memory and by that, improving performance.
The new query plan from EXPLAIN is as follows, with the query now completing in <500ms consistently:
QUERY PLAN
Limit (cost=71.06..820.32 rows=5000 width=20)
-> Incremental Sort (cost=71.06..15461.82 rows=102706 width=20)
" Sort Key: ed.event_id, ed.version"
Presorted Key: ed.event_id
-> Nested Loop (cost=0.84..6659.05 rows=102706 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..1116.39 rows=541 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..8.35 rows=190 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Note:
[Start by running VACUUM ANALYZE on both tables]
since deltas_to_retrieve only needs to contain the lowest versions, it could be unique on event_id
you can simplify the query to:
SELECT event_id, version
FROM event_deltas ed
WHERE EXISTS (
SELECT * FROM deltas_to_retrieve zz
WHERE zz.event_id = ed.event_id
AND zz.version < ed.version
)
ORDER BY event_id, version
LIMIT 5000;
We have a view which does string_agg on a table with 54 million records where string_agg need to process 2 strings per group. View definition using query like following is resulting in postgresql server process to die due to out of memory.
SELECT id, string_agg(msg, “,”) FROM msgs GROUP BY id
When the view query is modified as below, it is working fine.
SELECT id, string_agg(msg, ',') as msgs
FROM ( SELECT id, msg, row_number() over (partition by id) as row_num
FROM msgs) as limited_alerts
WHERE row_num < 5
GROUP BY id
What is the reason? Is it because postgres is able to use temporary files in 2nd case ? Are there any links/articles which explains the details ?
Table msgs
id bigint PRIMARY KEY
service text
msg text
Total number of rows: 54 million
Max num of msgs per id: 2
Looks like the one which crashes is using HashAggregate and other one GroupAggregate
explain select count(*) from (SELECT msgs.id, string_agg(msgs.alerts, ','::text) AS alerts FROM msgs GROUP BY msgs.id) as a;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Aggregate (cost=3032602.58..3032602.59 rows=1 width=0)
-> HashAggregate (cost=3032597.58..3032600.08 rows=200 width=108)
Group Key: msgs.id
-> Append (cost=0.00..2587883.05 rows=88942906 width=108)
-> Seq Scan on msgs (cost=0.00..0.00 rows=1 width=548)
-> Seq Scan on msgs_2016_01_24 (cost=0.00..506305.40 rows=17394640 width=107)
-> Seq Scan on msgs_2016_01_31 (cost=0.00..509979.80 rows=17512480 width=107)
-> Seq Scan on msgs_2016_02_07 (cost=0.00..491910.32 rows=16883332 width=108)
-> Seq Scan on msgs_2016_02_14 (cost=0.00..496443.84 rows=17071384 width=108)
-> Seq Scan on msgs_2016_02_21 (cost=0.00..552162.84 rows=19026084 width=108)
-> Seq Scan on msgs_2016_02_28 (cost=0.00..31038.05 rows=1054705 width=111)
-> Seq Scan on msgs_2016_03_06 (cost=0.00..10.70 rows=70 width=548)
-> Seq Scan on msgs_2016_03_13 (cost=0.00..10.70 rows=70 width=548)
-> Seq Scan on msgs_2016_03_20 (cost=0.00..10.70 rows=70 width=548)
-> Seq Scan on msgs_2016_03_27 (cost=0.00..10.70 rows=70 width=548)
My end goal is to optimize my query using indexes, but I'm having trouble adding the right index. Everything I've tried results in the same cost in the Explain diagram, and no indication that it's even using any indexes.
I have two tables:
event that has two date columns: start_date and end_date (can be null).
fiscal_date that has:
two date columns start_date and end_date (cannot be null)
a fiscal_year column of type char(4)
a fiscal_quarter column of type char(1)
There's another table address that's just a one-to-one with a foreign key in event. There's no indexes on it save the public key.
I have a query that I can't change that figures out what fiscal quarter and year the event starts in:
SELECT
e.*,
(select 'Q' || fd.fiscal_quarter || ' FY' || fd.fiscal_year
from fiscal_date fd
where e.start_date between fd.start_date and fd.end_date
limit 1) as fiscal_quarter_year,
(select 'Q' || fd.fiscal_quarter
from fiscal_date fd
where e.start_date between fd.start_date and fd.end_date
limit 1) as fiscal_quarter,
(select 'FY' || fd.fiscal_year
from fiscal_date fd
where e.start_date between fd.start_date and fd.end_date
limit 1) as fiscal_year,
a.street1, a.street2, a.street3, a.city, a.state, a.country, a.postal_code
FROM event AS e
LEFT OUTER JOIN address a ON e.address_id=a.address_id;
Here's an EXPLAIN of the query (notice all the expensive seq scans on the left):
As requested, here's the output of explain analyze:
Hash Left Join (cost=115.78..2846.64 rows=1649 width=5087) (actual time=18.334..134.279 rows=1649 loops=1)
Hash Cond: (e.address_id = a.address_id)
-> Seq Scan on event e (cost=0.00..323.49 rows=1649 width=5031) (actual time=0.223..19.808 rows=1649 loops=1)
-> Hash (cost=68.68..68.68 rows=3768 width=60) (actual time=17.797..17.797 rows=3768 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 248kB
-> Seq Scan on address a (cost=0.00..68.68 rows=3768 width=60) (actual time=0.004..9.071 rows=3768 loops=1)
SubPlan 1
-> Limit (cost=0.00..0.49 rows=1 width=28) (actual time=0.011..0.014 rows=1 loops=1649)
-> Seq Scan on fiscal_date fd (cost=0.00..1.46 rows=3 width=28) (actual time=0.006..0.006 rows=1 loops=1649)
Filter: (($0 >= start_date) AND ($0 <= end_date))
SubPlan 2
-> Limit (cost=0.00..0.48 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1649)
-> Seq Scan on fiscal_date fd (cost=0.00..1.43 rows=3 width=8) (actual time=0.006..0.006 rows=1 loops=1649)
Filter: (($1 >= start_date) AND ($1 <= end_date))
SubPlan 3
-> Limit (cost=0.00..0.48 rows=1 width=20) (actual time=0.010..0.012 rows=1 loops=1649)
-> Seq Scan on fiscal_date fd (cost=0.00..1.43 rows=3 width=20) (actual time=0.005..0.005 rows=1 loops=1649)
Filter: (($2 >= start_date) AND ($2 <= end_date))
Total runtime: 138.008 ms
I've tried adding indexes to event that index the start and end dates (both and individually), adding an index to fiscal_date's date columns, but nothing seems to be reducing the cost calculation of this query.
How do I optimize this query, or isn't it possible?
Ok, so your problem isn't the fiscal date sequential scans, the number of rows are so few that sequential scan is probably the correct thing to do. You probably want Index on address._id on both tables.
If adress_is is the primary key of adress table it already is indexed.
Also, just to be sure, run vacuum full and vacuum analyze on all tables.
EDIT:
The performance seems really bad considering there are so few rows (under 10000 is nothing). Are the tables really big or is the hardware ancient? If not you should probably take a serious look at the configuration (working mem etc)
The docs for Pg's Window function say:
The rows considered by a window function are those of the "virtual table" produced by the query's FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any. For example, a row removed because it does not meet the WHERE condition is not seen by any window function. A query can contain multiple window functions that slice up the data in different ways by means of different OVER clauses, but they all act on the same collection of rows defined by this virtual table.
However, I'm not seeing this. It seems to me like the Select Filter is very near to the left margin and the top (last thing done).
=# EXPLAIN SELECT * FROM chrome_nvd.view_options where fkey_style = 303451;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Subquery Scan view_options (cost=2098450.26..2142926.28 rows=14825 width=180)
Filter: (view_options.fkey_style = 303451)
-> Sort (cost=2098450.26..2105862.93 rows=2965068 width=189)
Sort Key: o.sequence
-> WindowAgg (cost=1446776.02..1506077.38 rows=2965068 width=189)
-> Sort (cost=1446776.02..1454188.69 rows=2965068 width=189)
Sort Key: h.name, k.name
-> WindowAgg (cost=802514.45..854403.14 rows=2965068 width=189)
-> Sort (cost=802514.45..809927.12 rows=2965068 width=189)
Sort Key: h.name
-> Hash Join (cost=18.52..210141.57 rows=2965068 width=189)
Hash Cond: (o.fkey_opt_header = h.id)
-> Hash Join (cost=3.72..169357.09 rows=2965068 width=166)
Hash Cond: (o.fkey_opt_kind = k.id)
-> Seq Scan on options o (cost=0.00..128583.68 rows=2965068 width=156)
-> Hash (cost=2.21..2.21 rows=121 width=18)
-> Seq Scan on opt_kind k (cost=0.00..2.21 rows=121 width=18)
-> Hash (cost=8.80..8.80 rows=480 width=31)
-> Seq Scan on opt_header h (cost=0.00..8.80 rows=480 width=31)
(19 rows)
These two WindowAgg's essentially change the plan to something that seems to never finish from the much faster
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan view_options (cost=329.47..330.42 rows=76 width=164) (actual time=20.263..20.403 rows=42 loops=1)
-> Sort (cost=329.47..329.66 rows=76 width=189) (actual time=20.258..20.300 rows=42 loops=1)
Sort Key: o.sequence
Sort Method: quicksort Memory: 35kB
-> Hash Join (cost=18.52..327.10 rows=76 width=189) (actual time=19.427..19.961 rows=42 loops=1)
Hash Cond: (o.fkey_opt_header = h.id)
-> Hash Join (cost=3.72..311.25 rows=76 width=166) (actual time=17.679..18.085 rows=42 loops=1)
Hash Cond: (o.fkey_opt_kind = k.id)
-> Index Scan using options_pkey on options o (cost=0.00..306.48 rows=76 width=156) (actual time=17.152..17.410 rows=42 loops=1)
Index Cond: (fkey_style = 303451)
-> Hash (cost=2.21..2.21 rows=121 width=18) (actual time=0.432..0.432 rows=121 loops=1)
-> Seq Scan on opt_kind k (cost=0.00..2.21 rows=121 width=18) (actual time=0.042..0.196 rows=121 loops=1)
-> Hash (cost=8.80..8.80 rows=480 width=31) (actual time=1.687..1.687 rows=480 loops=1)
-> Seq Scan on opt_header h (cost=0.00..8.80 rows=480 width=31) (actual time=0.030..0.748 rows=480 loops=1)
Total runtime: 20.893 ms
(15 rows)
What is going on, and how do I fix it? I'm using Postgresql 8.4.8. Here is what the actual view is doing:
SELECT o.fkey_style, h.name AS header, k.name AS kind
, o.code, o.name AS option_name, o.description
, count(*) OVER (PARTITION BY h.name) AS header_count
, count(*) OVER (PARTITION BY h.name, k.name) AS header_kind_count
FROM chrome_nvd.options o
JOIN chrome_nvd.opt_header h ON h.id = o.fkey_opt_header
JOIN chrome_nvd.opt_kind k ON k.id = o.fkey_opt_kind
ORDER BY o.sequence;
No, PostgreSQL will only push down a WHERE clause on a VIEW that does not have an Aggregate. (Window functions are consider Aggregates).
< x> I think that's just an implementation limitation
< EvanCarroll> x: I wonder what would have to be done to push the
WHERE clause down in this case.
< EvanCarroll> the planner would have to know that the WindowAgg doesn't itself add selectivity and therefore it is safe to push the WHERE down?
< x> EvanCarroll; a lot of very complicated work with the planner, I'd presume
And,
< a> EvanCarroll: nope. a filter condition on a view applies to the output of the view and only gets pushed down if the view does not involve aggregates
I'm using a partitioned postgres table following the documentation using rules, using a partitioning scheme based on date ranges (my date column is an epoch integer)
The problem is that a simple query to select the row with the maximum value of the sharded column is not using indices:
First, some settings to coerce postgres to do what I want:
SET constraint_exclusion = on;
SET enable_seqscan = off;
The query on a single partition works:
explain (SELECT * FROM urls_0 ORDER BY date_created ASC LIMIT 1);
Limit (cost=0.00..0.05 rows=1 width=38)
-> Index Scan using urls_date_created_idx_0 on urls_0 (cost=0.00..436.68 rows=8099 width=38)
However, the same query on the entire table is seq scanning:
explain (SELECT * FROM urls ORDER BY date_created ASC LIMIT 1);
Limit (cost=50000000274.88..50000000274.89 rows=1 width=51)
-> Sort (cost=50000000274.88..50000000302.03 rows=10859 width=51)
Sort Key: public.urls.date_created
-> Result (cost=10000000000.00..50000000220.59 rows=10859 width=51)
-> Append (cost=10000000000.00..50000000220.59 rows=10859 width=51)
-> Seq Scan on urls (cost=10000000000.00..10000000016.90 rows=690 width=88)
-> Seq Scan on urls_15133 urls (cost=10000000000.00..10000000016.90 rows=690 width=88)
-> Seq Scan on urls_15132 urls (cost=10000000000.00..10000000016.90 rows=690 width=88)
-> Seq Scan on urls_15131 urls (cost=10000000000.00..10000000016.90 rows=690 width=88)
-> Seq Scan on urls_0 urls (cost=10000000000.00..10000000152.99 rows=8099 width=38)
Finally, a lookup by date_created does work correctly with contraint exclusions and index scans:
explain (SELECT * FROM urls where date_created = 1212)
Result (cost=10000000000.00..10000000052.75 rows=23 width=45)
-> Append (cost=10000000000.00..10000000052.75 rows=23 width=45)
-> Seq Scan on urls (cost=10000000000.00..10000000018.62 rows=3 width=88)
Filter: (date_created = 1212)
-> Index Scan using urls_date_created_idx_0 on urls_0 urls (cost=0.00..34.12 rows=20 width=38)
Index Cond: (date_created = 1212)
Does anyone know how to use partitioning so that this type of query will use an index scan?
Postgresql 9.1 knows how to optimize this out of the box.
In 9.0 or earlier, you need to decompose the query manually, by unioning each of the subqueries individually with their own order by/limit statement.