PostgreSQL: Same request is slower with plpgsql language compared to sql - postgresql

I'm new to PostgreSQL and I'm facing a issue regarding table functions performance. What I need to do is the equivalent of a Stored Procedure in MSSQL. After some research I found that a table function is the way to go so I took an exemple to create my function using plpgsql.
By comparing the execution times, it was 2 times slower using the function than calling the query directly (the query is exactly the same in the function).
After digging a little bit, I've found that using SQL language in my function improves a lot the execution time (becomes exactly the same time as if I call the query). After reading on this, I understand that plpgsql adds a little bit overhead but the difference is too big to explain that.
Since I'm not using any plpgsql functionality, this solution is fine for me and totally makes sense. However, I'd like to understand why such difference. If I compare the execution plans, the plpgsql version does some HashAggregate and sequential search compared to the SQL version that does GroupAggregate with some pre-sorting... I did use auto_explain as suggested by Laurenz Albe and I added at the end both execution plans.
Why such difference in the execution plan of the same query with the only difference the language? And moreover, even the result of the SUM (see my request below) has a significant difference. I know I'm using floating values so the result can be a little different between each call, but in this case the difference between the query and function is around ~3 which is unexpected (~10001 vs ~9998).
Below the code to reproduce the problem using 2 tables and 2 functions.
Note that I'm using PostgreSQL 12.
Any explanation are appreciated :) Thanks.
-- Step 1: Create database
-- Step 2: Create tables
-- table1
CREATE TABLE public.table1(area real, code text COLLATE pg_catalog."default");
-- table 2
CREATE TABLE public.table2(code text COLLATE pg_catalog."default" NOT NULL, surface real, CONSTRAINT table2_pkey PRIMARY KEY (code));
-- Step 3: create functions
-- plpgsql
CREATE OR REPLACE FUNCTION public.test_function()
RETURNS TABLE(code text, value real)
LANGUAGE 'plpgsql'
COST 100
VOLATILE
ROWS 1000
AS $BODY$
BEGIN
RETURN QUERY
SELECT table2.code, (case when (sum(area) * surface) IS NULL then 0 else (sum(area) * surface) end) AS value
FROM table1
INNER JOIN table2 on table1.code = table2.code
GROUP BY table2.code, surface
;
END;
$BODY$;
-- sql
CREATE OR REPLACE FUNCTION public.test_function2()
RETURNS TABLE(code text, value real)
LANGUAGE SQL
AS $BODY$
SELECT table2.code, (case when (sum(area) * surface) IS NULL then 0 else (sum(area) * surface) end) AS value
FROM table1
INNER JOIN table2 on table1.code = table2.code
GROUP BY table2.code, surface
$BODY$;
-- Step 4: insert some random data
-- table 2
INSERT INTO table2(code, surface) VALUES ('AAAAA', 1);
INSERT INTO table2(code, surface) VALUES ('BBBBB', 1);
INSERT INTO table2(code, surface) VALUES ('CCCCC', 1);
INSERT INTO table2(code, surface) VALUES ('DDDDD', 1);
INSERT INTO table2(code, surface) VALUES ('EEEEE', 1);
-- table1 (will take some time, this simulate my current query with 10 millions rows)
DO
$$
DECLARE random_code text;
DECLARE code_count int := (SELECT COUNT(*) FROM table2);
BEGIN
FOR i IN 1..10000000 LOOP
random_code := (SELECT code FROM table2 OFFSET floor(random() * code_count) LIMIT 1);
INSERT INTO public.table1(area, code) VALUES (random() / 100, random_code);
END LOOP;
END
$$
-- Step 5: compare
SELECT * FROM test_function()
SELECT * FROM test_function2() -- 2 times faster
Execution plan for test_function (plpgsql version)
2021-04-14 11:52:10.335 GMT [5056] LOG: duration: 3808.919 ms plan:
Query Text: SELECT table2.code, (case when (sum(area) * surface) IS NULL then 0 else (sum(area) * surface) end) AS value
FROM table1
INNER JOIN table2 on table1.code = table2.code
GROUP BY table2.code, surface
HashAggregate (cost=459899.03..459918.08 rows=1270 width=40) (actual time=3808.908..3808.913 rows=5 loops=1)
Group Key: table2.code
Buffers: shared hit=34 read=162130
-> Hash Join (cost=38.58..349004.15 rows=14785984 width=40) (actual time=215.340..2595.247 rows=10000000 loops=1)
Hash Cond: (table1.code = table2.code)
Buffers: shared hit=34 read=162130
-> Seq Scan on table1 (cost=0.00..310022.84 rows=14785984 width=10) (actual time=215.294..1036.615 rows=10000000 loops=1)
Buffers: shared hit=33 read=162130
-> Hash (cost=22.70..22.70 rows=1270 width=36) (actual time=0.019..0.020 rows=5 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
Buffers: shared hit=1
-> Seq Scan on table2 (cost=0.00..22.70 rows=1270 width=36) (actual time=0.013..0.014 rows=5 loops=1)
Buffers: shared hit=1
2021-04-14 11:52:10.335 GMT [5056] CONTEXT: PL/pgSQL function test_function() line 3 at RETURN QUERY
Execution plan for test_function2 (sql version)
2021-04-14 11:54:24.122 GMT [5056] LOG: duration: 1513.001 ms plan:
Query Text:
SELECT table2.code, (case when (sum(area) * surface) IS NULL then 0 else (sum(area) * surface) end) AS value
FROM table1
INNER JOIN table2 on table1.code = table2.code
GROUP BY table2.code, surface
Finalize GroupAggregate (cost=271918.31..272252.77 rows=1270 width=40) (actual time=1484.846..1512.998 rows=5 loops=1)
Group Key: table2.code
Buffers: shared hit=96 read=162098
-> Gather Merge (cost=271918.31..272214.67 rows=2540 width=40) (actual time=1484.840..1512.990 rows=15 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=96 read=162098
-> Sort (cost=270918.29..270921.46 rows=1270 width=40) (actual time=1435.897..1435.899 rows=5 loops=3)
Sort Key: table2.code
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
Buffers: shared hit=96 read=162098
-> Partial HashAggregate (cost=270840.11..270852.81 rows=1270 width=40) (actual time=1435.857..1435.863 rows=5 loops=3)
Group Key: table2.code
Buffers: shared hit=74 read=162098
-> Hash Join (cost=38.58..240035.98 rows=6160827 width=40) (actual time=204.916..1022.133 rows=3333333 loops=3)
Hash Cond: (table1.code = table2.code)
Buffers: shared hit=74 read=162098
-> Parallel Seq Scan on table1 (cost=0.00..223771.27 rows=6160827 width=10) (actual time=204.712..486.850 rows=3333333 loops=3)
Buffers: shared hit=65 read=162098
-> Hash (cost=22.70..22.70 rows=1270 width=36) (actual time=0.155..0.156 rows=5 loops=3)
Buckets: 2048 Batches: 1 Memory Usage: 17kB
Buffers: shared hit=3
-> Seq Scan on table2 (cost=0.00..22.70 rows=1270 width=36) (actual time=0.142..0.143 rows=5 loops=3)
Buffers: shared hit=3
2021-04-14 11:54:24.122 GMT [5056] CONTEXT: SQL function "test_function2" statement 1

First, a general discussion how to get execution plans in such a case
To get to the bottom of that, activate auto_explain and track function execution in postgresql.conf:
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = 0
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_nested_statements = on
track_functions = 'pl'
Then restart the database. Don't do that on a busy productive database, as it will log a lot and add considerable overhead!
Reset the database statistics with
SELECT pg_stat_reset();
Now the execution plans of all the SQL statements inside your functions will be logged, and PostgreSQL keeps track of function execution times.
Look at the execution plans and execution times of the statements when called from the SQL function and the PL/pgSQL function and see if you can spot a difference. Then compare the execution times in pg_stat_user_functions to compare the function execution time.
Explanation in the current case, after looking at the execution plans
The query run from PL/pgSQL is not parallelized. Due to a limitation in the implementation, queries run with RETURN QUERY never are.

Related

How can I get PostgreSQL to use an index with a computed value in a predicate?

What I mean is this. In PostgreSQL (v 15.1) I have a table foo created in the following way.
create table foo (
id integer primary key generated by default as identity,
id_mod_7 int generated always as (id % 7) stored
);
create index on foo (id_mod_7, id);
insert into foo (id) select generate_series(1, 10000);
If I query this table with a predicate that doesn't use a literal constant but rather uses a function, a sequential scan is used:
explain analyze
select count(1) from foo where id_mod_7 = extract(dow from current_date);
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Aggregate (cost=245.12..245.13 rows=1 width=8) (actual time=7.218..7.219 rows=1 loops=1)
-> Seq Scan on foo (cost=0.00..245.00 rows=50 width=0) (actual time=0.020..7.028 rows=1428 loops=1)
Filter: ((id_mod_7)::numeric = EXTRACT(dow FROM CURRENT_DATE))
Rows Removed by Filter: 8572
Planning Time: 0.178 ms
Execution Time: 7.281 ms
However, if I query this table with a predicate that does use a literal constant, an index scan is used:
explain analyze
select count(1) from foo where id_mod_7 = 6;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=48.84..48.85 rows=1 width=8) (actual time=0.321..0.322 rows=1 loops=1)
-> Index Only Scan using foo_id_mod_7_id_idx on foo (cost=0.29..45.27 rows=1428 width=0) (actual time=0.022..0.214 rows=1428 loops=1)
Index Cond: (id_mod_7 = 6)
Heap Fetches: 0
Planning Time: 0.106 ms
Execution Time: 0.397 ms
I thought maybe I could fool it into using the index if I used the caching (alleged?) properties of Common Table Expressions (CTE), but to no avail:
explain analyze
with param as (select extract(dow from current_date) as dow)
select count(1) from foo join param on id_mod_7 = dow;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Aggregate (cost=245.12..245.13 rows=1 width=8) (actual time=5.830..5.831 rows=1 loops=1)
-> Seq Scan on foo (cost=0.00..245.00 rows=50 width=0) (actual time=0.025..5.668 rows=1428 loops=1)
Filter: ((id_mod_7)::numeric = EXTRACT(dow FROM CURRENT_DATE))
Rows Removed by Filter: 8572
Planning Time: 0.234 ms
Execution Time: 5.894 ms
It's not fatal, but I'm just trying to understand what's going on here. Thanks!
P.S. and just to avoid confusion, it's not the table column that is being computed within the SQL query. It's the value in the predicate expression that is (or would be) computed within the SQL query.
Like I said, I've tried using a CTE because I believed the CTE would be cached or materialized and expected an index scan, but unfortunately still got a sequential scan.
This is because extract() returns a numeric value, but the column is an integer. You can see this effect in the execution plan: (id_mod_7)::numeric = ... - the column needs to be cast to a numeric to be able to match the value from the extract() function
You need to cast the result of the extract() function to an int:
select count(*)
from foo
where id_mod_7 = extract(dow from current_date)::int

Getting incorrect total number of likes and dislikes sometimes

I have a table feed_item_likes_dislikes in PostgresQL (feed_item_id, user_id, vote) where
feed_item_id is uuid
user_id is integer
vote = TRUE then it is a like
vote = FALSE then it is a dislike
vote = NULL meaning the user originally liked or disliked but came back and removed the vote by unvoting
I have another table feed_item_likes_dislikes_aggregate(feed_item_id, likes, dislikes) where I want to maintain total number of likes dislikes per post
when the user adds a new like in the feed_item_likes_dislikes table with
INSERT INTO feed_item_likes_dislikes VALUES('54d67b62-9b71-a6bc-d934-451c1eaae3bc', 1, TRUE);
I want to update the total number of likes in the aggregate table. Similar case needs to be handled for dislikes and when user unvotes something by setting vote to null
User may also update their like to a dislike and vice versa and in every condition, the total number of likes and dislikes for that post needs to be maintained
I wrote the following trigger function to accomplish this
CREATE OR REPLACE FUNCTION update_votes() RETURNS trigger AS $$
DECLARE
feed_item_id_val uuid;
likes_val integer;
dislikes_val integer;
BEGIN
IF (TG_OP = 'DELETE') THEN
-- when a row is deleted, store feed_item_id of the deleted row so that we can update its likes and dislikes count
feed_item_id_val:=OLD.feed_item_id;
ELSIF (TG_OP = 'UPDATE') OR (TG_OP='INSERT') THEN
feed_item_id_val:=NEW.feed_item_id;
END IF;
-- get total number of likes and dislikes for the given feed_item_id
SELECT COUNT(*) FILTER(WHERE vote=TRUE) AS likes, COUNT(*) FILTER(WHERE vote=FALSE) AS dislikes INTO likes_val, dislikes_val FROM feed_item_likes_dislikes WHERE feed_item_id=feed_item_id_val;
-- update the aggregate count for only this feed_item_id
INSERT INTO feed_item_likes_dislikes_aggregate (feed_item_id, likes, dislikes) VALUES (feed_item_id_val, likes_val, dislikes_val) ON CONFLICT(feed_item_id) DO UPDATE SET likes=likes_val, dislikes=dislikes_val;
RETURN NULL; -- result is ignored since this is an AFTER trigger
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER update_votes_trigger AFTER INSERT OR UPDATE OR DELETE ON feed_item_likes_dislikes FOR EACH ROW EXECUTE PROCEDURE update_votes();
But when I do a bulk insert into feed_item_likes_dislikes, sometimes the total number of likes and dislikes is incorrect.
Can someone kindly tell me how I can fix this?
Update 1
I tried creating a view but it takes a lot of time on my production data set, here is the db fiddle https://www.db-fiddle.com/f/2ZAkjQhUydMaV9o5xvLgMT/17
Query #1
EXPLAIN ANALYZE SELECT f.feed_item_id,pubdate,link,guid,title,summary,author,feed_id,COALESCE(likes, 0) AS likes,COALESCE(dislikes, 0) AS dislikes,COALESCE(bullish, 0) AS bullish,COALESCE(bearish, 0) AS bearish FROM feed_items f LEFT JOIN likes_dislikes_aggregate l ON f.feed_item_id = l.feed_item_id LEFT JOIN bullish_bearish_aggregate b ON f.feed_item_id = b.feed_item_id ORDER BY pubdate DESC, f.feed_item_id DESC LIMIT 10;
QUERY PLAN
Limit (cost=112.18..112.21 rows=10 width=238) (actual time=0.257..0.260 rows=10 loops=1)
-> Sort (cost=112.18..112.93 rows=300 width=238) (actual time=0.257..0.257 rows=10 loops=1)
Sort Key: f.pubdate DESC, f.feed_item_id DESC
Sort Method: top-N heapsort Memory: 27kB
-> Hash Left Join (cost=91.10..105.70 rows=300 width=238) (actual time=0.162..0.222 rows=100 loops=1)
Hash Cond: (f.feed_item_id = b.feed_item_id)
-> Hash Left Join (cost=45.55..59.35 rows=300 width=222) (actual time=0.080..0.114 rows=100 loops=1)
Hash Cond: (f.feed_item_id = l.feed_item_id)
-> Seq Scan on feed_items f (cost=0.00..13.00 rows=300 width=206) (actual time=0.004..0.011 rows=100 loops=1)
-> Hash (cost=43.05..43.05 rows=200 width=32) (actual time=0.069..0.069 rows=59 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Subquery Scan on l (cost=39.05..43.05 rows=200 width=32) (actual time=0.037..0.052 rows=59 loops=1)
-> HashAggregate (cost=39.05..41.05 rows=200 width=32) (actual time=0.036..0.046 rows=59 loops=1)
Group Key: feed_item_likes_dislikes.feed_item_id
-> Seq Scan on feed_item_likes_dislikes (cost=0.00..26.60 rows=1660 width=17) (actual time=0.003..0.008 rows=95 loops=1)
-> Hash (cost=43.05..43.05 rows=200 width=32) (actual time=0.064..0.064 rows=63 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 12kB
-> Subquery Scan on b (cost=39.05..43.05 rows=200 width=32) (actual time=0.029..0.044 rows=63 loops=1)
-> HashAggregate (cost=39.05..41.05 rows=200 width=32) (actual time=0.028..0.038 rows=63 loops=1)
Group Key: feed_item_bullish_bearish.feed_item_id
-> Seq Scan on feed_item_bullish_bearish (cost=0.00..26.60 rows=1660 width=17) (actual time=0.002..0.007 rows=93 loops=1)
Planning Time: 0.140 ms
Execution Time: 0.328 ms
View on DB Fiddle
The attempt to keep a running aggregate is always riddled with traps, and is almost always not worth the effort. The solution is to not try to store aggregates but to derive them as needed. You do this by creating a VIEW rather that a table. This then removes all additional processing, especially so in this case as your trigger basically contains the query needed to generate the view. ( see Demo here )
create or replace VIEW likes_dislikes_aggregate as
select id
, count(*) filter(where vote) as likes
, count(*) filter(where not vote) as dislikes
, count(*) filter(where vote is null) as no_vote
from likes_dislikes
group by id;
No trigger, no additional code everything with the view through standard DML. Notice that the entity view is basically nothing but your count query without the trigger overhead and maintenance.
SELECT COUNT(*) FILTER(WHERE vote=TRUE) AS likes, COUNT(*) FILTER(WHERE vote=FALSE) AS dislikes INTO likes_val, dislikes_val FROM feed_item_likes_dislikes WHERE feed_item_id=feed_item_id_val;

PostgreSQL slow order

I have table (over 100 millions records) on PostgreSQL 13.1
CREATE TABLE report
(
id serial primary key,
license_plate_id integer,
datetime timestamp
);
Indexes (for test I create both of them):
create index report_lp_datetime_index on report (license_plate_id, datetime);
create index report_lp_datetime_desc_index on report (license_plate_id desc, datetime desc);
So, my question is why query like
select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75)
order by datetime desc
limit 100
Is very slow (~10sec). But query without order statement is fast (milliseconds).
Explain:
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34, 75,374,57123)
limit 100
Limit (cost=0.57..400.38 rows=100 width=316) (actual time=0.037..0.216 rows=100 loops=1)
Buffers: shared hit=103
-> Index Scan using report_lp_id_idx on report r (cost=0.57..44986.97 rows=11252 width=316) (actual time=0.035..0.202 rows=100 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=103
Planning Time: 0.228 ms
Execution Time: 0.251 ms
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75,374,57123)
order by datetime desc
limit 100
Limit (cost=44193.63..44193.88 rows=100 width=316) (actual time=4921.030..4921.047 rows=100 loops=1)
Buffers: shared hit=11455 read=671
-> Sort (cost=44193.63..44221.76 rows=11252 width=316) (actual time=4921.028..4921.035 rows=100 loops=1)
Sort Key: datetime DESC
Sort Method: top-N heapsort Memory: 128kB
Buffers: shared hit=11455 read=671
-> Bitmap Heap Scan on report r (cost=151.18..43763.59 rows=11252 width=316) (actual time=54.422..4911.927 rows=12148 loops=1)
Recheck Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Heap Blocks: exact=12063
Buffers: shared hit=11455 read=671
-> Bitmap Index Scan on report_lp_id_idx (cost=0.00..148.37 rows=11252 width=0) (actual time=52.631..52.632 rows=12148 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=59 read=4
Planning Time: 0.427 ms
Execution Time: 4921.128 ms
You seem to have rather slow storage, if reading 671 8kB-blocks from disk takes a couple of seconds.
The way to speed this up is to reorder the table in the same way as the index, so that you can find the required rows in the same or adjacent table blocks:
CLUSTER report_lp_id_idx USING report_lp_id_idx;
Be warned that rewriting the table in this way causes downtime – the table will not be available while it is being rewritten. Moreover, PostgreSQL does not maintain the table order, so subsequent data modifications will cause performance to gradually deteriorate, so that after a while you will have to run CLUSTER again.
But if you need this query to be fast no matter what, CLUSTER is the way to go.
Your two indices do exactly the same thing, so you can remove the second one, it's useless.
To optimize your query, the order of the fields inside the index must be reversed:
create index report_lp_datetime_index on report (datetime,license_plate_id);
BEGIN;
CREATE TABLE foo (d INTEGER, i INTEGER);
INSERT INTO foo SELECT random()*100000, random()*1000 FROM generate_series(1,1000000) s;
CREATE INDEX foo_d_i ON foo(d DESC,i);
COMMIT;
VACUUM ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100;
Limit (cost=0.42..343.92 rows=100 width=8) (actual time=0.076..9.359 rows=100 loops=1)
-> Index Only Scan Backward using foo_d_i on foo (cost=0.42..40976.43 rows=11929 width=8) (actual time=0.075..9.339 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9016
Heap Fetches: 0
Planning Time: 0.339 ms
Execution Time: 9.387 ms
Note the index is not used to optimize the WHERE clause. It is used here as a compact and fast way to store references to the rows ordered by date DESC, so the ORDER BY can do an index-only scan and avoid sorting. By adding column id to the index, an index-only scan can be performed to test the condition on id, without hitting the table for every row. Since there is a low LIMIT value it does not need to scan the whole index, it only scans it in date DESC order until it finds enough rows satisfying the WHERE condition to return the result.
It will be faster if you create the index in date DESC order, this could be useful if you use ORDER BY date DESC + LIMIT in other queries too.
You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.
Easy to work around. The optimum way to do this query would be an index-only scan to filter on WHERE conditions, then LIMIT, then hit the table to get the rows. For some reason if "select *" is used postgres takes the id column from the table instead of taking it from the index, which results in lots of unnecessary heap fetches for rows whose id is rejected by the WHERE condition.
Easy to work around, by doing it manually. I've also added another bogus column to make sure the SELECT * hits the table.
EXPLAIN (ANALYZE,buffers) SELECT * FROM foo
JOIN (SELECT d,i FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (d,i)
ORDER BY d DESC LIMIT 100;
Limit (cost=0.85..1281.94 rows=1 width=17) (actual time=0.052..3.618 rows=100 loops=1)
Buffers: shared hit=453
-> Nested Loop (cost=0.85..1281.94 rows=1 width=17) (actual time=0.050..3.594 rows=100 loops=1)
Buffers: shared hit=453
-> Limit (cost=0.42..435.44 rows=100 width=8) (actual time=0.037..2.953 rows=100 loops=1)
Buffers: shared hit=53
-> Index Only Scan using foo_d_i on foo foo_1 (cost=0.42..51936.43 rows=11939 width=8) (actual time=0.037..2.935 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=53
-> Index Scan using foo_d_i on foo (cost=0.42..8.45 rows=1 width=17) (actual time=0.005..0.005 rows=1 loops=100)
Index Cond: ((d = foo_1.d) AND (i = foo_1.i))
Buffers: shared hit=400
Execution Time: 3.663 ms
Another option is to just add the primary key to the date,license_plate index.
SELECT * FROM foo JOIN (SELECT id FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (id) ORDER BY d DESC LIMIT 100;
Limit (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.920..3.947 rows=100 loops=1)
Buffers: shared hit=473
-> Sort (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.919..3.931 rows=100 loops=1)
Sort Key: foo.d DESC
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=473
-> Nested Loop (cost=0.85..1354.66 rows=100 width=17) (actual time=0.055..3.858 rows=100 loops=1)
Buffers: shared hit=473
-> Limit (cost=0.42..509.41 rows=100 width=8) (actual time=0.039..3.116 rows=100 loops=1)
Buffers: shared hit=73
-> Index Only Scan using foo_d_i_id on foo foo_1 (cost=0.42..60768.43 rows=11939 width=8) (actual time=0.039..3.093 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=73
-> Index Scan using foo_pkey on foo (cost=0.42..8.44 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=100)
Index Cond: (id = foo_1.id)
Buffers: shared hit=400
Execution Time: 3.972 ms
Edit
After thinking about it... since the LIMIT restricts the output to 100 rows ordered by date desc, wouldn't it be nice if we could get the 100 most recent rows for each license_plate_id, put all that into a top-n sort, and only keep the best 100 for all license_plate_ids? That would avoid reading and throwing away a lot of rows from the index. Even if that's much faster than hitting the table, it will still load up these index pages in RAM and clog up your buffers with stuff you don't actually need to keep in cache. Let's use LATERAL JOIN:
EXPLAIN (ANALYZE,BUFFERS)
SELECT * FROM foo
JOIN (SELECT d,i FROM
(VALUES (1),(2),(4),(5),(6),(7),(8),(10),(15),(22),(34),(75)) idlist
CROSS JOIN LATERAL
(SELECT d,i FROM foo WHERE i=idlist.column1 ORDER BY d DESC LIMIT 100) f2
ORDER BY d DESC LIMIT 100
) f3 USING (d,i)
ORDER BY d DESC LIMIT 100;
It's even faster: 2ms, and it uses the index on (license_plate_id,date) instead of the other way around. Also, and this is important, since each subquery in the lateral hits only the index pages that contain rows that will actually be selected, while the previous queries hit much more index pages. So you save on RAM buffers.
If you don't need the index on (date,license_plate_id) and don't want to keep a useless index, that could be interesting since this query doesn't use it. On the other hand, if you need the index on (date,license_plate_id) for something else and want to keep it, then... maybe not.
Please post results for the winning query 🔥

postgres function update large table and deletion performance

postgres version: 9.3
postgres.conf: all default configurations
I have 2 tables, A and B,both have 1 million rows.
There is a postgres function that will execute every 2 seconds, it will update Table A where ids in an array(array size = 20), and then delete the rows in Table B.
DB function shows as below:
CREATE OR REPLACE FUNCTION test_function (ids NUMERIC[])
RETURNS void AS $$
BEGIN
UPDATE A a
SET status = 'begin', end_time = (NOW() AT TIME ZONE 'UTC')
WHERE a.id = ANY (ids);
DELETE FROM B b
WHERE b.aid = ANY (ids)
AND b.status = 'end';
END;
$$ LANGUAGE plpgsql;
Analysis shows as below:
explain(ANALYZE,BUFFERS,VERBOSE) select test_function('{2,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}');
QUERY PLAN
Result (cost=0.00..0.26 rows=1 width=0) (actual time=14030.435..14030.436 rows=1 loops=1)
Output: test_function('{2,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}'::numeric[])
Buffers: shared hit=24297 read=26137 dirtied=20
Total runtime: 14030.444 ms
(4 rows)
My Question is:
In the production environment, why this function need to execute at most 7 seconds before success;
When this function is executing, this process will eats up to 60%. --> This is the key problem
EDIT:
Analyze each single sql:
explain(ANALYZE,VERBOSE,BUFFERS) UPDATE A a SET status = 'begin',
end_time = (now()) WHERE a.id = ANY
('{2,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}');
QUERY PLAN
Update on public.A a (cost=0.45..99.31 rows=20 width=143) (actual time=1.206..1.206 rows=0 loops=1)
Buffers: shared hit=206 read=27 dirtied=30
-> Index Scan using A_pkey on public.a a (cost=0.45..99.31 rows=20 width=143) (actual time=0.019..0.116 rows=19 loops=1)
Output: id, start_time, now(), 'begin'::character varying(255), xxxx... ctid
Index Cond: (t.id = ANY('{2,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}'::integer[]))
Buffers: shared hit=75 read=11
Trigger test_trigger: time=5227.111 calls=1
Total runtime: 5228.357 ms
(8 rows)
explain(ANALYZE,BUFFERS,VERBOSE) DELETE FROM
B b WHERE tq.aid = ANY
('{2,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}');
QUERY PLAN
Delete on B b (cost=0.00..1239.11 rows=20 width=6) (actual time=6.013..6.013 rows=0 loops=1)
Buffers: shared hit=448
-> Seq Scan on B b (cost=0.00..1239.11 rows=20 width=6) (actual time=6.011..6.011 rows=0 loops=1)
Output: ctid
Filter: (b.aid = ANY ('{2,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}'::bigint[]))
Rows Removed by Filter: 21743
Buffers: shared hit=448
Total runtime: 6.029 ms
(8 rows)
CPU usage
Before calling:
Afer frequently operations:

SQL function very slow compared to query without function wrapper

I have this PostgreSQL 9.4 query that runs very fast (~12ms):
SELECT
auth_web_events.id,
auth_web_events.time_stamp,
auth_web_events.description,
auth_web_events.origin,
auth_user.email,
customers.name,
auth_web_events.client_ip
FROM
public.auth_web_events,
public.auth_user,
public.customers
WHERE
auth_web_events.user_id_fk = auth_user.id AND
auth_user.customer_id_fk = customers.id AND
auth_web_events.user_id_fk = 2
ORDER BY
auth_web_events.id DESC;
But if I embed it into a function, the query runs very slow through all data, seems that is running through every record, what am I missing?, I have ~1M of data and I want to simplify my database layer storing the large queries into functions and views.
CREATE OR REPLACE FUNCTION get_web_events_by_userid(int) RETURNS TABLE(
id int,
time_stamp timestamp with time zone,
description text,
origin text,
userlogin text,
customer text,
client_ip inet
) AS
$func$
SELECT
auth_web_events.id,
auth_web_events.time_stamp,
auth_web_events.description,
auth_web_events.origin,
auth_user.email AS user,
customers.name AS customer,
auth_web_events.client_ip
FROM
public.auth_web_events,
public.auth_user,
public.customers
WHERE
auth_web_events.user_id_fk = auth_user.id AND
auth_user.customer_id_fk = customers.id AND
auth_web_events.user_id_fk = $1
ORDER BY
auth_web_events.id DESC;
$func$ LANGUAGE SQL;
The query plan is:
"Sort (cost=20.94..20.94 rows=1 width=791) (actual time=61.905..61.906 rows=2 loops=1)"
" Sort Key: auth_web_events.id"
" Sort Method: quicksort Memory: 25kB"
" -> Nested Loop (cost=0.85..20.93 rows=1 width=791) (actual time=61.884..61.893 rows=2 loops=1)"
" -> Nested Loop (cost=0.71..12.75 rows=1 width=577) (actual time=61.874..61.879 rows=2 loops=1)"
" -> Index Scan using auth_web_events_fk1 on auth_web_events (cost=0.57..4.58 rows=1 width=61) (actual time=61.860..61.860 rows=2 loops=1)"
" Index Cond: (user_id_fk = 2)"
" -> Index Scan using auth_user_pkey on auth_user (cost=0.14..8.16 rows=1 width=524) (actual time=0.005..0.005 rows=1 loops=2)"
" Index Cond: (id = 2)"
" -> Index Scan using customers_id_idx on customers (cost=0.14..8.16 rows=1 width=222) (actual time=0.004..0.005 rows=1 loops=2)"
" Index Cond: (id = auth_user.customer_id_fk)"
"Planning time: 0.369 ms"
"Execution time: 61.965 ms"
I'm calling the funcion on this way:
SELECT * from get_web_events_by_userid(2)
The query plan for the function:
"Function Scan on get_web_events_by_userid (cost=0.25..10.25 rows=1000 width=172) (actual time=279107.142..279107.144 rows=2 loops=1)"
"Planning time: 0.038 ms"
"Execution time: 279107.175 ms"
EDIT: I just change the parameters, and the issue persist.
EDIT2: Query plan for the Erwin answer:
"Sort (cost=20.94..20.94 rows=1 width=791) (actual time=0.048..0.049 rows=2 loops=1)"
" Sort Key: w.id"
" Sort Method: quicksort Memory: 25kB"
" -> Nested Loop (cost=0.85..20.93 rows=1 width=791) (actual time=0.030..0.037 rows=2 loops=1)"
" -> Nested Loop (cost=0.71..12.75 rows=1 width=577) (actual time=0.023..0.025 rows=2 loops=1)"
" -> Index Scan using auth_user_pkey on auth_user u (cost=0.14..8.16 rows=1 width=524) (actual time=0.011..0.012 rows=1 loops=1)"
" Index Cond: (id = 2)"
" -> Index Scan using auth_web_events_fk1 on auth_web_events w (cost=0.57..4.58 rows=1 width=61) (actual time=0.008..0.008 rows=2 loops=1)"
" Index Cond: (user_id_fk = 2)"
" -> Index Scan using customers_id_idx on customers c (cost=0.14..8.16 rows=1 width=222) (actual time=0.003..0.004 rows=1 loops=2)"
" Index Cond: (id = u.customer_id_fk)"
"Planning time: 0.541 ms"
"Execution time: 0.101 ms"
user
While rewriting your function I realized that you added column aliases here:
SELECT
...
auth_user.email AS user,
customers.name AS customer,
.. which wouldn't do anything to begin with, since those aliases are invisible outside the function and not referenced inside the function. So they would be ignored. For documentation purposes better use a comment.
But it also makes your query invalid, because user is a completely reserved word and cannot be used as column alias unless double-quoted.
Oddly, in my tests the function seems to work with the invalid alias. Probably because it is ignored (?). But I am not sure this couldn't have side effects.
Your function rewritten (otherwise equivalent):
CREATE OR REPLACE FUNCTION get_web_events_by_userid(int)
RETURNS TABLE (
id int
, time_stamp timestamptz
, description text
, origin text
, userlogin text
, customer text
, client_ip inet
)
LANGUAGE sql STABLE AS
$func$
SELECT w.id
, w.time_stamp
, w.description
, w.origin
, u.email -- AS user -- make this a comment!
, c.name -- AS customer
, w.client_ip
FROM public.auth_user u
JOIN public.auth_web_events w ON w.user_id_fk = u.id
JOIN public.customers c ON c.id = u.customer_id_fk
WHERE u.id = $1 -- reverted the logic here
ORDER BY w.id DESC
$func$;
Obviously, the STABLE keyword changed the outcome. Function volatility should not be an issue in the test situation you describe. The setting does not normally profit a single, isolated function call. Read details in the manual. Also, standard EXPLAIN does not show query plans for what's going on inside functions. You could employ the additional module auto-explain for that:
Postgres query plan of a UDF invocation written in pgpsql
You have a very odd data distribution:
auth_web_events table has 100000000 records, auth_user->2 records, customers-> 1 record
Since you didn't define otherwise, the function assumes an estimate of 1000 rows to be returned. But your function is actually returning only 2 rows. If all your calls only return (in the vicinity of) 2 rows, just declare that with an added ROWS 2. Might change the query plan for the VOLATILE variant as well (even if STABLE is the right choice anyway here).
You will get better performance by making this query dynamic and using plpgsql.
CREATE OR REPLACE FUNCTION get_web_events_by_userid(uid int) RETURNS TABLE(
id int,
time_stamp timestamp with time zone,
description text,
origin text,
userlogin text,
customer text,
client_ip inet
) AS $$
BEGIN
RETURN QUERY EXECUTE
'SELECT
auth_web_events.id,
auth_web_events.time_stamp,
auth_web_events.description,
auth_web_events.origin,
auth_user.email AS user,
customers.name AS customer,
auth_web_events.client_ip
FROM
public.auth_web_events,
public.auth_user,
public.customers
WHERE
auth_web_events.user_id_fk = auth_user.id AND
auth_user.customer_id_fk = customers.id AND
auth_web_events.user_id_fk = ' || uid ||
'ORDER BY
auth_web_events.id DESC;'
END;
$$ LANGUAGE plpgsql;