Is there a way to improve the query performance in postgresql? - postgresql

postgresql:
I'm writing a query in postgresql which is getting struck while running. It is not at all returning any records. Could anyone help me on this?
Actual Query:
select a.auditdate,b.description as auditcategory,remoteaddress,u.name as user1,e.name || '[' || e.employeenumber +']' as employee,a.additionalinfo from tblauditlog a
inner join tblauditcategory b on b.cid = a.auditcategory and b.cid<>756
left outer join tbluser u on a.userid=u.cid and u.usertype<>250
left outer join tblemployee e on (a.affectedemployeeid=e.cid or a.affectedemployee=e.cid or a.affectedemployee=e.employeenumber)
where auditdate >= '01 sep 2022' and auditdate <= '15 sep 2022' order by a.auditdate desc
Query Plan:
Nested Loop Left Join (cost=0.71..659969657.09 rows=1026362 width=151)
Join Filter: (a.userid = u.cid)
-> Nested Loop Left Join (cost=0.71..654861596.89 rows=1026362 width=142)
Join Filter: ((a.affectedemployeeid = e.cid) OR ((a.affectedemployee)::text = textin(int4out(e.cid))) OR ((a.affectedemployee)::text = (e.employeenumber)::text))
-> Nested Loop (cost=0.71..268627.45 rows=566679 width=134)
-> Index Scan Backward using idx_tblauditlog_auditdate on tblauditlog a (cost=0.43..101251.69 rows=567370 width=112)
Index Cond: ((auditdate >= '2022-09-01 00:00:00'::timestamp without time zone) AND (auditdate <= '2022-09-15 00:00:00'::timestamp without time zone))
-> Index Scan using tblauditcategory_pkey on tblauditcategory b (cost=0.28..0.30 rows=1 width=30)
Index Cond: (cid = a.auditcategory)
Filter: (cid <> 756)
-> Materialize (cost=0.00..8005.07 rows=46205 width=33)
-> Seq Scan on tblemployee e (cost=0.00..7774.05 rows=46205 width=33)
-> Materialize (cost=0.00..4554.94 rows=331 width=14)
-> Seq Scan on fk_tbluser u (cost=0.00..4553.29 rows=331 width=14)
Filter: (usertype <> '250'::numeric)
(15 rows)
Actual number of records in each table:
tblAuditlog : 6852333
tblAuditCategory : 825
tbluser : 46342
tblemployee : 46014
Index created:
tblAuditlog:
"tblauditlog_pkey" PRIMARY KEY, btree (cid)
"idx_tblauditlog_auditdate" btree (auditdate, auditcategory, userid)
tblAuditcategory:
1. "tblauditcategory_pkey" PRIMARY KEY, btree (cid)
2. "tblauditCategory_unique" UNIQUE CONSTRAINT, btree (code)
3. "idx_tblauditcategory_code" btree (code)
tbluser:
1. "tbluser_pkey" PRIMARY KEY, btree (cid)
2. "tbluser_employeeid_key" UNIQUE CONSTRAINT, btree (employeeid)
3. "tbluser_name_key" UNIQUE CONSTRAINT, btree (name)
4. "uq_fk_tbluser_name_type_employeeid" UNIQUE CONSTRAINT, btree (name, usertype, employeeid)
5. "idx_tbluser_utype" btree (cid, usertype)
tblemployee:
1. "tblemployee_pkey" PRIMARY KEY, btree (cid)
2. "tblemployee_employeeno_key" UNIQUE CONSTRAINT, btree (employeenumber)
3. "tblemployee_guid_key" UNIQUE CONSTRAINT, btree (sid)
4. "idx_tblemployee_employeeno" btree (employeenumber)
Thanks in advance...

Thanks #jjanes. As you suggested the OR condition was the culprit. I have changed the query like this. After the modification, it is getting executed within a fraction of second. Thank you all for your support.
The modified query is:
select a.auditdate,b.description as auditcategory,remoteaddress,u.name as user1,coalesce(e.name,e1.name,e2.name) || '[' || coalesce(e.employeeno,e1.employeeno,e2.employeeno) +']' as employee,a.additionalinfo from tblauditlog a
inner join tblauditcategory b on b.cid = a.auditcategory and b.cid<>756
left outer join tbluser u on a.userid=u.cid and u.usertype<>250
left outer join tblemployee e on (a.affectedemployeeid=e.cid)
left join tblemployee e1 on(a.affectedemployee=e1.cid)
left join tblemployee e2 on(a.affectedemployee=e2.employeeno)
where auditdate >= '01 sep 2022' and auditdate <= '15 sep 2022' order by a.auditdate desc ;
There is a huge improvement in the query plan:
Sort (cost=433884.73..436109.96 rows=890089 width=151)
Sort Key: a.auditdate DESC
-> Hash Join (cost=252015.26..306723.76 rows=890089 width=151)
Hash Cond: (a.auditcategory = b.cid)
-> Merge Left Join (cost=251985.75..297667.18 rows=891175 width=184)
Merge Cond: ((a.affectedemployee)::text = (e2.employeeno)::text)
-> Merge Left Join (cost=251985.33..275914.91 rows=891175 width=172)
Merge Cond: ((a.affectedemployee)::text = (textin(int4out(e1.cid))))
-> Sort (cost=227637.20..229094.12 rows=582767 width=143)
Sort Key: a.affectedemployee
-> Hash Left Join (cost=20079.20..147328.20 rows=582767 width=143)
Hash Cond: (a.userid = u.cid)
-> Hash Left Join (cost=15491.10..141210.24 rows=582767 width=137)
Hash Cond: (a.affectedemployeeid = e.cid)
-> Index Scan Backward using idx_tblauditlog_auditdate on tblauditlog a (cost=0.43..103968.76 rows=582767 width=112)
Index Cond: ((auditdate >= '2022-09-01 00:00:00'::timestamp without time zone) AND (auditdate <= '2022-09-15 00:00:00'::timestamp without time zone))
-> Hash (cost=13227.63..13227.63 rows=111363 width=33)
-> Seq Scan on tblemployee e (cost=0.00..13227.63 rows=111363 width=33)
-> Hash (cost=4583.89..4583.89 rows=337 width=14)
-> Seq Scan on tbluser u (cost=0.00..4583.89 rows=337 width=14)
Filter: (usertype <> '250'::numeric)
-> Materialize (cost=24348.13..24904.95 rows=111363 width=33)
-> Sort (cost=24348.13..24626.54 rows=111363 width=33)
Sort Key: (textin(int4out(e1.cid)))
-> Seq Scan on tblemployee e1 (cost=0.00..13227.63 rows=111363 width=33)
-> Index Scan using tblemployee_employeeno_key on tblemployee e2 (cost=0.42..15775.39 rows=111363 width=29)
-> Hash (cost=19.26..19.26 rows=820 width=30)
-> Seq Scan on tblauditcategory b (cost=0.00..19.26 rows=820 width=30)
Filter: (cid <> 756)
(29 rows)
Thank you once again...

Related

Bad execution plan on Postgresql

I'm trying to migrate from SQL Server to Postgresql.
Here is my Posgresql code:
Create View person_names As
SELECT lp."Code", n."Name", n."Type"
from "Persons" lp
Left Join LATERAL
(
Select *
From "Names" n
Where n.id = lp.id
Order By "Date" desc
Limit 1
) n on true
limit 100;
Explain
Select "Code" From person_names;
It prints
"Subquery Scan on person_names (cost=0.42..448.85 rows=100 width=10)"
" -> Limit (cost=0.42..447.85 rows=100 width=56)"
" -> Nested Loop Left Join (cost=0.42..303946.91 rows=67931 width=56)"
" -> Seq Scan on ""Persons"" lp (cost=0.00..1314.31 rows=67931 width=10)"
" -> Limit (cost=0.42..4.44 rows=1 width=100)"
" -> Index Only Scan Backward using ""IX_Names_Person"" on ""Names"" n (cost=0.42..4.44 rows=1 width=100)"
" Index Cond: ("id" = (lp."id")::numeric)"
Why there is an "Index Only Scan" for the "Names" table? This table is not required to get a result. On SQL Server I get only a single scan over the "Persons" table.
How can I tune Postgres to get a better query plans? I'm trying the lastest version, which is the Postgresql 15 beta 3.
Here is SQL Server version:
Create View person_names As
SELECT top 100 lp."Code", n."Name", n."Type"
from "Persons" lp
Outer Apply
(
Select Top 1 *
From "Names" n
Where n.id = lp.id
Order By "Date" desc
) n
GO
SET SHOWPLAN_TEXT ON;
GO
Select "Code" From person_names;
It gives correct execution plan:
|--Top(TOP EXPRESSION:((100)))
|--Index Scan(OBJECT:([Persons].[IX_Persons] AS [lp]))
Change the lateral join to a regular left join, then Postgres is able to remove the select on the Names table:
create View person_names
As
SELECT lp.Code, n.Name, n.Type
from Persons lp
Left Join (
Select distinct on (id) *
From Names n
Order By id, Date desc
) n on n.id = lp.id
limit 100;
The following index will support the distinct on () in case you do include columns from the Names table:
create index on "Names"(id, "Date" desc);
For select code from names this gives me this plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on persons lp (cost=0.00..309.00 rows=20000 width=7) (actual time=0.009..1.348 rows=20000 loops=1)
Planning Time: 0.262 ms
Execution Time: 1.738 ms
For select Code, name, type From person_names; this gives me this plan:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=559.42..14465.93 rows=20000 width=25) (actual time=5.585..68.545 rows=20000 loops=1)
Hash Cond: (n.id = lp.id)
-> Unique (cost=0.42..13653.49 rows=20074 width=26) (actual time=0.053..57.323 rows=20000 loops=1)
-> Index Scan using names_id_date_idx on names n (cost=0.42..12903.49 rows=300000 width=26) (actual time=0.052..41.125 rows=300000 loops=1)
-> Hash (cost=309.00..309.00 rows=20000 width=11) (actual time=5.407..5.407 rows=20000 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1116kB
-> Seq Scan on persons lp (cost=0.00..309.00 rows=20000 width=11) (actual time=0.011..2.036 rows=20000 loops=1)
Planning Time: 0.460 ms
Execution Time: 69.180 ms
Of course I had to guess the table structures as you haven't provided any DDL.
Online example
Change your view definition like that
create view person_names as
select p."Code",
(select "Name"
from "Names" n
where n.id = p.id
order by "Date" desc
limit 1)
from "Persons" p
limit 100;

Postgres: How to remove duplicates of old records in log table

I have a log table with about 5 million records:
id BIGSERIAL,
object_type_name VARCHAR(255),
object_id BIGINT,
user_id BIGINT,
service_id BIGINT,
op_id INTEGER,
dt TIMESTAMP(0) WITHOUT TIME ZONE DEFAULT now(),
property_name VARCHAR(255),
CONSTRAINT object_log_object_log_pkey PRIMARY KEY(id)
I need to delete duplicate records leaving only the latest record (the one with the max id). The problem is that my query is very slow (> 1 min):
DELETE FROM sys.object_log AS t5
USING (
SELECT t3.id
FROM sys.object_log t3 LEFT JOIN (
SELECT t1.id
FROM sys.object_log t1
WHERE t1.id = (
SELECT max(t2.id)
FROM sys.object_log t2
WHERE t2.object_type_name = t1.object_type_name
AND t2.object_id = t1.object_id
AND t2.property_name = t1.property_name
)
) t4 ON t3.id=t4.id
WHERE t4.id IS NULL
) t6
WHERE t5.id = t6.id
QUERY PLAN
Delete on object_log t5 (cost=1.30..72821293.06 rows=8298362 width=18)
-> Merge Join (cost=1.30..72821293.06 rows=8298362 width=18)
Merge Cond: (t3.id = t5.id)
-> Merge Anti Join (cost=0.86..72365877.02 rows=8298362 width=20)
Merge Cond: (t3.id = t1.id)
-> Index Scan using object_log_object_log_pkey on object_log t3 (cost=0.43..330836.36 rows=8340062 width=14)
-> Index Scan using object_log_object_log_pkey on object_log t1 (cost=0.43..72013669.25 rows=41700 width=14)
Filter: (id = (SubPlan 1))
SubPlan 1
-> Aggregate (cost=8.58..8.59 rows=1 width=8)
-> Index Only Scan using object_log_idx1 on object_log t2 (cost=0.56..8.58 rows=1 width=8)
Index Cond: ((object_type_name = (t1.object_type_name)::text) AND (object_id = t1.object_id) AND (property_name = (t1.property_name)::text))
-> Index Scan using object_log_object_log_pkey on object_log t5 (cost=0.43..330836.36 rows=8340062 width=14)
Any idea how to improve performance ?
UPD.1
Next query is also slow:
DELETE FROM sys.object_log
WHERE id IN (
SELECT id
FROM (
SELECT id, ROW_NUMBER() OVER w AS rnum
FROM sys.object_log
WINDOW w AS (PARTITION BY object_type_name, object_id, property_name ORDER BY id)
) t
WHERE t.rnum > 1)
QUERY PLAN
QUERY PLAN
Delete on object_log (cost=1703454.67..1960873.74 rows=2780021 width=38)
-> Hash Semi Join (cost=1703454.67..1960873.74 rows=2780021 width=38)
Hash Cond: (object_log.id = t.id)
-> Seq Scan on object_log (cost=0.00..197648.62 rows=8340062 width=14)
-> Hash (cost=1668704.40..1668704.40 rows=2780021 width=40)
-> Subquery Scan on t (cost=1355952.08..1668704.40 rows=2780021 width=40)
Filter: (t.rnum > 1)
-> WindowAgg (cost=1355952.08..1564453.63 rows=8340062 width=38)
-> Sort (cost=1355952.08..1376802.23 rows=8340062 width=30)
Sort Key: object_log_1.object_type_name, object_log_1.object_id, object_log_1.property_name, object_log_1.id
-> **Seq Scan** on object_log object_log_1 (cost=0.00..197648.62 rows=8340062 width=30)
This is what I use for this: https://wiki.postgresql.org/wiki/Deleting_duplicates
Hope you find it useful.
Bjarni

PostgreSQL table indexing

I want to index my tables for the following query:
select
t.*
from main_transaction t
left join main_profile profile on profile.id = t.profile_id
left join main_customer customer on (customer.id = profile.user_id)
where
(upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%')))
and t.service_type = 'SERVICE_1'
and t.status = 'SUCCESS'
and t.mode = 'AUTO'
and t.transaction_type = 'WITHDRAW'
and customer.client = 'corp'
and t.pub_date>='2018-09-05' and t.pub_date<='2018-11-05'
order by t.pub_date desc, t.id asc
LIMIT 1000;
This is how I tried to index my tables:
CREATE INDEX main_transaction_pr_id ON main_transaction (profile_id);
CREATE INDEX main_profile_user_id ON main_profile (user_id);
CREATE INDEX main_customer_client ON main_customer (client);
CREATE INDEX main_transaction_gin_req_no ON main_transaction USING gin (upper(request_no) gin_trgm_ops);
CREATE INDEX main_customer_gin_phone ON main_customer USING gin (upper(phone) gin_trgm_ops);
CREATE INDEX main_transaction_general ON main_transaction (service_type, status, mode, transaction_type); --> don't know if this one is true!!
After indexing like above my query is spending over 4.5 seconds for just selecting 1000 rows!
I am selecting from the following table which has 34 columns including 3 FOREIGN KEYs and it has over 3 million data rows:
CREATE TABLE main_transaction (
id integer NOT NULL DEFAULT nextval('main_transaction_id_seq'::regclass),
description character varying(255) NOT NULL,
request_no character varying(18),
account character varying(50),
service_type character varying(50),
pub_date" timestamptz(6) NOT NULL,
"service_id" varchar(50) COLLATE "pg_catalog"."default",
....
);
I am also joining two tables (main_profile, main_customer) for searching customer.phone and for selecting customer.client. To get to the main_customer table from main_transaction table, I can only go by main_profile
My question is how can I index my table too increase performance for above query?
Please, do not use UNION for OR for this case (upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%'))) instead can we use case when condition? Because, I have to convert my PostgreSQL query into Hibernate JPA! And I don't know how to convert UNION except Hibernate - Native SQL which I am not allowed to use.
Explain:
Limit (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.381 rows=1 loops=1)
-> Sort (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.380 rows=1 loops=1)
Sort Key: t.pub_date DESC, t.id
Sort Method: quicksort Memory: 27kB
-> Hash Join (cost=20817.10..411600.73 rows=38 width=1906) (actual time=3214.473..3885.369 rows=1 loops=1)
Hash Cond: (t.profile_id = profile.id)
Join Filter: ((upper((t.request_no)::text) ~~ '%20181104-2158-2723948%'::text) OR (upper((customer.phone)::text) ~~ '%20181104-2158-2723948%'::text))
Rows Removed by Join Filter: 593118
-> Seq Scan on main_transaction t (cost=0.00..288212.28 rows=205572 width=1906) (actual time=0.068..1527.677 rows=593119 loops=1)
Filter: ((pub_date >= '2016-09-05 00:00:00+05'::timestamp with time zone) AND (pub_date <= '2018-11-05 00:00:00+05'::timestamp with time zone) AND ((service_type)::text = 'SERVICE_1'::text) AND ((status)::text = 'SUCCESS'::text) AND ((mode)::text = 'AUTO'::text) AND ((transaction_type)::text = 'WITHDRAW'::text))
Rows Removed by Filter: 2132732
-> Hash (cost=17670.80..17670.80 rows=180984 width=16) (actual time=211.211..211.211 rows=181516 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 3166kB
-> Hash Join (cost=6936.09..17670.80 rows=180984 width=16) (actual time=46.846..183.689 rows=181516 loops=1)
Hash Cond: (customer.id = profile.user_id)
-> Seq Scan on main_customer customer (cost=0.00..5699.73 rows=181106 width=16) (actual time=0.013..40.866 rows=181618 loops=1)
Filter: ((client)::text = 'corp'::text)
Rows Removed by Filter: 16920
-> Hash (cost=3680.04..3680.04 rows=198404 width=8) (actual time=46.087..46.087 rows=198404 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 2966kB
-> Seq Scan on main_profile profile (cost=0.00..3680.04 rows=198404 width=8) (actual time=0.008..20.099 rows=198404 loops=1)
Planning time: 0.757 ms
Execution time: 3885.680 ms
With the restriction to not use UNION, you won't get a good plan.
You can slightly speed up processing with the following indexes:
main_transaction ((service_type::text), (status::text), (mode::text),
(transaction_type::text), pub_date)
main_customer ((client::text))
These should at least get rid of the sequential scans, but the hash join that takes the lion's share of the processing time will remain.

Postgres No Index Only Scan On Delete?

I have a query running in Postgres 9.3.9 where I want to delete some records from a temp table based on using an EXISTS clause that matches a specific partial index condition I created. The following related query uses an Index Only Scan on this partial index (abbreviated as 'conditions' below):
EXPLAIN
SELECT l.id
FROM temp_table l
WHERE NOT EXISTS
(SELECT 1
FROM customers cx
WHERE cx.id = l.customer_id
AND ( conditions ));
QUERY PLAN
----------------------------------------------------------------------------------------------
Nested Loop Anti Join (cost=0.42..252440.38 rows=43549 width=4)
-> Seq Scan on temp_table l (cost=0.00..1277.98 rows=87098 width=8)
-> Index Only Scan using customers__bad on customers cx (cost=0.42..3.35 rows=1 width=4)
Index Cond: (id = l.customer_id)
(4 rows)
Here is the actual delete query SQL. This doesn't but I am convinced should use the same Index Only Scan as above, and I wonder if it's a bug in Postgres? Notice the higher cost:
DELETE
FROM temp_table l
WHERE EXISTS(SELECT 1
FROM cnu.customers cx
WHERE cx.id = l.customer_id
AND ( conditions ));
QUERY PLAN
------------------------------------------------------------------------------------------------
Delete on temp_table l (cost=0.42..495426.94 rows=43549 width=12)
-> Nested Loop Semi Join (cost=0.42..495426.94 rows=43549 width=12)
-> Seq Scan on temp_table l (cost=0.00..1277.98 rows=87098 width=10)
-> Index Scan using customers__bad on customers cx (cost=0.42..6.67 rows=1 width=10)
Index Cond: (id = l.customer_id)
(5 rows)
To show that it should be possible on delete to get the same plan, I had to do this, and it gave me the plan I wanted, and was twice as fast as the query above that uses an Index Scan instead of Index Only Scan:
WITH the_right_records AS
(SELECT l.id
FROM temp_table l
WHERE NOT EXISTS
(SELECT 1
FROM cnu.customers cx
WHERE cx.id = l.customer_id
AND ( conditions ))
DELETE FROM temp_table t
WHERE NOT EXISTS (SELECT 1
FROM the_right_records x
WHERE x.id = t.id);
QUERY PLAN
------------------------------------------------------------------------------------------------------
Delete on temp_table t (cost=253855.72..256902.88 rows=43549 width=34)
CTE the_right_records
-> Nested Loop Anti Join (cost=0.42..252440.38 rows=43549 width=4)
-> Seq Scan on temp_table l (cost=0.00..1277.98 rows=87098 width=8)
-> Index Only Scan using customers__bad on customers cx (cost=0.42..3.35 rows=1 width=4)
Index Cond: (id = l.customer_id)
-> Hash Anti Join (cost=1415.34..4462.50 rows=43549 width=34)
Hash Cond: (t.id = x.id)
-> Seq Scan on temp_table t (cost=0.00..1277.98 rows=87098 width=10)
-> Hash (cost=870.98..870.98 rows=43549 width=32)
-> CTE Scan on the_right_records x (cost=0.00..870.98 rows=43549 width=32)
(11 rows)
I've noticed this same behavior in other examples. So anyone have any ideas?

Why isn't PostgreSQL using an index in a Merge Join scenario?

explain select count(1) from tab1_201502 t1, tab2_201502 t2
where t1.serv_no=t2.serv_no
and t1.PC_LOGIN_COUNT1 >5
and t1.FET_WZ_FEE < 80
and t2.ALL_FLOW_2G<50;
QUERY PLAN
----------------------------------------------------------------------
Aggregate (cost=4358706.25..4358706.26 rows=1 width=0)
-> Merge Join (cost=4339930.99..4358703.30 rows=1179 width=0)
Merge Cond: ((t1.serv_no)::text = (t2.serv_no)::text)
-> Index Scan using tab1_201502_serv_no_idx on tab1_201502 t1
(cost=0.56..6239071.57 rows=263219 width=12)
Filter: ((pc_login_count1 > 5::numeric)
AND (fet_wz_fee < 80::numeric))
-> Sort (cost=4339914.76..4340306.63 rows=156747 width=12)
Sort Key: t2.serv_no
-> Seq Scan on tab2_201502 t2
(cost=0.00..4326389.00 rows=156747 width=12)
Filter: (all_flow_2g < 50::numeric)
All tables are indexed on serv_no.
Why is PostgreSQL ignoring the tab2_201502 index for scan?
This is your query:
select count(1)
from tab1_201502 t1 join
tab2_201502 t2
on t1.serv_no = t2.serv_no
where t1.PC_LOGIN_COUNT1 > 5 and t1.FET_WZ_FEE < 80 and t2.ALL_FLOW_2G < 50;
Postgres is deciding that filtering by the where clause is more important than performing the join.
I would recommend trying two sets of indexes for this query. They are: tab2_201502(ALL_FLOW_2G, serv_no) and tab1_201502(serv_no, PC_LOGIN_COUNT1, FET_WZ_FEE).
The second pair is: tab1_201502(PC_LOGIN_COUNT1, FET_WZ_FEE, serv_no) and tab2_201502(serv_no, ALL_FLOW_2G).
Which works better depends on which table is the driving table for the join.