Poor Performance while data upsert to Postgres - postgresql

I am submitting 3 million records to postgres table1 from a staging table table2,I have my update and insert queries as below
UPDATE table1 t set
col1 = stage.col1,
col2 = stage.col2 ,
col3 = stage.stage.col3
from table2 stage
where t.id::uuid = stage.id::uuid
and coalesce(t.name,'name') = coalesce(stage.name,'name')
and coalesce(t.level,'level') = coalesce(stage.level,'level');
INSERT INTO table1
(col1,
col2,
col3,
col4,
id,
name,
level)
select
stage.col1,
stage.col2,
stage.col3,
stage.col4,
stage.id,
stage.name,
stage.level
from table2 stage
where NOT EXISTS (select
from table1 t where
t.id::uuid = stage.id::uuid
and coalesce(t.name,'name') = coalesce(stage.name,'name')
and coalesce(t.level,'level') = coalesce(stage.level,'level'));
I am facing performance issues (takes long 1.5 hours) even using the exactly same indexed keys (btree) as defined on the table, In order to test the cause ,I created a replica of the table1 without indexes and I was able to submit entire data in 15 ~ 17 mins approx., So I am inclined to think that indexes are killing the performance on the table as there are so many of them (some unused indexes which I cannot drop due to permission issues).I am looking for suggestions to improve/optimize my query or may be use some other strategy to upsert the data to reduce data load time. Any suggestion is appreciated.

Running an explain analyze on the query helped me to realize the query was never using the defined indexes on target table and was doing a sequential scan on a large number of rows ,the cause was one of the keys used in update/insert was defined without a coalesce in the defined indexes , though it means I have to handle null well before feeding in to my code , but it improved the performance significantly. I am open to further improvements.

Related

Postgres Force Query To Recreate Execution Plan

In best of my knowledge Postgres final an execution plan for a query on its 1~5th execution and then stuck to it.
My query(for table contains billions of rows and i have to pick top n):
select col1, col2
from table_a
where col1='a'
and col3='b'
order by col1 desc
limit 5;
There is an existing index (ix_1) on (col1, col3) that query is using.
Moving up to Postgres 12 I have created an container index (ix_2) as under:
(col1 desc, col3) include (col2)
Now I want query to use (ix_2) to make it an index only scan as col2 is included in ix_2 but query still use old (ix_1).
Since index forcing hints also not work in Postgres, so is there anyway in Postgres to force query to recreate its execution plan, so that query may consider my new index (ix_2)?
What an interesting username.
I think your assumptions about what is going on are wrong. Creating a new index sends out an invalidation message, which should force all other sessions to re-plan the query even if they think they already know the best plan.
Most likely what is going on is that the planner just re-picks the old plan anyway, because it still thinks it will be fastest. An index-only scan is only beneficial of many of the table pages are marked as allvisible. If few of them are, then there isn't much gain. But the index is probably larger, which will give it a (slightly) higher cost estimate. You should VACUUM the table to make sure the visibility map is current.
But really if you just want to get it to use the IOS, rather than do a root cause analysis, then you can just drop the old index. There is no point in having both.
Also, I wouldn't bother with INCLUDE, unless col2 is of a type that doesn't define btree operators. Just throw it into the main body of the index like (col1 desc, col3, col2).
Finally, there is no point in ordering by a column which you just forced to all have identical values to each other.

Top 5 unique rows without Distinct from 7Million rows

Postgres 12 on GCP.
Table with approx 7-Million rows and growing.
select Distinct col1, col2
from tab_a
where col3='abc'
and col4='xyz'
order by col2
limit 5;
with Distinct this query take around 2.1 to 2.8 sec
without Distinct it took 0.25 sec, but my table got duplicate data as per business requirement.
Is there anyway I can get top 5 unique records without costly Distinct clause?
I can do following but its not the robust solution:
select Distinct
col1, col2
from (
select col1, col2
from tab_a
where col3='abc'
and col4='xyz'
order by col2
limit 50
)
limit 5;
Can someone guide me a more robust solution please?
Wishes
SQL, especially for large tables, relies on indexes to run queries efficiently. You didn't tell us anything about your indexes, so there's some guessing in this answer.
Creating the following index should help your first query a lot. There's nothing wrong with that first query: if it meets your business requirements go ahead and use it.
CREATE INDEX CONCURRENTLY tab_a_distinct
ON tab_a USING BTREE
(col3, col4, col2, col1);
Why will this help?
BTREE indexes work as if they were sorted in the order of the items in the index.
You match for equality on col3 and col4 so postgreSQL can random-access the BTREE index to the first matching row, and then scan the index sequentially until it finds no more matching rows.
You want your output ordered by col2, so that column is next. postgreSQL will scan the index until it has the five rows you need then stop.
You want DISTINCT values of col2 and col3. postgreSQL can get them from the index.
In other words, this is a covering composite index for your query.
Read this to learn more. https://use-the-index-luke.com/

How does Postgres choos which index to use in case if multiple indexes are present?

I am new to Postgres and a bit confused on how Postgres decides which index to use if I have more than one btree indexes defined as below.
CREATE INDEX index_1 ON sample_table USING btree (col1, col2, COALESCE(col3, 'col3'::text));
CREATE INDEX index_2 ON sample_table USING btree (col1, COALESCE(col3, 'col3'::text));
I am using col1, col2, COALESCE(col3, 'col3'::text) in my join condition when I write to sample_table (from source tables) but when I do a explain analyze to get the query plan I see sometimes that it uses index_2 to scan rather than index_1 and sometimes just goes with sequential scan .I want to understand what can make Postgres to use one index over another?
Without seeing EXPLAIN (ANALYZE, BUFFERS) output, I can only give a generic answer.
PostgreSQL considers all execution plans that are feasible and estimates the row count and cost for each node. Then it takes the plan with the lowest cost estimate.
It could be that the condition on col2 is sometimes more selective and sometimes less, for example because you sometimes compare it to rare and sometimes to frequent values. If the condition involving col2 is not selective, it does not matzer much which of the two indexes is used. In that case PostgreSQL prefers the smaller two-column index.

Can heavily index table have its updates slower even if the columns updated aren't in any of the indexes?

I'm trying to understand why a 14 Milion row table is so slow updating, even though I'm joining with its primary key, and updating in batches( 5000 rows).
THIS IS THE QUERY
UPDATE A
SET COL1= B.COL1,
COL2 = B.COL2,
COL3 = 'ALWAYS THE SAME VAL'
FROM TABLE_X A, TABLE_Y B
WHERE A.PK = B.PK
TABLE_X has 14 Million rows
TABLE_X has 12 INDEXES, however the updated columns do not belong to any index. so it's not expected that this slowness is caused by having so many indexes right?
TABLE_Y has 5000 rows
ADITIONAL INFORMATION
I must update by the order of other column(Group) rather than the PK. If I could update by the order of PK then it would be way faster.
This is a business need. If they need to stop the process. they want groups to be either updated or not updated at all.
What could be causing such slow updates?
Database is SYBASE 15.7

way to reduce the cost in db2 for count(*)

Hi I had a DB2 Query as below
select count(*) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
For the above query i even tried with Count(1) but when i took the access plan, cost is same.
select count(1) as count from
table_a,
table_b,
table c
where
b.xxx=234 AND
b.yyy=c.wedf
Result SEt:
Count
618543562
Is there any other way to reduce the cost.
PS: b.xxx,b.yyy, c.wedf is indexed..
Thanks in advance.
I think one of the problem are statistics on the table. Did you execute Runstats? Probably, the data distribution or the quantity of rows that has to be read is a lot, and DB2 concludes that is better to read the whole table, instead of process an index, and then fetch the rows from the table.
It seems that both queries are taking the same access plan, and I think they are doing table scans.
Are the three columns part of the same index? or they are indexed separately? If they are part of different indexes, is there any ANDing between indexes in the access plan? If there is not ANDing with different indexes, the columns has to be read from the table in order to process the predicates.
The reason count(1) and count(*) are giving the same cost, is because both has to do a TableScan.
Please, take a look at the access plan, not only the results in timerons, but also the steps. Is the access plan taking the indexes? how many sorts is executing?
Try to change the optimization level, and you will see that the access plans change. I think you are executing with the default one (5)
If you want to force the query to take in account an index, you can create an optimization profile
What is the relation between (B,C) tables and A table. In your query you just use CROSS JOIN between A and (B,C). So it is the MAIN performance issue.
If you really need this count just multiply counts for A and (B,C):
select
(select count(*) from a)
*
(select count(*) from b, c where b.xxx=234 AND b.yyy=c.wedf )
for DB2 use this:
select a1.cnt*
(select count(*) as cnt2 from b, c where b.xxx=234 AND b.yyy=c.wedf )
from
(select count(*) as cnt1 from a) a1