Can heavily index table have its updates slower even if the columns updated aren't in any of the indexes? - database-performance

I'm trying to understand why a 14 Milion row table is so slow updating, even though I'm joining with its primary key, and updating in batches( 5000 rows).
THIS IS THE QUERY
UPDATE A
SET COL1= B.COL1,
COL2 = B.COL2,
COL3 = 'ALWAYS THE SAME VAL'
FROM TABLE_X A, TABLE_Y B
WHERE A.PK = B.PK
TABLE_X has 14 Million rows
TABLE_X has 12 INDEXES, however the updated columns do not belong to any index. so it's not expected that this slowness is caused by having so many indexes right?
TABLE_Y has 5000 rows
ADITIONAL INFORMATION
I must update by the order of other column(Group) rather than the PK. If I could update by the order of PK then it would be way faster.
This is a business need. If they need to stop the process. they want groups to be either updated or not updated at all.
What could be causing such slow updates?
Database is SYBASE 15.7

Related

how to sort a column in postgresql and set it to sorted values

I am new to PostgreSQL and would like to know how to set a column in a table to its sorted version. For example:
(table: t1, column: points)
5
7
3
9
8
4
(table: t1, column: points) // pls note it is sorted
3
4
5
7
8
9
My incorrect version:
UPDATE outputTable SET points_count = (SELECT points_count FROM outputTable ORDER BY points_count ASC)
Try with this :
UPDATE outputTable
SET points_count = s.points_count
FROM (SELECT points_count, ctid FROM outputTable ORDER BY points_count ASC) s
WHERE outputTable.ctid = s.ctid;
As you are planning to update same table with reference to same table, you will need row level equality criteria like ctid to update each row.
It seems like you want to sort the rows in a table.
Now this is normally a pointless exercise, since tables have no fixed order. In fact, every UPDATE will change the order of rows in a PostgreSQL table.
The only way you can get a certain order in the result rows of a query is by using the ORDER BY clause, which will sort the rows regardless of their physical order in the table (which is not dependable, as mentioned above).
There is one use case for physically reordering a table: an index range scan using an index on points_count will be much more efficient if the table is physically sorted like the index. The reason is that far fewer table blocks will be accessed.
Therefore, there is a way to rewrite the table in a certain order as long as you have an index on the column:
CREATE INDEX ON outputtable (points_count);
CLUSTER outputtable USING points_count;
But – as I said above – unless you plan a range scan on that index, the exercise is pointless.

Poor Performance while data upsert to Postgres

I am submitting 3 million records to postgres table1 from a staging table table2,I have my update and insert queries as below
UPDATE table1 t set
col1 = stage.col1,
col2 = stage.col2 ,
col3 = stage.stage.col3
from table2 stage
where t.id::uuid = stage.id::uuid
and coalesce(t.name,'name') = coalesce(stage.name,'name')
and coalesce(t.level,'level') = coalesce(stage.level,'level');
INSERT INTO table1
(col1,
col2,
col3,
col4,
id,
name,
level)
select
stage.col1,
stage.col2,
stage.col3,
stage.col4,
stage.id,
stage.name,
stage.level
from table2 stage
where NOT EXISTS (select
from table1 t where
t.id::uuid = stage.id::uuid
and coalesce(t.name,'name') = coalesce(stage.name,'name')
and coalesce(t.level,'level') = coalesce(stage.level,'level'));
I am facing performance issues (takes long 1.5 hours) even using the exactly same indexed keys (btree) as defined on the table, In order to test the cause ,I created a replica of the table1 without indexes and I was able to submit entire data in 15 ~ 17 mins approx., So I am inclined to think that indexes are killing the performance on the table as there are so many of them (some unused indexes which I cannot drop due to permission issues).I am looking for suggestions to improve/optimize my query or may be use some other strategy to upsert the data to reduce data load time. Any suggestion is appreciated.
Running an explain analyze on the query helped me to realize the query was never using the defined indexes on target table and was doing a sequential scan on a large number of rows ,the cause was one of the keys used in update/insert was defined without a coalesce in the defined indexes , though it means I have to handle null well before feeding in to my code , but it improved the performance significantly. I am open to further improvements.

Delete from a table on basis of indexed columns is taking for ever

We have a table having three indexed columns say
column1 of type bigint
column2 of type timestamp without time zone
column3 of type timestamp without time zone
The table is having more than 12 crores of records and we are trying to delete all the records which are greater than current date - 45 days using below query
delete from tableA
where column2 <= '2019-04-15 00:00:00.00'
OR column3 <= '2019-04-15 00:00:00.00';
This is executing for ever and never completes.
Is there any way we can improve the performance of this query.
Drop indexes, delete data and recreate indexes. But this is not working as I am not able to delete data even after dropping the indexes.
delete
from tableA
where column2 <= '2019-04-15 00:00:00.00'
OR column3 <= '2019-04-15 00:00:00.00'
I do not want to change the query but want the Postgres configured through some property so that it is able to delete the records
See also for a good discussion of the issue Best way to delete millions of rows by ID
12 crores == 120 million rows?
Deleting from a large indexed table is slow because the index is rebuilt many times during the process. If you can select the rows you want to keep and use them to create a new table, then drop the old one, the process is much faster. If you do this regularly, use table partitioning and disconnect a partition when required, this can then be dropped.
1) Check the logs, you are probably suffering from deadlocks.
2) Try creating a new table selecting the data you need, then drop and rename. Use all the columns in your index in the query. DROP TABLE is much faster than DELETE .. FROM
CREATE TABLE new_table AS (
SELECT * FROM old_table WHERE
column1 >= 1 AND column2 >= current_date - 45 AND column3 >= current_date - 45);
DROP TABLE old_table;
ALTER TABLE new_table RENAME TO old_table;
CREATE INDEX ...
3) Create a new table using partitions based on date, with a table for say 15, 30 or 45 days (if you regularly remove data that is 45 days old). See https://www.postgresql.org/docs/10/ddl-partitioning.html for details.

PostgreSQL different index creation time for same datatype

I have a table with three columns A, B, C, all of type bytea.
There are around 180,000,000 rows in the table. A, B and C all have exactly 20 bytes of data, C sometimes contains NULLs
When creating indexes for all columns with
CREATE INDEX index_A ON transactions USING hash (A);
CREATE INDEX index_B ON transactions USING hash (B);
CREATE INDEX index_C ON transactions USING hash (C);
index_A is created in around 10 minutes, while B and C are taking over 10 hours after which I aborted them. I ran every CREATE INDEX on their own, so no indices were created in parallel. There are also no other queries running in the database.
When running
SELECT * FROM pg_stat_activity;
wait_event_type and wait_event are both NULL, state is active.
Why are the second index creations taking so long, and can I do anything to speed them up?
Ensure the statistics on your table are up-to-date.
Then execute the following query:
SELECT attname, n_distinct, correlation
from pg_stats
where tablename = '<Your table name here>'
Basically, the database will have more work to create indexes when:
The number of distinct values gets higher.
The correlation (= are values in the field physically stored in order) is close to 0.
I suspect you will see field A is different in terms of distinct values and/or a higher correlation than the other 2 fields.
Edit: Basically, creating an index = FULL SCAN of the table and create entries in the index as you progress. With the stats you have shared below that means:
Column A: it was detected as unique
A single scan is enough as the DB knows 1 record = 1 index entry.
Columns B & C : it was detected as having very few distinct values + abs(correlation) is very low.
Each index entry takes an entire FULL SCAN of the table.
Note: the description is simplified to highlight the difference.
Solution 1:
Do not create indexes for B and C.
It might sound stupid but in fact and as explained here, a small correlation means the indexes will probably not be used (an index is useful only when entries are not scattered in all the table blocks).
Solution 2:
Order records on the disk.
The initialization would be something like this:
CREATE TABLE Transactions_order as SELECT * FROM Transactions;
TRUNCATE TABLE Transactions;
INSERT INTO Transactions SELECT * FROM Transactions_order ORDER BY B,C,A;
DROP TABLE Transactions_order;
The tricky part comes next: with insert/update/delete records, you need to keep track of the correlation and ensure it does not drop too much.
If you can't guarantee that, stick to solution 1.
Solution3:
Create partitions and enjoy partition pruning.
There are quite a lot of efforts being made for partitioning recently in postgresql. It could be worth having a look into it.

Update with join condition is taking too long in redshift

I have a table in a Redshift cluster with 5 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
Update tbl1
set price=tbl2.price, flag=true
from tbl2 join tbl1 on tbl1.id=tbl2.id
where tbl1.time between (some value) and
tbl2.createtime between (some value)
I have sort key on time and dist key on id. When I checked stl_scan table, its shows that my query is scanning 50 million rows on each slice, and only returning 50K rows on each slice. I stopped the query after 20 mins.
For testing, I created same table with 1 billion rows and same update query took 3 mins.
When I run select with same condition I get the results in few seconds.Is there anything I am doing wrong?
I believe the correct syntax is:
Update tbl1
set price = tbl2.price,
flag = true
from tbl2
where tbl1.id = tbl2.id and
tbl1.time between (some value) and
tbl2.createtime between (some value);
Note that tbl1 is only mentoned once, in the update clause. There is no join, just a correlation clause.