I have a table in a Redshift cluster with 5 billion rows. I have a job that tries to update some column values based on some filter. Updating anything at all in this table is incredibly slow. Here's an example:
Update tbl1
set price=tbl2.price, flag=true
from tbl2 join tbl1 on tbl1.id=tbl2.id
where tbl1.time between (some value) and
tbl2.createtime between (some value)
I have sort key on time and dist key on id. When I checked stl_scan table, its shows that my query is scanning 50 million rows on each slice, and only returning 50K rows on each slice. I stopped the query after 20 mins.
For testing, I created same table with 1 billion rows and same update query took 3 mins.
When I run select with same condition I get the results in few seconds.Is there anything I am doing wrong?
I believe the correct syntax is:
Update tbl1
set price = tbl2.price,
flag = true
from tbl2
where tbl1.id = tbl2.id and
tbl1.time between (some value) and
tbl2.createtime between (some value);
Note that tbl1 is only mentoned once, in the update clause. There is no join, just a correlation clause.
Related
I have seen a strange behavior in PostgreSQL (PostGIS).
I have two tables in PostGIS with geometry columns. one table is a grid and the other one is lines. I want to delete all grid cells that no line passes through them.
In other words, I want to delete the rows from a table when that row has no spatial intersection with any rows of second table.
First, in a subquery, I find the ids of the rows that have any intersection. Then, I delete any row that its id is not in that returned list of ids.
DELETE FROM base_grid_916453354
WHERE id NOT IN
(
SELECT DISTINCT bg.id
FROM base_grid_916453354 bg, (SELECT * FROM tracks_heatmap_1000
LIMIT 100000) tr
WHERE bg.geom && tr.buffer
);
The following subquery returns in only 12 seconds
SELECT DISTINCT bg.id
FROM base_grid_916453354 bg, (SELECT * FROM tracks_heatmap_1000 LIMIT
100000) tr
WHERE bg.geom && tr.buffer
, while the whole query did not return even in 1 hour!!
I ran explain command and it is the result of it, but I cannot interpret it:
How can I improve this query and why deleting from the returned list takes so much of time?
It is very strange because the subquery is a spatial query between 2 tables of 9 million and 100k rows, while the delete part is just checking a list and deleting!! In my mind, the delete part is much much easier.
Don't post text as images of text!
Increase work_mem until the subplan becomes a hashed subplan.
Or rewrite it to use 'NOT EXISTS' rather than NOT IN
I found a fast way to do this query. As #jjanes said, I used EXISTS() function:
DELETE FROM base_grid_916453354 bg
WHERE NOT EXISTS
(
SELECT 1
FROM tracks_heatmap_1000 tr
WHERE bg.geom && tr.buffer
);
This query takes around 1 minute and it is acceptable for the size of my tables.
I am new to PostgreSQL and would like to know how to set a column in a table to its sorted version. For example:
(table: t1, column: points)
5
7
3
9
8
4
(table: t1, column: points) // pls note it is sorted
3
4
5
7
8
9
My incorrect version:
UPDATE outputTable SET points_count = (SELECT points_count FROM outputTable ORDER BY points_count ASC)
Try with this :
UPDATE outputTable
SET points_count = s.points_count
FROM (SELECT points_count, ctid FROM outputTable ORDER BY points_count ASC) s
WHERE outputTable.ctid = s.ctid;
As you are planning to update same table with reference to same table, you will need row level equality criteria like ctid to update each row.
It seems like you want to sort the rows in a table.
Now this is normally a pointless exercise, since tables have no fixed order. In fact, every UPDATE will change the order of rows in a PostgreSQL table.
The only way you can get a certain order in the result rows of a query is by using the ORDER BY clause, which will sort the rows regardless of their physical order in the table (which is not dependable, as mentioned above).
There is one use case for physically reordering a table: an index range scan using an index on points_count will be much more efficient if the table is physically sorted like the index. The reason is that far fewer table blocks will be accessed.
Therefore, there is a way to rewrite the table in a certain order as long as you have an index on the column:
CREATE INDEX ON outputtable (points_count);
CLUSTER outputtable USING points_count;
But – as I said above – unless you plan a range scan on that index, the exercise is pointless.
I have 3 tables:
table1:{id, uid}
table2:{id, uid}
table1_table2:{table1_id, table2_id}
I need to execute the following queries:
SELECT 1 FROM table1_table2
LEFT JOIN table1 ON table1.id = table1_table2.table1_id
LEFT JOIN table2 ON table2.id = table1_table2.table2_id
WHERE table1.uid = ? and table2.uid = ?
I have unique indices on UUID columns, so I expected the search to be fast. When I have an almost empty database, select takes 0 ms, when there are 50,000 records in table 1, 100 records in table 2 and 110,000 records in table1_table2, select takes 10 ms, which is a lot, because I have to make 400,000 queries. Can I have O(1) on select?
Now I'm using hibernate(spring data) and postgres.
You have unique indices but have you updated statistics with ANALYZE as well?
What type is used for UID column and what type are you feeding it with from Java?
Is there any difference, when you run it from Hibernate/Java and from Postgres console?
Run the query with "EXPLAIN", get the execution plan - from Java as well as from Postgres console, and observe any differences. See How to get query plan information from Postgres into JDBC
I'm trying to understand why a 14 Milion row table is so slow updating, even though I'm joining with its primary key, and updating in batches( 5000 rows).
THIS IS THE QUERY
UPDATE A
SET COL1= B.COL1,
COL2 = B.COL2,
COL3 = 'ALWAYS THE SAME VAL'
FROM TABLE_X A, TABLE_Y B
WHERE A.PK = B.PK
TABLE_X has 14 Million rows
TABLE_X has 12 INDEXES, however the updated columns do not belong to any index. so it's not expected that this slowness is caused by having so many indexes right?
TABLE_Y has 5000 rows
ADITIONAL INFORMATION
I must update by the order of other column(Group) rather than the PK. If I could update by the order of PK then it would be way faster.
This is a business need. If they need to stop the process. they want groups to be either updated or not updated at all.
What could be causing such slow updates?
Database is SYBASE 15.7
I am currently trying to join two tables, where both of the tables have very many different in the columns I am joining.
Here's the tsql
from AVG(Position) as Position from MonitoringGsc_Keywords as sk
Join GSC_RankingData on sk.Id = GSC_RankingData.KeywordId
groupy by sk.Id
The execution plan shows me, that it takes very much time to perform the join. I think it is because a huge group from the first table has to be compared with a huge group of values in the second table.
MonitoringGsc_Keywords.Id has about 60.000 different values
GSC_RankingData hat about 100.000.000 Values
MonitoringGsc_Keywords.Id is Primary-Key of MonitoringGsc_Keywords GSC_RankingData.KeywordId is indexed.
So, what can i do to increase performance?
Is Position column from GSC_RankingData table? If yes then JOIN is redundant and query should looks like this:
SELECT AVG(rd.Position) as Position
FROM GSC_RankingData rd
GROUP BY rd.KeywordId;
If Position column is in GSC_RankingData table then index on GSC_RankingData should include this column and looks like this:
CREATE INDEX IX_GSC_RankingData_KeywordId_Position ON GSC_RankingData(KeywordId) INCLUDE(Position);
You should check indexes fragmentation for this tables, to do this you could use this query:
SELECT * FROM sys.dm_db_index_physical_stats(db_id(), object_id('MonitoringGsc_Keywords'), null, null, 'DETAILED')
if avg_fragmentation_in_percent > 5% and < 30% then
ALTER INDEX [index name] on [table name] REORGANIZE;
if avg_fragmentation_in_percent >= 30% then
ALTER INDEX [index name] on [table name] REBUILD;
It could be problem with statistics, you could check it with query:
SELECT
sp.stats_id, name, filter_definition, last_updated, rows, rows_sampled,
steps, unfiltered_rows, modification_counter
FROM sys.stats AS stat
CROSS APPLY sys.dm_db_stats_properties(stat.object_id, stat.stats_id) AS sp
WHERE stat.object_id = object_id('GSC_RankingData');
check last update date, rows count, if it not be current then update statistics. Also it could be possible that statistics not exist, then you must create it.