update slow for large rows but ok in chunks - tsql

So i am trying to update say 1 million rows in a 10 million row table.
I have an index on Y.
One mass update seemingly never finishes. I am thinking this is due to maintaining the data in tempdb incase the transaction needs to rollback, and the tempdb might not be big enough and needs to page to disk etc. Whereas the chunks loop only needs to do this for 50k rows at a time.
Just after others thoughts, correct me if I am wrong, I am from an Oracle background so not entirely sure what the rollback segment equivalent is
update X
set Y = 0
where Y is null
but doing in chunks of 50000 finishes in a minute
declare #i int=1
while #i > 0
begin
update top (50000) X
set Y = 0
where Y is null
set #i = ##ROWCOUNT
end

This might be a case where the database is under such heavy load (or the server is so underpowered) that the set based operation is to heavy and simply cant complete because it cant get a lock or requires to much resources all at once.
By design a cursor works on a smaller subset of data, this in theory should be slower but might not be in your particular case.
From here: https://stackoverflow.com/a/10013070/937131
A cursor can save you time - because you don't need to wait for the
processing and download of your complete recordset It will save you
memory, both on the server and on the client because they don't have
to dedicate a big chunk of memory to resultsets
Consistency: Using a cursor, you do (usually) not operate on a
consistent snapshot of the data, but on a row. So your
concurrency/consistency/isolation guarantees drop from the whole
database (ACID) to only one row. You can usually inform your DBMS what
level of concurrency you want, but if you are too nitpicky (locking
the complete table you are in), you will throw away many of the
resource savings on the server side.

Related

What about expected performance in Pentaho?

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.
Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?
17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

Postgres: Do we always need at least 3-4 times free the space of the biggest table?

we are using Postgres to store ~ 2.000.000.000 samples. This ends up in tables with ~ 500 mio entries and ~100GB Size each table.
What I want to do:
E.g. update the table entries: UPDATE table SET flag = true;
After this, the table is twice as big, i.e. 200GB
To get the space (stored on a SSD) back we: "VACCUM FULL table"
Unfortunately, this step needs again loads of space which results in the Vacuum to fail due to too little space left.
My Questions:
Does this mean, that, in order to make this UPDATE query only once and to get the space back for other tables in this DB we need at least 300-400GB space for a 100GB table?
In your scenario, you won't get away without having at least twice as much space as the table data would require.
The cheapest solution is probably to define the table with a fillfactor of 50 so that half of each block is left empty, thereby doubling the table size. Then the updated rows can all be in the same block as the original rows, and the UPDATE won't increase the table size because PostgreSQL can use the heap only tuple (HOT) update feature. The old versions will be freed immediately if there are no long running transactions that can still see them.
NOTE: This will only work if the colum you are updating is not indexed.
The downside of this approach is that the table is always twice the necessary size, and all sequential scans will take twice as long. It won't bother you if you don't use sequential scans of the table.

postgres 9.2 table size with pg_total_relation_size

I process a table with ~10^7 rows the following way: take last N rows, update them in some way, and delete, then vacuum table. In the end I make a query for pg_total_relation_size. Loop repeats until the table is over. Each iteration last for several seconds. There are no any other queries for this table except mentioned above. The problem is that I get the same results for table size. It changes about once a several hours.
So the question is -- does postgres store somewhere the table size or does it calculate it every time the function is invoked? I.e., does my table size really stays the same in spite of its processing?
Your table really does stay the same size on disk despite the DELETEs and VACUUMing you're doing. As per the documentation on VACUUM, ordinary VACUUM only releases space back to the OS if it can do so by truncating free space from the end of the file without rearranging live rows.
The space is still "free" in that PostgreSQL can re-use it for other new rows. It is much, much faster to re-use space that PostgreSQL hasn't given back to the OS than it is to extend a relation with new space, so this is often preferable.
The other reason Pg doesn't just give this space back is that it can only give space back to the OS when it's a contiguous chunk with no visible rows until the end of the file. This doesn't happen much so in practice Pg needs to move some rows around to compact the table and allow it to free space at the end, kind of like a defrag on a file system. This is an inefficient and slow process that can counter-intuitively make the table slower to access instead of faster, so it's not always a good idea.
If you have a relation that's mostly but not entirely empty it can be worth doing a VACUUM FULL (Pg 9.0 and above) or CLUSTER (all versions) to free the space. If you expect to refill the table this is usually counter-productive; it's actually better to leave it as-is.
(For what I mean by terms like "live" and "visible" see the documentation on MVCC which will help you understand Pg's table organisation.)
Personally I'd skip the manual VACUUM in your case. Turn autovacuum up if you need to. If you really need to you could consider partitioning your table, processing it partition by partition and TRUNCATE each partition when you're done processing it.

DB Trigger to limit maximum table size in Postgres

Is it possible, perhaps using DB-triggers to set a maximum table-size in a postgres DB?
For example, say I have a table called: Comments.
From the user perspective, this can be done as frequently as possible, but say I only want to store the 100 most recent comments in the DB. So what I want to do is have a trigger that automatically maintains this. I.e. when more than 100 comments are there, it deletes the oldest one, etc.
Could someone help me with writing such a trigger?
I think a trigger is the wrong tool for the job; although it is possible to implement this. Something about spawning a "delete" from an executing insert makes the hair on my neck neck stand up. You will generate a lot of locking and possibly contention that way; and inserts should generally not generate locks.
To me this says "stored procedure" all the way.
But I also think you should ask yourself, "why delete" old comments? Deletes are an anathema. Better just limit them when you display them. If you are really worried about the size of the table, use a TEXT column. Postgres will maintain these in a shadow table and full scans of the original table will blaze along just fine.
Limiting to 100 comments per user is rather simple, e.g.
delete from comments where user_id = new.user_id
order by comment_date desc offset 100;
Limiting the byte size is trickier. You'd need to calculate the relevant row sizes and that won't account for index sizes, dead rows, etc. At best you'd use the admin functions to get the table size but these won't yield the size per user, only the total size.
We could in theory create a table of 100 dummy records and then simply overwrite them with the actual comments. Once we pass the 100th we will overwrite the 1st one, etc.
This way we are suppose to keep the same size of the table, but that is not possible, because an update is equivalent to delete,insert in Postgresql. So the size of the table will continue to grow.
So if the objective is not to overflow the disk drive then once the disk is full at 80% a "vacuum full" should be performed to free up disk space. "Vacuum full" requires disk space by itself. If you kept the records to a fixed number then there will be an effect of the vacuum. Also there seems to be cases where vacuum can fail.

PostgreSQL - Why are some queries on large datasets so incredibly slow

I have two types of queries I run often on two large datasets. They run much slower than I would expect them to.
The first type is a sequential scan updating all records:
Update rcra_sites Set street = regexp_replace(street,'/','','i')
rcra_sites has 700,000 records. It takes 22 minutes from pgAdmin! I wrote a vb.net function that loops through each record and sends an update query for each record (yes, 700,000 update queries!) and it runs in less than half the time. Hmmm....
The second type is a simple update with a relation and then a sequential scan:
Update rcra_sites as sites
Set violations='No'
From narcra_monitoring as v
Where sites.agencyid=v.agencyid and v.found_violation_flag='N'
narcra_monitoring has 1,700,000 records. This takes 8 minutes. The query planner refuses to use my indexes. The query runs much faster if I start with a set enable_seqscan = false;. I would prefer if the query planner would do its job.
I have appropriate indexes, I have vacuumed and analyzed. I optimized my shared_buffers and effective_cache_size best I know to use more memory since I have 4GB. My hardware is pretty darn good. I am running v8.4 on Windows 7.
Is PostgreSQL just this slow? Or am I still missing something?
Possibly try reducing your random_page_cost (default: 4) compared to seq_page_cost: this will reduce the planner's preference for seq scans by making random-accesses driven by indices more attractive.
Another thing to bear in mind is that MVCC means that updating a row is fairly expensive. In particular, updating every row in a table requires doubling the amount of storage for the table, until it can be vacuumed. So in your first query, you may want to qualify your update:
UPDATE rcra_sites Set street = regexp_replace(street,'/','','i')
where street ~ '/'
(afaik postgresql doesn't automatically suppress the update if it looks like you're not actually updating anything. Istr there was a standard trigger function added in 8.4 (?) to allow you to do that, but it's perhaps better to address it in the client side)
When a row is updated, a new row version is written.
If the new row does not fit in the same disk block, then every index entry pointing to the old row needs to be updated to point to the new row.
It is not just indexes on the updated data that need updating.
If you have a lot of indexes on rcra_sites, and only one or two frequently updated fields, then you might gain by separating the frequently updated fields into a table of their own.
You can also reduce the fillfactor percentage below its default of 100, so that some of the updates can result in new rows being written to the same block, resulting in the indexes pointing to that block not needing to be updated.