Is killing a "CLUSTER ON index" dangerous for database? - postgresql

All the question is in the title,
if we kill a cluster query on a 100 millions row table, will it be dangerous for database ?
the query is running for 2 hours now, and i need to access the table tomorrow morning (12h left hopefully).
I thought it would be far quicker, my database is running on raid ssd and Bi-Xeon Processor.
Thanks for your wise advice.
Sid

No, you can kill the cluster operation without any risk. Before the operation is done, nothing has changed to the original table- and indexfiles. From the manual:
When an index scan is used, a temporary copy of the table is created
that contains the table data in the index order. Temporary copies of
each index on the table are created as well. Therefore, you need free
space on disk at least equal to the sum of the table size and the
index sizes.
When a sequential scan and sort is used, a temporary sort file is also
created, so that the peak temporary space requirement is as much as
double the table size, plus the index sizes.

As #Frank points out, it is perfectly fine to do so.
Assuming you want to run this query in the future and assuming you have the luxury of a service window and can afford some downtime, I'd tweak some settings to boost the performance a bit.
In your configuration:
turn off fsync, for higher throughput to the file system
Fsync stands for file system sync. With fsync on, the database waits for the file system to commit on every page flush.
maximize your maintenance_work_mem
It's ok to just take all memory available, as it will not be allocated during production hours. I don't know how big your table and the index you are working on are, things will run faster when they can be fully loaded in main memory.

Related

How to determine how much "slack" in postgres database?

I've got a postgres database which I recently vacuumed. I understand that process marks space as available for future use, but for the most part does not return it to the OS.
I need to track how close I am to using up that available "slack space" so I can ensure the entire database does not start to grow again.
Is there a way to see how much empty space the database has inside it?
I'd prefer to just do a VACUUM FULL and monitor disk consumption, but I can't lock the table for a prolonged period, nor do I have the disk space.
Running version 13 on headless Ubuntu if that's important.
Just like internal free space is not given back to the OS, it also isn't shared between tables or other relations (like indexes). So having freespace in one table isn't going to help if a different table is the one growing. You can use pg_freespacemap to get a fast approximate answer for each table, or pgstattuple for more detailed data.

What about expected performance in Pentaho?

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.
Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?
17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

Heroku Postgres database size doesn't go down after deleting rows

I'm using a Dev level database on Heroku that was about 63GB and approaching about 9.9 million rows (close to the limit of 10 million for this tier). I ran a script that deleted about 5 million rows I didn't need, and now (few days later) in the Postgres control panel/using pginfo:table-size it shows roughly 4.7 million rows but it's still at 63GB. 64 is the limit for he next tier so I need to reduce the size.
I've tried vacuuming but pginfo:bloat said the bloat was only about 3GB. Any idea what's happening here?
If you have [vacuum][1]ed the table, don't worry about the size one disk still remaining unchanged. The space has been marked as reusable by new rows. So you can easily add another 4.7 million rows and the size on disk wont grow.
The standard form of VACUUM removes dead row versions in tables and
indexes and marks the space available for future reuse. However, it
will not return the space to the operating system, except in the
special case where one or more pages at the end of a table become
entirely free and an exclusive table lock can be easily obtained. In
contrast, VACUUM FULL actively compacts tables by writing a complete
new version of the table file with no dead space. This minimizes the
size of the table, but can take a long time. It also requires extra
disk space for the new copy of the table, until the operation
completes.
If you want to shrink it on disk, you will need to VACUUM FULL which locks the tables and needs as much extra space as the size of the tables when the operation is in progress. So you will have to check your quota before you try this and your site will be unresponsive.
Update:
You can get a rough idea about the size of your data on disk by querying the pg_class table like this:
SELECT SUM(relpages*8192) from pg_class
Another method is a query of this nature:
SELECT pg_database_size('yourdbname');
This link: https://www.postgresql.org/docs/9.5/static/disk-usage.html provides additional information on disk usage.

When should one vacuum a database, and when analyze?

I just want to check that my understanding of these two things is correct. If it's relevant, I am using Postgres 9.4.
I believe that one should vacuum a database when looking to reclaim space from the filesystem, e.g. periodically after deleting tables or large numbers of rows.
I believe that one should analyse a database after creating new indexes, or (periodically) after adding or deleting large numbers of rows from a table, so that the query planner can make good calls.
Does that sound right?
vacuum analyze;
collects statistics and should be run as often as much data is dynamic (especially bulk inserts). It does not lock objects exclusive. It loads the system, but is worth of. It does not reduce the size of table, but marks scattered freed up place (Eg. deleted rows) for reuse.
vacuum full;
reorganises the table by creating a copy of it and switching to it. This vacuum requires additional space to run, but reclaims all not used space of the object. Therefore it requires exclusive lock on the object (other sessions shall wait it to complete). Should be run as often as data is changed (deletes, updates) and when you can afford others to wait.
Both are very important on dynamic database
Correct.
I would add that you can change the value of the default_statistics_target parameter (default to 100) in the postgresql.conf file to a higher number, after which, you should restart your server and run analyze to obtain more accurate statistics.

postgres 9.2 table size with pg_total_relation_size

I process a table with ~10^7 rows the following way: take last N rows, update them in some way, and delete, then vacuum table. In the end I make a query for pg_total_relation_size. Loop repeats until the table is over. Each iteration last for several seconds. There are no any other queries for this table except mentioned above. The problem is that I get the same results for table size. It changes about once a several hours.
So the question is -- does postgres store somewhere the table size or does it calculate it every time the function is invoked? I.e., does my table size really stays the same in spite of its processing?
Your table really does stay the same size on disk despite the DELETEs and VACUUMing you're doing. As per the documentation on VACUUM, ordinary VACUUM only releases space back to the OS if it can do so by truncating free space from the end of the file without rearranging live rows.
The space is still "free" in that PostgreSQL can re-use it for other new rows. It is much, much faster to re-use space that PostgreSQL hasn't given back to the OS than it is to extend a relation with new space, so this is often preferable.
The other reason Pg doesn't just give this space back is that it can only give space back to the OS when it's a contiguous chunk with no visible rows until the end of the file. This doesn't happen much so in practice Pg needs to move some rows around to compact the table and allow it to free space at the end, kind of like a defrag on a file system. This is an inefficient and slow process that can counter-intuitively make the table slower to access instead of faster, so it's not always a good idea.
If you have a relation that's mostly but not entirely empty it can be worth doing a VACUUM FULL (Pg 9.0 and above) or CLUSTER (all versions) to free the space. If you expect to refill the table this is usually counter-productive; it's actually better to leave it as-is.
(For what I mean by terms like "live" and "visible" see the documentation on MVCC which will help you understand Pg's table organisation.)
Personally I'd skip the manual VACUUM in your case. Turn autovacuum up if you need to. If you really need to you could consider partitioning your table, processing it partition by partition and TRUNCATE each partition when you're done processing it.