Place fast-growing table on another data file - postgresql

I am a postgres newbie and could not find the answer in the documentation or from google. Please help me out.
We have a fast-growing table whose content which will be periodically off-loaded to a file and then the table will be vacuumed and cleaned.
The table however grows rapidly and can use up 20% of the anticipated size of the database to 80+%. In some cases it could reach 100% of the volume size due to bad capacity planning. Also, this table has the highest write volume and it would be better to offload it into another volume that has a better write performance. hence we want to handle this table specifically in another location.
The growth rate is about 1GB/hour. There is only about one hour of activity per day though. The cleanup happens at 45 days, so 45GB of data is generated. We delete a month's worth of data, so the db size goes down to 15GB again.
Is it possible to place a fast-growing table (which is periodically cleaned and vacuumed) into a separate file in a separate volume in postgresql? I want it to be part of the same database, but only on a different file.

Related

huge postgres database size reduction

I have a huge database (9 billon rows, 3000 columns) which is currently host on postgres, it is huge about 30 TB in size. I am wondering if there is some practical ways to reduce the size of the database while preserving the same information in the database, as storage is really costly.
If you don't want to delete any data.
Vacuuming. Depending on how many updates/deletes your database performs there will be much garbage. A database that size is likely full of tables that do not cross vacuum thresholds often (pg13 has a fix for this). Manually running vacuums will allocate dead rows for re-use, and free up space where ends of pages are no longer needed.
Index management. These bloat over time and should be smaller than your tables. Re-indexing (concurrently) will give you some space back, or allow re-use of existing pages.
data de-duplication / normalisation. See where you can remove data from tables where it is not needed, or presented elsewhere in the database.

What about expected performance in Pentaho?

I am using Pentaho to create ETL's and I am very focused on performance. I develop an ETL process that copy 163.000.000 rows from Sql server 2088 to PostgreSQL and it takes 17h.
I do not know how good or bad is this performance. Do you know how to measure if the time that takes some process is good? At least as a reference to know if I need to keep working heavily on performance or not.
Furthermore, I would like to know if it is normal that in the first 2 minutes of ETL process it load 2M rows. I calculate how long will take to load all the rows. The expected result is 6 hours, but then the performance decrease and it takes 17h.
I have been investigating in goole and I do not find any time references neither any explanations about performance.
Divide and conquer, and proceed by elimination.
First, add a LIMIT to your query so it takes 10 minutes instead of 17 hours, this will make it a lot easier to try different things.
Are the processes running on different machines? If so, measure network bandwidth utilization to make sure it isn't a bottleneck. Transfer a huge file, make sure the bandwidth is really there.
Are the processes running on the same machine? Maybe one is starving the other for IO. Are source and destination the same hard drive? Different hard drives? SSDs? You need to explain...
Examine IO and CPU usage of both processes. Does one process max out one cpu core?
Does a process max out one of the disks? Check iowait, iops, IO bandwidth, etc.
How many columns? Two INTs, 500 FLOATs, or a huge BLOB with a 12 megabyte PDF in each row? Performance would vary between these cases...
Now, I will assume the problem is on the POSTGRES side.
Create a dummy table, identical to your target table, which has:
Exact same columns (CREATE TABLE dummy LIKE table)
No indexes, No constraints (I think it is the default, double check the created table)
BEFORE INSERT trigger on it which returns NULL and drop the row.
The rows will be processed, just not inserted.
Is it fast now? OK, so the problem was insertion.
Do it again, but this time using an UNLOGGED TABLE (or a TEMPORARY TABLE). These do not have any crash-resistance because they don't use the journal, but for importing data it's OK.... if it crashes during the insert you're gonna wipe it out and restart anyway.
Still No indexes, No constraints. Is it fast?
If slow => IO write bandwidth issue, possibly caused by something else hitting the disks
If fast => IO is OK, problem not found yet!
With the table loaded with data, add indexes and constraints one by one, find out if you got, say, a CHECK that uses a slow SQL function, or a FK into a table which has no index, that kind of stuff. Just check how long it takes to create the constraint.
Note: on an import like this you would normally add indices and constraints after the import.
My gut feeling is that PG is checkpointing like crazy due to the large volume of data, due to too-low checkpointing settings in the config. Or some issue like that, probably random IO writes related. You put the WAL on a fast SSD, right?
17H is too much. Far too much. For 200 Million rows, 6 hours is even a lot.
Hints for optimization:
Check the memory size: edit the spoon.bat, find the line containing -Xmx and change it to half your machine memory size. Details varies with java version. Example for PDI V7.1.
Check if the query from the source database is not too long (because too complex, or server memory size, or ?).
Check the target commit size (try 25000 for PostgresSQL), the Use batch update for inserts in on, and also that the index and constraints are disabled.
Play with the Enable lazy conversion in the Table input. Warning, you may produce difficult to identify and debug errors due to data casting.
In the transformation property you can tune the Nr of rows in rowset (click anywhere, select Property, then the tab Miscelaneous). On the same tab check the transformation is NOT transactional.

MongoDb "Working Set" exceeding RAM

I m collecting timeseries in mongoDb. Eventually my working set will be larger than my RAM. However I mostly need to access the recent data.
If I put everything in just one table, would it still be possible to do that? Because the index size will keep growing if I just put all the data in one table.
I was thinking of creating a new table every month and put the data there. This way, the very old data will not be loaded in RAM unless someone (rarely) needs that archive data.
So my question is : is it better to manually partition the data like that or just leave everything up to mongoDB?

Is killing a "CLUSTER ON index" dangerous for database?

All the question is in the title,
if we kill a cluster query on a 100 millions row table, will it be dangerous for database ?
the query is running for 2 hours now, and i need to access the table tomorrow morning (12h left hopefully).
I thought it would be far quicker, my database is running on raid ssd and Bi-Xeon Processor.
Thanks for your wise advice.
Sid
No, you can kill the cluster operation without any risk. Before the operation is done, nothing has changed to the original table- and indexfiles. From the manual:
When an index scan is used, a temporary copy of the table is created
that contains the table data in the index order. Temporary copies of
each index on the table are created as well. Therefore, you need free
space on disk at least equal to the sum of the table size and the
index sizes.
When a sequential scan and sort is used, a temporary sort file is also
created, so that the peak temporary space requirement is as much as
double the table size, plus the index sizes.
As #Frank points out, it is perfectly fine to do so.
Assuming you want to run this query in the future and assuming you have the luxury of a service window and can afford some downtime, I'd tweak some settings to boost the performance a bit.
In your configuration:
turn off fsync, for higher throughput to the file system
Fsync stands for file system sync. With fsync on, the database waits for the file system to commit on every page flush.
maximize your maintenance_work_mem
It's ok to just take all memory available, as it will not be allocated during production hours. I don't know how big your table and the index you are working on are, things will run faster when they can be fully loaded in main memory.

DB Trigger to limit maximum table size in Postgres

Is it possible, perhaps using DB-triggers to set a maximum table-size in a postgres DB?
For example, say I have a table called: Comments.
From the user perspective, this can be done as frequently as possible, but say I only want to store the 100 most recent comments in the DB. So what I want to do is have a trigger that automatically maintains this. I.e. when more than 100 comments are there, it deletes the oldest one, etc.
Could someone help me with writing such a trigger?
I think a trigger is the wrong tool for the job; although it is possible to implement this. Something about spawning a "delete" from an executing insert makes the hair on my neck neck stand up. You will generate a lot of locking and possibly contention that way; and inserts should generally not generate locks.
To me this says "stored procedure" all the way.
But I also think you should ask yourself, "why delete" old comments? Deletes are an anathema. Better just limit them when you display them. If you are really worried about the size of the table, use a TEXT column. Postgres will maintain these in a shadow table and full scans of the original table will blaze along just fine.
Limiting to 100 comments per user is rather simple, e.g.
delete from comments where user_id = new.user_id
order by comment_date desc offset 100;
Limiting the byte size is trickier. You'd need to calculate the relevant row sizes and that won't account for index sizes, dead rows, etc. At best you'd use the admin functions to get the table size but these won't yield the size per user, only the total size.
We could in theory create a table of 100 dummy records and then simply overwrite them with the actual comments. Once we pass the 100th we will overwrite the 1st one, etc.
This way we are suppose to keep the same size of the table, but that is not possible, because an update is equivalent to delete,insert in Postgresql. So the size of the table will continue to grow.
So if the objective is not to overflow the disk drive then once the disk is full at 80% a "vacuum full" should be performed to free up disk space. "Vacuum full" requires disk space by itself. If you kept the records to a fixed number then there will be an effect of the vacuum. Also there seems to be cases where vacuum can fail.