How can I reinitialize a large table? - postgresql

I use a POSTGRESQL database. I have a big problem with a table that alone is 227 GB. The table only contains 420,000 records. I did a VACUUM FULL on the table. I waited for two days but nothing. I had to stop . Same for REINDEX and TRUNCATE. I therefore looked in the directory of the DB for the files corresponding to this table. There are 227 files of 1GB each. Is there a way to reset my table?

Related

Postgres pg_dump getting very slow while copying large objects

I was performing a pg_dump operation on a postgres (v9) database of size around 80Gb.
The operation never seemed to finish even when trying the following:
running a FULL VACUUM before dumping
dumping the db into a directory-format archive (using -Fd)
without compression (-Z 0)
dumping the db into a directory in parallel (tried up to 10 threads -j 10)
When using the --verbose flag I saw that the most of the logs are related to creating/executing large objects.
When I tried dumping each table on its own (pg_dump -t table_name) the result was fast again (in minutes) but when restoring the dump to another db, the application that uses the db started throwing exceptions regarding some resources not being found (they should've been in the db)
As in Postgres pg_dump docs when using the -t flag the command will not copy blobs.
I added the flag -b (pg_dump -b -t table_name) and the operation went back to being slow.
So the problem I guess is with exporting the blobs in the db.
The number of blobs should be around 5 Million which can explain the slowness in general but the duration of execution is lasting as long as 5 hours before killing the process manually.
The blobs are relatively small (Max 100 Kb per blob)
Is this expected? or is there something fishy going around?
The slowness was due to high number of orphaned blobs
Apparently when launching a FULL VACUUM on a postgres database, it doesn't not remove orphaned large objects.
When I queried the amount of large objects in my database
select count(distinct loid) from pg_largeobject;
output:
151200997
The displayed amount from the query did not match the expected value. The expected amount of blobs should be around 5 Million in my case.
The table (the one that I created in the app) that references those blobs, in my case, is subject for frequent updates and postgres does not delete the old tuples (rows) but rather marks them as 'dead' and inserts the new ones. With each update to the table the old blob becomes no longer referenced by alive tuples, only by dead ones which makes it an orphaned blob.
Postgres has a dedicated command 'vacuumlo' to vacuum orphaned blobs.
After using it (the vacuum took around 4h) the dump operation became much faster. The new duration is around 2h (previsouly taking hours and hours without finishing)

psql copy command hangs with large CSV data set

I'm trying to load some large data sets from CSV into a Postgres 11 database (Windows) to do some testing. Fist problem I ran into was that with very large CSV I got this error: "ERROR: could not stat file "'D:/temp/data.csv' Unknown error". So after searching, I found a workaround to load the data from a zip file. So I setup 7-zip and was able to load some data with a command like this:
psql -U postgres -h localhost -d MyTestDb -c "copy my_table(id,name) FROM PROGRAM 'C:/7z e -so d:/temp/data.zip' DELIMITER ',' CSV"
Using this method, I was able to load a bunch of files of varying sizes, one with 100 million records that was 700MB zipped. But then I have one more large file with 100 million records that's around 1GB zipped, that one for some reason is giving me grief. Basically, the psql process just keeps running and never stops. I can see based on data files growing that it generates data up to a certain point, but then at some point it stops growing. I'm seeing 6 files in a data folder named 17955, 17955.1, 17955.2, etc. up to 17955.5. The Date Modified date on those files continues to be updated, but they're not growing in size and my psql program just sits there. If I shut down the process, I lose all the data since I assume it rolls it back when the process does not run to completion.
I looked at the logs in the data/log folder, there doesn't seem to be anything meaningful there. I can't say I'm very used to Postgres, I've used SQL Server the most, so looking for tips on where to look, or what extra logging to turn on, or anything else that could help figure out why this process is stalling.
Figured it out thanks to #jjanes comment above (sadly he/she didn't add an answer). I was adding 100 million records to a table with a foreign key to another table with 100 million records. I removed the foreign key, added the records, and then re-added the foreign key, that did the trick. I guess checking the foreign keys is just too much with a bulk insert this size.

How can I automatically maintain a dump of modified rows in PostGreSql

So, I have a PostGreSQL DB. For some chosen tables in that DB I want to maintain a plain dump of the rows when modified. Note this dump is not a recovery or backup dump. It is just a file which will have the incremental rows. That is, whenever a row is inserted or updated, I want that appended to this file or to a file in a folder. Idea is to load that folder into say something like hive periodically so that I can run queries to check previous states of certain rows, columns. Now, these are very high transactional tables and the dump does not need to be real time. It can be in batches, every hour. I want to avoid a trigger firing hundreds of times every minute. I am looking for something which is off the shelf - already available in PostGreSQL. I did some research but everything is related to PostGreSQL backup - which is not the exact use case.
I have read some links like https://clarkdave.net/2015/02/historical-records-with-postgresql-and-temporal-tables-and-sql-2011/ Implementing history of PostgreSQL table etc - but these are based on insert update trigger and create the history table on PostGreSQL itself. I want to avoid both. I cannot have the history on PostGreSQL as it will be huge soon. And I do not want to keep writing to files through a trigger firing constantly.

Redshift query results incorrect before VACUUM

When Redshift uses an index to run a query (say counts), does it exclude counting the rows in the unsorted region?
I had copied a lot a data using the COPY command, but did not VACUUM the table afterwords. On running my query (involving joins with several tables), the results of the query were wrong - the newly copied rows in the unsorted region weren't counted.
Then, after vacuuming the table, the query began returning the correct results. Is this expected behavior, or is this a bug introduced by Amazon?
Vacuuming won't have any effect on COPYed rows, which are effectively inserts. Vacuum physically deletes rows previously deleted with an SQL delete statement, which only marks the rows as deleted, so they don't participate in subsequent queries but still consume disk space.
Redshift is an eventually consistent database, so even if your COPY command had completed, the rows may not yet be visible to queries.
Running vacuum is basically a defrag, which requires all rows to be reorganized. This (probably) causes the table to be brought into a consistent state, ie all rows are visible to queries.

DB2 9.5 + drop indexes + tablespaces = not pages reduced

I am using DB2 v9.5, the database is not automatic storage and table spaces are all SMS (I know that SMS is not the best practice, but I'm studying to perform the migration then).
I dropped a total of 144 indexes, which were not used, but the amount of pages used/allocated in the database did not change after the DROP INDEX.
As far as I remember, for SMS tablespaces, if DROP of objects (tables or indexes), REORGs not be necessary, unless you had just deleted rows from the table, where it would be necessary to run the REORG to reduce the size allocated for the table .
Some opnion of what can be done to actually free the space from the indexes that were dropped?
Thanks
When you are sure you had your indexes in SMS tablespaces, you should look in the corresponding filesystem, e.g. with df -h or some such.