psql copy command hangs with large CSV data set - postgresql

I'm trying to load some large data sets from CSV into a Postgres 11 database (Windows) to do some testing. Fist problem I ran into was that with very large CSV I got this error: "ERROR: could not stat file "'D:/temp/data.csv' Unknown error". So after searching, I found a workaround to load the data from a zip file. So I setup 7-zip and was able to load some data with a command like this:
psql -U postgres -h localhost -d MyTestDb -c "copy my_table(id,name) FROM PROGRAM 'C:/7z e -so d:/temp/data.zip' DELIMITER ',' CSV"
Using this method, I was able to load a bunch of files of varying sizes, one with 100 million records that was 700MB zipped. But then I have one more large file with 100 million records that's around 1GB zipped, that one for some reason is giving me grief. Basically, the psql process just keeps running and never stops. I can see based on data files growing that it generates data up to a certain point, but then at some point it stops growing. I'm seeing 6 files in a data folder named 17955, 17955.1, 17955.2, etc. up to 17955.5. The Date Modified date on those files continues to be updated, but they're not growing in size and my psql program just sits there. If I shut down the process, I lose all the data since I assume it rolls it back when the process does not run to completion.
I looked at the logs in the data/log folder, there doesn't seem to be anything meaningful there. I can't say I'm very used to Postgres, I've used SQL Server the most, so looking for tips on where to look, or what extra logging to turn on, or anything else that could help figure out why this process is stalling.

Figured it out thanks to #jjanes comment above (sadly he/she didn't add an answer). I was adding 100 million records to a table with a foreign key to another table with 100 million records. I removed the foreign key, added the records, and then re-added the foreign key, that did the trick. I guess checking the foreign keys is just too much with a bulk insert this size.

Related

Postgres pg_dump getting very slow while copying large objects

I was performing a pg_dump operation on a postgres (v9) database of size around 80Gb.
The operation never seemed to finish even when trying the following:
running a FULL VACUUM before dumping
dumping the db into a directory-format archive (using -Fd)
without compression (-Z 0)
dumping the db into a directory in parallel (tried up to 10 threads -j 10)
When using the --verbose flag I saw that the most of the logs are related to creating/executing large objects.
When I tried dumping each table on its own (pg_dump -t table_name) the result was fast again (in minutes) but when restoring the dump to another db, the application that uses the db started throwing exceptions regarding some resources not being found (they should've been in the db)
As in Postgres pg_dump docs when using the -t flag the command will not copy blobs.
I added the flag -b (pg_dump -b -t table_name) and the operation went back to being slow.
So the problem I guess is with exporting the blobs in the db.
The number of blobs should be around 5 Million which can explain the slowness in general but the duration of execution is lasting as long as 5 hours before killing the process manually.
The blobs are relatively small (Max 100 Kb per blob)
Is this expected? or is there something fishy going around?
The slowness was due to high number of orphaned blobs
Apparently when launching a FULL VACUUM on a postgres database, it doesn't not remove orphaned large objects.
When I queried the amount of large objects in my database
select count(distinct loid) from pg_largeobject;
output:
151200997
The displayed amount from the query did not match the expected value. The expected amount of blobs should be around 5 Million in my case.
The table (the one that I created in the app) that references those blobs, in my case, is subject for frequent updates and postgres does not delete the old tuples (rows) but rather marks them as 'dead' and inserts the new ones. With each update to the table the old blob becomes no longer referenced by alive tuples, only by dead ones which makes it an orphaned blob.
Postgres has a dedicated command 'vacuumlo' to vacuum orphaned blobs.
After using it (the vacuum took around 4h) the dump operation became much faster. The new duration is around 2h (previsouly taking hours and hours without finishing)

How to update the Postgresql using CSV file multiple times

I have a CSV file whose data is to be imported to Postgres database , I did it using import function in pgadmin III but the problem is my CSV file changes frequently so how to import the data overwriting the already existing data in database from CSV file ?
You can save WAL logging through an optimization between TRUNCATE/COPY in the same transaction. The basic idea is to wipe the database table with TRUNCATE and reimport the data with COPY. This doesn't need to be done manually with pgAdmin each time. It can be scripted with something like:
BEGIN;
-- The CSV file is 'mydata.csv' and the table is 'mydata'.
TRUNCATE mydata;
COPY mydata FROM 'mydata.csv' WITH (FORMAT csv);
COMMIT;
Note that it requires superuser access to work. The COPY command also takes various arguments, so you can adjust for different settings for null and headers etc.
Finally it should be noted that you ideally want these both to be in the same transaction. I'm not going to over-complicate this example here though as this level of care isn't needed in many of the real-world sorts of cases where one is copying in a CSV file. If you think your situation needs it, it's not too hard to track down.

File loading issues in DB2 using Load utility

I have a .csv file, comma-delimited (located at C:/). I am using the DB2 LOAD utility to load data present in the CSV file in a DB2 table.
LOAD CLIENT FROM C:\Users\somepath\FileName.csv of del
MODIFIED BY NOCHARDEL COLDEL, insert into SchemaName.TABLE_NAME;
CSV file has 25 rows. After the utility completed I got an error message for NOCHARDEL. My table has all 25 rows properly loaded. Now when I try to execute an insert/update/delete statement on any of the tables present in that schema I am getting following error.
Lookup Error - DB2 Database Error: ERROR [55039] [IBM][DB2/AIX64] SQL0290N Table space access is not allowed.
Could you please help me whether I am making any mistake or missing a parameter that is causing lock on the table.
Earlier while loading the file similar situation occurred, where DBA confirmed that Table space in question is in “load in progress” state
Changes generated by the DB2 LOAD utility are not logged (one of the side-effects of its high performance). If the database crashes immediately after the load it will be impossible to recover the table that was loaded by replaying log records, because there are no such records. For this reason the tablespace containing the loaded table is automatically placed in the BACKUP PENDING mode, forcing you to take a backup of that tablespace or the entire database to ensure it is fully recoverable.
There are options that you can specify for the LOAD command that can help you avoid this situation in the future:
NONRECOVERABLE -- this option does not place the tablespace into the BACKUP PENDING mode, but, as its name implies, the table you're loading to becomes non-recoverable in case of a crash, and your only option in that situation will be to drop and re-create the table.
COPY YES -- this option creates a copy of the table prior to loading, which can be used to recover the table to its pre-LOAD state in case of a crash.
If you are only loading 25 records, I suggest you use the IMPORT utility instead -- it does not have these restrictions because it is fully logged (at the price of lower performance, which for 25 records won't matter).
Thanks #mustaccio. I had 60 Million rows to insert. I was using 25 as sample to check the outcome.
To add another point, we later came to know that this is a known DB2 bug that keeps the load in progress state (DB2 is unable to acknowledge that the load has completed and the session remains open indefinitely) and place the table space in backup pending state.
Recovery is the only option to release the table space once it is in pending state.
This issue is fixed in fix pack 10 as per the DB2 team (we are yet to deploy and test). Mean while NONRECOVERABLE key word is working fine for us
The reason why your table is stuck in the LOAD IN PROGRESS state is the NOCHARDEL error happening at the end of the LOAD.
Have you tried restarting the database? This should reinitialize all table spaces and remove any rogue states.
http://www-01.ibm.com/support/docview.wss?uid=swg1IC65395
http://www-01.ibm.com/support/docview.wss?uid=swg21427102

Reconstructing PostgreSQL database from data files

You've heard the story, disk fails, no recent db backup, recovered files in disarray...
I've got a pg 9.1 database with a particular table I want to bring up to date. In the postgres data/base/444444 directory are all the raw files with table and index data. One particular table I can identify and it's files are as follows:
[relfindnode]
[relfindnode]_vm
[relfindnode]_fsm
where [relfindnode] is the number corresponding to the table I want to reconstruct.
In the current out-of-date database the main [relfindnode] file is 16MB.
In my recovered files, I have found the corresponding [relfindnode] file and the _vm and _fsm files. The main [relfindnode] file is 20MB, so I'm hopeful that this contains more upto-date data.
However, when I swap the files over and restart my machine (OS X) and I inspect the table, it has approximately the same number of records in it (not exactly the same).
Question: is it possible just to swap out these files and expect it to work? How else can I extract the data from the 20MB table file? I've read the other threads here regarding constructing from raw data files.
Thanks.

Repair Corrupt database postgresql

I have multiple errors with my postgresql db, which resulted after a power surge:
I cannot access most tables from my database. When I try for example select * from ac_cash_collection, I get the foolowing error:
ERROR: missing chunk number 0 for toast value 118486855 in pg_toast_2619
when I try pg_dump I get the following error:
Error message from server: ERROR: relation "public.st_stock_item_newlist" does not exist
pg_dump: The command was: LOCK TABLE public.st_stock_item_newlist IN ACCESS SHARE MODE
I went ahead and tried to run reindex of the whole database, I actually I left it runnng, went to sleep, and I found it had not done anything in the morning, so I had to cancel it.
I need some help to fix this as soon as possible, Please help.
Before you do anything else, http://wiki.postgresql.org/wiki/Corruption and act on the instructions. Failure to do so risks making the problem worse.
There are two configuration parameters listed in the Fine Manual that might be of use: ignore_system_indexes and zero_damaged_pages. I have never used them, but I would if I were desparate ...
I don't know if they help against toast-tables. In any case, if setting them causes your database(s) to become usable again, I would {backup + drop + restore} to get all tables and catalogs into newborn shape again. Success!
If you have backups, just restore from them.
If not - you've just learned why you need regular backups. There's nothing PostgreSQL can do if hardware misbehaves.
In addition, if you ever find yourself in this situation again, first stop PostgreSQL and take a complete file-level backup of everything - all tablespaces, WAL etc. That way you have a known starting point.
So - if you still want to recover some data.
Try dumping individual tables. Get what you can this way.
Drop indexes if they cause problems
Dump sections of tables (id=0..9999, 1000..19999 etc) - that way you can identify where some rows may be corrupted and dump ever-smaller sections to recover what's still good.
Try dumping just certain columns - large text values are stored out-of-line (in toast tables) so avoiding them might get the rest of your data out.
If you've got corrupted system tables then you're getting into a lot of work.
That's a lot of work, and then you'll need to go through and audit what you've recovered and try to figure out what's missing/incorrect.
There are more things you can do (creating empty blocks in some cases can let you dump partial data) but they're all more complicated and fiddly and unless the data is particularly valuable not worth the effort.
Key message to take away from this - make sure you take regular backups, and make sure they work.
Before you do ANYTHING ELSE, take a complete file-system-level copy of the damaged database.
http://wiki.postgresql.org/wiki/Corruption
Failure to do so destroys evidence about what caused the corruption, and means that if your repair efforts go badly and make things worse you can't undo them.
Copy it now!
If few/specific files are corrupted, following tricks might help.
Restore Older dump in different node or a second installation.
Copy required files from RESTORED/second installation to FAILED node.
Stop & Start PSQL.
From Today's experience!
Error message from server: ERROR: could not read block 226448 in file "base/12345/12345.1": Input/output error
try (probably it will fail)
cp base/12345/12345.1 /root/backup/12345.1-orig
try mv base/12345/12345.1 /root/backup/12345.1-orig #expecting this to finish. Else do rm -rf base/12345/12345.1 /root/backup/12345.1-orig
Finally,
Magic of tar. (if below tar completes, you have luck!)
tar -zcvf my_backup.tar.gz /var/lib/postgresql/xx/main/xx
Extract the corrupted file from TAR.
Replace it in original locationbase/12345/12345.1.
Stop & Start PGSQL
IMPORTANT: Please try googling and do try vaccum, reindex and disk checks like fsck etc before getting to this stage.
Also, always take a Filesystem backup before doing any TRIAL and ERROR method :)