How do I check that I removed required data only? - postgresql

I have a really big database (running on PostgreSQL) containing a lot of tables with sophisticated relations between them (foreign keys, on delete cascade and so on).
I need remove some data from a number of tables, but I'm not sure what amount of data will be really deleted from database due to cascade removals.
How can I check that I'll not delete data that should not be deleted?
I have a test database - just a copy of real one where I can do what I want :)
The only idea I have is dump database before and after and check it. But it not looks comfortable.
Another idea - dump part of database, that, as I think, should not be affected by my DELETE statements and check this part before and after data removal. But I see no simple ways to do it (there are hundreds of tables and removal should work with ~10 of them). Is there some way to do it?
Any other ideas how to solve the problem?

You can query the information_schema to draw yourself a picture on how the constraints are defined in the database. Then you'll know what is going to happen when you delete. This will be useful not only for this case, but always.
Something like (for constraints)
select table_catalog,table_schema,table_name,column_name,rc.* from
information_schema.constraint_column_usage ccu,
information_schema.referential_constraints rc
where ccu.constraint_name = rc.constraint_name

Using psql, start a transaction, perform your deletes, then run whatever checking queries you can think of. You can then either rollback or commit.

If the worry is keys left dangling (i.e.: pointing to a deleted record) then run the deletion on your test database, then use queries to find any keys that now point to invalid targets. (while you're doing this you can also make sure the part that should be unaffected did not change)
A better solution would be to spend time mapping out the delete cascades so you know what to expect - knowing how your database works is pretty valuable so the effort spent on this will be useful beyond this particular deletion.
And no matter how sure you are back the DB up before doing big changes!

Thanks for answers!
Vinko, your answer is very useful for me and I'll study it dipper.
actually, for my case, it was enough to compare tables counts before and after records deletion and check what tables were affected by it.
it was done by simple commands described below
psql -U U_NAME -h`hostname` -c '\d' | awk '{print $3}' > tables.list
for i in `cat tables.list `; do echo -n "$i: " >> tables.counts; psql -U U_NAME -h`hostname` -t -c "select count(*) from $i" >> tables.counts; done
for i in `cat tables.list `; do echo -n "$i: " >> tables.counts2; psql -U U_NAME -h`hostname` -t -c "select count(*) from $i" >> tables.counts2; done
diff tables.counts tables.counts2

Related

Is there a way to filter pg_dump by timestamp for PostgreSQL?

I have a database that needs backing up, but only for specific timestamps and tables, like from the first of October to the 15th of October. After looking up on multiple sites, I have not found any methods that can suit my requirements.
Let's say I have database_A, and database A has 15 tables. I want to be able to use pg_dump to back up 10 tables from database_A, from the 1st of October to the 15th of October, all into 1 file. Below is what I have managed to do, but have not gotten the date portion yet as I'm not entirely sure.
pg_dump -U postgres -t"\"table_1\"" -t"\"table_2\"" database_A > backup.csv
This above code will work if I want to back up multiple tables into one file, and it will back up the entire table, from start to end.
I would much appreciate if someone could help me with this, as I am still mostly a beginner at this. Thank you!
If the data you're copying has a column named timestamp you can use psql and the COPY command to accomplish this:
# Optional: clear existing table since COPY FROM will append data
psql -c "TRUNCATE TABLE my_table" target_db
psql -c "COPY (SELECT * FROM my_table WHERE timestamp >= '...' AND timestamp <= '...') TO STDOUT" source_db | psql -c "COPY my_table FROM STDIN" target_db
You can repeat this pattern for as many tables as necessary. I've used this approach before to copy a subset of live data into a development database and it works quite well, especially if you put the above commands into a shell script.

How to load bulk data to table from table using query as quick as possible? (postgresql)

I have a large table(postgre_a) which has 0.1 billion records with 100 columns. I want to duplicate this data into the same table.
I tried to do this using sql
INSERT INTO postgre_a select i1 + 100000000, i2, ... FROM postgre_a;
However, this query is running more than 10 hours now... so I want to do this more faster. I tried to do this with copy, but I cannot find the way to use copy from statement with query.
Is there any other method can do this faster?
You cannot directly use a query in COPY FROM, but maybe you can use COPY FROM PROGRAM with a query to do what you want:
COPY postgre_a
FROM PROGRAM '/usr/pgsql-10/bin/psql -d test'
' -c ''copy (SELECT i1+ 100000000, i2, ... FROM postgre_a) TO STDOUT''';
(Of course you have to replace the path to psql and the database name with your values.)
I am not sure if that is faster than using INSERT, but it is worth a try.
You should definitely drop all indexes and constraints before the operation and recreate them afterwards.

How to ignore errors with psql \copy meta-command

I am using psql with a PostgreSQL database and the following copy command:
\COPY isa (np1, np2, sentence) FROM 'c:\Downloads\isa.txt' WITH DELIMITER '|'
I get:
ERROR: extra data after last expected column
How can I skip the lines with errors?
You cannot skip the errors without skipping the whole command up to and including Postgres 14. There is currently no more sophisticated error handling.
\copy is just a wrapper around SQL COPY that channels results through psql. The manual for COPY:
COPY stops operation at the first error. This should not lead to problems in the event of a COPY TO, but the target table will
already have received earlier rows in a COPY FROM. These rows will
not be visible or accessible, but they still occupy disk space. This
might amount to a considerable amount of wasted disk space if the
failure happened well into a large copy operation. You might wish to
invoke VACUUM to recover the wasted space.
Bold emphasis mine. And:
COPY FROM will raise an error if any line of the input file contains
more or fewer columns than are expected.
COPY is an extremely fast way to import / export data. Sophisticated checks and error handling would slow it down.
There was an attempt to add error logging to COPY in Postgres 9.0 but it was never committed.
Solution
Fix your input file instead.
If you have one or more additional columns in your input file and the file is otherwise consistent, you might add dummy columns to your table isa and drop those afterwards. Or (cleaner with production tables) import to a temporary staging table and INSERT selected columns (or expressions) to your target table isa from there.
Related answers with detailed instructions:
How to update selected rows with values from a CSV file in Postgres?
COPY command: copy only specific columns from csv
It is too bad that in 25 years Postgres doesn't have -ignore-errors flag or option for COPY command. In this era of BigData you get a lot of dirty records and it can be very costly for the project to fix every outlier.
I had to make a work-around this way:
Copy the original table and call it dummy_original_table
in the original table, create a trigger like this:
CREATE OR REPLACE FUNCTION on_insert_in_original_table() RETURNS trigger AS $$
DECLARE
v_rec RECORD;
BEGIN
-- we use the trigger to prevent 'duplicate index' error by returning NULL on duplicates
SELECT * FROM original_table WHERE primary_key=NEW.primary_key INTO v_rec;
IF v_rec IS NOT NULL THEN
RETURN NULL;
END IF;
BEGIN
INSERT INTO original_table(datum,primary_key) VALUES(NEW.datum,NEW.primary_key)
ON CONFLICT DO NOTHING;
EXCEPTION
WHEN OTHERS THEN
NULL;
END;
RETURN NULL;
END;
Run a copy into the dummy table. No record will be inserted there, but all of them will be inserted in the original_table
psql dbname -c \copy dummy_original_table(datum,primary_key) FROM '/home/user/data.csv' delimiter E'\t'
Workaround: remove the reported errant line using sed and run \copy again
Later versions of Postgres (including Postgres 13), will report the line number of the error. You can then remove that line with sed and run \copy again, e.g.,
#!/bin/bash
bad_line_number=5 # assuming line 5 is the bad line
sed ${bad_line_number}d < input.csv > filtered.csv
[per the comment from #Botond_Balázs ]
Here's one solution -- import the batch file one line at a time. The performance can be much slower, but it may be sufficient for your scenario:
#!/bin/bash
input_file=./my_input.csv
tmp_file=/tmp/one-line.csv
cat $input_file | while read input_line; do
echo "$input_line" > $tmp_file
psql my_database \
-c "\
COPY my_table \
FROM `$tmp_file` \
DELIMITER '|'\
CSV;\
"
done
Additionally, you could modify the script to capture the psql stdout/stderr and exit
status, and if the exit status is non-zero, echo $input_line and the captured stdout/stderr to stdin and/or append it to a file.

PostgreSQL: How to modify the text before /copy it

Lets say I have some customer data like the following saved in a text file:
|Mr |Peter |Bradley |72 Milton Rise |Keynes |MK41 2HQ |
|Mr |Kevin |Carney |43 Glen Way |Lincoln |LI2 7RD | 786 3454
I copied the aforementioned data into my customer table using the following command:
\copy customer(title, fname, lname, addressline, town, zipcode, phone) from 'customer.txt' delimiter '|'
However, as it turns out, there are some extra space characters before and after various parts of the data. What I'd like to do is call trim() before copying the data into the table - what is the best way to achieve this?
Is there a way to call trim() on every value of every row and avoid inserting unclean data in the first place?
Thanks,
I think the best way to go about this is to add a BEFORE INSERT trigger to the table you're inserting to. This way, you can write a stored procedure that will execute before every record is inserted and trim whitepsace (or do any other transformations you may need) on any columns that need it. When you're done, simply remove the trigger (or leave it, which will improve data integrity if you never want that whitespace int those columns). I think explaining how to create a trigger and stored procedure in PostgreSQL is probably outside the scope of this question, but I will link to the documentation for each.
I think this is the best way because it is simpler than parsing through a text file or writing shell code to do this. This kind of sanitization is the kind of thing triggers do very well and very simply.
Creating a Trigger
Creating a Trigger Function
I have somehow similar use case in one of the projects. My input files:
has number of lines in the file as a last line;
needs to have line numbers added on every line;
needs to have file_id added to every line.
I use the following piece of shell code:
FACT=$( dosql "TRUNCATE tab_raw RESTART IDENTITY;
COPY tab_raw(file_id,lnum,bnum,bname,a_day,a_month,a_year,a_time,etype,a_value)
FROM stdin WITH (DELIMITER '|', ENCODING 'latin1', NULL '');
$(sed -e '$d' -e '=' "$FILE"|sed -e 'N;s/\n/|/' -e 's/^/'$DSID'|/')
\.
VACUUM ANALYZE tab_raw;
SELECT count(*) FROM tab_raw;
" | sed -e 's/^[ ]*//' -e '/^$/d'
)
dosql is a shell function, that executes psql with proper connectivity info and executes everything, that was given as an argument.
As a result of this operation I will have $FACT variable holding a total count of inserter records (for error detection).
Later I do another dosql call:
dosql "SET work_mem TO '800MB';
SELECT tab_prepare($DSID);
VACUUM ANALYZE tab_raw;
SELECT tab_duplicates($DSID);
SELECT tab_dst($DSID);
SELECT tab_gaps($DSID);
SELECT tab($DSID);"
to get analyze and move data into the final tables from auxiliary one.

Transaction time out workaround for PostgreSQL

AFAIK, PostgreSQL 8.3 does not support transaction time out. I've read about supporting this feature in the future and there's some discussion about it. However, for specific reasons, I need a solution for this problem. So what I did is a script that runs periodically:
1) Based on locks and activity, query in order to retrieve processID of the transactions that is taking too long, and keeping the oldest (trxTimeOut.sql):
SELECT procpid
FROM
(
SELECT DISTINCT age(now(), query_start) AS age, procpid
FROM pg_stat_activity, pg_locks
WHERE pg_locks.pid = pg_stat_activity.procpid
) AS foo
WHERE age > '30 seconds'
ORDER BY age DESC
LIMIT 1
2) Based on this query, kill the corresponding process (trxTimeOut.sh):
psql -h localhost -U postgres -t -d test_database -f trxTimeOut.sql | xargs kill
Although I've tested it and seems to work, I'd like to know if it's an acceptable approach or should I consider a different one?
PostgreSQL provides idle_in_transaction_session_timeout since version 9.6, to automatically terminate transactions that are idle for too long.
It's also possible to set a limit on how long a command can take, through statement_timeout, independently on the duration of the transaction it's in, or why it's stuck (busy query or waiting for a lock).
To auto-abort transactions that are stuck specifically waiting for a lock, see lock_timeout.
These settings can be set at the SQL level with commands like SET shown below, or can be set as defaults to a database with ALTER DATABASE, or to a user with ALTER USER, or to the entire instance through postgresql.conf.
SET statement_timeout=10000; -- time out after 10 seconds