PostgreSQL: How to modify the text before /copy it - postgresql

Lets say I have some customer data like the following saved in a text file:
|Mr |Peter |Bradley |72 Milton Rise |Keynes |MK41 2HQ |
|Mr |Kevin |Carney |43 Glen Way |Lincoln |LI2 7RD | 786 3454
I copied the aforementioned data into my customer table using the following command:
\copy customer(title, fname, lname, addressline, town, zipcode, phone) from 'customer.txt' delimiter '|'
However, as it turns out, there are some extra space characters before and after various parts of the data. What I'd like to do is call trim() before copying the data into the table - what is the best way to achieve this?
Is there a way to call trim() on every value of every row and avoid inserting unclean data in the first place?
Thanks,

I think the best way to go about this is to add a BEFORE INSERT trigger to the table you're inserting to. This way, you can write a stored procedure that will execute before every record is inserted and trim whitepsace (or do any other transformations you may need) on any columns that need it. When you're done, simply remove the trigger (or leave it, which will improve data integrity if you never want that whitespace int those columns). I think explaining how to create a trigger and stored procedure in PostgreSQL is probably outside the scope of this question, but I will link to the documentation for each.
I think this is the best way because it is simpler than parsing through a text file or writing shell code to do this. This kind of sanitization is the kind of thing triggers do very well and very simply.
Creating a Trigger
Creating a Trigger Function

I have somehow similar use case in one of the projects. My input files:
has number of lines in the file as a last line;
needs to have line numbers added on every line;
needs to have file_id added to every line.
I use the following piece of shell code:
FACT=$( dosql "TRUNCATE tab_raw RESTART IDENTITY;
COPY tab_raw(file_id,lnum,bnum,bname,a_day,a_month,a_year,a_time,etype,a_value)
FROM stdin WITH (DELIMITER '|', ENCODING 'latin1', NULL '');
$(sed -e '$d' -e '=' "$FILE"|sed -e 'N;s/\n/|/' -e 's/^/'$DSID'|/')
\.
VACUUM ANALYZE tab_raw;
SELECT count(*) FROM tab_raw;
" | sed -e 's/^[ ]*//' -e '/^$/d'
)
dosql is a shell function, that executes psql with proper connectivity info and executes everything, that was given as an argument.
As a result of this operation I will have $FACT variable holding a total count of inserter records (for error detection).
Later I do another dosql call:
dosql "SET work_mem TO '800MB';
SELECT tab_prepare($DSID);
VACUUM ANALYZE tab_raw;
SELECT tab_duplicates($DSID);
SELECT tab_dst($DSID);
SELECT tab_gaps($DSID);
SELECT tab($DSID);"
to get analyze and move data into the final tables from auxiliary one.

Related

How to use psql command line to remotely save multiple FOR loop results to remote client files?

I have an anonymous function containing a query within a FOR loop that executes 100 times, and I need to save the 100 result sets as 100 files on the remote client (not on the server).
It seems like the psql \copy meta-command should be the way to do this, but I'm at a loss. Something of this form, maybe?
\copy (anonymous_function_w/_FOR_loop_here) to 'filename.txt'
where filename.txt is built from the FOR loop variable's value in each iteration. That's important - the files on the remote client need to be named based on the FOR loop's variable.
Is there any way to pull this off? I suppose an alternative approach would be to UNION all 100 query results into one big result, with the FOR loop's variable value in one field, and then use bash scripting to split it into 100 appropriately named files. But my bash skills are pretty lame. If psql can do the job directly that would be great.
EDIT: I should add that here's what the FOR loop variable looks like:
FOR rec IN SELECT DISTINCT county FROM voter.counties
so the file name would be built from rec.county + '.txt'
The typical approach to this is to use a SQL statement that generates the necessary statements, spool the output into a script file, then run that file.
Something like:
-- prepare for a "plain" output without headers or something similar
\a
\t
-- spool the output into export.sql
\o export.sql
select format('\copy (select * from some_table where county = %L) to ''%s.txt''', county, county)
from (select distinct county from voter.counties) t;
-- turn spooling off
\o
-- run the generated file
\i export.sql
So for each county name in voters.counties the export.sql will contain:
\copy (select * from some_table where county = 'foobar') to 'foobar.txt'

How to ignore errors with psql \copy meta-command

I am using psql with a PostgreSQL database and the following copy command:
\COPY isa (np1, np2, sentence) FROM 'c:\Downloads\isa.txt' WITH DELIMITER '|'
I get:
ERROR: extra data after last expected column
How can I skip the lines with errors?
You cannot skip the errors without skipping the whole command up to and including Postgres 14. There is currently no more sophisticated error handling.
\copy is just a wrapper around SQL COPY that channels results through psql. The manual for COPY:
COPY stops operation at the first error. This should not lead to problems in the event of a COPY TO, but the target table will
already have received earlier rows in a COPY FROM. These rows will
not be visible or accessible, but they still occupy disk space. This
might amount to a considerable amount of wasted disk space if the
failure happened well into a large copy operation. You might wish to
invoke VACUUM to recover the wasted space.
Bold emphasis mine. And:
COPY FROM will raise an error if any line of the input file contains
more or fewer columns than are expected.
COPY is an extremely fast way to import / export data. Sophisticated checks and error handling would slow it down.
There was an attempt to add error logging to COPY in Postgres 9.0 but it was never committed.
Solution
Fix your input file instead.
If you have one or more additional columns in your input file and the file is otherwise consistent, you might add dummy columns to your table isa and drop those afterwards. Or (cleaner with production tables) import to a temporary staging table and INSERT selected columns (or expressions) to your target table isa from there.
Related answers with detailed instructions:
How to update selected rows with values from a CSV file in Postgres?
COPY command: copy only specific columns from csv
It is too bad that in 25 years Postgres doesn't have -ignore-errors flag or option for COPY command. In this era of BigData you get a lot of dirty records and it can be very costly for the project to fix every outlier.
I had to make a work-around this way:
Copy the original table and call it dummy_original_table
in the original table, create a trigger like this:
CREATE OR REPLACE FUNCTION on_insert_in_original_table() RETURNS trigger AS $$
DECLARE
v_rec RECORD;
BEGIN
-- we use the trigger to prevent 'duplicate index' error by returning NULL on duplicates
SELECT * FROM original_table WHERE primary_key=NEW.primary_key INTO v_rec;
IF v_rec IS NOT NULL THEN
RETURN NULL;
END IF;
BEGIN
INSERT INTO original_table(datum,primary_key) VALUES(NEW.datum,NEW.primary_key)
ON CONFLICT DO NOTHING;
EXCEPTION
WHEN OTHERS THEN
NULL;
END;
RETURN NULL;
END;
Run a copy into the dummy table. No record will be inserted there, but all of them will be inserted in the original_table
psql dbname -c \copy dummy_original_table(datum,primary_key) FROM '/home/user/data.csv' delimiter E'\t'
Workaround: remove the reported errant line using sed and run \copy again
Later versions of Postgres (including Postgres 13), will report the line number of the error. You can then remove that line with sed and run \copy again, e.g.,
#!/bin/bash
bad_line_number=5 # assuming line 5 is the bad line
sed ${bad_line_number}d < input.csv > filtered.csv
[per the comment from #Botond_Balázs ]
Here's one solution -- import the batch file one line at a time. The performance can be much slower, but it may be sufficient for your scenario:
#!/bin/bash
input_file=./my_input.csv
tmp_file=/tmp/one-line.csv
cat $input_file | while read input_line; do
echo "$input_line" > $tmp_file
psql my_database \
-c "\
COPY my_table \
FROM `$tmp_file` \
DELIMITER '|'\
CSV;\
"
done
Additionally, you could modify the script to capture the psql stdout/stderr and exit
status, and if the exit status is non-zero, echo $input_line and the captured stdout/stderr to stdin and/or append it to a file.

export table to csv on postgres

How can I export a table to .csv in Postgres, when I'm not superuser and can't use the copy command?
I can still import the data to postgres with "import" button on the right click, but no export option.
Use psql and redirect stream to file:
psql -U <USER> -d <DB_NAME> -c "COPY <YOUR_TABLE> TO stdout DELIMITER ',' CSV HEADER;" > file.csv
COPY your_table TO '/path/to/your/file.csv' DELIMITER ',' CSV HEADER;
For more details go to this manual
Besides what marvinorez's suggests in his answer you can do, from psql:
\copy your_table TO '/path/to/your/file.csv' DELIMITER ',' CSV HEADER
On the other hand, from pgadmin3, you can also open the table by right clicking on it's name and then selecting View Data. Then you can click on the upper-left corner of the table (where the column name row joins with the row number column, a gray empty square) to select all rows. Finally, you can copy with CtrlC or Edit -> Copy in the menu. The data will be copied to the clipboard in csv format, delimited by semicolon ;.
You can then paste it in LibreOffice Calc, MS Excel to display for instance.
If your table is large (what is large depends on the amount of RAM of your machine, among other things) it might not fit in the clipboard, so in that case, I would not use this method but the first one (\copy).
The easiest way would indeed be a COPY to stdout I think. If you can't do this, how about using pg_dump and then transform the output file with sed, AWK or even a text editor? This should work even with search and replace in an acceptable amount of time :)
I was having trouble with superuser and running psql, I took the simple stupid way using PGAdmin III.
1) SELECT * FROM ;
Before running select Query in the menu bar and select 'Query to File'
This will save it to a folder of your choice. May have to play with the settings on how to export, it likes quoting and ;.
2) SELECT * FROM ;
run normally and then save the output by selecting export in the File menu. This will save as a .csv
This is not a good approach for large tables. Tables I have done this for are a few 100,000 rows and 10-30 columns. Large tables may have problems.

Printing to the screen in a .sql file in PostgreSQL

I have a .sql file I am building for an upgrade to my application that alters tables, inserts/updates, etc.
I want to write to the screen after every command finishes.
So, for instance if I have something like:
insert into X...
I want to see something like,
Starting to insert into table X
Finished inserting into table X
Is this possible in PostgreSQL?
This sounds like it should be a very easy thing to do, however, I cannot find anywhere how to do it.
If you're just feeding a big pile of SQL to psql then you have a couple of options.
You could run psql with --echo-all:
-a
--echo-all
Print all input lines to standard output as they
are read. This is more useful for script processing than interactive
mode. This is equivalent to setting the variable ECHO to all.
That and the other "echo everything of this type" options (see the manual) are probably too noisy though. If you just want to print things manually, use \echo:
\echo text [ ... ]
Prints the arguments to the standard output, separated by one space and followed by a newline. This can be useful to intersperse information in the output of scripts.
So you can say:
\echo 'Starting to insert into table X'
-- big pile of inserts go here...
\echo 'Finished inserting into table X'
Via an answer to How can I run an ad-hoc script in PostgreSQL?:
DO language plpgsql $$
BEGIN
RAISE NOTICE 'Hello, World!';
END
$$;
Depending on what you're doing, I'd be worried about doing a bunch of anonymous code blocks. You might consider storing the above as a function, and passing in whatever value you want logged.
There's probably a better way to do it. But if you need to use vanilla SQL, try this:
SELECT NULL AS "Starting to insert into table X";
-- big pile of inserts go here...
SELECT NULL AS "Finished inserting into table X";

syntax for COPY in postgresql

INSERT INTO contacts_lists (contact_id, list_id)
SELECT contact_id, 67544
FROM plain_contacts
Here I want to use Copy command instead of Insert command in sql to reduce the time to insert values. I fetched the data using select operation. How can i insert it into a table using Copy command in postgresql. Could you please give an example for it?. Or any other suggestion in order to achieve the reduction of time to insert the values.
As your rows are already in the database (because you apparently can SELECT them), then using COPY will not increase the speed in any way.
To be able to use COPY you have to first write the values into a text file, which is then read into the database. But if you can SELECT them, writing to a textfile is a completely unnecessary step and will slow down your insert, not increase its speed
Your statement is as fast as it gets. The only thing that might speed it up (apart from buying a faster harddisk) is to remove any potential index on contact_lists that contains the column contact_id or list_id and re-create the index once the insert is finished.
You can find the syntax described in many places, I'm sure. One of those is this wiki article.
It looks like it would basically be:
COPY plain_contacts (contact_id, 67544) TO some_file
And
COPY contacts_lists (contact_id, list_id) FROM some_file
But I'm just reading from the resources that Google turned up. Give it a try and post back if you need help with a specific problem.