How use GNU parallel and GNU SQL with \copy in PostgreSQL - postgresql

I want to do a bulk load to a PostgreSQL database, there are several files and are pretty big. I just read in Using GNU Parallel With split about the GNU Parallel and GNU SQL, and It looks fantastic, Could some one help me with an example of using GNU Parallel, GNU SQL and \copy or COPY for doing a bulk load to PostgreSQL?

I'd just use pg_bulkload instead. It does that, and more, for you.

Both GNU parallel and pg_bulkload are cool, but not available in most default installations. I think this task can be achieved by using '&' (background sub-shell) and 'wait' in a shell script, which invokes multi \COPY operations simultaneously. Here is an example:
#!/bin/bash
PG_USER_NAME=bizusr
PG_DB_NAME=bizdb
BCP_DIR=/data/biz/bcp/input
do_bcp()
{
TABLE_NAME=$1
echo "`date` $$ copy $TABLE_NAME begin"
psql -q -U $PG_USER_NAME -d $PG_DB_NAME << EOF
-- SET DATESTYLE TO 'ISO,YMD'; -- you may need this when dealing with timestamps
\COPY $TABLE_NAME FROM '${BCP_DIR}/${TABLE_NAME}.bcp' WITH (FORMAT CSV, DELIMITER '|');
EOF
echo "`date` $$ copy $TABLE_NAME done"
}
echo "`date` $$ parallel copy started"
for BCP_FILE in `ls ${BCP_DIR}/*.bcp`; do
TABLE_NAME=`echo $BCP_FILE|awk -F"/" '{print $NF}'|sed -e s/\.bcp$//`
do_bcp $TABLE_NAME &
done
wait
echo "`date` $$ parallel copy finished"

Related

What "*#" means after executint a command in PostgreSql 10 on Windows 7?

I'm using PostgreSQL on Windows 7 through the command line. I want to import the content of different CSV files into a newly created table.
After executing the command the database name appeared like:
database=#
Now appears like
database*# after executing:
type directory/*.csv | psql -c 'COPY sch.trips(value1, value2) from stdin CSV HEADER';
What does *# mean?
Thanks
This answer is for Linux and as such doesn't answer OP's question for Windows. I'll leave it up anyway for anyone that comes across this in the future.
You accidentally started a block comment with your type directory/*.csv. type doesn't do what you think it does. From the bash built-ins:
With no options, indicate how each name would be interpreted if used as a command name.
Try doing cat instead:
cat directory/*.csv | psql -c 'COPY sch.trips(value1, value2) from stdin CSV HEADER';
If this gives you issues because each CSV has its own header, you can also do:
for file in directory/*.csv; do cat "$file" | psql -c 'COPY sch.trips(value1, value2) from stdin CSV HEADER'; done
Type Command
The type built-in command in Bash is a way of viewing command interpreter results. For example, using it with ssh:
$ type ssh
ssh is /usr/bin/ssh
This indicates how ssh would be interpreted when you run ssh as a command in the current Bash environment. This is useful for things like aliases. As an example for this, ll is usually an alias to ls -l. Here's what my Bash environment had for ll:
$ type ll
ll is aliased to `ls -l --color=auto'
For you, when you pipe the result of this command to psql, it encounters the /* in the input and assumes it's a block comment, which is what the database*# prompt means (the * indicates it's waiting for the comment close pattern, */).
Cat Command
cat is for concatenating multiple files together. By default, it writes to standard out, so cat directory/*.csv will write each CSV file to standard out one after another. However, piping this means that each CSV's header will also be piped mid-stream of the copy. This may not be desirable, so:
For Loop
We can use for to loop over each file and individually import it. The version I have above, for file in directory/*.csv, will properly handle files with spaces. Properly formatted:
for file in directory/*; do
cat "$file" | psql -c 'COPY sch.trips(value1, value2) from stdin CSV HEADER'
done
References
PostgreSQL 10 Comments Documentation (postgresql.org)
type built-in Manual page (mankier.com)
cat Manual page (mankier.com)
Bash looping tutorial (tldp.org)

How to copy a csv file from a url to Postgresql

Is there any way to use copy command for batch data import and read data from a url. For example, copy command has a syntax like :
COPY sample_table
FROM 'C:\tmp\sample_data.csv' DELIMITER ',' CSV HEADER;
What I want is not to give a local path but a url. Is there any way?
It's pretty straightforward, provided you have an appropriate command-line tool available:
COPY sample_table FROM PROGRAM 'curl "http://www.example.com/file.csv"'
Since you appear to be on Windows, I think you'll need to install curl or wget yourself. There is an example using wget on Windows here which may be useful.
My solution is
cat $file |
tail -$numberLine |
sed 's/ / ,/g' |
psql -q -d $dataBaseName -c "COPY tableName FROM STDIN DELIMITER ','"
You can insert a awk between sed and psql to add missing column.
Interesting if already you know what to put in the missing column.
awk '{print $0" , "'info_about_missing_column'"\n"}'
I have done that and it works and faster than INSERT.

sh: variable substitution with heredoc

cat "${pos}" | /usr/bin/iconv -f CP1251 -t UTF-8 | uniq | sed -En "/^CLIENT_ID.*/!p" | while read line
do
.....
......
cat >> "$TMPFILE" << EOF
INSERT INTO ......;
EOF
done
As you can see each iteration writes a SQL statement to a tmp-file.
I launched this script from a regular interactive shell and got the expected output. Launched from a cron job - nothing.
After investigating I found a problem. When I use "$TMPFILE" without "" the script works ok. Why does this happen?
OS: FreeBSD, bourne shell.
IIRC, cron doesn't source all the files that a login shell does, so you will end up with different settings for environment variables. Could be the path $TMPFILE is pointing to contains spaces when run from cron for example.
Also, on some systems (depending on setup), cron uses a different shell. So if you start your script from command line, for example /usr/bin/sh might be used, whereas when started by cron, /bin/sh is used. (I have no experience with *BSD, but I have observed this on linux.)

Postgres COPY command to tail a file?

I want to use to copy command to copy data into postgres; while there other processes are simultaneously writing into the CSV file.
Is something like this possible? Take the stdout from tail and pipe into the stdin of postgres.
COPY targetTable ( column1, column2 )
FROM `tail -f 'path/to/data.csv'`
WITH CSV
Assuming PostgreSQL 9.3 or better, there's the possibility of copying from a program output with:
COPY FROM PROGRAM 'command'
From the doc:
PROGRAM
A command to execute. In COPY FROM, the input is read from standard
output of the command, and in COPY TO, the output is written to the
standard input of the command.
This may be what you need except for the fact that tail -f being a never-ending command by design, it's not obvious how you plan for the COPY to ever finish. Presumably you'd need to replace tail -f by a more elaborate script with some exit condition.
you can also do COPY FROM STDIN;
example:
tail -f datafile.csv | psql -tc "COPY table from STDIN" database

PostgreSQL: How to pass parameters from command line?

I have a somewhat detailed query in a script that uses ? placeholders. I wanted to test this same query directly from the psql command line (outside the script). I want to avoid going in and replacing all the ? with actual values, instead I'd like to pass the arguments after the query.
Example:
SELECT *
FROM foobar
WHERE foo = ?
AND bar = ?
OR baz = ? ;
Looking for something like:
%> {select * from foobar where foo=? and bar=? or baz=? , 'foo','bar','baz' };
You can use the -v option e.g:
$ psql -v v1=12 -v v2="'Hello World'" -v v3="'2010-11-12'"
and then refer to the variables in SQL as :v1, :v2 etc:
select * from table_1 where id = :v1;
Please pay attention to how we pass string/date values using two quotes " '...' " But this way of interpolation is prone to SQL injections, because it's you who's responsible for quoting. E.g. need to include a single quote? -v v2="'don''t do this'".
A better/safer way is to let PostgreSQL handle it:
$ psql -c 'create table t (a int, b varchar, c date)'
$ echo "insert into t (a, b, c) values (:'v1', :'v2', :'v3')" \
| psql -v v1=1 -v v2="don't do this" -v v3=2022-01-01
Found out in PostgreSQL, you can PREPARE statements just like you can in a scripting language. Unfortunately, you still can't use ?, but you can use $n notation.
Using the above example:
PREPARE foo(text,text,text) AS
SELECT *
FROM foobar
WHERE foo = $1
AND bar = $2
OR baz = $3 ;
EXECUTE foo('foo','bar','baz');
DEALLOCATE foo;
In psql there is a mechanism via the
\set name val
command, which is supposed to be tied to the -v name=val command-line option. Quoting is painful, In most cases it is easier to put the whole query meat inside a shell here-document.
Edit
oops, I should have said -v instead of -P (which is for formatting options) previous reply got it right.
You can also pass-in the parameters at the psql command-line, or from a batch file. The first statements gather necessary details for connecting to your database.
The final prompt asks for the constraint values, which will be used in the WHERE column IN() clause. Remember to single-quote if strings, and separate by comma:
#echo off
echo "Test for Passing Params to PGSQL"
SET server=localhost
SET /P server="Server [%server%]: "
SET database=amedatamodel
SET /P database="Database [%database%]: "
SET port=5432
SET /P port="Port [%port%]: "
SET username=postgres
SET /P username="Username [%username%]: "
SET /P bunos="Enter multiple constraint values for IN clause [%constraints%]: "
ECHO you typed %constraints%
PAUSE
REM pause
"C:\Program Files\PostgreSQL\9.0\bin\psql.exe" -h %server% -U %username% -d %database% -p %port% -e -v v1=%constraints% -f test.sql
Now in your SQL code file, add the v1 token within your WHERE clause, or anywhere else in the SQL. Note that the tokens can also be used in an open SQL statement, not just in a file. Save this as test.sql:
SELECT * FROM myTable
WHERE NOT someColumn IN (:v1);
In Windows, save the whole file as a DOS BATch file (.bat), save the test.sql in the same directory, and launch the batch file.
Thanks for Dave Page, of EnterpriseDB, for the original prompted script.
I would like to offer another answer inspired by #malcook's comment (using bash).
This option may work for you if you need to use shell variables within your query when using the -c flag. Specifically, I wanted to get the count of a table, whose name was a shell variable (which you can't pass directly when using -c).
Assume you have your shell variable
$TABLE_NAME='users'
Then you can get the results of that by using
psql -q -A -t -d databasename -c <<< echo "select count(*) from $TABLE_NAME;"
(the -q -A -t is just to print out the resulting number without additional formatting)
I will note that the echo in the here-string (the <<< operator) may not be necessary, I originally thought the quotes by themselves would be fine, maybe someone can clarify the reason for this.
It would appear that what you ask can't be done directly from the command line. You'll either have to use a user-defined function in plpgsql or call the query from a scripting language (and the latter approach makes it a bit easier to avoid SQL injection).
I've ended up using a better version of #vol7ron answer:
DO $$
BEGIN
IF NOT EXISTS(SELECT 1 FROM pg_prepared_statements WHERE name = 'foo') THEN
PREPARE foo(text,text,text) AS
SELECT *
FROM foobar
WHERE foo = $1
AND bar = $2
OR baz = $3;
END IF;
END$$;
EXECUTE foo('foo','bar','baz');
This way you can always execute it in this order (the query prepared only if it does not prepared yet), repeat the execution and get the result from the last query.