Importing zipped CSV file into PostgreSQL - postgresql

I have a big compressed csv file (25gb) and I want to import it into PostgreSQL 9.5 version. Is there any fast way to import zip or qzip file into postgres without extracting the file?

There is an old trick to use a named pipe (works on Unix, don't know about Windows)
create a named pipe: mkfifo /tmp/omyfifo
write the file contents to it: zcat mycsv.csv.z > /tmp/omyfifo &
[from psql] copy mytable(col1,...) from '/tmp/omyfifo'
[when finished] : rm /tmp/omyfifo
The zcat in the backgound will block until a reader (here: the COPY command) will start reading, and it will finish at EOF. (or if the reader closes the pipe)
You could even start multiple pipes+zcat pairs, which will be picked up by multiple COPY statements in your sql script.
This will work from pgadmin, but the fifo (+zcat process) should be present on the machine where the DBMS server runs.
BTW: a similar trick using netcat can be used to read a file from a remote machine (which of course should write the file to the network socket)

example how to do it with zcat and pipe:
-bash-4.2$ psql -p 5555 t -c "copy tp to '/tmp/tp.csv';"
COPY 1
-bash-4.2$ gzip /tmp/tp.csv
-bash-4.2$ zcat /tmp/tp.csv.gz | psql -p 5555 t -c "copy tp from stdin;"
COPY 1
-bash-4.2$ psql -p 5555 t -c "select count(*) from tp"
count
-------
2
(1 row)
also from 9.3 release you can:
psql -p 5555 t -c "copy tp from program 'zcat /tmp/tp.csv.gz';"
without pipe at all

If you have a ZIP (.zip) instead of a GZIP (.gz) archive, you can use unzip -p to pipe the zipped file.
psql -p 5555 -t -c "copy tp from program 'unzip -p /tmp/tp.csv.zip';"

Related

PostgreSQL COPY pipe output to gzip and then to STDOUT

The following command works well
$ psql -c "copy (select * from foo limit 3) to stdout csv header"
# output
column1,column2
val1,val2
val3,val4
val5,val6
However the following does not:
$ psql -c "copy (select * from foo limit 3) to program 'gzip -f --stdout' csv header"
# output
COPY 3
Why do I have COPY 3 as the output from this command? I would expect that the output would be the compressed CSV string, after passing through gzip.
The command below works, for instance:
$ psql -c "copy (select * from foo limit 3) to stdout csv header" | gzip -f -c
# output (this garbage is just the compressed string and is as expected)
߉T`M�A �0 ᆬ}6�BL�I+�^E�gv�ijAp���qH�1����� FfВ�,Д���}������+��
How to make a single SQL command that directly pipes the result into gzip and sends the compressed string to STDOUT?
When you use COPY ... TO PROGRAM, the PostgreSQL server process (backend) starts a new process and pipes the file to the process's standard input. The standard output of that process is lost. It only makes sense to use COPY ... TO PROGRAM if the called program writes the data to a file or similar.
If your goal is to compress the data that go across the network, you could use sslmode=require sslcompression=on in your connect string to use the SSL network compression feature I built into PostgreSQL 9.2. Unfortunately this has been deprecated and most OpenSSL binaries are shipped with the feature disabled.
There is currently a native network compression patch under development, but it is questionable whether that will make v14.
Other than that, you cannot get what you want at the moment.
copy is running gzip on the server and not forwarding the STDOUT from gzip on to the client.
You can use \copy instead, which would run gzip on the client:
psql -q -c "\copy (select * from foo limit 3) to program 'gzip -f --stdout' csv header"
This is fundamentally the same as piping to gzip, which you show in your question.
If the goal is to compress the output of copy so it transfers faster over the network, then...
psql "postgresql://ip:port/dbname?sslmode=require&sslcompression=1"
It should display "compression active" if it's enabled. That probably requires some server config variable to be enabled though.
Or you can simply use ssh:
ssh user#dbserver "psql -c \"copy (select * from foo limit 3) to stdout csv header\" | gzip -f -c" >localfile.csv.gz
But... of course, you need ssh access to the db server.
If you don't have ssh to the db server, maybe you have ssh to another box in the same datacenter that has a fast network link to the db server, in that case you can ssh to it instead of the db server. Data will be transferred uncompressed between that box and the database, compressed on the box, and piped via ssh to your local machine. That will even save cpu on the database server since it won't be doing the compression.
If that doesn't work, well then, why not put the ssh command into the "to program" and have the server send it via ssh to your machine? You'll have to setup your router and open a port, but you can do that. Of course you'll have to find a way to put the password in the ssh command line, that's usually a big no-no, but maybe just for once. Or just use netcat instead, that doesn't require a password.
Also, if you want speed, please, use zstd instead of gzip.
Here's an example with netcat. I just tested it and it worked.
On destination machine which is 192.168.0.1:
nc -lp 65001 | zstd -d >file.csv
In another terminal:
psql -c "copy (select * from foo) to program 'zstd -9 |nc -N 192.168.0.1 65001' csv header" test
Note -N option for netcat.
You can use copy to PROGRAM:
COPY foo_table to PROGRAM 'gzip > /tmp/foo_table.csv' delimiters',' CSV HEADER;

How to import a sample DB into postgres?

According to a website I can download their sample file dvdrental.zip, but
The database file is in zipformat ( dvdrental.zip) so you need to extract > it to dvdrental.tar
First of all, what is a tar? I thought it had to be tar.gz to be compressed? I don't even know how to create a "tar" by itself. I tried:
tar -zcvf dvdrental.tar.gz dvdrental
and
tar -cf dvdrental.tar dvdrental
I try to import with pgAdmin 4 and I get either:
pg_restore: [archiver] input file does not appear to be a valid archive
or
pg_restore: [tar archiver] could not find header for file "toc.dat" in tar archive
respectively. Now, don't ask me why a popular tutorial site created a file in the wrong format. But, can you tell me how to repackage this file so I can use it as a sample DB?
Using Mac OS 10.12.4. Postgres 9.6. And PgAdmin 4 (not sure if it's in beta? It crashed and does all kinds of nonsensical window movement and highlighting)
I have extracted .zip archive first. Then opened pgAdmin and followed the guide "Load the DVD Rental database using the pgAdmin"
https://www.postgresqltutorial.com/load-postgresql-sample-database/
Pay attention to changing 'Format' field from 'Custom or Tar' to 'Directory'. Then you should be able to restore DB.
If you look into the .tar archive you will find the restore.sql where at the top:
-- File paths need to be edited. Search for $$PATH$$ and
-- replace it with the path to the directory containing
-- the extracted data files.
So to create sample DB you could to extract .tar content somewhere and use single command:
sed -e 's/\$\$PATH\$\$/\/path\/to\/extracted\/files/g' restore.sql | psql
Or
sed -e 's/\$\$PATH\$\$/\/path\/to\/extracted\/files/g' restore.sql > r.sql
and try to execute the r.sql content using PgAdmin.
get sample dataset from the link you cited and save somewhere.
Assuming postgres is installed and running do the following:
Run createdb dvdrental
Run pg_restore -d dvdrental ./dvdrental where "./dvdrental" is the path to the downloaded and unzipped file.
For create sample DB in postgres you following this steps:
1.- Create directory and enter it:
mkdir -p /tmp/dvdrental && cd /tmp/dvdrental
2.- Download zip file dvdrental.zip:
wget https://www.postgresqltutorial.com/wp-content/uploads/2019/05/dvdrental.zip
3.- Uncompress file .zip and later .tar:
unzip dvdrental.zip
tar -xvf dvdrental.tar
4.- Replace in execution time $$PATH variable and review it with grep command:
sed -e 's/\$\$PATH\$\$/\/tmp\/dvdrental/g' restore.sql | grep --color dvdrental
5.- Import DB sample for specific host (localhost), port (5433), user (db) and database name (postgres):
sed -e 's/\$\$PATH\$\$/\/tmp\/dvdrental/g' restore.sql | psql -h localhost -p 5433 -U db -d postgres
Finally, I show import successful with program pgAdmin III

PSQL COPY can't find file

I am trying to execute this command:
PS C:\Program Files\PostgreSQL\9.4\bin> .\psql.exe -h front.linux-pgsql01.qa.local -p 5432 -d site -U qa -w -c "Delete from product_factor_lolek; COPY product_factor_lolek FROM E'C:\\OP_data\\SEARCH\\1.csv' delimiter '^' CSV;"
My file is located on this path: C:\OP_data\SEARCH\1.csv. But in fact, I've got an error:
ERROR: could not open file "C:\OP_data\SEARCH\1.csv" for reading: No such file or directory
I am using Windows server, PostgreSQL 9.4. What should I write for correct path?
P.S. I can't use \COPY
The COPY command will attempt to access your CSV file on Server (front.linux-pgsql01.qa.local), not in the client. So you must send your CSV to there and point the command to its path or use stdin as you mentioned:
psql.exe -h front.linux-pgsql01.qa.local -p 5432 -d site -U qa -w -c "Delete from product_factor_lolek; COPY product_factor_lolek FROM STDIN' delimiter '^' CSV;" < C:\OP_data\SEARCH\1.csv

Copying data from local .CSV file to pgsql table in remote server

I am trying to copy data from a csv file from my local machine into a remote pgsql table named states, but i am getting an ERROR: Syntax error at or near "FROM". Can someone guide me as to why i am receiving this error?
COPY FROM STDIN states FROM '/Users/Shared/data.csv' DELIMITER AS ',';
The problem is that the path to the file is in the remote server, not the local one.
you need psql and pipe the file to STDIN:
psql -h host -d remoteDB -U myuser -c "copy states from STDIN with delimiter as ',';" < /path/file.csv
alternatively you can also do:
cat /path/file.csv | psql -h host -d remoteDB -U myuser -c "copy states from STDIN with delimiter as ',';"
You can't do it directly with a filename unless the file is on the Postgres server. The docs of COPY state:
COPY with a file name instructs the PostgreSQL server to directly read from or write to a file. The file must be accessible to the server and the name must be specified from the viewpoint of the server. When STDIN or STDOUT is specified, data is transmitted via the connection between the client and the server.
You'll have to pipe the file in via STDIN.
Many Postgres drivers provide a method to make this easier. For example, ruby-pg provides copy_data.
conn.copy_data "COPY states FROM STDIN FORMAT CSV" do
File.foreach('/Users/Shared/data.csv') do |line|
conn.put_copy_data(line)
end
end

Bulk loading into PostgreSQL from a remote client

I need to bulk load a large file into PostgreSQL. I would normally use the COPY command, but this file needs to be loaded from a remote client machine. With MSSQL, I can install the local tools and use bcp.exe on the client to connect to the server.
Is there an equivalent way for PostgreSQL? If not, what is the recommended way of loading a large file from a client machine if I cannot copy the file to the server first?
Thanks.
COPY command is supported in PostgreSQL Protocol v3.0 (Postgresql 7.4 or newer).
The only thing you need to use COPY from a remote client is a libpq enabled client such as psql command line utility.
From the remote client run:
$ psql -d dbname -h 192.168.1.1 -U uname < yourbigscript.sql
You can use the \copy command from psql tool like:
psql -h IP_REMOTE_POSTGRESQL -d DATABASE -U USER_WITH_RIGHTS -c "\copy
TABLE(FIELD_LIST_SEPARATE_BY_COMMA) from 'FILE_IN_CLIENT_MACHINE(MAYBE IN THE SAME
DIRECTORY)' with csv header"
Assuming you have some sort of client in order to run the query, you can use the COPY FROM STDIN form of the COPY command: http://www.postgresql.org/docs/current/static/sql-copy.html
Use psql's \copy command to load data in sql:
$ psql -h <IP> -p <port> -U <username> -d <database>
database# \copy schema.tablename from '/home/localdir/bulkdir/file.txt' delimiter as '|'
database# \copy schema.tablename from '/home/localdir/bulkdir/file.txt' with csv header