How to skip first several lines when importing CSV from Postgresql? - postgresql

I was trying to import a CSV file to my Postgresql with the frist 8 lines skipped and start from the 9th line. My codes below works to read from the second line and treat the first line as header:
create table report(
id integer,
name character(3),
orders integer,
shipments float
);
COPY report
FROM 'C:\Users\sample.csv' DELIMITER ',' CSV HEADER;
Now how to improve this code to read from the 9th line.
Thank you!
CSV details

With PostgreSQL 9.3 or newer, COPY can refer to a program to preprocess the data, for instance Unix tail.
To start an import at line 9:
COPY report FROM PROGRAM 'tail -n +9 /path/to/file.csv' delimiter ',' csv;
Apparently you're using Windows, so tail might not be immediately available. Personally I would install it from MSYS, otherwise there are alternatives mentioned in
Looking for a windows equivalent of the unix tail command
or Windows equivalent of the 'tail' command.

Related

COPY Postgres table with Delimiter as double byte

I want to copy a Postgres (version 11) table into a csv file with delimiter as double byte character. Please assist if this can be achieved.
I am trying this:
COPY "Tab1" TO 'C:\Folder\Tempfile.csv' with (delimiter E'অ');
Getting an error:
COPY delimiter must be a single one-byte character
You could use COPY TO PROGRAM. On Unix system that could look like
COPY "Tabl" TO PROGRAM 'sed -e ''s/|/অ/g'' > /outfile.csv' (FORMAT 'csv', delimiter '|');
Choose a delimiter that does not occur in the data. On Windows, perhaps you can write a Powershell command that translates the characters.

Cannot COPY UTF-8 data to ScyllaDB with cqlsh

I'm trying to copy a large data set from Postgresql to ScyllaDB, which is supposed to be compatible with Cassandra.
This is what I'm trying:
psql <db_name> -c "COPY (SELECT row_number() OVER () as id, * FROM ds.my_data_set LIMIT 20) TO stdout WITH (FORMAT csv, HEADER, DELIMITER ';');" \
| \
CQLSH_HOST=172.17.0.3 cqlsh -e 'COPY test.mytable (id, "Ist Einpöster", [....]) FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
I get an obscure error without a stack trace:
:1:'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
My data, and column names, including the ones already in the created table in ScyllaDB, contain values with German text. It's not ASCII, but I haven't found anywhere to set the encoding, and everywhere I looked it seemed to be using utf-8 already. I tried this as well, and saw in the vicinity of line 1135 that, and changed it in my local cqlsh (using vim $(which cqlsh)), but it had no effect.
I'm using cqlsh 5.0.1, installed using pip. (weirdly it was pip install cqlsh==5.0.4)
I also tried the cqlsh from the docker image that I used to install ScyllaDB, and it has the exact same error.
<Update>
As suggested, I piped the data to a file:
psql <db_name> -c "COPY (SELECT row_number() OVER (), * FROM ds.my_data_set ds) TO stdout WITH (FORMAT csv, HEADER);" | head -n 1 > test.csv
I thinned it down to the first row (CSV header). Piping it to cqlsh made it cry with the same error. Then, using python3.5 interactive shell, I did this:
>>> with open('test.csv', 'rb') as fp:
... data = fp.read()
>>> data
b'row_number,..... Ist Einp\xc3\xb6ster ........`
So there we are, \xc3 in the flesh. Is it UTF-8?
>>> data.decode('utf-8')
'row_number,....... Ist Einpöster ........`
Yes, it's utf-8. So how does the error happen?
>>> data.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 336: ordinal not in range(128)
Same error text, so it's probably Python as well, but without a stack trace, I have no idea where this is happening, and default encodings are utf-8. I tried overriding the default with utf-8 but nothing changed. Still, somewhere, something is trying to decode a stream using ASCII.
This is the locale on the server/client:
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Someone on Slack suggested this answer UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Once I added the last 2 lines in cqlsh.py at the beginning, it got past the decoding issue, but the same column was reported as invalid with another error:
:1:Invalid column name Ist Einpöster
side note:
I lost interest in this test at this point, and I'm just trying to not have an unanswered question, so please excuse the wait time. As I was trying it out as an analytical engine, coupled with Spark, as a data source for Tableau, I found "better" alternatives, like Vertica and ClickHouse. "Better" because both of them have limitations.
</Update>
How can I complete this import?
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh parsed those as ascii and not utf-8, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh understands stdin as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
Update: utf8: Print invalid UTF-8 character position
Add new validate_with_error_position function
which returns -1 if data is a valid UTF-8 string
or otherwise a byte position of first invalid
character. The position is added to exception
messages of all UTF-8 parsing errors in Scylla.
validate_with_error_position is done in two
passes in order to preserve the same performance
in common case when the string is valid.
https://github.com/scylladb/scylla/commit/ffd8c8c505b92a71df7e34d5196c7545f11cb12f

how to pass variable to copy command in Postgresql

I tried to make a variable in SQL statement in Postgresql, but it did not work.
There are many csv files stored under the path. I want to set path in Postgresql that can tell copy command where can find csv files.
SQL statement sample:
\set outpath '/home/clients/ats-dev/'
\COPY licenses (_id, name,number_seats ) FROM :outpath + 'licenses.csv' CSV HEADER DELIMITER ',';
\COPY uploaded_files (_id, added_date ) FROM :outpath + 'files.csv' CSV HEADER DELIMITER ',';
It did not work. I got error: no such files. The two files licneses.csv and files.csv are stored under /home/cilents/ats-dev on Ubuntu. I found some sultion that use "\set file 'license.csv'". It did not work for me becacuse I have many csv files. also I tried to use "from : outpath || 'licenses.csv'". it did not work ether. Appreciate for any helps.
Using 9.3.
It looks like psql does not support :variable substitution withinpsql backslash commands.
test=> \set somevar fred
test=> \copy z from :somevar
:somevar: No such file or directory
so you will need to do this via an external tool like the unix shell. e.g.
for f in *.sql; do
psql -c "\\copy $(basename $f) FROM '$f'"
done
You can try COPY command
\set outpath '\'/home/clients/ats-dev/'
COPY licenses (_id, name,number_seats ) FROM :outpath/licenses.csv' WITH CSV HEADER DELIMITER ',';
COPY uploaded_files (_id, added_date ) FROM :outpath/files.csv' WITH CSV HEADER DELIMITER ',';
Note: Files named in a COPY command are read or written directly by the server, not by the client application. Therefore, they must reside on or be accessible to the database server machine, not the client. They must be accessible to and readable or writable by the PostgreSQL user (the user ID the server runs as), not the client. Similarly, the command specified with PROGRAM is executed directly by the server, not by the client application, must be executable by the PostgreSQL user. COPY naming a file or command is only allowed to database superusers, since it allows reading or writing any file that the server has privileges to access.
Documentation: Postgresql 9.3 COPY
It may have been true when this was originally asked, that psql backslash commands didn't support variable interpolation, but in my PostgreSQL 14 instance that's no longer the case. However, the psql manpage is clear that \copy specifically does not support variable interpolation.

PostgreSQL 9.5: Skip first two lines in the text file

I have the text file to import with the following format:
columA | columnB | columnC
-----------------------------------------
1 | A | XYZ
2 | B | XZ
3 | C | YZ
I can skip first line by using:
WITH CSV HEADER;
in copy command, but got stuck while skipping second line.
If you're using COPY FROM 'filename', you could instead use COPY FROM PROGRAM to invoke some shell command which removes the header from the file and returns the rest.
In Windows:
COPY t FROM PROGRAM 'more +2 "C:\Path\To\File.txt"'
In Linux:
COPY t FROM PROGRAM 'tail -n +3 /path/to/file.txt'
If you're trying to send a local file to a remote server, you can do something similar through psql, e.g.:
tail -n +3 file.txt | psql -c 'COPY t FROM STDIN'
The COPY command can only skip the first line. The easiest solution would be to manually remove the second line before importing but if that is not possible, then you have to use a "dirty" trick.
You create a table that has a single column of varchar type and import the text file into that table. After import you run a PL/pgSQL function to read all the rows in the table (except the header rows, obviously) and extract the information you want to insert in the destination table with, for instance, the regexp_matches() or regexp_split_to_array() function. You can also automate the whole process by using an after insert trigger on the import table, if you have to import many files with the same issue.

Importing CSV file into PostgreSQL

Using MySQL Administrator GUI tool I have exported some data tables retrieved from an sql dumpfile to csv files.
I then tried to import these CSV files into a PostgreSQL database using the postgres COPY command. I've tried entering
COPY articles FROM '[insert .csv dir here]' DELIMITERS ',' CSV;
and also the same command without the delimiters part.
I get an error saying
ERROR: invalid input syntax for integer: "id"
CONTEXT: COPY articles, line 1, column id: "id"
In conclusion my question is what are some thoughts and solutions to this problem? Could it possibly be something to do with the way I created the csv files? or have I made a rookie mistake elsewhere?
If you have header columns just add the header qualifier to the copy statement as per
documentation to skip that line
http://www.postgresql.org/docs/8.4/static/sql-copy.html