How to debug Postgres copy command failure - postgresql

I have around 75k records which I am loading to a Postgres table using copy command which is failing. I get an exception
ERROR: invalid byte sequence for encoding "UTF8": 0xbd
Now i need to find which line is having this entry. Is there any way to do this? I am thinking in lines of enabling some postgres logging that might help or any other solution
Note: I am getting the issue with only one particular file. Other files are getting loaded without issues

I always seem to get a line-number in my error, no matter whether I use COPY or \copy and feed a file via redirection or -f.
ERROR: invalid byte sequence for encoding "UTF8": 0xa3
CONTEXT: COPY z, line 3
If there are only a couple of bad chars and you just want to strip them you can use iconv (assuming you're on a unix-like system).
iconv -c --from=utf8 --to=utf8 /tmp/badchars.txt > /tmp/stripped.txt
You could always run diff against the before + after versions if you wanted to see what was stripped out.

Related

How to fix syntax errors in postgresql .sql dump file when restoring with psql?

I have a postgresql .sql dump file created by pg_dump on another windows 10 box. I am trying to restore it on my windows 10 laptop with
"psql -U user -d database -1 -f filename.sql". I created the database, but when I run the command to do the restore I get an error from psql after I give it my password:
psql:filename.sql:1:1: ERROR: syntax error at or near "ÿ_"
LINE 1: ÿ_;
The file looks like straight ascii (I only see two dashes on line one. I don't see a 'y' with an umlaut anywhere). I did a file on the .sql file with cygwin bash, and it says:
Little-endian UTF-16 Unicode text, with very long lines, with CRLF, CR line >terminators
I really don't want to recreate the database by hand. I am looking for any suggestions.
I tried psql with and without the '-1' option; no luck. I tried putting a ';' at the top of the sql file, which I found suggested somewhere; again no luck.
I did a psql -l on my postgresql installation and the encoding on all my databases (including the one to which I am trying to do the restore) shows UTF8.
There really is no code. It is just that I can't seem to restore this dump file because it errors out.
I think that captures my problem. The windows box that I got the dump from is not available to me now; so I'm just hoping there is a way to get around this problem. Recreating the database by hand table by table is something I would prefer to avoid.
Thanks--
Al
In my case , this exact thing happened because I was taking the dump using windows Powershell , due to which other characters got included in the dump file.
Simply using command prompt to take the solved my problem.
I can only give you leads how to debug the problem, because the cause is not immediately obvious.
First, there should be a line close to the beginning of the dump file that sets client_encoding. The dump file should be in that encoding.
I can see two possibilities:
The file got mangled during transfer. To test for that, calculate a checksum for both files and compare.
Always use binary mode to transfer PostgreSQL dumps.
some editor or something else sneaked a BOM (byte order mark) into the file at the very beginning.
That's my prime suspect, since the problem is at line 1.
Use a hex editor or od (in Cygwin) to verify that. If this is the problem, simply replace the BOM with spaces.

ERROR: could not stat file "XX.csv": Unknown error

I run this command:
COPY XXX FROM 'D:/XXX.csv' WITH (FORMAT CSV, HEADER TRUE, NULL 'NULL')
In Windows 7, it successfully imports CSV files of less than 1GB.
If the file is more then 1GB big, I get an “unknown error”.
[Code: 0, SQL State: XX000] ERROR: could not stat file "'D:/XXX.csv' Unknown error
How can I fix this issue?
You can work around this by piping the file through a program. For example I just used this to copy from a 24GB file on Windows 10 and PostgreSQL 11.
copy t(c,d) from program 'cmd /c "type x:\path\to\file.txt"' with (format text);
This copies the text file file.txt into the table t, columns c and d.
The trick here is to run cmd in a single command mode, with /c and telling it to type out the file in question.
https://github.com/MIT-LCP/mimic-code/issues/493
alistairewj commented Nov 3, 2018 • ►
edited
Okay, the could not stat file "CHARTEVENTS.csv": Unknown error is actually a bug in PostgreSQL 11. Under the hood it makes a call to fstat() to make sure the file is not a directory, and unfortunately fstat() is a 32-bit program which can't handle large files like chartevents. I tested the build on Windows with PostgreSQL 10.5 and I didn't get this error so I think it's fairly new.
The best workaround is to keep the files compressed (i.e. keep them as .csv.gz files) and use 7zip to load in the data directly from compressed files. In testing this seemed to still work. There is a pretty detailed tutorial on how to do this here: https://mimic.physionet.org/tutorials/install-mimic-locally-windows/
The brief version of above is that you keep the .csv.gz files, you add the 7zip binary to your windows environment path, and then you call the postgres_load_data_7zip.sql file to load in the data. You can use the postgres_checks.sql file after everything to make sure you loaded in all the data correctly.
edit: For your later error, where you are using this 7zip approach, I'm not sure why it's not loading. Try redownloading just the ADMISSIONS.csv.gz file and seeing if it still throws you that same error. Maybe there is a new version of 7zip which requires me to update the script or something!
For anyone else who googled this Postgres error message after attempting to work with a >1gb file in Postgres 11, I can confirm that #亚军吴's answer above is spot-on. It is indeed a size issue.
I tried a different approach, though, than #亚军吴's and #Loren's: I simply uninstalled Postgres 11 and installed the stable version of Postgres 10.7. (I'm on Windows 10, by the way, in case that matters.)
I re-ran the original code that had prompted the error and voila, a few minutes later I'd filled in a new table with data from a medium-ish-size csv file (~3gb). I initially tried to use CSVSplitter, per #Loren, which was working fine until I got close to running out of storage space on my machine. (Thanks, Battlefield 5.)
In my case, there isn't anything in PGSQL 11 that I was relying on that wasn't in version 10.7, so I think this could be a good solution for anyone else who runs into this problem. Thanks everyone above for contributing, especially to the OP for posting this in the first place. I cured a huge, huge headache!
This has been fixed in commit bed90759f in PostgreSQL v14.
The file limit for the error is actually 4 GB.
The fix was too invasive to be backported, so you can only upgrade to avoid the problem. Once the fix has had some field testing, you could lobby the pgsql-hackers mailing list to get it backported.
With pgAdmin and AWS, I used CSVSplitter to split into files less than 1GB. Lame, but worked. pgAdmin import appends to the existing table. (Changed escape character from ' to " in order to avoid error due to unquoted text in the source file. Typically I apply quotes in LibreOffice, but these files were too big to open.)
It seems this is not a database problem, but a problem of psql / pgadmin. The workaround is using an admin software from the previous psql versions:
Use the existing PostgreSQL 11 database
Install psql or pgadmin from the PostgreSQL 10 installation and use it to upload the file (with the command shown in the question)
Hope this helps anyone coming across the same problem.
Add two lines to your CSV file: One at the begining and one at the end:
COPY XXX FROM STDIN WITH (FORMAT CSV, HEADER TRUE, NULL 'NULL');
<here are the lines your file already contains>
\.
Don't forget another newline after the \. line. Then call
psql -h hostname -d dbname -U username -f 'D:/XXX.csv'
This is what worked for me:
\COPY member_data.lab_result FROM PROGRAM 'gzip -dcf lab_result.dat.gz' WITH (FORMAT 'csv', DELIMITER '|', QUOTE '`')

Cannot COPY UTF-8 data to ScyllaDB with cqlsh

I'm trying to copy a large data set from Postgresql to ScyllaDB, which is supposed to be compatible with Cassandra.
This is what I'm trying:
psql <db_name> -c "COPY (SELECT row_number() OVER () as id, * FROM ds.my_data_set LIMIT 20) TO stdout WITH (FORMAT csv, HEADER, DELIMITER ';');" \
| \
CQLSH_HOST=172.17.0.3 cqlsh -e 'COPY test.mytable (id, "Ist Einpöster", [....]) FROM STDIN WITH DELIMITER = $$;$$ AND HEADER = TRUE;'
I get an obscure error without a stack trace:
:1:'ascii' codec can't decode byte 0xc3 in position 9: ordinal not in range(128)
My data, and column names, including the ones already in the created table in ScyllaDB, contain values with German text. It's not ASCII, but I haven't found anywhere to set the encoding, and everywhere I looked it seemed to be using utf-8 already. I tried this as well, and saw in the vicinity of line 1135 that, and changed it in my local cqlsh (using vim $(which cqlsh)), but it had no effect.
I'm using cqlsh 5.0.1, installed using pip. (weirdly it was pip install cqlsh==5.0.4)
I also tried the cqlsh from the docker image that I used to install ScyllaDB, and it has the exact same error.
<Update>
As suggested, I piped the data to a file:
psql <db_name> -c "COPY (SELECT row_number() OVER (), * FROM ds.my_data_set ds) TO stdout WITH (FORMAT csv, HEADER);" | head -n 1 > test.csv
I thinned it down to the first row (CSV header). Piping it to cqlsh made it cry with the same error. Then, using python3.5 interactive shell, I did this:
>>> with open('test.csv', 'rb') as fp:
... data = fp.read()
>>> data
b'row_number,..... Ist Einp\xc3\xb6ster ........`
So there we are, \xc3 in the flesh. Is it UTF-8?
>>> data.decode('utf-8')
'row_number,....... Ist Einpöster ........`
Yes, it's utf-8. So how does the error happen?
>>> data.decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 336: ordinal not in range(128)
Same error text, so it's probably Python as well, but without a stack trace, I have no idea where this is happening, and default encodings are utf-8. I tried overriding the default with utf-8 but nothing changed. Still, somewhere, something is trying to decode a stream using ASCII.
This is the locale on the server/client:
LANG=
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8
Someone on Slack suggested this answer UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
Once I added the last 2 lines in cqlsh.py at the beginning, it got past the decoding issue, but the same column was reported as invalid with another error:
:1:Invalid column name Ist Einpöster
side note:
I lost interest in this test at this point, and I'm just trying to not have an unanswered question, so please excuse the wait time. As I was trying it out as an analytical engine, coupled with Spark, as a data source for Tableau, I found "better" alternatives, like Vertica and ClickHouse. "Better" because both of them have limitations.
</Update>
How can I complete this import?
What was it?
The query passed in as an argument, contained the column list, which contained that column with a non-ASCII character. At some point, cqlsh parsed those as ascii and not utf-8, which lead to this error.
How it was fixed?
First attempt was to add these 2 lines in cqlsh:
reload(sys)
sys.setdefaultencoding('utf-8')
but that still made the script unable to work with that column.
Second attempt was to simply pass the query from a file. If you can't, know that bash supports process substitution, so instead of this:
cqlsh -f path/to/query.cql
you can have
cqlsh -f <(echo "COPY .... FROM STDIN;")
And it's all great, except that it doesn't work either. cqlsh understands stdin as "interactive", from a prompt, and not piped in. The result is that it doesn't import anything. One could just create a file, and load it from the file, but that's an extra step that might take minutes or hours, depending on the data size.
Thankfully, POSIX systems have these virtual files like '/dev/stdin', so the above command is equivalent to this:
cqlsh -f <(echo "COPY .... FROM '/dev/stdin';")
except that cqlsh now thinks that you actually have a file, and it reads it like a file, so you can pipe your data and be happy.
This would probably work, but for some reason I got the last kick:
cqlsh.sql:2:Failed to import 15 rows: InvalidRequest - Error from server: code=2200 [Invalid query] message="Batch too large", will retry later, attempt 4 of 5
I think it's funny that 15 rows are too much for a distributed storage engine. And it's likely that it's again some limitation from the engine related to unicode and just a wrong error message. Or I'm wrong. Nevertheless, the initial question was answered, with some BIG help from the guys in Slack.
I don't see that you ever got an answer to this. UTF-8 should be the default.
Did you try --encoding?
Docs: https://docs.scylladb.com/getting-started/cqlsh/
If you didn't get an answer here, would you wish to ask it on our slack channel?
I would try to eliminate all the extra complexity you have in there first. Try to dump a few rows into a CSV, and then load it into Scylla using COPY
Update: utf8: Print invalid UTF-8 character position
Add new validate_with_error_position function
which returns -1 if data is a valid UTF-8 string
or otherwise a byte position of first invalid
character. The position is added to exception
messages of all UTF-8 parsing errors in Scylla.
validate_with_error_position is done in two
passes in order to preserve the same performance
in common case when the string is valid.
https://github.com/scylladb/scylla/commit/ffd8c8c505b92a71df7e34d5196c7545f11cb12f

wiki dump encoding

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran
perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2
I got an error
enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^
I guess there are some non-utf8 characters contained in the dump. So I ran
iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2
And indeed, I got some errors
BZh91AY&SYiconv: illegal input sequence at position 10
So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.
Many thanks
-- [solved] I should first unzip the file first.
You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. Uncompress it first.
(Posting borrible's answer so that this resolved question is not listed as unanswered.)

Postgresql copying data into a table

I am using the copy command in Postgresql and I have a line of data in a text file that is tab seperated and I would like to copy it into the db table.
I get an error saying:
ERROR: invalid byte sequence for encoding "UTF8": 0x00
SQL state: 22021
Context: COPY real_acct1, line 113038
So I went to the line 113038 from the text file and copied it along with 4 or 5 neighboring lines into a new text file and behold that new data went in.
Any helpful thoughts? This is parcel data attributes info.
Your problem is actually one of character encoding.
The easiest way to deal with this is running your import data through iconv (assuming you're on a unix machine).
iconv -f original_charset -t utf-8 originalfile > newfile