PostgreSQL invalid byte sequence for encoding utf8 0xbf - postgresql

I am importing a CSV file which is related to the properties. It has /n between the values. While trying to import it into a table, the following error shows up:
PostgreSQL invalid byte sequence for encoding utf8 0xbf
I tried by simply importing the single column only, but it is not working.
Column values will look like this:
"Job No 305385917-001: To attached Garage (Single remain).\n10305 - 132 STREET NW
Plan 23AF Blk 84 Lot 14\n2002995 LERTA LTD O/A LIR HOMES DONTON\nHENORA"
I want to import the above whole into a single column.
COPY edmonton.general_filtered (descriptive)
FROM 'D:/property_own/descriptive_details.csv'
DELIMITER ',' CSV HEADER;

Your COPY statement is correct, but your data are not in UTF8 encoding.
They are probably in Latin-1 or Windows-1252, where 0xBF is ¿.
Specify the encoding correctly, e.g.:
COPY edmonton.general_filtered (descriptive)
FROM 'D:/property_own/descriptive_details.csv'
(FORMAT 'csv', HEADER, ENCODING 'WIN1252');

Related

PostgreSQL how to read csv file with decimal comma?

I try to read a csv file containing real numbers with a comma as separator. I try to read this file with \copy in psql:
\copy table FROM 'filename.csv' DELIMITER ';' CSV HEADER;
psql does not recognize the comma as decimal point.
psql:filename.sql:44: ERROR: invalid input syntax for type real: "9669,84"
CONTEXT: COPY filename, line 2, column col-3: "9669,84"
I did some googling but could not find any answer other than "change the decimal comma into a decimal point". I tried SET DECIMALSEPARATORCOMMA=ON; but that did not work. I also experimented with some encoding but I couldn't find whether encoding governs the decimal point (I got the impression it didn't).
Is there really no solution other than changing the input data?
COPY to a table where you insert the number into a varchar field. Then do something like in psql:
--Temporarily change numeric formatting to one that uses ',' as
--decimal separator.
set lc_numeric = "de_DE.UTF-8";
--Below is just an example. In your case the select would be part of
--insert into the target table. Also the first part of to_number
--would be the field from your staging table.
select to_number('9669,84', '99999D999');
9669.84
You might need to change the format string to match all the numbers. For more information on what is available see Data formatting Table 9.28. Template Patterns for Numeric Formatting.

How to convert hex characters when using Postgres COPY FROM?

I am importing data from a file to PostgreSQL database table using COPY FROM.
Some of the strings in my file contain hex characters (mostly \x0d and \x0a) and I'd like them to be converted into regular text using COPY.
My problem is that they are treated as regular text and remain in the string unchanged.
How can I get the hex values converted?
Here is a simplified example of my situation:
-- The table I am importing to
CREATE TABLE my_pg_table (
id serial NOT NULL,
value text
);
COPY my_pg_table(id, data)
FROM 'location/data.file'
WITH CSV
DELIMITER ' ' -- this is actually a tab
QUOTE ''''
ENCODING 'UTF-8'
Example file:
1 'some data'
2 'some more data \x0d'
3 'even more data \x0d\x0a'
Note: the file is tab delimited.
Now, doing:
SELECT * FROM my_pg_table
would get me results containing hex.
Additional info for context:
My task is to export data from sybase tables (many hundreds) and import to Postgres. I am using UNLOAD to export data to files like so:
UNLOAD
TABLE my_sybase_table
TO 'location/data.file'
DELIMITED BY ' ' -- this is actually a tab
BYTE ORDER MARK OFF
ENCODING 'UTF-8'
It seems to me that (for a reason I don't understand) hex is only converted when using FORMAT TEXT and FORMAT CSV will treat it as regular string.
Solving the problem in my situation:
Because I had to use TEXT I didn't have the QUOTE option anymore and because of that I couldn't have quoted strings in my files anymore. So I needed my files in a little different format and eventually used this to export my table from sybase:
UNLOAD
SELECT
COALESCE(cast(id as long varchar), '(NULL)'),
COALESCE(cast(data as long varchar), '(NULL)')
FROM my_sybase_table
TO 'location/data.file'
DELIMITED BY ' ' -- still tab delimited
BYTE ORDER MARK OFF
QUOTES OFF
ENCODING 'UTF-8'
and to import it to postgres:
COPY my_pg_table(id, data)
FROM 'location/data.file'
DELIMITER ' ' -- tab delimited
NULL '(NULL)'
ENCODING 'UTF-8'
I used (NULL), because I needed a way to differentiate between an empty string and null. I casted every column to long varchar, to make my mass export/import more convenient.
I'd be still very interested to know why hex wouldn't convert when using FORMAT CSV.

Using ASCII 31 field separator character as Postgresql COPY delimiter

We are exporting data from Postgres 9.3 into a text file for ingestion by Spark.
We would like to use the ASCII 31 field separator character as a delimiter instead of \t so that we don't have to worry about escaping issues.
We can do so in a shell script like this:
#!/bin/bash
DELIMITER=$'\x1F'
echo "copy ( select * from table limit 1) to STDOUT WITH DELIMITER '${DELIMITER}'" | (psql ...) > /tmp/ascii31
But we're wondering, is it possible to specify a non-printable glyph as a delimiter in "pure" postgres?
edit: we attempted to use the postgres escaping convention per http://www.postgresql.org/docs/9.3/static/sql-syntax-lexical.html
warehouse=> copy ( select * from table limit 1) to STDOUT WITH DELIMITER '\x1f';
and received
ERROR: COPY delimiter must be a single one-byte character
Try prepending E before the sequence you're trying to use as a delimter. For example E'\x1f' instead of '\x1f'. Without the E PostgreSQL will read '\x1f' as four separate characters and not a hexadecimal escape sequence, hence the error message.
See the PostgreSQL manual on "String Constants with C-style Escapes" for more information.
From my testing, both of the following work:
echo "copy (select 1 a, 2 b) to stdout with delimiter u&'\\001f'"| psql;
echo "copy (select 1 a, 2 b) to stdout with delimiter e'\\x1f'"| psql;
I've extracted a small file from Actian Matrix (a fork of Amazon Redshift, both derivatives of postgres), using this notation for ASCII character code 30, "Record Separator".
unload ('SELECT btrim(class_cd) as class_cd, btrim(class_desc) as class_desc
FROM transport.stg.us_fmcsa_carrier_classes')
to '/tmp/us_fmcsa_carrier_classes_mk4.txt'
delimiter as '\036' leader;
This is an example of how this file looks in VI:
C^^Private Property
D^^Private Passenger Business
E^^Private Passenger Non-Business
I then moved this file over to a machine hosting PostgreSQL 9.5 via sftp, and used the following copy command, which seems to work well:
copy fmcsa.carrier_classes
from '/tmp/us_fmcsa_carrier_classes_mk4.txt'
delimiter u&'\001E';
Each derivative of postgres, and postgres itself seems to prefer a slightly different notation. Too bad we don't have a single standard!

DB2 UTF-8 Data Storage - Extraneous Byte Values

I am attempting to store Unicode characters in UTF8 format on a DB2 database. I have confirmed that the charset is 1208 and the that the database is specified to hold UTF8.
I am, however, getting odd results when querying some unicode data.
select hex(firstname), firstname, from my_schema.my_table where my_pk = 1234;
The results are as below:
C383C289 Ã
The character in the result is displaying wrong. From what I gather, it's being represented by the hex values "C383C289". The actual character sent on the insert was É and should be represented in UTF8 as C389.
At this stage I'm assuming that it could be the program that I am using to query the data that is interpreting it wrong. But to what extent are the hex values (first result column) wrong? They seem to have unused fluff "83C2" between the actual bytes. Or, is "C383C289" actually correct, and some UTF8 decoding engines can't handle the fluff? This seems unlikely to me.
The client (DB2 For Toad, and WinSQL) both display the character as an à which is represented in UTF8 as C383.
*Edit. I tested on the CLI and it is correctly returning the É character. Am I missing something? Is the "hex" function returning something that it shouldn't be?
É (U+00C9) in UTF-8 is 0xC3 0x89.
à (U+00C3) in UTF-8 is 0xC3 0x83.
‰ (U+0089) in UTF-8 is 0xC2 0x89.
This means your insert code is taking É, encoding it to UTF-8 octets 0xC3 0x89 before then inserting those octets into the DB. The DB is interpreting them as individual characters 0xC3 and 0x89 and encoding them a second time into UTF-8, thus producing 0xC3 0x83 0xC2 0x89.
You need to fix your insert code to not perform that initial encode anymore, so the DB will see the original É as-is and not a pre-encoded version of it. How you actually do it is anyone's guess, since you did not show your actual insert code yet.
This is not really an answer, just to demonstrate the correct behaviour:
> db2 "insert into t1 values ('Élan')"
DB20000I The SQL command completed successfully.
> db2 select "hex (f1), f1 from t1"
1 F1
---------- -----
C3896C616E Élan
1 record(s) selected.

SPOOL - Format columns with french characters

I am creating a file from a SELECT query using sqlplus with SPOOL command. Some of the columns in my SELECT query have French characters, which are not written properly the file.
SELECT RPAD(Column1, ‘ ‘, 32 ) FROM TableX;
If the value of Column1 contains for example the character "é", then the output would have length=31 instead of 32 and the "é" char is not correctly shown in output file.
How can I format the columns so that I get proper value and length from my columns?
I found out how to resolve my formating problem.
1. The definition of selected column must be replaced from Column1 VARCHAR2(32 BYTE) to VARCHAR2(32 CHAR);
2. The charset environnemnt variable NLS_LANG must accept french characters: NLS_LANG=FRENCH_FRANCE.WE8ISO8859P15.
Thx anyway!