windows right single quote error when copying data into postgres from csv file - postgresql

i'm trying to import csv file into a postgres db (ver 9.3, with database encoding set as UTF8). using the command below, i get the error (also below)
copy mytable from 'C:/candidate_analyze.csv' delimiter ',' csv;
ERROR: invalid byte sequence for encoding "UTF8": 0x96
after researching, i see that this error is related Windows-1252 or the windows version of the right single quote mark instead of apostrophe.
there is a text field in the csv file (called "orig_text") that has the right single quote mark in it.
this copy functionality is something that is going to be automated, so i can't go in there and manuually do a search and replace for the windows right quote mark everytime.
any ideas as to a solution to this problem?
any help would be greatly appreciated. thank you in advance.

The COPY command has an ENCODING option:
ENCODING
Specifies that the file is encoded in the encoding_name. If this option is omitted, the current client encoding is used.
So if your file really is encoded in windows-1252 then you could say:
copy mytable from 'C:/candidate_analyze.csv' delimiter ',' encoding 'windows-1252' csv;

Related

PGAdmin / Postgres : Cannot write Timestamp data to Postgres (2018-04-18 05:40:28)

When I try to write the following date time format to Postgres using Pgadmin
2018-04-18 05:40:28
I get the following error. ERROR: invalid input syntax for type timestamp: "2018-04-18 05:40:28"
CONTEXT: COPY timestamp, line 1, column date: "2018-04-18 05:40:28"
I am trying to write the data using the timestamp format within Postgres.
Any pointers on where I am going wrong would be much appreciated.
Thank you.
My educated guess: you have a leading BOM (byte order mark) in the file that should be removed.
How do I remove  from the beginning of a file?
Or some exotic whitespace or non-printing character that should be removed or replaced.
Trim trailing spaces with PostgreSQL
And the offending character (well, the BOM is not a "character", strictly speaking, it's just mistaken for one) was not copied to the question. That would explain the otherwise contradicting error message.
To test, copy the "2018-04-18 05:40:28" part from the error message and paste it in a pgAdmin SQL editor window (which you seem to be using) and test:
SELECT '"2018-04-18 05:40:28"' = '"2018-04-18 05:40:28"';
---------^ BOM here?
I added a leading BOM to demonstrate in the first string. Type the second string by hand to be sure it's plain ASCII. If you get false, we are on to something here.
But we cannot be too sure, your question is confusing and essential information is missing. Don't use the basic type names timestamp and date as identifiers, for sanity.

decipher encoding UTF-8 Issue

I have to compare 2 text files generated from SQL server(generated directly) and Impala ( through Unix).Both are saved as UTF-8 file. I have converted the SQL server generated file using dos2unix for direct compare in unix. I have some data which is encoded and am not able to check what the encoding is.
Below is some sample data from SQL server file.
Rock<81>ller
<81>hern
<81>ber
R<81>cking
Below is sample data from Unix file.
Rock�ller
�ber
R�cking
�ber
I checked the file using HXD editor and both SQL server generated data and unix file showed code 81. I looked up code 81 in UTF and found it is control<> character.
I am really lost as encoding is fairly new to me. Any help to decipher what encoding it is actualy used would be very helpful.

Non-ISO extended-ASCII CSV giving special character while importing in DB

I am getting CSV from S3 server and inserting it into PostgreSQL using java.
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, key));
BufferedReader reader = new BufferedReader(
new InputStreamReader(object.getObjectContent())
);
For some of the rows the value in a column contains the special characters �. I tried using the encodings UTF-8, UTF-16 and ISO-8859-1 with InputStreamReader, but it didn't work out.
When the encoding WIN-1252 is used, the DB still shows some special characters, but when I export the data to CSV it shows the same characters which I found in the raw file.
But again when I am opening the file in Notepad the character is fine, but when I open it in excel, the same special character appears.
All the PostgreSQL stuff is quite irrelevant; PostgreSQL can deal with practically any encoding. Check your data with an utility such as enca to determine how it is encoded, and set your PostgreSQL session to that encoding. If the server is in the same encoding or in some Unicode encoding, it should work fine.

Uploading data to RedShift using COPY

I am trying to upload data to RedShift using COPY command.
On this row:
4072462|10013868|default|2015-10-14 21:23:18.0|0|'A=0
I am getting this error:
Delimited value missing end quote
This is the COPY command:
copy test
from 's3://test/test.gz'
credentials 'aws_access_key_id=xxx;aws_secret_access_key=xxx' removequotes escape gzip
First, I hope you know why you are getting the mentioned error: You have a a single quote in one of the column values. While using the removequotes option, Redshift documentation clearly says that:
If a string has a beginning single or double quotation mark but no corresponding ending mark, the COPY command fails to load that row and returns an error.
One thing is certain: removequotes is certainly not what you are looking for.
Second, so what are your options?
If preprocessing the S3 file is in your control, consider using the escape option. Per the documentation,
When this parameter is specified, the backslash character (\) in input data is treated as an escape character.
So your input row in S3 should change to something like:
4072462|10013868|default|2015-10-14 21:23:18.0|0|\'A=0
See if the CSV DELIMITER '|' works for you. Check documentation here.

Character with encoding UTF8 has no equivalent in WIN1252

I am getting the following exception:
Caused by: org.postgresql.util.PSQLException: ERROR: character 0xefbfbd of encoding "UTF8" has no equivalent in "WIN1252"
Is there a way to eradicate such characters, either via SQL or programmatically?
(SQL solution should be preferred).
I was thinking of connecting to the DB using WIN1252, but it will give the same problem.
I had a similar issue, and I solved by setting the encoding to UTF8 with \encoding UTF8 in the client before attempting an INSERT INTO foo (SELECT * from bar WHERE x=y);. My client was using WIN1252 encoding but the database was in UTF8, hence the error.
More info is available on the PostgreSQL wiki under Character Set Support (devel docs).
What do you do when you get this message? Do you import a file to Postgres? As devstuff said it is a BOM character. This is a character Windows writes as first to a text file, when it is saved in UTF8 encoding - it is invisible, 0-width character, so you'll not see it when opening it in a text editor.
Try to open this file in for example Notepad, save-as it in ANSI encoding and add (or replace similar) set client_encoding to 'WIN1252' line in your file.
Don't eridicate the characters, they're real and used for good reasons. Instead, eridicate Win1252.
I had a very similar issue. I had a linked server from SQL Server to a PostgreSQL database. Some data I had in the table I was selecting from using an openquery statement had some character that didn't have an equivalent in Win1252. The problem was that the System DSN entry (to be found under the ODBC Data Source Administrator) I had used for the connection was configured to use PostgreSQL ANSI(x64) rather than PostgreSQL Unicode(x64). Creating a new data source with the Unicode support and creating a new modified linked server and refernecing the new linked server in your openquery resolved the issue for me. Happy days.
That looks like the byte sequence 0xBD, 0xBF, 0xEF as a little-endian integer. This is the UTF8-encoded form of the Unicode byte-order-mark (BOM) character 0xFEFF.
I'm not sure what Postgre's normal behaviour is, but the BOM is normally used only for encoding detection at the beginning of an input stream, and is usually not returned as part of the result.
In any case, your exception is due to this code point not having a mapping in the Win1252 code page. This will occur with most other non-Latin characters too, such as those used in Asian scripts.
Can you change the database encoding to be UTF8 instead of 1252? This will allow your columns to contain almost any character.
I was able to get around it by using Postgres' substring function and selecting that instead:
select substring(comments from 1 for 200) from billing
The comment that the special character started each field was a great help in finally resolving it.
This problem appeared for us around 19/11/2016 with our old Access 97 app accessing a postgresql 9.1 DB.
This was solved by changing the driver to UNICODE instead of ANSI (see plang comment).
Here's what worked for me :
1 enable ad-hoc queries in sp_configure.
2 add ODBC DSN for your linked PostgreSQL server.
3 make sure you have both ANSI and Unicode (x64) drivers (try with both).
4 run query like this below - change UID, server ip, db name and password.
5 just keep the query in last line in postgreSQL format.
EXEC sp_configure 'show advanced options', 1
RECONFIGURE
GO
EXEC sp_configure 'ad hoc distributed queries', 1
RECONFIGURE
GO
SELECT * FROM OPENROWSET('MSDASQL',
'Driver=PostgreSQL Unicode(x64);
uid=loginid;
Server=1.2.3.41;
port=5432;
database=dbname;
pwd=password',
'select * FROM table_name limit 10;')
I have face this issue when my Windows 10 using Mandarin China as default language. This problem has occurred because I did try to import a database with UTF-8. Checking via psql and do "\l", it shows collate and cytpe is Mandarin China.
The solution, reset OS language back to US and re-install PostgreSQL. As the collate back to UTF-8, you can reset back your OS language again.
I write the full context and solution here https://www.yodiw.com/fix-utf8-encoding-win1252-cputf8-postgresql-windows-10/