Character with encoding UTF8 has no equivalent in WIN1252 - postgresql

I am getting the following exception:
Caused by: org.postgresql.util.PSQLException: ERROR: character 0xefbfbd of encoding "UTF8" has no equivalent in "WIN1252"
Is there a way to eradicate such characters, either via SQL or programmatically?
(SQL solution should be preferred).
I was thinking of connecting to the DB using WIN1252, but it will give the same problem.

I had a similar issue, and I solved by setting the encoding to UTF8 with \encoding UTF8 in the client before attempting an INSERT INTO foo (SELECT * from bar WHERE x=y);. My client was using WIN1252 encoding but the database was in UTF8, hence the error.
More info is available on the PostgreSQL wiki under Character Set Support (devel docs).

What do you do when you get this message? Do you import a file to Postgres? As devstuff said it is a BOM character. This is a character Windows writes as first to a text file, when it is saved in UTF8 encoding - it is invisible, 0-width character, so you'll not see it when opening it in a text editor.
Try to open this file in for example Notepad, save-as it in ANSI encoding and add (or replace similar) set client_encoding to 'WIN1252' line in your file.

Don't eridicate the characters, they're real and used for good reasons. Instead, eridicate Win1252.

I had a very similar issue. I had a linked server from SQL Server to a PostgreSQL database. Some data I had in the table I was selecting from using an openquery statement had some character that didn't have an equivalent in Win1252. The problem was that the System DSN entry (to be found under the ODBC Data Source Administrator) I had used for the connection was configured to use PostgreSQL ANSI(x64) rather than PostgreSQL Unicode(x64). Creating a new data source with the Unicode support and creating a new modified linked server and refernecing the new linked server in your openquery resolved the issue for me. Happy days.

That looks like the byte sequence 0xBD, 0xBF, 0xEF as a little-endian integer. This is the UTF8-encoded form of the Unicode byte-order-mark (BOM) character 0xFEFF.
I'm not sure what Postgre's normal behaviour is, but the BOM is normally used only for encoding detection at the beginning of an input stream, and is usually not returned as part of the result.
In any case, your exception is due to this code point not having a mapping in the Win1252 code page. This will occur with most other non-Latin characters too, such as those used in Asian scripts.
Can you change the database encoding to be UTF8 instead of 1252? This will allow your columns to contain almost any character.

I was able to get around it by using Postgres' substring function and selecting that instead:
select substring(comments from 1 for 200) from billing
The comment that the special character started each field was a great help in finally resolving it.

This problem appeared for us around 19/11/2016 with our old Access 97 app accessing a postgresql 9.1 DB.
This was solved by changing the driver to UNICODE instead of ANSI (see plang comment).

Here's what worked for me :
1 enable ad-hoc queries in sp_configure.
2 add ODBC DSN for your linked PostgreSQL server.
3 make sure you have both ANSI and Unicode (x64) drivers (try with both).
4 run query like this below - change UID, server ip, db name and password.
5 just keep the query in last line in postgreSQL format.
EXEC sp_configure 'show advanced options', 1
RECONFIGURE
GO
EXEC sp_configure 'ad hoc distributed queries', 1
RECONFIGURE
GO
SELECT * FROM OPENROWSET('MSDASQL',
'Driver=PostgreSQL Unicode(x64);
uid=loginid;
Server=1.2.3.41;
port=5432;
database=dbname;
pwd=password',
'select * FROM table_name limit 10;')

I have face this issue when my Windows 10 using Mandarin China as default language. This problem has occurred because I did try to import a database with UTF-8. Checking via psql and do "\l", it shows collate and cytpe is Mandarin China.
The solution, reset OS language back to US and re-install PostgreSQL. As the collate back to UTF-8, you can reset back your OS language again.
I write the full context and solution here https://www.yodiw.com/fix-utf8-encoding-win1252-cputf8-postgresql-windows-10/

Related

Can PostgreSQL convert entries to UTF-8 even though the input is Latin1?

I have psql (PostgreSQL) 10.10 and client_encoding is UTF8. Now entries are made by an older Delphi version which cannot use UTF8 so the entries in the DB have the special signs not represented as UTF8. A ™ sign is represented by \u0099 for instance. Is it possible to force a conversion when the sign is entered into the data base? Switching Delphi is not an option right now. I am sorry if this is a basic question. My knowledge about data bases is limited.
It looks like your Delphi client is not using LATIN1, but WINDOWS-1252, because ™ is code point 99 in that encoding.
You can change client_encoding per session, and that is what you should do.
Either let your application execute
SET client_encoding = WIN1252;
or set the PGCLIENTENCODING environment variable or specify client_encoding as part of the connect string.

Incorrect Special Character Handling in Informatica Powercenter 9.1

I am currently working on a project in my organisation where we are migrating Informatica Powercenter in our application from v8.1 to v9.1.
Informatica PC is loading data from datafiles but is not able to maintain certain special characters present in few of the input dat files.
The data was is getting loaded correctly in v8.1.
Tried changing characterset settings in Informatica as below -
CodePage movement = Unicode
NLS_LANG = AMERICAN_AMERICA.UTF8 to ENGLISH_UNITEDKINGDOM.UTF8
"DataMovementMode" = Unicode
After making the above settings I am getting the below error in the in Informatica log:
READER_1_2_1> FR_3015 Warning! Row [2258], field [exDestination]: Data [TO] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [exDestination]: Data [IOMR] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [parentOID]: Data [O-MS1109ZTRD00:esm4:iomr-2_20040510_0_0] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2268], field [exDestination]: Data [IOMR] was truncated.
The special character that are being sent in the data are and not being handled correctly -
Ø
Ù
Ɨ
¿
Á
Can somebody please guide how to resolve this issue? What else is required at Informatica end to be changed.
Does it need any session parameters to be set in database?
I posted this in another thread about special characters. Please check if this is of any help.
Start with the Source in designer. Are you able to see the data correctly in the source qualifier preview? If not, you might want to set ff source definition encoding to UTF-8.
The Integration service should be running in Unicode mode and not ASCII mode. You can check this from the Integration service properties in Admin Console.
The target should be UTF-8 encoding.
Check the relational connection ( if the target is a database) encoding in workflow manager to ensure it is UTF-8
If the problem persists, write the output to a UTF-8 flat file and check if the data is loading properly. If yes, then the issue is with writing to the database.
Check the database settings like NLS_LANG, NLS_CHARACTERSET (for oracle) etc.
Also set your integration service (IS) to run in Unicode mode for best results apart from configuring ODBC and relational connections to use Unicode
Details for Unicode & ASCII
a) Unicode - IS allows 2 bytes for each character and uses additional byte for each non-ascii character (such as Japanese/chinese characters)
b) ASCII - IS holds all data in a single byte
Make sure that the size of the variable is big enough to hold the data. Some times the warnings mentioned will be received when the size is small to hold the incoming data.

Force Unicode on Data Transfer utility for iSeries AS400 for TSV tab delimited files

I am using Data Transfer utility for IBM i in order to create TSV files from my AS400s and then import them to my SQl Server Data Warehouse.
Following this: SO Question about SSIS encoding script i want to stop using conversion in SSIS task and have the data ready from the source.
I have tried using vatious codepages in TSV creation (1200 etc.) but 1208 only does the trick in half: It creates UTF8 which then i have to convert to unicode as shown in the other question.
What CCSID i have to use to get unicode from the start?
Utility Screenshot:
On IBM i, CCSID support is intended to be seamless. Imagine the situation where the table is in German encoding, your job is in English and you are creating a new table in French - all on a system whose default encoding is Chinese. Use the appropriate CCSID for each of these and the operating system will do the character encoding conversion for you.
Unfortunately, many midrange systems aren't configured properly. Their system default CCSID is 'no CCSID / binary' - a remnant of a time some 20 years ago, before CCSID support. DSPSYSVAL QCCSID will tell you what the default CCSID is for your system. If it's 65535, that's 'binary'. This causes no end of problems, because the operating system can't figure out what the true character encoding is. Because CCSID(65535) was set for many years, almost all the tables on the system have this encoding. All the jobs on the system run under this encoding. When everything on the system is 65535, then the OS doesn't need to do any character conversion, and all seems well.
Then, someone needs multi-byte characters. It might be an Asian language, or as in your case, Unicode. If the system as a whole is 'binary / no conversion' it can be very frustrating because, essentially, the system admins have lied to the operating system with respect to the character encoding that is effect for the database and jobs.
I'm guessing that you are dealing with a CCSID(65535) environment. I think you are going to have to request some changes. At the very least, create a new / work table using an appropriate CCSID like EBCDIC US English (37). Use a system utility like CPYF to populate this table. Now try to download that, using a CCSID of say, 13488. If that does what you need, then perhaps all you need is an intermediate table to pass your data through.
Ultimately, the right solution is a proper CCSID configuration. Have the admins set the QCCSID system value and consider changing the encoding on the existing tables. After that, the system will handle multiple encodings seamlessly, as intended.
The CCSID on IBM i called 13488 is Unicode type UCS-2 (UTF-16 Big Endian). There is not "one unicode" - there are several types of Unicode formats. I looked at your other question. 1208 is also Unicode UTF-8. So what exactly is meant "to get Unicode to begin with" is not clear (you are getting Unicode to begin with in format UTF-8) -- but then I read your other question and the function you mention does not say what kind of "unicode" it expects :
using (StreamWriter writer = new StreamWriter(to, false, Encoding.Unicode, 1000000))
The operating system on IBM i default is to mainly store data in EBCDIC database tables, and there are some rare applications that are built on this system to use Unicode natively. It will translate the data into whatever type of Unicode it supports.
As for SQL Server and Java - I am fairly sure they use UCS-2 type Unicode so if you try using CCSID 13488 on the AS/400 side to transfer, it may let you avoid the extra conversion from UTF-8 Unicode because CCSID 13488 is UCS-2 style Unicode.
https://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.html
There are 2 CCSID's for UTF-8 unicode on system i 1208 and 1209. 1208 is UTF-8 with IBM PAU 1209 is for UTF-8. See link above.

Reading special characters from FoxPro using OLEDB

I'm using the FoxPro OLEDB driver (VFPOLEDB.1) to connect to a DBF using ADO.NET. The problem I am having is that some characters don't come across correctly. For example the '²' character comes out as '_'.
I have tried issuing the SET ANSI OFF command, to no avail.
I have found that the DBF is codepage 850
Does anyone know what is going on?
Foxpro doesn't support UNICODE if that is what you appear to be getting. It only works with ASCII 0-255 character set. Codepage 850 I believe is MS-DOS. There is a CPConvert() (for code page conversion), but I don't know if that is associated with the OleDbProvider as a usable function.
It turns out that I had to add CodePage=850 to the connection string so that it matched the DBF's code page.

Encoding problems with ogr2ogr and Postgis/PostgreSQL database

In our organization, we handle GIS content in different file formats. I need to put these files into a PostGIS database, and that is done using ogr2ogr. The problem is, that the database is UTF8 encoded, and the files might have a different encoding.
I found descriptions of how I can specify the encoding by adding an options parameter to org2ogr, but appearantly it doesn't have an effect.
ogr2ogr -f PostgreSQL PG:"host=localhost user=username dbname=dbname \
password=password options='-c client_encoding=latin1'" sourcefile;
The error I recieve is:
ERROR 1: ALTER TABLE "soer_vd" ADD COLUMN "målsætning" CHAR(10)
ERROR: invalid byte sequence for encoding "UTF8": 0xe56c73
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
ERROR 1: ALTER TABLE "soer_vd" ADD COLUMN "påvirkning" CHAR(10)
ERROR: invalid byte sequence for encoding "UTF8": 0xe57669
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
ERROR 1: INSERT command for new feature failed.
ERROR: invalid byte sequence for encoding "UTF8": 0xf8
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
Currently, my source file is a Shape file and I'm pretty sure, that it is Latin1 encoded.
What am I doing wrong here and can you help me?
Kind regards, Casper
Magnus is right and I will discuss the solution here.
I have seen the option to inform PostgreSQL about character encoding, options=’-c client_encoding=xxx’, used many places, but it does not seem to have any effect. If someone knows how this part is working, feel free to elaborate.
Magnus suggested to set the environment variable PGCLIENTENCODING to LATIN1. This can, according to a mailing list I queried, be done by modifying the call to ogr2ogr:
ogr2ogr -–config PGCLIENTENCODING LATIN1 –f PostgreSQL
PG:”host=hostname user=username dbname=databasename password=password” inputfile
This didn’t do anything for me. What worked for me was to, before the call to ogr2ogr, to:
SET PGCLIENTENCODING=LATIN1
It would be great to hear more details from experienced users and I hope it can help others :)
That does sound like it would set the client encoding to LATIN1. Exactly what error do you get?
Just in case ogr2ogr doesn't pass it along properly, you can also try setting the environment variable PGCLIENTENCODING to latin1.
I suggest you double check that they are actually LATIN1. Simply running file on it will give you a good idea, assuming it's actually consistent within the file. You can also try sending it through iconv to convert it to either LATIN1 or UTF8.
You need to write your command line like this :
PGCLIENTENCODING=LATIN1 ogr2ogr -f PostgreSQL PG:"dbname=...
Currently, OGR from GDAL does not perform any recoding of character data during translation between vector formats. The team has prepared RFC 23.1: Unicode support in OGR document which discusses support of recoding for OGR drivers. The RFC 23 was adopted and the core functionality was already released in GDAL 1.6.0. However, most of OGR drivers have not been updated, including Shapefile driver.
For the time being, I would describe OGR as encoding agnostic and ignorant. It means, OGR does take what it gets and sends it out without any processing. OGR uses char type to manipulate textual data. This is fine to handle multi-byte encoded strings (like UTF-8) - it's just a plain stream of bytes stored as array of char elements.
It is advised that developers of OGR drivers should return UTF-8 encoded strings of attribute values, however this rule has not been widely adopted across OGR drivers, thus making this functionality not end-user ready yet.
On windows a command is
SET PGCLIENTENCODING=LATIN1
On linux
export PGCLIENTENCODING=LATIN1
or
PGCLIENTENCODING=LATIN1
Moreover this discussion help me:
https://gis.stackexchange.com/questions/218443/ogr2ogr-encoding-on-windows-using-os4geo-shell-with-census-data
On windows
SET PGCLIENTENCODING=LATIN1 ogr2ogr...
do not give me any error, but ogr2ogr do not works...I need to change the system variable (e.g. System--> Advanced system settings--> Environment variables -->New system variable) reboot the system and then run
ogr2ogr...
I solved this problem using this command:
pg_restore --host localhost --port 5432 --username postgres --dbname {DBNAME} --schema public --verbose "{FILE_PATH to import}"
I don't know if this is the right solution, but it worked for me.
For some reason, I dont know why, I could not import tables with ÅÄÖ in them to the public schema.
When I created a new schema I could import the tables to the new schema.