Encoding problems with ogr2ogr and Postgis/PostgreSQL database

Encoding problems with ogr2ogr and Postgis/PostgreSQL database - postgresql

In our organization, we handle GIS content in different file formats. I need to put these files into a PostGIS database, and that is done using ogr2ogr. The problem is, that the database is UTF8 encoded, and the files might have a different encoding.
I found descriptions of how I can specify the encoding by adding an options parameter to org2ogr, but appearantly it doesn't have an effect.
ogr2ogr -f PostgreSQL PG:"host=localhost user=username dbname=dbname \
password=password options='-c client_encoding=latin1'" sourcefile;
The error I recieve is:
ERROR 1: ALTER TABLE "soer_vd" ADD COLUMN "målsætning" CHAR(10)
ERROR: invalid byte sequence for encoding "UTF8": 0xe56c73
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
ERROR 1: ALTER TABLE "soer_vd" ADD COLUMN "påvirkning" CHAR(10)
ERROR: invalid byte sequence for encoding "UTF8": 0xe57669
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
ERROR 1: INSERT command for new feature failed.
ERROR: invalid byte sequence for encoding "UTF8": 0xf8
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
Currently, my source file is a Shape file and I'm pretty sure, that it is Latin1 encoded.
What am I doing wrong here and can you help me?
Kind regards, Casper

Magnus is right and I will discuss the solution here.
I have seen the option to inform PostgreSQL about character encoding, options=’-c client_encoding=xxx’, used many places, but it does not seem to have any effect. If someone knows how this part is working, feel free to elaborate.
Magnus suggested to set the environment variable PGCLIENTENCODING to LATIN1. This can, according to a mailing list I queried, be done by modifying the call to ogr2ogr:
ogr2ogr -–config PGCLIENTENCODING LATIN1 –f PostgreSQL
PG:”host=hostname user=username dbname=databasename password=password” inputfile
This didn’t do anything for me. What worked for me was to, before the call to ogr2ogr, to:
SET PGCLIENTENCODING=LATIN1
It would be great to hear more details from experienced users and I hope it can help others :)

That does sound like it would set the client encoding to LATIN1. Exactly what error do you get?
Just in case ogr2ogr doesn't pass it along properly, you can also try setting the environment variable PGCLIENTENCODING to latin1.
I suggest you double check that they are actually LATIN1. Simply running file on it will give you a good idea, assuming it's actually consistent within the file. You can also try sending it through iconv to convert it to either LATIN1 or UTF8.

You need to write your command line like this :
PGCLIENTENCODING=LATIN1 ogr2ogr -f PostgreSQL PG:"dbname=...

Currently, OGR from GDAL does not perform any recoding of character data during translation between vector formats. The team has prepared RFC 23.1: Unicode support in OGR document which discusses support of recoding for OGR drivers. The RFC 23 was adopted and the core functionality was already released in GDAL 1.6.0. However, most of OGR drivers have not been updated, including Shapefile driver.
For the time being, I would describe OGR as encoding agnostic and ignorant. It means, OGR does take what it gets and sends it out without any processing. OGR uses char type to manipulate textual data. This is fine to handle multi-byte encoded strings (like UTF-8) - it's just a plain stream of bytes stored as array of char elements.
It is advised that developers of OGR drivers should return UTF-8 encoded strings of attribute values, however this rule has not been widely adopted across OGR drivers, thus making this functionality not end-user ready yet.

On windows a command is
SET PGCLIENTENCODING=LATIN1
On linux
export PGCLIENTENCODING=LATIN1
or
PGCLIENTENCODING=LATIN1
Moreover this discussion help me:
https://gis.stackexchange.com/questions/218443/ogr2ogr-encoding-on-windows-using-os4geo-shell-with-census-data
On windows
SET PGCLIENTENCODING=LATIN1 ogr2ogr...
do not give me any error, but ogr2ogr do not works...I need to change the system variable (e.g. System--> Advanced system settings--> Environment variables -->New system variable) reboot the system and then run
ogr2ogr...

I solved this problem using this command:
pg_restore --host localhost --port 5432 --username postgres --dbname {DBNAME} --schema public --verbose "{FILE_PATH to import}"
I don't know if this is the right solution, but it worked for me.

For some reason, I dont know why, I could not import tables with ÅÄÖ in them to the public schema.
When I created a new schema I could import the tables to the new schema.

Related

How to transcode Unicode to ISO 8859-1 with postgres 13

How can I transcode a UTF-8 string to Latin1 with PostgreSQL 13+ ?
I've read this SO thread but the functions convert(), convert_from() and convert_to() no longer exist starting from Postgres 13.
EDIT: the solution is given by Laurenz Albe, who pointed out that the functions still exist. I was only afterwards that I noticed:
Google made me land on the manual for 8.2, for which convert() has a different signature than in version 8.3+
I tried the 8.2 SQL code that resulted in ERROR: syntax error at or near "USING"
I couldn't find the function in the version 13 docs, because:
the function manual has been moved to Binary functions
So the correct SQL should have been:
SELECT convert('text_in_utf8', 'UTF8', 'LATIN1');

convert_from and convert_to still exist, but they cannot convert from text to text because text is always a string in the database encoding. Strings in other encoding can only be stored as bytea.
I cannot guide you any further, because you didn't tell us what problem you are trying to solve.

Can PostgreSQL convert entries to UTF-8 even though the input is Latin1?

I have psql (PostgreSQL) 10.10 and client_encoding is UTF8. Now entries are made by an older Delphi version which cannot use UTF8 so the entries in the DB have the special signs not represented as UTF8. A ™ sign is represented by \u0099 for instance. Is it possible to force a conversion when the sign is entered into the data base? Switching Delphi is not an option right now. I am sorry if this is a basic question. My knowledge about data bases is limited.

It looks like your Delphi client is not using LATIN1, but WINDOWS-1252, because ™ is code point 99 in that encoding.
You can change client_encoding per session, and that is what you should do.
Either let your application execute
SET client_encoding = WIN1252;
or set the PGCLIENTENCODING environment variable or specify client_encoding as part of the connect string.

Incorrect Special Character Handling in Informatica Powercenter 9.1

I am currently working on a project in my organisation where we are migrating Informatica Powercenter in our application from v8.1 to v9.1.
Informatica PC is loading data from datafiles but is not able to maintain certain special characters present in few of the input dat files.
The data was is getting loaded correctly in v8.1.
Tried changing characterset settings in Informatica as below -
CodePage movement = Unicode
NLS_LANG = AMERICAN_AMERICA.UTF8 to ENGLISH_UNITEDKINGDOM.UTF8
"DataMovementMode" = Unicode
After making the above settings I am getting the below error in the in Informatica log:
READER_1_2_1> FR_3015 Warning! Row [2258], field [exDestination]: Data [TO] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [exDestination]: Data [IOMR] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [parentOID]: Data [O-MS1109ZTRD00:esm4:iomr-2_20040510_0_0] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2268], field [exDestination]: Data [IOMR] was truncated.
The special character that are being sent in the data are and not being handled correctly -
Ø
Ù
Ɨ
¿
Á
Can somebody please guide how to resolve this issue? What else is required at Informatica end to be changed.
Does it need any session parameters to be set in database?

I posted this in another thread about special characters. Please check if this is of any help.
Start with the Source in designer. Are you able to see the data correctly in the source qualifier preview? If not, you might want to set ff source definition encoding to UTF-8.
The Integration service should be running in Unicode mode and not ASCII mode. You can check this from the Integration service properties in Admin Console.
The target should be UTF-8 encoding.
Check the relational connection ( if the target is a database) encoding in workflow manager to ensure it is UTF-8
If the problem persists, write the output to a UTF-8 flat file and check if the data is loading properly. If yes, then the issue is with writing to the database.
Check the database settings like NLS_LANG, NLS_CHARACTERSET (for oracle) etc.

Also set your integration service (IS) to run in Unicode mode for best results apart from configuring ODBC and relational connections to use Unicode
Details for Unicode & ASCII
a) Unicode - IS allows 2 bytes for each character and uses additional byte for each non-ascii character (such as Japanese/chinese characters)
b) ASCII - IS holds all data in a single byte
Make sure that the size of the variable is big enough to hold the data. Some times the warnings mentioned will be received when the size is small to hold the incoming data.

Is there a command similar as MySQL's "set names" in mongodb?

Mongodb use utf-8 in internal, how to set the output charset? Is there a command similar as MySQL's "set names"? I using c++ mongoclient.

I'm not sure If the c++ driver behaves differently, but from what I know you'll always get a UTF-8 encoded result back. So if you want those data converted to another character set you need to perform it for yourself (don't know what ways you have with c++).

MongoDB exclusively deals with UTF-8. You can't change either input or output character sets and character encodings. You will need to do that in your application, where you also need to make sure that every string you send to MongoDB is actually UTF-8. None of the drivers currently support anything else. It's not likely they will ever do either.

Character with encoding UTF8 has no equivalent in WIN1252

I am getting the following exception:
Caused by: org.postgresql.util.PSQLException: ERROR: character 0xefbfbd of encoding "UTF8" has no equivalent in "WIN1252"
Is there a way to eradicate such characters, either via SQL or programmatically?
(SQL solution should be preferred).
I was thinking of connecting to the DB using WIN1252, but it will give the same problem.

I had a similar issue, and I solved by setting the encoding to UTF8 with \encoding UTF8 in the client before attempting an INSERT INTO foo (SELECT * from bar WHERE x=y);. My client was using WIN1252 encoding but the database was in UTF8, hence the error.
More info is available on the PostgreSQL wiki under Character Set Support (devel docs).

What do you do when you get this message? Do you import a file to Postgres? As devstuff said it is a BOM character. This is a character Windows writes as first to a text file, when it is saved in UTF8 encoding - it is invisible, 0-width character, so you'll not see it when opening it in a text editor.
Try to open this file in for example Notepad, save-as it in ANSI encoding and add (or replace similar) set client_encoding to 'WIN1252' line in your file.

Don't eridicate the characters, they're real and used for good reasons. Instead, eridicate Win1252.

I had a very similar issue. I had a linked server from SQL Server to a PostgreSQL database. Some data I had in the table I was selecting from using an openquery statement had some character that didn't have an equivalent in Win1252. The problem was that the System DSN entry (to be found under the ODBC Data Source Administrator) I had used for the connection was configured to use PostgreSQL ANSI(x64) rather than PostgreSQL Unicode(x64). Creating a new data source with the Unicode support and creating a new modified linked server and refernecing the new linked server in your openquery resolved the issue for me. Happy days.

That looks like the byte sequence 0xBD, 0xBF, 0xEF as a little-endian integer. This is the UTF8-encoded form of the Unicode byte-order-mark (BOM) character 0xFEFF.
I'm not sure what Postgre's normal behaviour is, but the BOM is normally used only for encoding detection at the beginning of an input stream, and is usually not returned as part of the result.
In any case, your exception is due to this code point not having a mapping in the Win1252 code page. This will occur with most other non-Latin characters too, such as those used in Asian scripts.
Can you change the database encoding to be UTF8 instead of 1252? This will allow your columns to contain almost any character.

I was able to get around it by using Postgres' substring function and selecting that instead:
select substring(comments from 1 for 200) from billing
The comment that the special character started each field was a great help in finally resolving it.

This problem appeared for us around 19/11/2016 with our old Access 97 app accessing a postgresql 9.1 DB.
This was solved by changing the driver to UNICODE instead of ANSI (see plang comment).

Here's what worked for me :
1 enable ad-hoc queries in sp_configure.
2 add ODBC DSN for your linked PostgreSQL server.
3 make sure you have both ANSI and Unicode (x64) drivers (try with both).
4 run query like this below - change UID, server ip, db name and password.
5 just keep the query in last line in postgreSQL format.
EXEC sp_configure 'show advanced options', 1
RECONFIGURE
GO
EXEC sp_configure 'ad hoc distributed queries', 1
RECONFIGURE
GO
SELECT * FROM OPENROWSET('MSDASQL',
'Driver=PostgreSQL Unicode(x64);
uid=loginid;
Server=1.2.3.41;
port=5432;
database=dbname;
pwd=password',
'select * FROM table_name limit 10;')

I have face this issue when my Windows 10 using Mandarin China as default language. This problem has occurred because I did try to import a database with UTF-8. Checking via psql and do "\l", it shows collate and cytpe is Mandarin China.
The solution, reset OS language back to US and re-install PostgreSQL. As the collate back to UTF-8, you can reset back your OS language again.
I write the full context and solution here https://www.yodiw.com/fix-utf8-encoding-win1252-cputf8-postgresql-windows-10/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse