Can PostgreSQL convert entries to UTF-8 even though the input is Latin1? - postgresql

I have psql (PostgreSQL) 10.10 and client_encoding is UTF8. Now entries are made by an older Delphi version which cannot use UTF8 so the entries in the DB have the special signs not represented as UTF8. A ™ sign is represented by \u0099 for instance. Is it possible to force a conversion when the sign is entered into the data base? Switching Delphi is not an option right now. I am sorry if this is a basic question. My knowledge about data bases is limited.

It looks like your Delphi client is not using LATIN1, but WINDOWS-1252, because ™ is code point 99 in that encoding.
You can change client_encoding per session, and that is what you should do.
Either let your application execute
SET client_encoding = WIN1252;
or set the PGCLIENTENCODING environment variable or specify client_encoding as part of the connect string.

Related

Convert Emoji UTF8 to Unicode in Powershell

I need to persist data in a database with powershell. Occasionally the data contains an emoji.
DB: On the DB side everything should be fine. The attributes are set to NVARCHAR which makes it possible to persist emojis. When I inserted the data manually the emoji got displayed after I query them(🤝💰).
I tested it with example data in SSMS and it worked perfectly.
Powershell: When preparing the SQL Statement in Powershell I noticed that the emojis are interpreted in UTF8 (ðŸ¤ðŸ’°). Basically gibberish.
Is a conversion from UTF8 to Unicode even necessary? How can I persist the emojis as 🤝💰 and not as ðŸ¤ðŸ’°/1f600
My colleague had the correct answer to this problem.
To persist emojis in a MS SQL Database you need to declare the column as nvarchar(max) (max is not necessarily), which I already did.
I tried to persist example data which I had hardcoded in my PS Script like this
#{ description = "Example Description😊😊 }
Apparently VS Code adds some kind of encoding on top of the data(our guess).
What basically solved the issue was simply requesting the data from the API and persist it into the database with prefix string literal with N + the nvarchar(max) column datatype
Example:
SET #displayNameFormatted = N'"+$displayName+"'
And then include that variable in my insert statement.
Does this answer your question? "Use NVARCHAR(size) datatype and prefix string literal with N:"
Add emoji / emoticon to SQL Server table
1 Emoji in powershell is 2 utf16 surrogate characters, since the code for it is too high for 16 bits. Surrogates and Supplementary Characters
'🤝'.length
2
Microsoft has made "unicode" a confusing term, since that's what they call utf16 le encoding.
Powershell 5.1 doesn't recognize utf8 no bom encoded scripts automatically.
We don't know what your commandline actually is, but see also: Unicode support for Invoke-Sqlcmd in PowerShell

Which Postgres multi client_encoding best practice?

I have a DB server and client with encoding UTF8,and the client some time write data in SJIS or LATIN1,
ERROR: invalid byte sequence for encoding "UTF8": 0xd5 0x78
I tried to set client_encoding to SJIS then it's worked
I wonder why this error happened?
Because I think UTF8 was support both of them descriped in this docs
https://www.postgresql.org/docs/11/multibyte.html#id-1.6.10.5.7
Are the any way to make it automatic convert without setting encoding manually?
No, you need to explicitly tell it what encoding the client uses.
Postgres does support automatic conversion between the encodings, but it does not support automatic encoding detection. If the client is assumed to use UTF-8 but then writes some SJIS bytes, they simply might end up as invalid.

Character encoding for Postgres API function return values?

I have a 9.0 postgres server instance and a database using UTF8 character encoding with German_Germany.1252 collation. I'm trying to get my libpq error messages on the client as US-ASCII strings. To this end I do:
PQsetClientEncoding( connection, "SQL_ASCII" );
which returns no error. However, the strings returned from PQerrorMessage() still seem to be UTF8.
Is the return value from PQerrorMessage always guaranteed to be UTF8? No matter the client/server settings?
SQL_ASCII as a client encoding means, pass the bytes through as is, which is exactly what you didn't want. There actually isn't any client encoding that corresponds to just ASCII. If your messages are in German, then you might want a setting such as LATIN1 or LATIN9. Otherwise change the language to English and the messages will be in ASCII anyway.

Character with encoding UTF8 has no equivalent in WIN1252

I am getting the following exception:
Caused by: org.postgresql.util.PSQLException: ERROR: character 0xefbfbd of encoding "UTF8" has no equivalent in "WIN1252"
Is there a way to eradicate such characters, either via SQL or programmatically?
(SQL solution should be preferred).
I was thinking of connecting to the DB using WIN1252, but it will give the same problem.
I had a similar issue, and I solved by setting the encoding to UTF8 with \encoding UTF8 in the client before attempting an INSERT INTO foo (SELECT * from bar WHERE x=y);. My client was using WIN1252 encoding but the database was in UTF8, hence the error.
More info is available on the PostgreSQL wiki under Character Set Support (devel docs).
What do you do when you get this message? Do you import a file to Postgres? As devstuff said it is a BOM character. This is a character Windows writes as first to a text file, when it is saved in UTF8 encoding - it is invisible, 0-width character, so you'll not see it when opening it in a text editor.
Try to open this file in for example Notepad, save-as it in ANSI encoding and add (or replace similar) set client_encoding to 'WIN1252' line in your file.
Don't eridicate the characters, they're real and used for good reasons. Instead, eridicate Win1252.
I had a very similar issue. I had a linked server from SQL Server to a PostgreSQL database. Some data I had in the table I was selecting from using an openquery statement had some character that didn't have an equivalent in Win1252. The problem was that the System DSN entry (to be found under the ODBC Data Source Administrator) I had used for the connection was configured to use PostgreSQL ANSI(x64) rather than PostgreSQL Unicode(x64). Creating a new data source with the Unicode support and creating a new modified linked server and refernecing the new linked server in your openquery resolved the issue for me. Happy days.
That looks like the byte sequence 0xBD, 0xBF, 0xEF as a little-endian integer. This is the UTF8-encoded form of the Unicode byte-order-mark (BOM) character 0xFEFF.
I'm not sure what Postgre's normal behaviour is, but the BOM is normally used only for encoding detection at the beginning of an input stream, and is usually not returned as part of the result.
In any case, your exception is due to this code point not having a mapping in the Win1252 code page. This will occur with most other non-Latin characters too, such as those used in Asian scripts.
Can you change the database encoding to be UTF8 instead of 1252? This will allow your columns to contain almost any character.
I was able to get around it by using Postgres' substring function and selecting that instead:
select substring(comments from 1 for 200) from billing
The comment that the special character started each field was a great help in finally resolving it.
This problem appeared for us around 19/11/2016 with our old Access 97 app accessing a postgresql 9.1 DB.
This was solved by changing the driver to UNICODE instead of ANSI (see plang comment).
Here's what worked for me :
1 enable ad-hoc queries in sp_configure.
2 add ODBC DSN for your linked PostgreSQL server.
3 make sure you have both ANSI and Unicode (x64) drivers (try with both).
4 run query like this below - change UID, server ip, db name and password.
5 just keep the query in last line in postgreSQL format.
EXEC sp_configure 'show advanced options', 1
RECONFIGURE
GO
EXEC sp_configure 'ad hoc distributed queries', 1
RECONFIGURE
GO
SELECT * FROM OPENROWSET('MSDASQL',
'Driver=PostgreSQL Unicode(x64);
uid=loginid;
Server=1.2.3.41;
port=5432;
database=dbname;
pwd=password',
'select * FROM table_name limit 10;')
I have face this issue when my Windows 10 using Mandarin China as default language. This problem has occurred because I did try to import a database with UTF-8. Checking via psql and do "\l", it shows collate and cytpe is Mandarin China.
The solution, reset OS language back to US and re-install PostgreSQL. As the collate back to UTF-8, you can reset back your OS language again.
I write the full context and solution here https://www.yodiw.com/fix-utf8-encoding-win1252-cputf8-postgresql-windows-10/

Encoding problems with ogr2ogr and Postgis/PostgreSQL database

In our organization, we handle GIS content in different file formats. I need to put these files into a PostGIS database, and that is done using ogr2ogr. The problem is, that the database is UTF8 encoded, and the files might have a different encoding.
I found descriptions of how I can specify the encoding by adding an options parameter to org2ogr, but appearantly it doesn't have an effect.
ogr2ogr -f PostgreSQL PG:"host=localhost user=username dbname=dbname \
password=password options='-c client_encoding=latin1'" sourcefile;
The error I recieve is:
ERROR 1: ALTER TABLE "soer_vd" ADD COLUMN "målsætning" CHAR(10)
ERROR: invalid byte sequence for encoding "UTF8": 0xe56c73
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
ERROR 1: ALTER TABLE "soer_vd" ADD COLUMN "påvirkning" CHAR(10)
ERROR: invalid byte sequence for encoding "UTF8": 0xe57669
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
ERROR 1: INSERT command for new feature failed.
ERROR: invalid byte sequence for encoding "UTF8": 0xf8
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
Currently, my source file is a Shape file and I'm pretty sure, that it is Latin1 encoded.
What am I doing wrong here and can you help me?
Kind regards, Casper
Magnus is right and I will discuss the solution here.
I have seen the option to inform PostgreSQL about character encoding, options=’-c client_encoding=xxx’, used many places, but it does not seem to have any effect. If someone knows how this part is working, feel free to elaborate.
Magnus suggested to set the environment variable PGCLIENTENCODING to LATIN1. This can, according to a mailing list I queried, be done by modifying the call to ogr2ogr:
ogr2ogr -–config PGCLIENTENCODING LATIN1 –f PostgreSQL
PG:”host=hostname user=username dbname=databasename password=password” inputfile
This didn’t do anything for me. What worked for me was to, before the call to ogr2ogr, to:
SET PGCLIENTENCODING=LATIN1
It would be great to hear more details from experienced users and I hope it can help others :)
That does sound like it would set the client encoding to LATIN1. Exactly what error do you get?
Just in case ogr2ogr doesn't pass it along properly, you can also try setting the environment variable PGCLIENTENCODING to latin1.
I suggest you double check that they are actually LATIN1. Simply running file on it will give you a good idea, assuming it's actually consistent within the file. You can also try sending it through iconv to convert it to either LATIN1 or UTF8.
You need to write your command line like this :
PGCLIENTENCODING=LATIN1 ogr2ogr -f PostgreSQL PG:"dbname=...
Currently, OGR from GDAL does not perform any recoding of character data during translation between vector formats. The team has prepared RFC 23.1: Unicode support in OGR document which discusses support of recoding for OGR drivers. The RFC 23 was adopted and the core functionality was already released in GDAL 1.6.0. However, most of OGR drivers have not been updated, including Shapefile driver.
For the time being, I would describe OGR as encoding agnostic and ignorant. It means, OGR does take what it gets and sends it out without any processing. OGR uses char type to manipulate textual data. This is fine to handle multi-byte encoded strings (like UTF-8) - it's just a plain stream of bytes stored as array of char elements.
It is advised that developers of OGR drivers should return UTF-8 encoded strings of attribute values, however this rule has not been widely adopted across OGR drivers, thus making this functionality not end-user ready yet.
On windows a command is
SET PGCLIENTENCODING=LATIN1
On linux
export PGCLIENTENCODING=LATIN1
or
PGCLIENTENCODING=LATIN1
Moreover this discussion help me:
https://gis.stackexchange.com/questions/218443/ogr2ogr-encoding-on-windows-using-os4geo-shell-with-census-data
On windows
SET PGCLIENTENCODING=LATIN1 ogr2ogr...
do not give me any error, but ogr2ogr do not works...I need to change the system variable (e.g. System--> Advanced system settings--> Environment variables -->New system variable) reboot the system and then run
ogr2ogr...
I solved this problem using this command:
pg_restore --host localhost --port 5432 --username postgres --dbname {DBNAME} --schema public --verbose "{FILE_PATH to import}"
I don't know if this is the right solution, but it worked for me.
For some reason, I dont know why, I could not import tables with ÅÄÖ in them to the public schema.
When I created a new schema I could import the tables to the new schema.