Non UTF8 chars in function parameters - postgresql

I have a badly behaved client application that is sending non UTF8 characters in a sql string to postgres. The string is in the form
"select call_function('paramvalue_1', paramvalue_2')"
the function looking like:
create or replace function(parameter_1 text, parameter_2 text)
returns...
Unfortunately the client occasionaly sends dodgy characters in the function parameters. It's going to be tricky to clean up the client app and I was wondering if there is any way for a function in a UTF-8 database to accept non-UTF8 characters and strip them from the param values before they get inserted anywhere.
I'm pretty convinced the answer is no but I thought it was worth asking here.

Related

Convert Emoji UTF8 to Unicode in Powershell

I need to persist data in a database with powershell. Occasionally the data contains an emoji.
DB: On the DB side everything should be fine. The attributes are set to NVARCHAR which makes it possible to persist emojis. When I inserted the data manually the emoji got displayed after I query them(🤝💰).
I tested it with example data in SSMS and it worked perfectly.
Powershell: When preparing the SQL Statement in Powershell I noticed that the emojis are interpreted in UTF8 (ðŸ¤ðŸ’°). Basically gibberish.
Is a conversion from UTF8 to Unicode even necessary? How can I persist the emojis as 🤝💰 and not as ðŸ¤ðŸ’°/1f600
My colleague had the correct answer to this problem.
To persist emojis in a MS SQL Database you need to declare the column as nvarchar(max) (max is not necessarily), which I already did.
I tried to persist example data which I had hardcoded in my PS Script like this
#{ description = "Example Description😊😊 }
Apparently VS Code adds some kind of encoding on top of the data(our guess).
What basically solved the issue was simply requesting the data from the API and persist it into the database with prefix string literal with N + the nvarchar(max) column datatype
Example:
SET #displayNameFormatted = N'"+$displayName+"'
And then include that variable in my insert statement.
Does this answer your question? "Use NVARCHAR(size) datatype and prefix string literal with N:"
Add emoji / emoticon to SQL Server table
1 Emoji in powershell is 2 utf16 surrogate characters, since the code for it is too high for 16 bits. Surrogates and Supplementary Characters
'🤝'.length
2
Microsoft has made "unicode" a confusing term, since that's what they call utf16 le encoding.
Powershell 5.1 doesn't recognize utf8 no bom encoded scripts automatically.
We don't know what your commandline actually is, but see also: Unicode support for Invoke-Sqlcmd in PowerShell

PostgreSQL Escape Microsoft Special Characters In Select Query

PostgreSQL, DBvisualizer and Salesforce
I'm selecting records from a database table and exporting them to a csv file: comma-separated and UTF8 encoded. I send the file to a user who is uploading the data into Saleforce. I do not know Salesforce, so I'm totally ignorant on that side of this. She is reporting that some data in the file is showing up as gibberish (non UTF8) characters (see below).
It seems that some of our users are copy/pasting emails into a web form which then inserts them into our db. Dates from the email headers (I believe) are the text that are showing as gibberish.
11‎/‎17‎/‎2015‎ ‎7‎:‎26‎:‎26‎ ‎AM
becomes
‎11‎/‎16‎/‎2015‎ ‎07‎:‎26‎:‎26‎ ‎AM
The text in the db field looks normal. It's when it is exported to a csv file and then that file is viewed in a text-editor like Wordpad or Salesforce. Then she sees the odd characters.
This only happens with dates from the text that is copy/pasted into the form/db. I have no idea how, or if there is a way, remove these "unseen" characters.
It's the same three-characters each time: ‎ I did a regex_replace() on these to strip them out, but it doesn't work. I think since they are not seen in the db field, the regex does see them.
It seems like even though I cannot see these characters, they must be there in some form that is making them show in text-editors like Wordpad or the Salesforce client after being exported to csv.
I can probably do a mass search/find/replace in the text editor, but it would be nice to do this in the sql and avoid the extra step each time.
Hoping someone has seen this and knows an easy fix.
Thanks for any ideas or pointers that may help.
The sequence ‎ is a left-to-right mark, encoded in UTF-8 (as 0xE2 0x80 0x8E), but being read as if it were in Windows-1252.
A left-to-right mark is invisible, so the fact that you can't see it in the database suggests that it's encoded correctly, but without knowing precisely what path the data took after that, it's hard to guess exactly where it was misinterpreted.
In any case, you should be able to replace the character in your Postgres query by using its Unicode escape sequence: E'\u200E'

Replace characters with multi-character strings

I am trying to replace German and Dutch umlauts such as ä, ü, or ß. They should be written like ae instead of ä. So I can't simply translate one char with another.
Is there a more elegant way to do that? Actually it looks like that (not completed yet):
SELECT addr, REPLACE (REPLACE(addr, 'ü','ue'),'ß','ss') FROM search;
On my way trying different commands I got another problem:
When I searched for Ü I got this:
ERROR: invalid byte sequence for encoding "UTF8": 0xdc27
Tried it with U&'\0220', it didn't replace anything. Only by using ü (for lowercase ü) it was replaced correctly. Has to do something with unicode, but how to solve this issue?
Kind regards from Germany. :)
Your server encoding seems to be UTF8.
I suspect your client_encoding does not match, which might give you a wrong impression of what you are dealing with. Check with:
SHOW client_encoding; -- in your actual session
And read this related answers:
Can not insert German characters in Postgres
Replace unicode characters in PostgreSQL
The rest of the tool chain has to be in sync, too. When using puTTY, for instance, one has to make sure, the terminal agrees with the rest: Change settings... Window -> Translation -> Remote character set = UTF-8.
As for your first question, you already have the best solution. A couple of umlauts are best replaced with a string of replace() statements.
As you seem to know already as well, single character replacements are more efficient with (a single) translate() statement.
Related:
Replace unicode characters in PostgreSQL
Regex remove all occurrences of multiple characters in a string
Beside other reasons I decided to write the replacement in python. Like Erwin wrote before, it seems there is no better solution as combining replace- commands.
In general pretty simple, even no encoding had to benn used. My "final" solution now looks like this:
ger_UE="Ü"
ger_AE="Ä"
ger_OE="Ö"
ger_SS="ß"
dk_AA="Å"
dk_OE="Ø"
dk_AE="Æ"
cur.execute("""Select addr, REPLACE (REPLACE (REPLACE( REPLACE (REPLACE (REPLACE (REPLACE(addr, '%s','UE'),'%s','OE'),'%s','AE'),'%s','SS'),'%s','AA'),'%s','OE'),'%s','AE')
from search WHERE x = '1';"""%(ger_UE,ger_OE,ger_AE,ger_SS,dk_AA,dk_OE,dk_AE))
I am now looking forward to the speed when it hits the large table. If anyone would like to make some annotations, they are very welcome.

Differentiate properly escaped HTML metacharacters from improperly escaped ones

I'm working on a replacement for a desktop Java app, a single page app written in Scala and Lift.
I have this situation where some of data in the database has properly used HTML metacharacters, such as Unicode escape sequences for accented characters in non-English names. At the same time, I have other data with improper HTML metacharacters, such as ampersands in the names or organizations.
Good (don't escape): Universita\u0027
Bad (needs escape): Bob & Jim
How do I determine whether or not the data needs to be fixed before I send it to the client?
There are two ways to approach this. One is a function that takes a string and returns the index of any improperly escaped HTML metacharacters (which I can fix myself). Alternately it could be a function that takes a string and returns a string with the improperly escaped metacharacters fixed, and leaves the proper ones alone.

how can i know what is the format code of parameter that comes to my page?

how can i know what is the format code of parameter that comes to my page?
whorking with ASP
some of the character that i see cant found in the DB (access)
i want to know the unicode of the value
Assuming you're using C#, you can get the 16-bit characters of strings as integers using (int) foo[0].
(In VB.NET, I think it would be Convert.ToInt32(foo.Chars(0)). In VBA, Asc(foo).)
If the string you're passing to ADO is correct, but the results are not as expected, either the data in the database was already corrupted by some earlier non-Unicode-friendly process, or else Access is not treating the input string and the database record as equal for some unexpected reason. (Equivalence of Unicode strings is a bit complicated.)