Postgres - decode special characters

Postgres - decode special characters - postgresql

I have words like this encoded: "cizaña", the encoded result is 63697A61F161
When I try to convert to 'cizaña' again
select decode('63697A61F161'::text, 'hex')
I get:
"ciza\361a"
What can I do? I tried to set set client_encoding to 'UTF-8'; without luck

the encoded result is 63697A61F161
"encoded" how? I think you care confusing text encoding with representation format of binary data.
63697A61F161 is the iso-8859-1 ("latin-1") encoding of the text "cizaña" with the binary represetned as hex octets.
decode('63697A61F161', 'hex') produces the bytea value '\x63697A61F161' if bytea_encoding is hex or 'ciza\361a' if bytea_encoding is escape. Either way, it's a representation of a binary string, not text.
If you want text, you must decode the text encoding into the current database text encoding with convert_from, e.g.
test=> select convert_from(decode('63697A61F161', 'hex'), 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
This should help explain:
demo=> select convert_from(BYTEA 'ciza\361a', 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
See? 'ciza\361a' is an octal-escape representation of the binary data for the iso-8859-1 encoding of the text 'cizaña'. It's the exact same value as the bytea hex-format value '\x63697A61F161':
demo=> select convert_from(BYTEA '\x63697A61F161', 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
So:
decode and encode transform text-string representations of binary data to and from bytea literals, Postgres binary objects. Which are output in a text form for display.h The encoding/decoding here is one of binary representation e.g. hex or base64.
convert_from and convert_to take binary data and apply text encoding processing to convert it to or from the local native database text encoding, producing text strings. The encoding here is text encoding.
It's ... not easy to follow at first. You might have to learn more about text encodings.

Related

(Postgresql) How to convert an emoji (or emoticon) to its unicode representation

Currently I am using python to get an emojis unicode representation.
I want to be able to do this using postgresql. Example:
messageText
----------
😀
select unicodeValue(messageText) from table where messageText = '😀';
Result: 'U+1F600'

This assumes that the database encoding is UTF-8, but that is a requirement anyway if you want to represent such strange characters:
SELECT to_hex(ascii('😀'));
to_hex
--------
1f600
(1 row)

Double encoded bytea in PostgreSQL

I'm storing binary data in a bytea field, but have during the import converted it twice to hex. How can I undo the double encoding?
My binary file starts with the character "0". In hex that's the character 30. In psql I expect to see the string that starts with \x30, since it will display it to me in hex by default. But what I see is that it starts with \x783330, where "78" is hex for "x", "33" from "3", and "30" for "0". So it's saying the stored string is: x30.
I can make it worse by casting text to a bytea, like encode(data, 'hex')::bytea, which will then turn it into \x373833333330, but I can't find a way to do the reverse. If I try decode(data::text, 'hex') it will complain about '' is not a valid hex string. If I use decode(substring(data::text) from 3), 'hex'), I get back my original string.

You probably stored the bytea the wrong way.
If you INSERT a hexadecimal string into a bytea, it is interpreted as a string and not as hexadecimal digits unless you prepend it with \x.
See
SELECT 'DEADBEEF'::bytea, '\xDEADBEEF'::bytea;
bytea | bytea
--------------------+------------
\x4445414442454546 | \xdeadbeef
(1 row)
When you use a program to insert a bytea, there are also ways to directly insert binary data; how that is done depends on the API you are using.

Encoding and decoding in postgresql

Let's say we have a string 'a\b'.
I need to encode it first, save to file, then read it from file and puck back in db.
How to encode and decode text that has escape characters?
select encode(E'a\b'::bytea, 'base64')
"YQg="
select decode('YQg=', 'base64')
"a\010"
After decoding I am not getting back string as it was in it's original form.

You're using an E'' string (escape string) and casting to bytea. The result will be a representation of that string in your current database encoding - probably UTF-8.
E'a\b' is the character a then the character represented by the escape \b which is ordinal \x08. PostgreSQL represents this string with a hex-escape when printing to the terminal because it's a non-printable character. The string is still two characters long.
postgres=> SELECT E'a\b';
?column?
----------
a\x08
(1 row)
postgres=> SELECT length(E'a\b');
length
--------
2
(1 row)
The cast to bytea implicitly does a conversion to the current database encoding:
postgres=> SELECT E'a\b'::bytea;
bytea
--------
\x6108
(1 row)
(\x61 is the ASCII ordinal for a in most encodings).
Except you must be on an old PostgreSQL since you have bytea_output = escape, resulting in octal escape output instead:
postgres=> SELECT E'a\b'::bytea;
bytea
-------
a\010
(1 row)
You need to decode the bytea back into a text string, e.g.
convert_from(decode('YQg=', 'base64'), 'utf-8')
... and even then the nonprintable character \b will be printed as \x08 by psql. You can verify that it is really that character inside the database using another client.
BTW, what's going on would be clearer if you instead explicitly encoded it when you stored it rather than relying on a cast to bytea:
encode(convert_to(E'a\b', 'utf-8'), bytea)

DB2 UTF-8 Data Storage - Extraneous Byte Values

I am attempting to store Unicode characters in UTF8 format on a DB2 database. I have confirmed that the charset is 1208 and the that the database is specified to hold UTF8.
I am, however, getting odd results when querying some unicode data.
select hex(firstname), firstname, from my_schema.my_table where my_pk = 1234;
The results are as below:
C383C289 Ã
The character in the result is displaying wrong. From what I gather, it's being represented by the hex values "C383C289". The actual character sent on the insert was É and should be represented in UTF8 as C389.
At this stage I'm assuming that it could be the program that I am using to query the data that is interpreting it wrong. But to what extent are the hex values (first result column) wrong? They seem to have unused fluff "83C2" between the actual bytes. Or, is "C383C289" actually correct, and some UTF8 decoding engines can't handle the fluff? This seems unlikely to me.
The client (DB2 For Toad, and WinSQL) both display the character as an Ã which is represented in UTF8 as C383.
*Edit. I tested on the CLI and it is correctly returning the É character. Am I missing something? Is the "hex" function returning something that it shouldn't be?

É (U+00C9) in UTF-8 is 0xC3 0x89.
Ã (U+00C3) in UTF-8 is 0xC3 0x83.
‰ (U+0089) in UTF-8 is 0xC2 0x89.
This means your insert code is taking É, encoding it to UTF-8 octets 0xC3 0x89 before then inserting those octets into the DB. The DB is interpreting them as individual characters 0xC3 and 0x89 and encoding them a second time into UTF-8, thus producing 0xC3 0x83 0xC2 0x89.
You need to fix your insert code to not perform that initial encode anymore, so the DB will see the original É as-is and not a pre-encoded version of it. How you actually do it is anyone's guess, since you did not show your actual insert code yet.

This is not really an answer, just to demonstrate the correct behaviour:
> db2 "insert into t1 values ('Élan')"
DB20000I The SQL command completed successfully.
> db2 select "hex (f1), f1 from t1"
1 F1
---------- -----
C3896C616E Élan
1 record(s) selected.

How to decode a Base64 string?

I have a normal string in Powershell that is from a text file containing Base64 text; it is stored in $x. I am trying to decode it as such:
$z = [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($x));
This works if $x was a Base64 string created in Powershell (but it's not). And this does not work on the $x Base64 string that came from a file, $z simply ends up as something like 䐲券.
What am I missing? For example, $x could be YmxhaGJsYWg= which is Base64 for blahblah.
In a nutshell, YmxhaGJsYWg= is in a text file then put into a string in this Powershell code and I try to decode it but end up with 䐲券 etc.

Isn't encoding taking the text TO base64 and decoding taking base64 BACK to text? You seem be mixing them up here. When I decode using this online decoder I get:
BASE64: blahblah
UTF8: nVnV
not the other way around. I can't reproduce it completely in PS though. See sample below:
PS > [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String("blahblah"))
nV�nV�
PS > [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes("nVnV"))
blZuVg==
EDIT I believe you're using the wrong encoder for your text. The encoded base64 string is encoded from UTF8(or ASCII) string.
PS > [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String("YmxhaGJsYWg="))
blahblah
PS > [System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String("YmxhaGJsYWg="))
汢桡汢桡
PS > [System.Text.Encoding]::ASCII.GetString([System.Convert]::FromBase64String("YmxhaGJsYWg="))
blahblah

There are no PowerShell-native commands for Base64 conversion - yet (as of PowerShell [Core] 7.1), but adding dedicated cmdlets has been suggested in GitHub issue #8620.
For now, direct use of .NET is needed.
Important:
Base64 encoding is an encoding of binary data using bytes whose values are constrained to a well-defined 64-character subrange of the ASCII character set representing printable characters, devised at a time when sending arbitrary bytes was problematic, especially with the high bit set (byte values > 0x7f).
Therefore, you must always specify explicitly what character encoding the Base64 bytes do / should represent.
Ergo:
on converting TO Base64, you must first obtain a byte representation of the string you're trying to encode using the character encoding the consumer of the Base64 string expects.
on converting FROM Base64, you must interpret the resultant array of bytes as a string using the same encoding that was used to create the Base64 representation.
Examples:
Note:
The following examples convert to and from UTF-8 encoded strings:
To convert to and from UTF-16LE ("Unicode") instead, substitute [Text.Encoding]::Unicode for [Text.Encoding]::UTF8
Convert TO Base64:
PS> [Convert]::ToBase64String([Text.Encoding]::UTF8.GetBytes('Motörhead'))
TW90w7ZyaGVhZA==
Convert FROM Base64:
PS> [Text.Encoding]::Utf8.GetString([Convert]::FromBase64String('TW90w7ZyaGVhZA=='))
Motörhead

This page shows up when you google how to convert to base64, so for completeness:
$b = [System.Text.Encoding]::UTF8.GetBytes("blahblah")
[System.Convert]::ToBase64String($b)

Base64 encoding converts three 8-bit bytes (0-255) into four 6-bit bytes (0-63 aka base64). Each of the four bytes indexes an ASCII string which represents the final output as four 8-bit ASCII characters. The indexed string is typically 'A-Za-z0-9+/' with '=' used as padding. This is why encoded data is 4/3 longer.
Base64 decoding is the inverse process. And as one would expect, the decoded data is 3/4 as long.
While base64 encoding can encode plain text, its real benefit is encoding non-printable characters which may be interpreted by transmitting systems as control characters.
I suggest the original poster render $z as bytes with each bit having meaning to the application. Rendering non-printable characters as text typically invokes Unicode which produces glyphs based on your system's localization.
Base64decode("the answer to life the universe and everything") = 00101010

If anyone would like to do it with a pipe in Powershell (like a filter) (e.g. read file contents and decode it), it can be achieved with a one-liner like that:
Get-Content base64.txt | %{[Text.Encoding]::UTF8.GetString([Convert]::FromBase64String($_))}

I had issues with spaces showing in between my output and there was no answer online at all to fix this issue. I literally spend many hours trying to find a solution and found one from playing around with the code to the point that I almost did not even know what I typed in at the time that I got it to work. Here is my fix for the issue: [System.Text.Encoding]::UTF8.GetString(([System.Convert]::FromBase64String($base64string)|?{$_}))

Still not a "built-in", but published to gallery, authored by MS:
https://github.com/powershell/textutility
TextUtility
ConvertFrom-Base64
Return a string decoded from base64.
ConvertTo-Base64
Return a base64 encoded representation of a string.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgres - decode special characters - postgresql

I have words like this encoded: "cizaña", the encoded result is 63697A61F161 When I try to convert to 'cizaña' again select decode('63697A61F161'::text, 'hex') I get: "ciza\361a" What can I do? I tried to set set client_encoding to 'UTF-8'; without luck

Related

(Postgresql) How to convert an emoji (or emoticon) to its unicode representation

Double encoded bytea in PostgreSQL

Encoding and decoding in postgresql

DB2 UTF-8 Data Storage - Extraneous Byte Values

How to decode a Base64 string?

Categories

Resources