Encoding and decoding in postgresql

Encoding and decoding in postgresql - postgresql

Let's say we have a string 'a\b'.
I need to encode it first, save to file, then read it from file and puck back in db.
How to encode and decode text that has escape characters?
select encode(E'a\b'::bytea, 'base64')
"YQg="
select decode('YQg=', 'base64')
"a\010"
After decoding I am not getting back string as it was in it's original form.

You're using an E'' string (escape string) and casting to bytea. The result will be a representation of that string in your current database encoding - probably UTF-8.
E'a\b' is the character a then the character represented by the escape \b which is ordinal \x08. PostgreSQL represents this string with a hex-escape when printing to the terminal because it's a non-printable character. The string is still two characters long.
postgres=> SELECT E'a\b';
?column?
----------
a\x08
(1 row)
postgres=> SELECT length(E'a\b');
length
--------
2
(1 row)
The cast to bytea implicitly does a conversion to the current database encoding:
postgres=> SELECT E'a\b'::bytea;
bytea
--------
\x6108
(1 row)
(\x61 is the ASCII ordinal for a in most encodings).
Except you must be on an old PostgreSQL since you have bytea_output = escape, resulting in octal escape output instead:
postgres=> SELECT E'a\b'::bytea;
bytea
-------
a\010
(1 row)
You need to decode the bytea back into a text string, e.g.
convert_from(decode('YQg=', 'base64'), 'utf-8')
... and even then the nonprintable character \b will be printed as \x08 by psql. You can verify that it is really that character inside the database using another client.
BTW, what's going on would be clearer if you instead explicitly encoded it when you stored it rather than relying on a cast to bytea:
encode(convert_to(E'a\b', 'utf-8'), bytea)

Related

(Postgresql) How to convert an emoji (or emoticon) to its unicode representation

Currently I am using python to get an emojis unicode representation.
I want to be able to do this using postgresql. Example:
messageText
----------
😀
select unicodeValue(messageText) from table where messageText = '😀';
Result: 'U+1F600'

This assumes that the database encoding is UTF-8, but that is a requirement anyway if you want to represent such strange characters:
SELECT to_hex(ascii('😀'));
to_hex
--------
1f600
(1 row)

Double encoded bytea in PostgreSQL

I'm storing binary data in a bytea field, but have during the import converted it twice to hex. How can I undo the double encoding?
My binary file starts with the character "0". In hex that's the character 30. In psql I expect to see the string that starts with \x30, since it will display it to me in hex by default. But what I see is that it starts with \x783330, where "78" is hex for "x", "33" from "3", and "30" for "0". So it's saying the stored string is: x30.
I can make it worse by casting text to a bytea, like encode(data, 'hex')::bytea, which will then turn it into \x373833333330, but I can't find a way to do the reverse. If I try decode(data::text, 'hex') it will complain about '' is not a valid hex string. If I use decode(substring(data::text) from 3), 'hex'), I get back my original string.

You probably stored the bytea the wrong way.
If you INSERT a hexadecimal string into a bytea, it is interpreted as a string and not as hexadecimal digits unless you prepend it with \x.
See
SELECT 'DEADBEEF'::bytea, '\xDEADBEEF'::bytea;
bytea | bytea
--------------------+------------
\x4445414442454546 | \xdeadbeef
(1 row)
When you use a program to insert a bytea, there are also ways to directly insert binary data; how that is done depends on the API you are using.

Postgres - decode special characters

I have words like this encoded: "cizaña", the encoded result is 63697A61F161
When I try to convert to 'cizaña' again
select decode('63697A61F161'::text, 'hex')
I get:
"ciza\361a"
What can I do? I tried to set set client_encoding to 'UTF-8'; without luck

the encoded result is 63697A61F161
"encoded" how? I think you care confusing text encoding with representation format of binary data.
63697A61F161 is the iso-8859-1 ("latin-1") encoding of the text "cizaña" with the binary represetned as hex octets.
decode('63697A61F161', 'hex') produces the bytea value '\x63697A61F161' if bytea_encoding is hex or 'ciza\361a' if bytea_encoding is escape. Either way, it's a representation of a binary string, not text.
If you want text, you must decode the text encoding into the current database text encoding with convert_from, e.g.
test=> select convert_from(decode('63697A61F161', 'hex'), 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
This should help explain:
demo=> select convert_from(BYTEA 'ciza\361a', 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
See? 'ciza\361a' is an octal-escape representation of the binary data for the iso-8859-1 encoding of the text 'cizaña'. It's the exact same value as the bytea hex-format value '\x63697A61F161':
demo=> select convert_from(BYTEA '\x63697A61F161', 'iso-8859-1');
convert_from
--------------
cizaña
(1 row)
So:
decode and encode transform text-string representations of binary data to and from bytea literals, Postgres binary objects. Which are output in a text form for display.h The encoding/decoding here is one of binary representation e.g. hex or base64.
convert_from and convert_to take binary data and apply text encoding processing to convert it to or from the local native database text encoding, producing text strings. The encoding here is text encoding.
It's ... not easy to follow at first. You might have to learn more about text encodings.

Replacing nonbreaking spaces (%A0) in Postgres

I've got some values in a varchar column that are separated by nonbreaking spaces (urlencoded %A0 instead of %20). I'm trying to replace them with spaces, but can't seem to get the syntax right:
select regexp_replace('hello world', E'\xa0', ' ')
What is the correct way to encode the character in a Postgres regexp_replace function? Or, is there a better way to do the replacement?

Replacing '\xa0' didn't work for me, possibly because my strings were in UTF-8 rather than Latin1 or other where the character is encoded directly as A0. (U+A0 is encoded with bytes C2 A0 in UTF-8)
I found it more practical to replace it as a code point (U+A0) rather than as the encoded bytes (C2 A0 or A0):
select replace('456321 ', E'\u00a0', '') -- value is E'456321\u00a0'

This may help you
select replace('Hello world', '\xa0', '')
Ref Postgresql (Current) Section 9.4. String Functions and Operators

bytea type & nulls, Postgres

I'm using a bytea type in PostgreSQL, which, to my understanding, contains just a series of bytes. However, I can't get it to play well with nulls. For example:
=# select length(E'aa\x00aa'::bytea);
length
--------
2
(1 row)
I was expecting 5. Also:
=# select md5(E'aa\x00aa'::bytea);
md5
----------------------------------
4124bc0a9335c27f086f24ba207a4912
(1 row)
That's the MD5 of "aa", not "aa\x00aa". Clearly, I'm Doing It Wrong, but I don't know what I'm doing wrong. I'm also on an older version of Postgres (8.1.11) for reasons outside of my control. (I'll see if this behaves the same on the latest Postgres as soon as I get home...)

Try this:
# select length(E'aa\\000aa'::bytea);
length
--------
5
Updated: Why the original didn't work? First, understand the difference between one slash and two:
pg=# select E'aa\055aa', length(E'aa\055aa') ;
?column? | length
----------+--------
aa-aa | 5
(1 row)
pg=# select E'aa\\055aa', length(E'aa\\055aa') ;
?column? | length
----------+--------
aa\055aa | 8
In the first case, I'm writing a literal string, 4 characters unescaped('a') and one escaped. The slash is consumed by the parser in a first pass, which converts the full \055
to a single char ('-' in this case).
In the second case, the first slash just escapes the second, the pair \\ is translated by the parser to a single \ and the 055 is seen as three characters.
Now, when converting a text to a bytea, escape characters (in a already parsed or produced text) are parsed/interpreted again! (Yes, this is confusing).
So, when I write
select E'aa\000aa'::bytea;
in the first parsing, the literal E'aa\000aa' is converted to an internal text with a null character in the third position (and depending on your postgresql version, the null character is interpreted as an EOS, and the text is assumed to be of length two - or in other versions an illegal string error is thrown).
Instead, when I write
select E'aa\\000aa'::bytea;
in the first parsing, the literal string "aa\000aa" (eight characters) is seen, and is asigned to a text; then in the casting to bytea, it is parsed again, and the sequence of characters '\000' is interpreted as a null byte.
IMO postgresql kind of sucks here.

You can use regular strings or dollar-quoted strings instead of escaped strings:
# select length('aa\000aa'::bytea);
length
════════
5
(1 row)
# select length($$aa\000aa$$::bytea);
length
════════
5
(1 row)
I think that dollar-quoted strings are a better option because, if the configuration parameter standard_conforming_strings is off, then PostgreSQL recognizes backslash escapes in both regular and escape string constants. However, as of PostgreSQL 9.1, the default is on, meaning that backslash escapes are recognized only in escape string constants.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Encoding and decoding in postgresql - postgresql

Related

(Postgresql) How to convert an emoji (or emoticon) to its unicode representation

Double encoded bytea in PostgreSQL

Postgres - decode special characters

Replacing nonbreaking spaces (%A0) in Postgres

bytea type & nulls, Postgres

Categories

Resources