bytea type & nulls, Postgres

bytea type & nulls, Postgres - postgresql

I'm using a bytea type in PostgreSQL, which, to my understanding, contains just a series of bytes. However, I can't get it to play well with nulls. For example:
=# select length(E'aa\x00aa'::bytea);
length
--------
2
(1 row)
I was expecting 5. Also:
=# select md5(E'aa\x00aa'::bytea);
md5
----------------------------------
4124bc0a9335c27f086f24ba207a4912
(1 row)
That's the MD5 of "aa", not "aa\x00aa". Clearly, I'm Doing It Wrong, but I don't know what I'm doing wrong. I'm also on an older version of Postgres (8.1.11) for reasons outside of my control. (I'll see if this behaves the same on the latest Postgres as soon as I get home...)

Try this:
# select length(E'aa\\000aa'::bytea);
length
--------
5
Updated: Why the original didn't work? First, understand the difference between one slash and two:
pg=# select E'aa\055aa', length(E'aa\055aa') ;
?column? | length
----------+--------
aa-aa | 5
(1 row)
pg=# select E'aa\\055aa', length(E'aa\\055aa') ;
?column? | length
----------+--------
aa\055aa | 8
In the first case, I'm writing a literal string, 4 characters unescaped('a') and one escaped. The slash is consumed by the parser in a first pass, which converts the full \055
to a single char ('-' in this case).
In the second case, the first slash just escapes the second, the pair \\ is translated by the parser to a single \ and the 055 is seen as three characters.
Now, when converting a text to a bytea, escape characters (in a already parsed or produced text) are parsed/interpreted again! (Yes, this is confusing).
So, when I write
select E'aa\000aa'::bytea;
in the first parsing, the literal E'aa\000aa' is converted to an internal text with a null character in the third position (and depending on your postgresql version, the null character is interpreted as an EOS, and the text is assumed to be of length two - or in other versions an illegal string error is thrown).
Instead, when I write
select E'aa\\000aa'::bytea;
in the first parsing, the literal string "aa\000aa" (eight characters) is seen, and is asigned to a text; then in the casting to bytea, it is parsed again, and the sequence of characters '\000' is interpreted as a null byte.
IMO postgresql kind of sucks here.

You can use regular strings or dollar-quoted strings instead of escaped strings:
# select length('aa\000aa'::bytea);
length
════════
5
(1 row)
# select length($$aa\000aa$$::bytea);
length
════════
5
(1 row)
I think that dollar-quoted strings are a better option because, if the configuration parameter standard_conforming_strings is off, then PostgreSQL recognizes backslash escapes in both regular and escape string constants. However, as of PostgreSQL 9.1, the default is on, meaning that backslash escapes are recognized only in escape string constants.

Related

Is there a way to see raw string values using SQL / presto SQL / athena?

Edit after asked to better specify my need:
TL;DR: How to show whitespace escaped characters (such as /r) in the Athena console when performing a query? So this: "abcdef/r" instead of this "abcdef ".
I have a dataset with a column that contains some strings of variable length, all of them with a trailing whitespace.
Now, since I had analyzed this data before, using python, I know that this whitespace is a \r; however, if in Athena I SELECT my_column, it obviously doesn't show the escaped whitespace.
Essentially, what I'm trying to achieve:
my_column | ..
----------+--------
abcdef\r | ..
ghijkl\r | ..
What I'm getting instead:
my_column | ..
----------+--------
abcdef | ..
ghijkl | ..
If you're asking why would I want that, it's just to avoid having to parse this data through python if I ever incur in this situation again, so that I can immediately know if there's any weird escaped characters in my strings.
Any help is much appreciated.

Need explanation on character types in PostgreSQL

I went through the documentation on character types of PostgreSQL. But still I have some questions
"char" is fixed length i,e 1
if so what is length of "char[]" because I am not able to change it in pgadmin I thought it is used for variable length character array. So, what is the actual default size?
What is use of character as compared to "char"[]? For now I assume character is used for fixed length character array for which we define size.
Why character[] is used?
What is difference between character varying and character varying[]?

In case you come from a C background, a character string type in PostgreSQL is not an array of characters.
For each type foo in PostgreSQL, there is a corresponding array type foo[] that denotes an array of values of type foo. Use array types only if you do not plan to manipulate them much inside the database; if you do, it is usually better to normalize the array into a separate table.
Leaving aside array types, there are different character types:
"char" (the double quotes are required): a single character. Mostly used in catalog tables. Don't use this type unless you know what you are doing.
character(n) or char(n): fixed length characters string. No matter what you store there, it will always be padded with spaces on the right side. The behaviour, as dictated by the SQL standard, is sometimes surprising, so you rarely want this type.
text: arbitrary-length character string. This is the type you want for characters strings unless you want the database to impose a length limit.
character varying(n) or varchar(n): this is the same as text with an additional length limit.
To round it off with an example:
CREATE TABLE strtest(
id serial PRIMARY KEY,
sc "char",
c character(10),
vc character varying(10),
vca character varying(10)[]
);
INSERT INTO strtest (sc, c, vc, vca)
VALUES (
'x',
'short',
'short',
ARRAY['short1', 'short2', 'short3']
);
SELECT sc, c, vc, vca[2] FROM strtest;
sc | c | vc | vca
----+------------+-------+--------
x | short | short | short2
(1 row)

PostgreSQL tuple format

Is there any document describing the tuple format that PostgreSQL server adheres to? The official documentation appears arcane about this.
A single tuple seems simple enough to figure out, but when it comes to arrays of tuples, arrays of composite tuples, and finally nested arrays of composite tuples, it is impossible to be certain about the format simply by looking at the output.
I am asking this following my initial attempt at implementing pg-tuple, a parser that's still missing today, to be able to parse PostgreSQL tuples within Node.js
Examples
create type type_A as (
a int,
b text
);
with a simple text: (1,hello)
with a complex text: (1,"hello world!")
create type type_B as (
c type_A,
d type_A[]
);
simple-value array: {"(2,two)","(3,three)"}
for type_B[] we can get:
{"(\"(7,inner)\",\"{\"\"(88,eight-1)\"\",\"\"(99,nine-2)\"\"}\")","(\"(77,inner)\",\"{\"\"(888,eight-3)\"\",\"\"(999,nine-4)\"\"}\")"}
It gets even more complex for multi-dimensional arrays of composite types.
UPDATE
Since it feels like there is no specification at all, I have started working on reversing it. Not sure if it can be done fully though, because from some initial examples it is often unclear what formatting rules are applied.

As Nick posted, according to docs:
the whitespace will be ignored if the field type is integer, but not
if it is text.
and
The composite output routine will put double quotes around field
values if they are empty strings or contain parentheses, commas,
double quotes, backslashes, or white space.
and
Double quotes and backslashes embedded in field values will be
doubled.
and now quoting Nick himself:
nested elements are converted to strings, and then quoted / escaped
like any other string
I give shorted example below, comfortably compared against its nested value:
a=# create table playground (t text, ta text[],f float,fa float[]);
CREATE TABLE
a=# insert into playground select 'space here',array['','bs\'],8.0,array[null,8.1];
INSERT 0 1
a=# insert into playground select 'no_space',array[null,'nospace'],9.0,array[9.1,8.0];
INSERT 0 1
a=# select playground,* from playground;
playground | t | ta | f | fa
---------------------------------------------------+------------+----------------+---+------------
("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
(no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
If you go for deeper nested quoting, look at:
a=# select nested,* from (select playground,* from playground) nested;
nested | playground | t | ta | f | fa
-------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+------------+----------------+---+------------
("(""space here"",""{"""""""",""""bs\\\\\\\\""""}"",8,""{NULL,8.1}"")","space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | ("space here","{"""",""bs\\\\""}",8,"{NULL,8.1}") | space here | {"","bs\\"} | 8 | {NULL,8.1}
("(no_space,""{NULL,nospace}"",9,""{9.1,8}"")",no_space,"{NULL,nospace}",9,"{9.1,8}") | (no_space,"{NULL,nospace}",9,"{9.1,8}") | no_space | {NULL,nospace} | 9 | {9.1,8}
(2 rows)
As you can see, the output again follows rules the above.
This way in short answers to your questions would be:
why array is normally presented inside double-quotes, while an empty array is suddenly an open value? (text representation of empty array does not contain comma or space or etc)
why a single " is suddenly presented as \""? (text representation of 'one\ two', according to rules above is "one\\ two", and text representation of the last is ""one\\\\two"" and it is just what you get)
why unicode-formatted text is changing the escaping for \? How can we tell the difference then? (According to docs,
PostgreSQL also accepts "escape" string constants, which are an
extension to the SQL standard. An escape string constant is specified
by writing the letter E (upper or lower case) just before the opening
single quote
), so it is not unicode text, but the the way you tell postgres that it should interpret escapes in text not as symbols, but as escapes. Eg E'\'' will be interpreted as ' and '\'' will make it wait for closing ' to be interpreted. In you example E'\\ text' the text represent of it will be "\\ text" - we add backslsh for backslash and take value in double quotes - all as described in online docs.
the way that { and } are escaped is not always clear (I could not anwer this question, because it was not clear itself)

Encoding and decoding in postgresql

Let's say we have a string 'a\b'.
I need to encode it first, save to file, then read it from file and puck back in db.
How to encode and decode text that has escape characters?
select encode(E'a\b'::bytea, 'base64')
"YQg="
select decode('YQg=', 'base64')
"a\010"
After decoding I am not getting back string as it was in it's original form.

You're using an E'' string (escape string) and casting to bytea. The result will be a representation of that string in your current database encoding - probably UTF-8.
E'a\b' is the character a then the character represented by the escape \b which is ordinal \x08. PostgreSQL represents this string with a hex-escape when printing to the terminal because it's a non-printable character. The string is still two characters long.
postgres=> SELECT E'a\b';
?column?
----------
a\x08
(1 row)
postgres=> SELECT length(E'a\b');
length
--------
2
(1 row)
The cast to bytea implicitly does a conversion to the current database encoding:
postgres=> SELECT E'a\b'::bytea;
bytea
--------
\x6108
(1 row)
(\x61 is the ASCII ordinal for a in most encodings).
Except you must be on an old PostgreSQL since you have bytea_output = escape, resulting in octal escape output instead:
postgres=> SELECT E'a\b'::bytea;
bytea
-------
a\010
(1 row)
You need to decode the bytea back into a text string, e.g.
convert_from(decode('YQg=', 'base64'), 'utf-8')
... and even then the nonprintable character \b will be printed as \x08 by psql. You can verify that it is really that character inside the database using another client.
BTW, what's going on would be clearer if you instead explicitly encoded it when you stored it rather than relying on a cast to bytea:
encode(convert_to(E'a\b', 'utf-8'), bytea)

PostgreSQL - how to check if my data contains a backslash

SELECT count(*) FROM table WHERE column ilike '%/%';
gives me the number of values containing "/"
How to do the same for "\"?

SELECT count(*)
FROM table
WHERE column ILIKE '%\\\\%';

Excerpt from the docs:
Note that the backslash already has a special meaning in string literals, so to write a pattern constant that contains a backslash you must write two backslashes in an SQL statement (assuming escape string syntax is used, see Section 4.1.2.1). Thus, writing a pattern that actually matches a literal backslash means writing four backslashes in the statement. You can avoid this by selecting a different escape character with ESCAPE; then a backslash is not special to LIKE anymore. (But it is still special to the string literal parser, so you still need two of them.)

Better yet - don't use like, just use standard position:
select count(*) from table where 0 < position( E'\\' in column );

I found on 12.5 I did not need an escape character
# select * from t;
x
-----
a/b
c\d
(2 rows)
# select count(*) from t where 0 < position('/' in x);
count
-------
1
(1 row)
# select count(*) from t where 0 < position('\' in x);
count
-------
1
(1 row)
whereas on 9.6 I did.
Bit strange but there you go.
Usefully,
position(E'/' in x)
worked on both versions.
You need to be careful - E'//' seems to work (i.e. parses) but does not actually find a slash.

You need E'\\\\' because the argument to LIKE is a regex and regex escape char is already \ (e.g ~ E'\\w' would match any string containing a printable char).
See the doc