How to convert literal \u sequences into UTF-8? [duplicate] - postgresql

This question already has an answer here:
Convert escaped Unicode character back to actual character in PostgreSQL
(1 answer)
Closed 8 years ago.
I am loading data dump from external source and some strings contain \uXXXX sequences for the UTF8 chars, like this one:
\u017D\u010F\u00E1r nad S\u00E1zavou
I can check the contents by using E'' constant in psql, but cannot find any function/operator to return me proper value.
I'd like to ask, if it's possible to convert this string with unicode escapes into normal UTF8 without using PL/pgSQL functions?

I don't think there is a built in method for that. Easiest way I can think of is the plpgsql function you wanted to avoid:
CREATE OR REPLACE FUNCTION str_eval(text, OUT t text)
LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
BEGIN
EXECUTE 'SELECT E''' || replace($1, '''', '''''') || ''''
USING $1
INTO t;
END
$func$;
The updated version safeguards against SQLi and is faster, too.

Related

ERROR: requested character too large for encoding: 14844072

I am converting following line of code from Oracle to PostgreSQL.
In Oracle:
select CHR(14844072) from dual
Output:
"
"
In postgresql:
select CHR(14844072);
Getting an error:
SQL Error [54000]: ERROR: requested character too large for encoding:
14844072
The behavior of the function is different from Oracle to Postgresql.
In oracle the statement is valid. So is, for example:
select CHR(0) from dual;
While in Postgresql, you can't SELECT CHR(0):
chr(0) is disallowed because text data types cannot store that
character.
Source: https://www.postgresql.org/docs/14/functions-string.html
This is just an example. More specific: what do you expect with value 14844072? Empty string is nonsense for Postgresql.
In Oracle you have this situation:
For single-byte character sets, if n > 256, then Oracle Database returns the binary equivalent of n mod 256
For multibyte character sets, n must resolve to one entire code point
But:
Invalid code points are not validated, and the result of specifying
invalid code points is indeterminate.
Source: https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions019.htm
In PostgreSQL the function depends from encoding, but, assuming you use UTF8:
In UTF8 encoding the argument is treated as a Unicode code point. In
other multibyte encodings the argument must designate an ASCII
character
Short answer: you need to work on the application code OR build your own function, something like this (just en example):
CREATE OR REPLACE FUNCTION myCHR(integer) RETURNS TEXT
AS $$
BEGIN
IF $1 = 0 THEN
RETURN '';
ELSE
IF $1 <= 1114111 THEN --replace the number according to your encoding
RETURN CHR($1);
ELSE
RETURN '';
END IF;
END IF;
END;
$$ LANGUAGE plpgsql;
In Oracle, this function expects an UTF8 encoded value. Now 14844072 in hex is E280A8, which corresponds to the UNICODE code point hex 2028, the "line separator" character.
In PostgreSQL, chr() expexts the code point as argument. So feed it the decimal value that corresponds to hex 2028:
SELECT chr(8232);

Postgres replacing 'text' with e'text'

I inserted a bunch of rows with a text field like content='...\n...\n...'.
I didn't use e in front, like conent=e'...\n...\n..., so now \n is not actually displayed as a newline - it's printed as text.
How do I fix this, i.e. how to change every row's content field from '...' to e'...'?
The syntax variant E'string' makes Postgres interpret the given string as Posix escape string. \n encoding a newline is only one of many interpreted escape sequences (even if the most common one). See:
Insert text with single quotes in PostgreSQL
To "re-evaluate" your Posix escape string, you could use a simple function with dynamic SQL like this:
CREATE OR REPLACE FUNCTION f_eval_posix_escapes(INOUT _string text)
LANGUAGE plpgsql AS
$func$
BEGIN
EXECUTE 'SELECT E''' || _string || '''' INTO _string;
END
$func$;
WARNING 1: This is inherently unsafe! We have to evaluate input strings dynamically without quoting and escaping, which allows SQL injection. Only use this in a safe environment.
WARNING 2: Don't apply repeatedly. Or it will misinterpret your actual string with genuine \ characters, etc.
WARNING 3: This simple function is imperfect as it cannot cope with nested single quotes properly. If you have some of those, consider instead:
Unescape a string with escaped newlines and carriage returns
Apply:
UPDATE tbl
SET content = f_eval_posix_escapes(content)
WHERE content IS DISTINCT FROM f_eval_posix_escapes(content);
db<>fiddle here
Note the added WHERE clause to skip updates that would not change anything. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
Use REPLACE in an update query. Something like this: (I'm on mobile so please ignore any typo or syntax erro)
UPDATE table
SET
column = REPLACE(column, '\n', e'\n')

Convert all hex in a string to its char value in Redshift

In Redshift, I'm trying to convert strings like this:
http%3A%2F%2Fwww.amazon.com%2FTest%3Fname%3DGary%26Bob
To look like this:
http://www.amazon.com/Test?name=Gary&Bob
Basically I need to convert all of the hex in a string to its char value. The only way I can think of is to use a regex function. I tried to do it in two different ways and received error messages for both:
SELECT REGEXP_REPLACE(hex_string, '%([[:xdigit:]][[:xdigit:]])', CHR(x'\\1'::int))
ERROR: 22P02: "\" is not a valid hexadecimal digit
SELECT REGEXP_REPLACE(hex_string, '%([[:xdigit:]][[:xdigit:]])',CHR(STRTOL('0x'||'\\1', 16)::int))
ERROR: 22023: The input 0x\1 is not valid to be converted to base 16
The CHR and STRTOL functions works by itself. For example:
SELECT CHR(x'3A'::int)
SELECT CHR(STRTOL('0x3A', 16)::int)
both returns
:
And if I run the same pattern using a different function (other than CHR and STRTOL), it works:
REGEXP_REPLACE(hex_string, '%([[:xdigit:]][[:xdigit:]])', LOWER('{H}'||'\\1'||'{/H}'))
returns
http{h}3A{/h}{h}2F{/h}{h}2F{/h}www.amazon.com{h}2F{/h}Test{h}3F{/h}name{h}3D{/h}Gary{h}26{/h}Bob
But for some reason those functions won't recognize the regex matching group.
Any tips on how I can do this?
I guess the other solution is to use nested REPLACE() functions for all of the special hex characters, but that's probably a very last resort.
What you want to do is called "URL decode".
Currently there is no built-in function for doing this, but you can create a custom User-Defined Function (make sure you have the required privileges):
CREATE FUNCTION urldecode(url VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
import urllib
return urllib.unquote(url).decode('utf8') # or 'latin-1', depending on how the text is encoded
$$ LANGUAGE plpythonu;
Example query:
SELECT urldecode('http%3A%2F%2Fwww.amazon.com%2FTest%3Fname%3DGary%26Bob');
Result:
http://www.amazon.com/Test?name=Gary&Bob
I tried #hiddenbit's answer in REDSHIFT, but Python 3 isn't supported. The following Py2 code did work for me, however:
DROP FUNCTION urldecode(varchar);
CREATE FUNCTION urldecode(url VARCHAR)
RETURNS varchar
IMMUTABLE AS $$
import urllib
return urllib.unquote(url)
$$ LANGUAGE plpythonu;

PostgreSQL Trim excessive trailing zeroes: type numeric but expression is of type text

I'm trying to clean out excessive trailing zeros, I used the following query...
UPDATE _table_ SET _column_=trim(trailing '00' FROM '_column_');
...and I received the following error:
ERROR: column "_column_" is of
expression is of type text.
I've played around with the quotes since that usually is what it barrels down to for text versus numeric though without any luck.
The CREATE TABLE syntax:
CREATE TABLE _table_ (
id bigint NOT NULL,
x bigint,
y bigint,
_column_ numeric
);
You can cast the arguments from and the result back to numeric:
UPDATE _table_ SET _column_=trim(trailing '00' FROM _column_::text)::numeric;
Also note that you don't quote column names with single quotes as you did.
Postgres version 13 now comes with the trim_scale() function:
UPDATE _table_ SET _column_ = trim_scale(_column_);
trim takes string parameters, so _column_ has to be cast to a string (varchar for example). Then, the result of trim has to be cast back to numeric.
UPDATE _table_ SET _column_=trim(trailing '00' FROM _column_::varchar)::numeric;
Another (arguably more consistent) way to clean out the trailing zeroes from a NUMERIC field would be to use something like the following:
UPDATE _table_ SET _column_ = CAST(to_char(_column_, 'FM999999999990.999999') AS NUMERIC);
Note that you would have to modify the FM pattern to match the maximum expected precision and scale of your _column_ field. For more details on the FM pattern modifier and/or the to_char(..) function see the PostgreSQL docs here and here.
Edit: Also, see the following post on the gnumed-devel mailing list for a longer and more thorough explanation on this approach.
Be careful with all the answers here. Although this looks like a simple problem, it's not.
If you have pg 13 or higher, you should use trim_scale (there is an answer about that already). If not, here is my "Polyfill":
DO $x$
BEGIN
IF count(*)=0 FROM pg_proc where proname='trim_scale' THEN
CREATE FUNCTION trim_scale(numeric) RETURNS numeric AS $$
SELECT CASE WHEN trim($1::text, '0')::numeric = $1 THEN trim($1::text, '0')::numeric ELSE $1 END $$
LANGUAGE SQL;
END IF;
END;
$x$;
And here is a query for testing the answers:
WITH test as (SELECT unnest(string_to_array('1|2.0|0030.00|4.123456000|300000','|'))::numeric _column_)
SELECT _column_ original,
trim(trailing '00' FROM _column_::text)::numeric accepted_answer,
CAST(to_char(_column_, 'FM999999999990.999') AS NUMERIC) another_fancy_one,
CASE WHEN trim(_column_::text, '0')::numeric = _column_ THEN trim(_column_::text, '0')::numeric ELSE _column_ END my FROM test;
Well... it looks like, I'm trying to show the flaws of the earlier answers, while just can't come up with other testcases. Maybe you should write more, if you can.
I'm like short syntax instead of fancy sql keywords, so I always go with :: over CAST and function call with comma separated args over constructs like trim(trailing '00' FROM _column_). But it's a personal taste only, you should check your company or team standards (and fight for change them XD)

How can I mimic the php urldecode function in postgresql?

I have a column url encoded with urlencode in php. I wish to make a select like this
SELECT some_mix_of_functions(...) AS Decoded FROM table
Replace is not a good solution because I will have to add all the decoding by hand. Any other solution to get the desire result ?
Yes you can:
CREATE OR REPLACE FUNCTION decode_url_part(p varchar) RETURNS varchar AS $$
SELECT convert_from(CAST(E'\\x' || string_agg(CASE WHEN length(r.m[1]) = 1 THEN encode(convert_to(r.m[1], 'SQL_ASCII'), 'hex') ELSE substring(r.m[1] from 2 for 2) END, '') AS bytea), 'UTF8')
FROM regexp_matches($1, '%[0-9a-f][0-9a-f]|.', 'gi') AS r(m);
$$ LANGUAGE SQL IMMUTABLE STRICT;
This creates a function decode_url_part, then you can use it like that:
SELECT decode_url_part('your%20urlencoded%20string')
Or you can just use the mix of functions and subqueries from the body of the above function.
This doesn't handle '+' characters (representing whitespace), but I guess adding this is quite easy (if you ever need it).
Also, this assumes utf-8 encoding for non-ascii characters, but you can replace 'UTF8' with your own encoding if you want.
It should be noted that the above code relies on undocumented postgresql feature, namely that the results of regexp_matches function are processed in the order they occur in the original string (which is natural, but not specified in docs).
As Pablo Santa Cruz notes, string_agg is a PostgreSQL 9.0 aggregate function. The equivalent code below doesn't use it (I hope it works for 8.x):
SELECT convert_from(CAST(E'\\x' || array_to_string(ARRAY(
SELECT CASE WHEN length(r.m[1]) = 1 THEN encode(convert_to(r.m[1], 'SQL_ASCII'), 'hex') ELSE substring(r.m[1] from 2 for 2) END
FROM regexp_matches($1, '%[0-9a-f][0-9a-f]|.', 'gi') AS r(m)
), '') AS bytea), 'UTF8');
Not out of the box. But you could create a pl/perl function that wraps the perl equivalent. (Or a pl/php function).