How to use regex capture groups in postgres stored procedures (if possible at all)?

How to use regex capture groups in postgres stored procedures (if possible at all)? - postgresql

In a system, I'm using a standard urn (RFC8141) as one of the fields. From that urn, one can derive a unique identifier. The weird thing about the urns described in RFC8141 is that you can have two different urns which are equal.
In order to check for unique keys, I need to extract different parts of the urn that make a unique key. To do so, I have this regex (Regex which matches URN by rfc8141):
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}[^-]):(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+)(?:\?\+(?<rcomponent>.*?))?(?:\?=(?<qcomponent>.*?))?(?:#(?<fcomponent>.*?))?)\z
which results in a five named capture groups (nid, nss, rcomponent, qcomponent en fcomponent). Only the nid and nss are important to check for uniqueness/equality. Or: even if the components change, as long as nid and nss are the same, two items/records are equal (no matter the values of the components). nid is checked case-insensitive, nss is checked case-sensitive.
Now, in order to check for uniqueness/equality, I'm defining a 'cleaned urn', which is the primary key. I've added a trigger, so I can extract the different capture groups. What I'd like to do is:
extract the nid and nss (see regex) of the urn
capture them by name. This is where I don't know how to do it: how can I capture these two capture groups in a postgresql stored procedure?
add them as 'cleaned urn', lowercasing nid (so to have case-insensitivity on that part) and url-encoding or url-decoding the string (one of the two, it doesn't matter, as long as it's consistent). (I'm also not sure if there's is a url encode/decode function in Postgres, but I that'll be another question once the previous one is solved :) ).
Example:
all these urns are equal/equivalent (and I want the primary key to be urn:example:a123,z456):
urn:example:a123,z456
URN:example:a123,z456
urn:EXAMPLE:a123,z456
urn:example:a123,z456?+abc (?+ denotes the start of the rcomponent)
urn:example:a123,z456?=xyz/something (?= denotes the start of the qcomponent)
urn:example:a123,z456#789 (# denotes the start of the fcomponent)
urn:example:a123%2Cz456
URN:EXAMPLE:a123%2cz456
urn:example:A123,z456 and urn:Example:A123,z456 both have key urn:example:A123,z456, which is different from the previous examples (because of the case-sensitiveness of the A123,z456).
just for completeness: urn:example:a123,z456?=xyz/something is different from urn:example:a123,z456/something?=xyz: everything after ?= (or ?+ or #) can be omitted, so the /something is part of the primary key in the latter case, but not in the former. (That's what the regex is actually capturing already.)
== EDIT 1: unnamed capture groups ==
with unnamed capture groups, this would be doing the same:
select
g[2] as nid,
g[3] as nss,
g[4] as rcomp,
g[5] as qcomp,
g[6] as fcomp
from (
select regexp_matches('uRn:example:a123,z456?=xyz/something',
'\A(urn:(?!urn:)([a-z0-9][a-z0-9-]{1,31}[^-]):((?:[-a-z0-9()+,.:=#;$_!*''&~\/]|%[0-9a-f]{2})+)(?:\?\+(.*?))?(?:\?=(.*?))?(?:#(.*?))?)$', 'i')
g)
as ar;
(g[1] is the full match, which I don't need)
I updated the query:
case insensitive matching should be done as flag
no capturing groups (postgres seems to have issues with names capturing groups)
and did a select on the array, splitting the array into columns.

Named capture don't seem to be supported and there seem to be some issues with the greedy/lazy lookup and negative lookahead. So, here's a solution that works fine:
DO $$
BEGIN
if not exists (SELECT 1 FROM pg_type WHERE typname = 'urn') then
CREATE TYPE urn AS (nid text, nss text, rcomp text, qcomp text, fcomp text);
end if;
END
$$;
CREATE or REPLACE FUNCTION spliturn(urnstring text)
RETURNS urn as $$
DECLARE
urn urn;
urnregex text = concat(
'\A(urn:(?!urn:)',
'([a-z0-9][a-z0-9-]{1,31}[^-]):',
'((?:[-a-z0-9()+,.:=#;$_!*''&~\/]|%[0-9a-f]{2})+)',
'(?:\?\+(.*?))??',
'(?:\?=(.*?))??',
'(?:#(.*?))??',
')$');
BEGIN
select
lower(g[2]) as nid,
g[3] as nss,
g[4] as rcomp,
g[5] as qcomp,
g[6] as fcomp
into urn
from (select regexp_matches(urnstring, urnregex, 'i')
g) as ar;
RETURN urn;
END;
$$ language 'plpgsql' immutable strict;
note
no named groups (?<...>)
indicate case insensitive search with a flag
replacement of \z with $ to match the end of the string
escaping a quote with another quote ('') to allow for quotes
the double ?? for non-greedy search (Table 9-14)

Related

How to sort semver in postgresql database?

I have a problem with sorting semversions using PostrgreSQL query.
I found topic like this:
https://dba.stackexchange.com/questions/74283/how-to-order-by-typical-software-release-versions-like-x-y-z
but they only talk about sorting when semversion is only in a form like MAJOR.MINOR.PATCH,
which is indeed quite easy. But semversion may also include a prerelease (MAJOR.MINOR.PATCH-prerelase).
Quote from here: https://semver.org/
Precedence for two pre-release versions with the same major, minor,
and patch version MUST be determined by comparing each dot separated
identifier from left to right until a difference is found as follows:
Identifiers consisting of only digits are compared numerically.
Identifiers with letters or hyphens are compared lexically in ASCII
sort order.
Numeric identifiers always have lower precedence than non-numeric
identifiers.
A larger set of pre-release fields has a higher precedence than a
smaller set, if all of the preceding identifiers are equal.
Example: 1.0.0-alpha < 1.0.0-alpha.1 < 1.0.0-alpha.beta < 1.0.0-beta <
1.0.0-beta.2 < 1.0.0-beta.11 < 1.0.0-rc.1 < 1.0.0.
The hard part for me is to create "conditional" sorting which can take two cases into consideration:
when an identifier consists only digits (numerical comparison should happen)
when consists only letters or hyphens (ASCII sort order should be used)
and act accordingly to return the correct order.
My table consists two columns: version (major.minor.patch) and prerelease.
It is possible to sort prerelease similar as the version column - by splitting it by a ".", and treating each segment (identifier) as a string, but then there are cases when it doesn't work.
Example:
If we consider semversions like:
1.0.0-dev.123, 1.0.0-dev.124, 1.0.0-dev.1234
the correct order is :
1.0.0-dev.123, 1.0.0-dev.124, 1.0.0-dev.1234
whereas when we compare them as strings the output will be:
1.0.0-dev.123, 1.0.0-dev.1234, 1.0.0-dev.124
Thank you!

I found a solution. Sometimes you cannot use libraries.
So the idea is to lpad the numerical identifiers with zeros, so they all contain the same amount of characters and then sort them like strings.
My functions are following:
CREATE OR REPLACE FUNCTION numerical_identifier_pad (string text)
RETURNS text AS
'
DECLARE
result text;
BEGIN
IF string ~ ''^[0-9]{1,20}$'' THEN
result := lpad(string, 20, ''0'');
ELSE
result := string;
END IF;
RETURN result;
END;
'
LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION pad_numeric_identifiers_with_zeros (prerelease_column text)
RETURNS text AS
'
DECLARE
result text;
BEGIN
SELECT array_agg(a.numerical_identifier_pad)
INTO result
FROM (
SELECT numerical_identifier_pad(
unnest(string_to_array(prerelease_column, ''.'')::text[])
)
) AS a;
RETURN result;
END;
'
LANGUAGE plpgsql
IMMUTABLE;
So what I did is I used pad_numeric_identifiers_with_zeros function on my prerelease column and then I applied sorting on the result.
Let's say we have prereleases like in the example from my first post.
dev.123
dev.124
dev.1234
pad_numeric_identifiers_with_zeros would do something like this:
dev.0123
dev.0124
dev.1234

How can I break a long string in an "XMLTABLE" embedded SQL statement in RPGLE across multiple lines?

I have an XML path that exceeds 100 characters (and therefore truncates when the source is saved). My statement is something like this:
Exec SQL
Select Whatever
Into :Stuff
From Table as X,
XmlTable(
XmlNamespaces('http://namespace.url/' as "namespacevalue"),
'$X/really/long/path' Passing X.C1 as "X"
Columns
Field1 Char(3) Path 'example1',
Field2 Char(8) Path 'example2',
Field3 Char(32) Path '../example3'
) As R;
I must break $X/really/long/path across multiple lines. Per IBM's documentation,
The plus sign (+) can be used to indicate a continuation of a string constant.
However, this does not even pass precompile ("Token + was not valid"). I suspect this is due to where the string is in the statement.
I have also tried:
Putting the path in a host variable; this was not allowed
Using SQL CONCAT or ||; not allowed
Putting the path in a SQL global variable instead of a host variable; not allowed
I have considered:
Preparing the entire statement, but this is not ideal for a multitude of reasons
Truncating the path at a higher level in the hierarchy, but this does not return the desired "granularity" of records
Is there any way to span this specific literal in an XmlTable function across multiple lines in my source? Thanks for any and all ideas!

Something like
Exec SQL
Select Whatever
Into :Stuff
From Table as X,
XmlTable(
XmlNamespaces('http://namespace.url/' as "namespacevalue"),
'$X/really/+
long/path' Passing X.C1 as "X"
Columns
Field1 Char(3) Path 'example1',
Field2 Char(8) Path 'example2',
Field3 Char(32) Path '../example3'
) As R;
Should work, is that what you tried ?

The + didn't worked for me, so I had to shorten the path with // instead of /, which might by suboptimal .

removing leading zero and hyphen in Postgres

I need to remove leading zeros and hyphens from a column value in Postgresql database, for example:
121-323-025-000 should look like 12132325
060579-0001 => 605791
482-322-004 => 4823224
timely help will be really appreciated.

Postgresql string functions.
For more advanced string editing, regular expressions can be very powerful. Be aware that complex regular expressions may not be considered maintainable by people not familiar with them.
CREATE TABLE testdata (id text, expected text);
INSERT INTO testdata (id, expected) VALUES
('121-323-025-000', '12132325'),
('060579-0001', '605791'),
('482-322-004', '4823224');
SELECT id, expected, regexp_replace(id, '(^|-)0*', '', 'g') AS computed
FROM testdata;
How regexp_replace works. In this case we look for the beginning of the string or a hyphen for a place to start matching. We include any zeros that follow that as part of the match. Next we replace that match with an empty string. Finally, the global flag tells us to repeat the search until we reach the end of the string.

list trigger no system ending with "_BI"

I want to list the trigger no system ending with "_BI" in firebird database,
but no result with this
select * from rdb$triggers
where
rdb$trigger_source is not null
and (coalesce(rdb$system_flag,0) = 0)
and (rdb$trigger_source not starting with 'CHECK' )
and (rdb$trigger_name like '%BI')
but with this syntaxs it gives me a "_bi" and "_BI0U" and "_BI0U" ending result
and (rdb$trigger_name like '%BI%')
but with this syntaxs it gives me null result
and (rdb$trigger_name like '%#_BI')
thank you beforehand

The problem is that the Firebird system tables use CHAR(31) for object names, this means that they are padded with spaces up to the declared length. As a result, use of like '%BI') will not yield results, unless BI are the 30th and 31st character.
There are several solutions
For example you can trim the name before checking
trim(rdb$trigger_name) like '%BI'
or you can require that the name is followed by at least one space
rdb$trigger_name || ' ' like '%BI %'
On a related note, if you want to check if your trigger name ends in _BI, then you should also include the underscore in your condition. And as an underscore in like is a single character matcher, you need to escape it:
trim(rdb$trigger_name) like '%\_BI' escape '\'
Alternatively you could also try to use a regular expressions, as you won't need to trim or otherwise mangle the lefthand side of the expression:
rdb$trigger_name similar to '%\_BI[[:SPACE:]]*' escape '\'

PostgreSQL Trim excessive trailing zeroes: type numeric but expression is of type text

I'm trying to clean out excessive trailing zeros, I used the following query...
UPDATE _table_ SET _column_=trim(trailing '00' FROM '_column_');
...and I received the following error:
ERROR: column "_column_" is of
expression is of type text.
I've played around with the quotes since that usually is what it barrels down to for text versus numeric though without any luck.
The CREATE TABLE syntax:
CREATE TABLE _table_ (
id bigint NOT NULL,
x bigint,
y bigint,
_column_ numeric
);

You can cast the arguments from and the result back to numeric:
UPDATE _table_ SET _column_=trim(trailing '00' FROM _column_::text)::numeric;
Also note that you don't quote column names with single quotes as you did.

Postgres version 13 now comes with the trim_scale() function:
UPDATE _table_ SET _column_ = trim_scale(_column_);

trim takes string parameters, so _column_ has to be cast to a string (varchar for example). Then, the result of trim has to be cast back to numeric.
UPDATE _table_ SET _column_=trim(trailing '00' FROM _column_::varchar)::numeric;

Another (arguably more consistent) way to clean out the trailing zeroes from a NUMERIC field would be to use something like the following:
UPDATE _table_ SET _column_ = CAST(to_char(_column_, 'FM999999999990.999999') AS NUMERIC);
Note that you would have to modify the FM pattern to match the maximum expected precision and scale of your _column_ field. For more details on the FM pattern modifier and/or the to_char(..) function see the PostgreSQL docs here and here.
Edit: Also, see the following post on the gnumed-devel mailing list for a longer and more thorough explanation on this approach.

Be careful with all the answers here. Although this looks like a simple problem, it's not.
If you have pg 13 or higher, you should use trim_scale (there is an answer about that already). If not, here is my "Polyfill":
DO $x$
BEGIN
IF count(*)=0 FROM pg_proc where proname='trim_scale' THEN
CREATE FUNCTION trim_scale(numeric) RETURNS numeric AS $$
SELECT CASE WHEN trim($1::text, '0')::numeric = $1 THEN trim($1::text, '0')::numeric ELSE $1 END $$
LANGUAGE SQL;
END IF;
END;
$x$;
And here is a query for testing the answers:
WITH test as (SELECT unnest(string_to_array('1|2.0|0030.00|4.123456000|300000','|'))::numeric _column_)
SELECT _column_ original,
trim(trailing '00' FROM _column_::text)::numeric accepted_answer,
CAST(to_char(_column_, 'FM999999999990.999') AS NUMERIC) another_fancy_one,
CASE WHEN trim(_column_::text, '0')::numeric = _column_ THEN trim(_column_::text, '0')::numeric ELSE _column_ END my FROM test;
Well... it looks like, I'm trying to show the flaws of the earlier answers, while just can't come up with other testcases. Maybe you should write more, if you can.
I'm like short syntax instead of fancy sql keywords, so I always go with :: over CAST and function call with comma separated args over constructs like trim(trailing '00' FROM _column_). But it's a personal taste only, you should check your company or team standards (and fight for change them XD)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse