Identify special character in a string which has mix of Arabic and alphanumeric and special character - special-characters

I have a requirement where I need to identify if a string has any special/junk characters excluding Arabic and alphanumeric and space. I have tried below, but its not detecting spl character
select count(*) from table
where not regexp_like (column1,UNISTR('[\0600-\06FF\0750-\077F\0870-\089F\08A0-\08FF\FB50-\FDFF\FE70-\FEFF\0030-\0039\0041-\005A\0061-\007A]'));
column has following value 'طًيAa1##$'

You have NOT REGEXP_LIKE(column, allowed_characters)
This means that any string with at least one allowed character will return TRUE from the regular expression, and so be excluded by the WHERE clause.
You want REGEXP_LIKE(column, disallowed_characters)
This will identify any strings that have at least one disallowed character.
You can accomplish this with ^ inside the regular expression (^ meaning 'not any of these characters')
select count(*) from table
where regexp_like (Column1, UNISTR('[^\0600-\06FF\0750-\077F\0870-\089F\08A0-\08FF\FB50-\FDFF\FE70-\FEFF\0030-\0039\0041-\005A\0061-\007A]'));
Demo; https://dbfiddle.uk/Rq1Zzopk

Related

Postgres replacing 'text' with e'text'

I inserted a bunch of rows with a text field like content='...\n...\n...'.
I didn't use e in front, like conent=e'...\n...\n..., so now \n is not actually displayed as a newline - it's printed as text.
How do I fix this, i.e. how to change every row's content field from '...' to e'...'?
The syntax variant E'string' makes Postgres interpret the given string as Posix escape string. \n encoding a newline is only one of many interpreted escape sequences (even if the most common one). See:
Insert text with single quotes in PostgreSQL
To "re-evaluate" your Posix escape string, you could use a simple function with dynamic SQL like this:
CREATE OR REPLACE FUNCTION f_eval_posix_escapes(INOUT _string text)
LANGUAGE plpgsql AS
$func$
BEGIN
EXECUTE 'SELECT E''' || _string || '''' INTO _string;
END
$func$;
WARNING 1: This is inherently unsafe! We have to evaluate input strings dynamically without quoting and escaping, which allows SQL injection. Only use this in a safe environment.
WARNING 2: Don't apply repeatedly. Or it will misinterpret your actual string with genuine \ characters, etc.
WARNING 3: This simple function is imperfect as it cannot cope with nested single quotes properly. If you have some of those, consider instead:
Unescape a string with escaped newlines and carriage returns
Apply:
UPDATE tbl
SET content = f_eval_posix_escapes(content)
WHERE content IS DISTINCT FROM f_eval_posix_escapes(content);
db<>fiddle here
Note the added WHERE clause to skip updates that would not change anything. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
Use REPLACE in an update query. Something like this: (I'm on mobile so please ignore any typo or syntax erro)
UPDATE table
SET
column = REPLACE(column, '\n', e'\n')

How to use regex capture groups in postgres stored procedures (if possible at all)?

In a system, I'm using a standard urn (RFC8141) as one of the fields. From that urn, one can derive a unique identifier. The weird thing about the urns described in RFC8141 is that you can have two different urns which are equal.
In order to check for unique keys, I need to extract different parts of the urn that make a unique key. To do so, I have this regex (Regex which matches URN by rfc8141):
\A(?i:urn:(?!urn:)(?<nid>[a-z0-9][a-z0-9-]{1,31}[^-]):(?<nss>(?:[-a-z0-9()+,.:=#;$_!*'&~\/]|%[0-9a-f]{2})+)(?:\?\+(?<rcomponent>.*?))?(?:\?=(?<qcomponent>.*?))?(?:#(?<fcomponent>.*?))?)\z
which results in a five named capture groups (nid, nss, rcomponent, qcomponent en fcomponent). Only the nid and nss are important to check for uniqueness/equality. Or: even if the components change, as long as nid and nss are the same, two items/records are equal (no matter the values of the components). nid is checked case-insensitive, nss is checked case-sensitive.
Now, in order to check for uniqueness/equality, I'm defining a 'cleaned urn', which is the primary key. I've added a trigger, so I can extract the different capture groups. What I'd like to do is:
extract the nid and nss (see regex) of the urn
capture them by name. This is where I don't know how to do it: how can I capture these two capture groups in a postgresql stored procedure?
add them as 'cleaned urn', lowercasing nid (so to have case-insensitivity on that part) and url-encoding or url-decoding the string (one of the two, it doesn't matter, as long as it's consistent). (I'm also not sure if there's is a url encode/decode function in Postgres, but I that'll be another question once the previous one is solved :) ).
Example:
all these urns are equal/equivalent (and I want the primary key to be urn:example:a123,z456):
urn:example:a123,z456
URN:example:a123,z456
urn:EXAMPLE:a123,z456
urn:example:a123,z456?+abc (?+ denotes the start of the rcomponent)
urn:example:a123,z456?=xyz/something (?= denotes the start of the qcomponent)
urn:example:a123,z456#789 (# denotes the start of the fcomponent)
urn:example:a123%2Cz456
URN:EXAMPLE:a123%2cz456
urn:example:A123,z456 and urn:Example:A123,z456 both have key urn:example:A123,z456, which is different from the previous examples (because of the case-sensitiveness of the A123,z456).
just for completeness: urn:example:a123,z456?=xyz/something is different from urn:example:a123,z456/something?=xyz: everything after ?= (or ?+ or #) can be omitted, so the /something is part of the primary key in the latter case, but not in the former. (That's what the regex is actually capturing already.)
== EDIT 1: unnamed capture groups ==
with unnamed capture groups, this would be doing the same:
select
g[2] as nid,
g[3] as nss,
g[4] as rcomp,
g[5] as qcomp,
g[6] as fcomp
from (
select regexp_matches('uRn:example:a123,z456?=xyz/something',
'\A(urn:(?!urn:)([a-z0-9][a-z0-9-]{1,31}[^-]):((?:[-a-z0-9()+,.:=#;$_!*''&~\/]|%[0-9a-f]{2})+)(?:\?\+(.*?))?(?:\?=(.*?))?(?:#(.*?))?)$', 'i')
g)
as ar;
(g[1] is the full match, which I don't need)
I updated the query:
case insensitive matching should be done as flag
no capturing groups (postgres seems to have issues with names capturing groups)
and did a select on the array, splitting the array into columns.
Named capture don't seem to be supported and there seem to be some issues with the greedy/lazy lookup and negative lookahead. So, here's a solution that works fine:
DO $$
BEGIN
if not exists (SELECT 1 FROM pg_type WHERE typname = 'urn') then
CREATE TYPE urn AS (nid text, nss text, rcomp text, qcomp text, fcomp text);
end if;
END
$$;
CREATE or REPLACE FUNCTION spliturn(urnstring text)
RETURNS urn as $$
DECLARE
urn urn;
urnregex text = concat(
'\A(urn:(?!urn:)',
'([a-z0-9][a-z0-9-]{1,31}[^-]):',
'((?:[-a-z0-9()+,.:=#;$_!*''&~\/]|%[0-9a-f]{2})+)',
'(?:\?\+(.*?))??',
'(?:\?=(.*?))??',
'(?:#(.*?))??',
')$');
BEGIN
select
lower(g[2]) as nid,
g[3] as nss,
g[4] as rcomp,
g[5] as qcomp,
g[6] as fcomp
into urn
from (select regexp_matches(urnstring, urnregex, 'i')
g) as ar;
RETURN urn;
END;
$$ language 'plpgsql' immutable strict;
note
no named groups (?<...>)
indicate case insensitive search with a flag
replacement of \z with $ to match the end of the string
escaping a quote with another quote ('') to allow for quotes
the double ?? for non-greedy search (Table 9-14)

db2 remove all non-alphanumeric, including non-printable, and special characters

This may sound like a duplicate, but existing solutions does not work.
I need to remove all non-alphanumerics from a varchar field. I'm using the following but it doesn't work in all cases (it works with diamond questionmark characters):
select TRANSLATE(FIELDNAME, '?',
TRANSLATE(FIELDNAME , '', 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'))
from TABLENAME
What it's doing is the inner translate parse all non-alphanumeric characters, then the outer translate replace them all with a '?'. This seems to work for replacement character�. However, it throws The second, third or fourth argument of the TRANSLATE scalar function is incorrect. which is expected according to IBM:
The TRANSLATE scalar function does not allow replacement of a character by another character which is encoded using a different number of bytes. The second and third arguments of the TRANSLATE scalar function must end with correctly formed characters.
Is there anyway to get around this?
Edit: #Paul Vernon's solution seems to be working:
· 6005308 ??6005308
–6009908 ?6009908
–6011177 ?6011177
��6011183�� ??6011183??
Try regexp_replace(c,'[^\w\d]','') or regexp_replace(c,'[^a-zA-Z\d]','')
E.g.
select regexp_replace(c,'[^a-zA-Z\d]','') from table(values('AB_- C$£abc�$123£')) t(c)
which returns
1
---------
ABCabc123
BTW Note that the allowed regular expression patterns are listed on this page Regular expression control characters
Outside of a set, the following must be preceded with a backslash to be treated as a literal
* ? + [ ( ) { } ^ $ | \ . /
Inside a set, the follow must be preceded with a backslash to be treated as a literal
Characters that must be quoted to be treated as literals are [ ] \
Characters that might need to be quoted, depending on the context are - &

Remove/replace special characters in column values?

I have a table column containing values which I would like to remove all the hyphens from. The values may contain more than one hyphen and vary in length.
Example: for all values I would like to replace 123 - ABCD - efghi with 123ABCDefghi.
What is the easiest way to remove all hyphens & update all column values in the table?
You can use the regexp_replace function to left only the digits and letters, like this:
update mytable
set myfield = regexp_replace(myfield, '[^\w]+','');
Which means that everything that is not a digit or a letter or an underline will be replaced by nothing (that includes -, space, dot, comma, etc).
If you want to also include the _ to be replaced (\w will leave it) you can change the regex to [^\w]+|_.
Or if you want to be strict with the characters that must be removed you use: [- ]+ in this case here a dash and a space.
Also as suggested by Luiz Signorelly you can use to replace all occurrences:
update mytable
set myfield = regexp_replace(myfield, '[^\w]+','','g');
You can use this.
update table
set column = format('%s%s', left(column, 3), right(column, -6));
Before:
After:

PostgreSQL - how to check if my data contains a backslash

SELECT count(*) FROM table WHERE column ilike '%/%';
gives me the number of values containing "/"
How to do the same for "\"?
SELECT count(*)
FROM table
WHERE column ILIKE '%\\\\%';
Excerpt from the docs:
Note that the backslash already has a special meaning in string literals, so to write a pattern constant that contains a backslash you must write two backslashes in an SQL statement (assuming escape string syntax is used, see Section 4.1.2.1). Thus, writing a pattern that actually matches a literal backslash means writing four backslashes in the statement. You can avoid this by selecting a different escape character with ESCAPE; then a backslash is not special to LIKE anymore. (But it is still special to the string literal parser, so you still need two of them.)
Better yet - don't use like, just use standard position:
select count(*) from table where 0 < position( E'\\' in column );
I found on 12.5 I did not need an escape character
# select * from t;
x
-----
a/b
c\d
(2 rows)
# select count(*) from t where 0 < position('/' in x);
count
-------
1
(1 row)
# select count(*) from t where 0 < position('\' in x);
count
-------
1
(1 row)
whereas on 9.6 I did.
Bit strange but there you go.
Usefully,
position(E'/' in x)
worked on both versions.
You need to be careful - E'//' seems to work (i.e. parses) but does not actually find a slash.
You need E'\\\\' because the argument to LIKE is a regex and regex escape char is already \ (e.g ~ E'\\w' would match any string containing a printable char).
See the doc