Remove all character except A-Za-z0-9 without regex - postgresql-9.4

Is there a way to remove all characters except A-Za-z0-9 when querying some column in PostgreSQL without regex? or is the shortest way is regex?
SELECT foo FROM bar;
-- ajhg asjg, asga. 1234 ax
-- desired result: ajhgasjgasga1234ax
-- regex solution:
SELECT regexp_replace(foo,'[^A-Za-z0-9]','','g') FROM bar;

Related

does Postgres string functions (including regexp_*) consider tab as spaces?

I am running some SQL over a Postgres Server 9.5
a field got sometimes has leading spaces, including literal white space, and tab spaces '\t'
in many programming languages it's easy to do with a regexp replace, like this in JavaScript:
> ' \tafsdfwef\t \n'.replace(/\s+/g, '')
'afsdfwef'
then I found PostgreSQL also has this regexp_replace function and it also support \s to mean [[:space:]]
https://www.postgresql.org/docs/10/functions-matching.html#FUNCTIONS-POSIX-REGEXP
but this \s seems only recognizing literally white spaces ' ' ? the question is does this PostgreSQL regex support \s to include all kinds of spaces ( tabs, newlines )?
db=> SELECT regexp_replace('\tafsdfwef', '\s+', '');
regexp_replace
----------------
\tafsdfwef
(1 row)
db=> SELECT regexp_matches('\tafsdfwef', '\s+');
regexp_matches
----------------
(0 rows)
then I tested if trim function can recognize the other spaces? seems also no?
db=> SELECT trim('\tafsdfwef\t');
btrim
--------------
\tafsdfwef\t
(1 row)
db=> SELECT trim(' \tafsdfwef\t');
btrim
--------------
\tafsdfwef\t
(1 row)
db=> SELECT trim(' \tafsdfwef\t \n ');
btrim
------------------
\tafsdfwef\t \n
(1 row)
So, does PostgreSQL have an easy function can do strip all kinds of spaces, in leading, in middle, and at tail of a string?
EDIT: My complain is also toward the PostgreSQL documentation, they mentioned \t to [:space:] but isn't really all kinds of spaces, as most programmers know, it mentioned POSIX regex but isn't really POSIX,
anyone knows a better place to file them a bug ?
https://www.postgresql.org/docs/10/functions-matching.html#FUNCTIONS-POSIX-REGEXP
EDIT: here is Mozilla JavaScript documentation, what \s means
a single white space character, including space, tab, form feed, line feed and other Unicode spaces. Equivalent to [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff].
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
Yes, Postgres regexp functions do consider tab as space. Actually the text '\tafsdfwef' does not contain tab character. You have to write the letter E (upper or lower case) just before the opening single quote to get tab char (and/or other escape chars) in it:
SELECT regexp_replace(E'\ta\nb\fc\rd', '\s', '', 'g')
regexp_replace
----------------
abcd
(1 row)
Read about string constants in the documentation.

regexp_replace in PostgreSQL

I want to replace E-*-[_]F* string into E-*-\_F*. The code I am using is below.
select regexp_replace('E-*-[_]F*','-[\[(.)\]]', E'\\', 'g'); -- E-*\_]F*
I am not able to remove the closing bracket.
assuming you want the character inside the braces to be placed after a backslash:
jasen=# select regexp_replace('E-*-[_]F*','-\[(.)\]', '\\\1', 'g');
regexp_replace
----------------
E-*\_F*
(1 row)
The pattern looks for any character (.) between -[ and ]
the parentheses make it remember the character.
The whole matched part is replaced with a backslash, represented by \\ , followed by the first (and only) remembered part \1.

LIKE clause with \ character in PostgreSQL

I have this behavior in PostgreSQL 9.3:
-- (1) this "doesn't" work
select 't\om' like '%t\om%'
-- result = false
-- (2) this works
select 't/om' like '%t/om%'
-- result = true
Why is the (1) query result false? What is the best way to get true in (1) query?
The \ has no special meaning in SQL except inside the condition for the LIKE operator where it can be used to escape the wildcard characters.
But you can define a different escape character for LIKE which then makes the \ a "normal" character:
select 't\om' like '%t\om%' escape '#';
edit
As Sunrelax has commented, you can also use an empty string as the "escape" sequence:
select 't\om' like '%t\om%' escape '';
\ is an escape sequence, so you need to escape it, too:
select 't\om' like '%t\\om%';
There is also a configuration option you can set. See Escaping backslash in Postgresql

Column names with line breaks

I know that for text strings in PostgreSQL line breaks are unified by appending symbol E or e in front of the text:
SELECT E'first\nsecond'
results in:
first
second
But PostgreSQL also support line breaks within column names - not sure why or how evil this practice is, but one can do the following:
CREATE TABLE One("first\nsecond" text);
CREATE TABLE Two("first
second" text);
When you are unfortunate enough to run into one of these, you would find that while these queries work:
SELECT "first\nsecond" from One;
SELECT "first
second" from Two;
these ones do not:
SELECT "first
second" from One;
SELECT "first\nsecond" from Two;
My question is: Is there a way in PostgreSQL that unifies such differences, similar to the situation with the column values?
I have tried putting E in front of "first\nsecond" column names, but it is not supported. Trying to put \r\n instead (I'm using Windows) gave me a third type of column names, one that can only be queried as:
SELECT "first\r\nsecond" FROM Third
Column names are identifiers, and the gory details of the syntax for identifiers are described at:
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS
TL;DR: use the U&"..." syntax to inject non-printable characters into identifiers through their Unicode codepoints, and there's no way to unify CR,LF with LF alone.
How to refer to the column in a single line
We're allowed to use Unicode escape sequences in identifiers, so per documentation, the following does work:
select U&"first\000asecond" from Two;
if it's just a newline character between the two words.
What happens with the queries on the first table
The table is created with:
CREATE TABLE One("first\nsecond" text);
As the backslash character has no special meaning here, this column does not contain any newline.
It contains first followed by \ followed by n followed by second.
So:
SELECT "first\nsecond" from One;
does work because it's the same as what's in the CREATE TABLE
whereas
SELECT "first
second" from One;
fails because there's a newline in that SELECT where the actual column name in the table has a backslash followed by a n.
What happens with the queries on the second table
This is the opposite of "One".
CREATE TABLE Two("first
second" text);
The newline is taken verbatim and is part of the column.
So
SELECT "first
second" from Two;
works because the newline is there exactly as in the CREATE TABLE,
with an embedded newline,
whereas
SELECT "first\nsecond" from Two;
fails because as previously \n in this context does not mean a newline.
Carriage Return followed by Newline, or anything weirder
As mentioned in comments and your edit, this could be carriage return and newline instead, in which case the following should do:
select U&"first\000d\000asecond" from Two;
although in my test, hitting Enter in the middle of a column with psql on Unix and Windows has the same effect: a single newline in the column's name.
To check what exact characters ended up in a column name, we can inspect them in hexadecimal.
When applied to your create table example, from inside psql under Unix:
CREATE TABLE Two("first
second" text);
select convert_to(column_name::text,'UTF-8')
from information_schema.columns
where table_schema='public'
and table_name='two';
The result is:
convert_to
----------------------------
\x66697273740a7365636f6e64
For more complex cases (e.g. non-ascii characters with several bytes in UTF-8), a more advanced query might help, for easy-to-read codepoints:
select c,lpad(to_hex(ascii(c)),4,'0') from (
select regexp_split_to_table(column_name::text,'') as c
from information_schema.columns
where table_schema='public'
and table_name='two'
) as g;
c | lpad
---+------
f | 0066
i | 0069
r | 0072
s | 0073
t | 0074
+| 000a
|
s | 0073
e | 0065
c | 0063
o | 006f
n | 006e
d | 0064

Using Perl to move commas from end of line to begining of line

I've inherited a few dozen sql scripts that look like this:
select
column_a,
column_b,
column_c
from
my_table
To format them so they match the rest of our sql library, I'd like to change them to look like this:
select
column_a
,column_b
,column_c
from
my_table
where the commas start at the beginning of the line instead of at the end. I've taken a few passes at this in Perl, but haven't been able to get it to work just right.
Can any of you Perl gods provide some enlightenment here?
perl -pi.bak -0777 -wle's/,[^\n\S]*\n([^\n\S]*)/\n$1,/g' file1.sql file2.sql ...
The character class is any non-newline whitespace.
-0777 causes it to operate on whole files, not lines.