does Postgres string functions (including regexp_*) consider tab as spaces? - postgresql

I am running some SQL over a Postgres Server 9.5
a field got sometimes has leading spaces, including literal white space, and tab spaces '\t'
in many programming languages it's easy to do with a regexp replace, like this in JavaScript:
> ' \tafsdfwef\t \n'.replace(/\s+/g, '')
'afsdfwef'
then I found PostgreSQL also has this regexp_replace function and it also support \s to mean [[:space:]]
https://www.postgresql.org/docs/10/functions-matching.html#FUNCTIONS-POSIX-REGEXP
but this \s seems only recognizing literally white spaces ' ' ? the question is does this PostgreSQL regex support \s to include all kinds of spaces ( tabs, newlines )?
db=> SELECT regexp_replace('\tafsdfwef', '\s+', '');
regexp_replace
----------------
\tafsdfwef
(1 row)
db=> SELECT regexp_matches('\tafsdfwef', '\s+');
regexp_matches
----------------
(0 rows)
then I tested if trim function can recognize the other spaces? seems also no?
db=> SELECT trim('\tafsdfwef\t');
btrim
--------------
\tafsdfwef\t
(1 row)
db=> SELECT trim(' \tafsdfwef\t');
btrim
--------------
\tafsdfwef\t
(1 row)
db=> SELECT trim(' \tafsdfwef\t \n ');
btrim
------------------
\tafsdfwef\t \n
(1 row)
So, does PostgreSQL have an easy function can do strip all kinds of spaces, in leading, in middle, and at tail of a string?
EDIT: My complain is also toward the PostgreSQL documentation, they mentioned \t to [:space:] but isn't really all kinds of spaces, as most programmers know, it mentioned POSIX regex but isn't really POSIX,
anyone knows a better place to file them a bug ?
https://www.postgresql.org/docs/10/functions-matching.html#FUNCTIONS-POSIX-REGEXP
EDIT: here is Mozilla JavaScript documentation, what \s means
a single white space character, including space, tab, form feed, line feed and other Unicode spaces. Equivalent to [ \f\n\r\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff].
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

Yes, Postgres regexp functions do consider tab as space. Actually the text '\tafsdfwef' does not contain tab character. You have to write the letter E (upper or lower case) just before the opening single quote to get tab char (and/or other escape chars) in it:
SELECT regexp_replace(E'\ta\nb\fc\rd', '\s', '', 'g')
regexp_replace
----------------
abcd
(1 row)
Read about string constants in the documentation.

Related

Postgres escape double quotes

I am working with a malformed database which seems to have double quotes as part of the column names.
Example:
|"Market" |
|---------|
|Japan |
|UK |
|USA |
And I want to select like below
SELECT "\"Market\"" FROM mytable; /* Does not work */
How does one select such a thing?
The documentation says
[A] delimited identifier or quoted identifier […] is formed by enclosing an arbitrary sequence of characters in double-quotes ("). […]
Quoted identifiers can contain any character, except the character with code zero. (To include a double quote, write two double quotes.)
So you'll want to use
SELECT """Market""" AS "Market" FROM mytable;
An alternative would be
A variant of quoted identifiers allows including escaped Unicode characters identified by their code points. This variant starts with U& (upper or lower case U followed by ampersand) immediately before the opening double quote, without any spaces in between, for example U&"foo". […] Inside the quotes, Unicode characters can be specified in escaped form by writing a backslash followed by the four-digit hexadecimal code point number or alternatively a backslash followed by a plus sign followed by a six-digit hexadecimal code point number.
which in your case would mean
SELECT U&"\0022Market\0022" AS "Market" FROM mytable;
SELECT U&"\+000022Market\+000022" AS "Market" FROM mytable;
Disclaimer: your database may not actually have double quotes as part of the name itself. As mentioned in the comments, this might just be the way in which the tool you are using does display a column named Market (not market) since
Quoting an identifier also makes it case-sensitive
So all you might need could be
SELECT "Market" FROM mytable;

Error when running postgresql COPY Command

Would like to be able to add characters like '-' in the schema name when running COPY command in postgresSQL. Any way to get around this ? Thanks!
`psql -d postgres -c "\COPY (SELECT * FROM test-schema.tableName) TO data.csv DELIMITER ',' CSV"
ERROR: syntax error at or near "-"`enter code here`
LINE 1: COPY ( SELECT * FROM test-schema.tableName ) TO STDOUT DELIMITER ',...`
Yes though I tend to discourage it.
Identifiers:
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard, so their use might render applications less portable. The SQL standard will not define a key word that contains digits or starts or ends with an underscore, so identifiers of this form are safe against possible conflict with future extensions of the standard.
There is a second kind of identifier: the delimited identifier or quoted identifier. It is formed by enclosing an arbitrary sequence of characters in double-quotes (").
So:
create schema "test-schema";
CREATE SCHEMA
\dn "test-schema"
List of schemas
Name | Owner
-------------+----------
test-schema | postgres
create table "test-schema"."test-table"(id int);
select * from test-schema."test-table";
ERROR: syntax error at or near "-"
LINE 1: select * from test-schema."test-table";
select * from "test-schema"."test-table";
id
----
(0 rows)
As you see, if you double quote an identifier to get around the identifier naming rules then you are bound to always quoting it.

How can I put a psql meta-command in a psql variable to execute it later?

I create macros of sorts for psql by putting the SQL I want to run in a psql variable. I "call" the "macro" by just writing :variablename.
e.g.
psql=> \set example 'SELECT 1, ''string_literal'';'
psql=> :example
?column? | ?column?
----------+----------------
1 | string_literal
(1 row)
All good so far.
But now I want to toggle some psql settings as part of my macro. In this case, I want to set unaligned tuples-only mode only for this query, then restore it at the end.
How?
\set example '\a SELECT 1, ''string_literal''; \a'
won't work:
ERROR: syntax error at or near "a"
LINE 1: a SELECT 1, 'string_literal';
^
The best solution I've found so far uses the following behaviours within psql meta command parsing:
Anything within 'single quotes' is subject to C-string-like backslash-escape processing for \n etc. This is important because it means \a expands to a, \\a expands to \a, etc.
Within a single quoted string, single-quotes may be escaped by doubling them, like in SQL literals, so '' expands to '.
Outside of a quoted string the special sequence \\ marks the end of meta-command arguments and resumes normal SQL reading.
Thus, it's valid to write:
psql=> \a\\ SELECT 1, 'string literal'; \a\\
Output format is unaligned.
?column?|?column?
1|string literal
(1 row)
Output format is aligned.
To put that in a psql variable, single quote it, doubling the backslashes and doubling any embedded single quotes:
psql=> \set example '\\a\\\\ SELECT 1, ''string literal''; \\a\\\\'
psql=> :example
Output format is unaligned.
?column?|?column?
1|string literal
(1 row)
Output format is aligned.

regexp_replace in PostgreSQL

I want to replace E-*-[_]F* string into E-*-\_F*. The code I am using is below.
select regexp_replace('E-*-[_]F*','-[\[(.)\]]', E'\\', 'g'); -- E-*\_]F*
I am not able to remove the closing bracket.
assuming you want the character inside the braces to be placed after a backslash:
jasen=# select regexp_replace('E-*-[_]F*','-\[(.)\]', '\\\1', 'g');
regexp_replace
----------------
E-*\_F*
(1 row)
The pattern looks for any character (.) between -[ and ]
the parentheses make it remember the character.
The whole matched part is replaced with a backslash, represented by \\ , followed by the first (and only) remembered part \1.

Column names with line breaks

I know that for text strings in PostgreSQL line breaks are unified by appending symbol E or e in front of the text:
SELECT E'first\nsecond'
results in:
first
second
But PostgreSQL also support line breaks within column names - not sure why or how evil this practice is, but one can do the following:
CREATE TABLE One("first\nsecond" text);
CREATE TABLE Two("first
second" text);
When you are unfortunate enough to run into one of these, you would find that while these queries work:
SELECT "first\nsecond" from One;
SELECT "first
second" from Two;
these ones do not:
SELECT "first
second" from One;
SELECT "first\nsecond" from Two;
My question is: Is there a way in PostgreSQL that unifies such differences, similar to the situation with the column values?
I have tried putting E in front of "first\nsecond" column names, but it is not supported. Trying to put \r\n instead (I'm using Windows) gave me a third type of column names, one that can only be queried as:
SELECT "first\r\nsecond" FROM Third
Column names are identifiers, and the gory details of the syntax for identifiers are described at:
http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS
TL;DR: use the U&"..." syntax to inject non-printable characters into identifiers through their Unicode codepoints, and there's no way to unify CR,LF with LF alone.
How to refer to the column in a single line
We're allowed to use Unicode escape sequences in identifiers, so per documentation, the following does work:
select U&"first\000asecond" from Two;
if it's just a newline character between the two words.
What happens with the queries on the first table
The table is created with:
CREATE TABLE One("first\nsecond" text);
As the backslash character has no special meaning here, this column does not contain any newline.
It contains first followed by \ followed by n followed by second.
So:
SELECT "first\nsecond" from One;
does work because it's the same as what's in the CREATE TABLE
whereas
SELECT "first
second" from One;
fails because there's a newline in that SELECT where the actual column name in the table has a backslash followed by a n.
What happens with the queries on the second table
This is the opposite of "One".
CREATE TABLE Two("first
second" text);
The newline is taken verbatim and is part of the column.
So
SELECT "first
second" from Two;
works because the newline is there exactly as in the CREATE TABLE,
with an embedded newline,
whereas
SELECT "first\nsecond" from Two;
fails because as previously \n in this context does not mean a newline.
Carriage Return followed by Newline, or anything weirder
As mentioned in comments and your edit, this could be carriage return and newline instead, in which case the following should do:
select U&"first\000d\000asecond" from Two;
although in my test, hitting Enter in the middle of a column with psql on Unix and Windows has the same effect: a single newline in the column's name.
To check what exact characters ended up in a column name, we can inspect them in hexadecimal.
When applied to your create table example, from inside psql under Unix:
CREATE TABLE Two("first
second" text);
select convert_to(column_name::text,'UTF-8')
from information_schema.columns
where table_schema='public'
and table_name='two';
The result is:
convert_to
----------------------------
\x66697273740a7365636f6e64
For more complex cases (e.g. non-ascii characters with several bytes in UTF-8), a more advanced query might help, for easy-to-read codepoints:
select c,lpad(to_hex(ascii(c)),4,'0') from (
select regexp_split_to_table(column_name::text,'') as c
from information_schema.columns
where table_schema='public'
and table_name='two'
) as g;
c | lpad
---+------
f | 0066
i | 0069
r | 0072
s | 0073
t | 0074
+| 000a
|
s | 0073
e | 0065
c | 0063
o | 006f
n | 006e
d | 0064