T-SQL to Postgres wildcard logic - postgresql

I am in the process of converting some t-sql queries to postgres and I am having trouble wrapping my head around the postgres wildcard logic.
eg:
The following query in tsql will yeild 'A' however in postgres it returns 'B'
Select
case when 'abcd 1234' like '%[a-z]%[0-9]%' then 'A' else 'B' end as Q1
What would be the postgres equivalent to the above case when statement?
Furthermore, does anyone have a general rule of thumb for converting tsql string logic to postgres ?
Thanks in advance!

The difference that you're running into here is that SQL Server's TSQL accepts character range wildcards through the square bracket [] syntax but PostgreSQL does not.
Instead, PostgreSQL incorporates support for POSIX regular expressions within a query using the RegEx match operators - variations of ~ - in place of LIKE and offer quite a bit of flexibility with respect to case sensitivity and string-matching.
Restating your original query in a POSIX RegEx syntax to achieve an output of 'A' will resemble this:
Select
case when 'abcd 1234' ~ '(.*)[a-z](.*)[0-9](.*)' then 'A'
else 'B' end as Q1
As for the notion of general heuristics for handling these sorts of conversions, I would suggest that T-SQL character-set wildcards should be implemented as POSIX regular expressions using the RegEx match operator rather than LIKE. Otherwise, the T-SQL % and _ wildcards behave equivalently to the same PostgreSQL wildcards.
References:
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/like-transact-sql?view=sql-server-ver15
https://www.postgresql.org/docs/current/functions-matching.html#LIKE
https://www.postgresql.org/docs/current/functions-matching.html#POSIX-BRACKET-EXPRESSIONS

Related

Eliminate accents of a string in postgresql [duplicate]

In Microsoft SQL Server, it's possible to specify an "accent insensitive" collation (for a database, table or column), which means that it's possible for a query like
SELECT * FROM users WHERE name LIKE 'João'
to find a row with a Joao name.
I know that it's possible to strip accents from strings in PostgreSQL using the unaccent_string contrib function, but I'm wondering if PostgreSQL supports these "accent insensitive" collations so the SELECT above would work.
Update for Postgres 12 or later
Postgres 12 adds nondeterministic ICU collations, enabling case-insensitive and accent-insensitive grouping and ordering. The manual:
ICU locales can only be used if support for ICU was configured when PostgreSQL was built.
If so, this works for you:
CREATE COLLATION ignore_accent (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
CREATE INDEX users_name_ignore_accent_idx ON users(name COLLATE ignore_accent);
SELECT * FROM users WHERE name = 'João' COLLATE ignore_accent;
fiddle
Read the manual for details.
This blog post by Laurenz Albe may help to understand.
But ICU collations also have drawbacks. The manual:
[...] they also have some drawbacks. Foremost, their use leads to a
performance penalty. Note, in particular, that B-tree cannot use
deduplication with indexes that use a nondeterministic collation.
Also, certain operations are not possible with nondeterministic
collations, such as pattern matching operations. Therefore, they
should be used only in cases where they are specifically wanted.
My "legacy" solution may still be superior:
For all versions
Use the unaccent module for that - which is completely different from what you are linking to.
unaccent is a text search dictionary that removes accents (diacritic
signs) from lexemes.
Install once per database with:
CREATE EXTENSION unaccent;
If you get an error like:
ERROR: could not open extension control file
"/usr/share/postgresql/<version>/extension/unaccent.control": No such file or directory
Install the contrib package on your database server like instructed in this related answer:
Error when creating unaccent extension on PostgreSQL
Among other things, it provides the function unaccent() you can use with your example (where LIKE seems not needed).
SELECT *
FROM users
WHERE unaccent(name) = unaccent('João');
Index
To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for indexes. If a function can return a different result for the same input, the index could silently break.
unaccent() only STABLE not IMMUTABLE
Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:
It depends on the behavior of a dictionary.
There is no hard-wired connection to this dictionary.
It therefore also depends on the current search_path, which can change easily.
Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE. This brute-force method can break under certain conditions.
Others suggest a simple IMMUTABLE wrapper function (like I did myself in the past).
There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.
Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:
Best for now
This approach is more efficient than other solutions floating around, and safer.
Create an IMMUTABLE SQL wrapper function executing the two-parameter form with hard-wired, schema-qualified function and dictionary.
Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared IMMUTABLE as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.
The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.
CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
RETURNS text
LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';
Then:
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
$func$;
In Postgres 14 or later, an SQL-standard function is slightly cheaper, yet:
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT
BEGIN ATOMIC
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1);
END;
See:
What does BEGIN ATOMIC mean in a PostgreSQL SQL function / procedure?
Drop PARALLEL SAFE from both functions for Postgres 9.5 or older.
public being the schema where you installed the extension (public is the default).
The explicit type declaration (regdictionary) defends against hypothetical attacks with overloaded variants of the function by malicious users.
Previously, I advocated a wrapper function based on the STABLE function unaccent() shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier.
And that was already twice as fast as the first version which added SET search_path = public, pg_temp to the function - until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation.
If you lack the necessary privileges to create C functions, you are back to the second best implementation: An IMMUTABLE function wrapper around the STABLE unaccent() function provided by the module:
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.unaccent('public.unaccent', $1) -- schema-qualify function and dictionary
$func$;
Finally, the expression index to make queries fast:
CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));
Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the unaccent module.
Adapt queries to match the index (so the query planner will use it):
SELECT * FROM users
WHERE f_unaccent(name) = f_unaccent('João');
We don't need the function in the expression to the right of the operator. There we can also supply unaccented strings like 'Joao' directly.
The faster function does not translate to much faster queries using the expression index. Index look-ups operate on pre-computed values and are very fast either way. But index maintenance and queries not using the index benefit. And access methods like bitmap index scans may have to recheck values in the heap (the main relation), which involves executing the underlying function. See:
"Recheck Cond:" line in query plans with a bitmap index scan
Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:
'text search dictionary “unaccent” does not exist' entries in postgres log, supposedly during automatic analyze
Ligatures
In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent() always substitutes a single letter:
SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
E A e a S
You will love this update to unaccent in Postgres 9.6:
Extend contrib/unaccent's standard unaccent.rules file to handle all
diacritics known to Unicode, and expand ligatures correctly (Thomas
Munro, Léonard Benedetti)
Bold emphasis mine. Now we get:
SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
OE AE oe ae ss
Pattern matching
For LIKE or ILIKE with arbitrary patterns, combine this with the module pg_trgm in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:
CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);
Can be used for queries like:
SELECT * FROM users
WHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');
GIN and GIST indexes are more expensive (to maintain) than plain B-tree:
Difference between GiST and GIN index
There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
pg_trgm also provides useful operators for "similarity" (%) and "distance" (<->).
Trigram indexes also support simple regular expressions with ~ et al. and case insensitive pattern matching with ILIKE:
PostgreSQL accent + case insensitive search
No, PostgreSQL does not support collations in that sense
PostgreSQL does not support collations like that (accent insensitive or not) because no comparison can return equal unless things are binary-equal. This is because internally it would introduce a lot of complexities for things like a hash index. For this reason collations in their strictest sense only affect ordering and not equality.
Workarounds
Full-Text-Search Dictionary that Unaccents lexemes.
For FTS, you can define your own dictionary using unaccent,
CREATE EXTENSION unaccent;
CREATE TEXT SEARCH CONFIGURATION mydict ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION mydict
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, simple;
Which you can then index with a functional index,
-- Just some sample data...
CREATE TABLE myTable ( myCol )
AS VALUES ('fóó bar baz'),('qux quz');
-- No index required, but feel free to create one
CREATE INDEX ON myTable
USING GIST (to_tsvector('mydict', myCol));
You can now query it very simply
SELECT *
FROM myTable
WHERE to_tsvector('mydict', myCol) ## 'foo & bar'
mycol
-------------
fóó bar baz
(1 row)
See also
Creating a case-insensitive and accent/diacritics insensitive search on a field
Unaccent by itself.
The unaccent module can also be used by itself without FTS-integration, for that check out Erwin's answer
I'm pretty sure PostgreSQL relies on the underlying operating system for collation. It does support creating new collations, and customizing collations. I'm not sure how much work that might be for you, though. (Could be quite a lot.)

Postgresql: literal table names

I am making an application that needs to construct Postgresql queries that will execute successfully in scenarios when table names are reserved keywords etc.
In Sql Server syntax this is achieved by wrapping everything in square brackets [] i.e. SELECT * FROM [database].[schema].[table_name].
I thought the equivalent in Postgresql was the use of double quotes "" i.e. SELECT * FROM "database"."schema"."table_name".
However, when I try this in Postgresql I get the error
Relation X doesn't exist
This works:
SELECT * FROM "postgres"."schema_a".Academic_Attainment
But this doesn't:
SELECT * FROM "postgres"."schema_a"."Academic_Attainment"
Related to: Escaping keyword-like column names in Postgres
Any suggestions?
As documented in the manual unquoted identifiers are folded to lowercase.
A quoted identifier is also case sensitive, so "Foo" is a different name than "foo".
So the name Academic_Attainment is the same as academic_attainment. If you really insist on using those dreaded double quotes, then you need to use a lower case identifier:
SELECT *
FROM "schema_a"."academic_attainment"
In general it's strongly recommended to not use quoted identifiers at all. As a rule of thumb: never use double quotes and you are fine.
If you are constructing dynamic SQL, use the format() function to do that together with the %I placeholder. It will take care of quoting if necessary (and only then), e.g.
format('select * from %I.%I', 'public', 'some_table') yields select * from public.some_table but format('select * from %I.%I', 'public', 'group') yields select * from public."group"
Unrelated to your question: Postgres doesn't support cross-database queries, so you should not get into the habit including the database name into your fully qualified names. The syntax you are using only works because you are connected to the database postgres. So I would recommend to stop using the database name in any table reference.

Is there a Natural Language Match function like the one in MySQL in PostgreSQL?

I was seeing the Natural Language match Function in MySQL which finds any matching strings on a query and returns the match score for any matching results. Is there a similar function in PostgreSQL?
I am aware of the TSQuery function and was looking for something more similar to the said MySQL function.
I don't know what exactly MySQL's natural language match function does, but it makes me think of the following PostgreSQL features:
soundex, metaphone and dmetaphone from the fuzzystrmatch extension (soundex is somewhat old-fashioned, the others more state of the art)
the similarity operator % from the pg_trgm extension

Postgres stemming throwing out matches

The query
SELECT to_tsvector('recreation') ## to_tsquery('recreatio:*');
returns false even though 'recreati' is a prefix of 'recreation'. This seems to happen because 'recreation' is stored as its stem, 'recreat'. For example, if we deliberately break the stemming algorithm by running
SELECT to_tsvector('recreation1') ## to_tsquery('recreatio:*');
the query returns true.
Is there a way to make the first query match?
Not sure if this answer is useful given the age of the question, but:
Concerning stemming
It seems you are right:
select ts_lexize('english_stem','recreation');
outputs
ts_lexize
-----------
{recreat}
(1 row)
and the documentation says
Also, * can be attached to a lexeme to specify prefix matching:
SELECT to_tsquery('supern:*A & star:A*B');
Such a lexeme will match any word in a tsvector that begins with the given string.
So it seems there is no way to make original query match.
A solution based on partial matching
One could fallback to looking for partial matches of the stems and the query, e.g. using pg_trgm extension:
SELECT (to_tsvector('recreation creation') ## to_tsquery('recreatio:*')) or
'recreatio:*' % any (
select trim(both '''' from regexp_split_to_table(strip(to_tsvector('recreation creation'))::text, ' '))
);
(Maybe the array of stems can be formed in a more elegant way.)

How to select usernames start with a-to-c range in PostgreSQL?

I got a table with username. I want to select username starts with a to c. What is the SQL syntax for this in PostgreSQL? Thanks.
There are a couple of different ways, but I'd probably use a regular expression with a character set match:
SELECT * FROM users WHERE username ~ '^[a-cA-C]';
or a substring search:
SELECT * FROM users WHERE lower(left(username,1)) BETWEEN 'a' AND 'c';
In older versions of PostgreSQL the left function isn't available, so you have to use substring(username from 1 for 1) instead.
See string functions and pattern matching for more information.