How to specify flags in regexp? - postgresql

My tableA has src_text field of type text. I want to be able to specify flags in my regexp in order to properly filter my text fields, like on this website.
select *
from tableA
where tableA.src_text ~ '/Par/gms'
This statement literally looks for '/Par/gms' string while I want it to filter by src_text fields with Par text inside it, using g, m and s regexp flags.
Thanks in advance for your time.

As GMB said, some flags like g only make sense if you are interested in the matching substrings, not if you want to test if the pattern matches or not.
You could use the function
regexp_match(string, pattern, flags)
which will return matching substrings. The result IS NOT NULL if the pattern matches.

Related

PostgreSQL pattern matching with Unicode graphemes

Is there any way to pattern match with Unicode graphemes?
As a quick example, when I run this query:
CREATE TABLE test (
id SERIAL NOT NULL,
name VARCHAR NOT NULL,
PRIMARY KEY (id),
UNIQUE (name)
);
INSERT INTO test (name) VALUES ('πŸ‘πŸ» One');
INSERT INTO test (name) VALUES ('πŸ‘ Two');
SELECT * FROM public.test WHERE test.name LIKE 'πŸ‘%';
I get both rows returned, rather than just 'πŸ‘ Two'. Postgres seems to be just comparing code points, but I want it to compare full graphemes, so it should only match 'πŸ‘ Two', because πŸ‘πŸ» is a different grapheme.
Is this possible?
It's a very interesting question!
I am not quite sure if it is possible anyway:
The skinned emojis are, in fact, two joined characters (like ligatures). The first character is the yellow hand πŸ‘ which is followed by an emoji skin modifier 🏻
This is how the light skinned hand is stored internally. So, for me, your result makes sense:
When you query any string, that begins with πŸ‘, it will return:
πŸ‘ Two (trivial)
πŸ‘_🏻 One (ignore the underscore, I try to suppress the automated ligature with this)
So, you can see, the light skinned emoji internally also starts with πŸ‘. That's why I believe, that your query doesn't work the way you like.
Workarounds/Solutions:
You can add a space to your query. This ensures, that there's no skin modifier after your πŸ‘ character. Naturally, this only works in your case, where all data sets have a space after the hand:
SELECT * FROM test WHERE name LIKE 'πŸ‘ %';
You can simply extend the WHERE clause like this:
SELECT * FROM test
WHERE name LIKE 'πŸ‘%'
AND name NOT LIKE 'πŸ‘πŸ»%'
AND name NOT LIKE 'πŸ‘πŸΌ%'
AND name NOT LIKE 'πŸ‘πŸ½%'
AND name NOT LIKE 'πŸ‘πŸΎ%'
AND name NOT LIKE 'πŸ‘πŸΏ%'
You can use regular expression pattern matching to exclude the skins:
SELECT * FROM test
WHERE name ~ '^πŸ‘[^🏻🏼🏽🏾🏿]*$'
see demo:db<>fiddle (note that the fiddle seems not to provide automated ligatures, so both characters are separated displayed there)

Searching for two Word wildcard strings that are nested

I'm having trouble finding the proper Word wildcard string to find numbers that fit the following patterns:
"NN NN NN" or "NN NN NN.NN" (where N is any number 0-9)
The trouble is the first string is a subset of the second string. My goal is to find a single wildcard string that will capture both. Unfortunately, I need to use an operator that is zero or more occurrences for the ".NN" portion and that doesn't exist.
I'm having to do two searches, and I'm using the following patterns:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}?[!0-9]
[0-9]{2}[^s ][0-9]{2}[^s ][0-9]{2}.[0-9]{2}
The problem is that first pattern (in bold). It works well unless I have the number in a table or something and there is nothing after it to match (or not match, if you will) the [!0-9].
You could use a single wildcard Find:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1,4}
or:
[0-9]{2}[^s ][0-9]{2}[^s ][0-9][0-9.]{1;4}
to capture both. Which you use depends on your regional settings.

SIMILAR TO function in postgresql is not working as expected

I am using below code to do calculation
select column1 from tablename where code SIMILAR TO '%(-|_|–)EST[1-2][0-9](-|_)%'
for this column value -CSEST190-KCY18-04-01-L the condition was passed, but in actual I want to ignore this type of data.
The correct value which should pass through the above condition is
-CS-EST19-0-KCY18-04-01-L
-CS_EST19-0-KCY18-04-01-L
Any suggestions, how to avoid this type of confusion?
Easiest way is to go full-regex, instead of using SQL standard SIMILAR TO.
select column1 from tablename where code ~ '[_–-]EST[12][0-9][_-]'
Notice this is does not have to match the full string, and you don't have to add .* on both ends (equivalent of % in LIKE and SIMILAR TO). The reason you got a match on that is, because of the underscore _, which is a single wildcard character.
Also, I switched the order, in the square brackets, so that the dash is the last character. That way it's treated as a character literal, not as a range specifier.

Find only exact word maches using SphinxQL

I'm trying to use Sphinx to find rows having words in their title column.
The query looks like this:
SELECT * FROM my_table WHERE MATCH ('#title "words"')
But it also returns rows having word (without the s) instead of words in the title.
What am I doing wrong?
Sounds like you have morphology (specifically stemming?) enabled on the index.
Should consider enabling index_exact_words
http://sphinxsearch.com/docs/current.html#conf-index-exact-words
which gives you exact form operator.
MATCH('#title =words')
Also gives you the possibility of the interesting expand_keywords option :)
http://sphinxsearch.com/docs/current.html#conf-expand-keywords
... or if dont ever want these matches, could disable stemming :) Alas there isn't a 'stemming optional' mode. (eg a ~ fuzzy operator to specifically stem)

Postgresql Function to sort characters within a string

Is there a postgresql function, preferably native function, that can sort a string such as 'banana' to 'aaabnn'?
Algorithmic efficiency of sorting is not of much importance since words will never be too long. However, database join efficiency is of some but not critical importance.
There is no native function with such functionality but you can use regexp_split_to_table to do so as this:
select theword
from (select regexp_split_to_table('banana',E'(?=.)') theword) tab
order by theword;
The result will be:
theword
a
a
a
b
n
n
This (?=.) will split by each character leaving the character as separator. It will also identify spaces. If you have a word with spaces and do not want it (the space) use E'(\\s*)' matches any whitespace character. I don't recall what the E means. I will search and edit the answer asap.
As explained in the DOCs in the section "regexp_split_to_table"
EDIT: As I said: The meaning of the E before the string you can see here: What's the "E" before a Postgres string?