PostgreSQL pattern matching with Unicode graphemes - postgresql

Is there any way to pattern match with Unicode graphemes?
As a quick example, when I run this query:
CREATE TABLE test (
id SERIAL NOT NULL,
name VARCHAR NOT NULL,
PRIMARY KEY (id),
UNIQUE (name)
);
INSERT INTO test (name) VALUES ('πŸ‘πŸ» One');
INSERT INTO test (name) VALUES ('πŸ‘ Two');
SELECT * FROM public.test WHERE test.name LIKE 'πŸ‘%';
I get both rows returned, rather than just 'πŸ‘ Two'. Postgres seems to be just comparing code points, but I want it to compare full graphemes, so it should only match 'πŸ‘ Two', because πŸ‘πŸ» is a different grapheme.
Is this possible?

It's a very interesting question!
I am not quite sure if it is possible anyway:
The skinned emojis are, in fact, two joined characters (like ligatures). The first character is the yellow hand πŸ‘ which is followed by an emoji skin modifier 🏻
This is how the light skinned hand is stored internally. So, for me, your result makes sense:
When you query any string, that begins with πŸ‘, it will return:
πŸ‘ Two (trivial)
πŸ‘_🏻 One (ignore the underscore, I try to suppress the automated ligature with this)
So, you can see, the light skinned emoji internally also starts with πŸ‘. That's why I believe, that your query doesn't work the way you like.
Workarounds/Solutions:
You can add a space to your query. This ensures, that there's no skin modifier after your πŸ‘ character. Naturally, this only works in your case, where all data sets have a space after the hand:
SELECT * FROM test WHERE name LIKE 'πŸ‘ %';
You can simply extend the WHERE clause like this:
SELECT * FROM test
WHERE name LIKE 'πŸ‘%'
AND name NOT LIKE 'πŸ‘πŸ»%'
AND name NOT LIKE 'πŸ‘πŸΌ%'
AND name NOT LIKE 'πŸ‘πŸ½%'
AND name NOT LIKE 'πŸ‘πŸΎ%'
AND name NOT LIKE 'πŸ‘πŸΏ%'
You can use regular expression pattern matching to exclude the skins:
SELECT * FROM test
WHERE name ~ '^πŸ‘[^🏻🏼🏽🏾🏿]*$'
see demo:db<>fiddle (note that the fiddle seems not to provide automated ligatures, so both characters are separated displayed there)

Related

How can I use tsvector on a string with numbers?

I would like to use a postgres tsquery on a column that has strings that all contain numbers, like this:
FRUIT-239476234
If I try to make a tsquery out of this:
select to_tsquery('FRUIT-239476234');
What I get is:
'fruit' & '-239476234'
I want to be able to search by just the numeric portion of this value like so:
239476234
It seems that it is unable to match this because it is interpreting my hyphen as a "negative sign" and doesn't think 239476234 matches -239476234. How can I tell postgres to treat all of my characters as text and not try to be smart about numbers and hyphens?
An answer from the future. Once version 13 of PostgreSQL is released, you will be able to do use the dict_int module to do this.
create extension dict_int ;
ALTER TEXT SEARCH DICTIONARY intdict (MAXLEN = 100, ABSVAL=true);
ALTER TEXT SEARCH CONFIGURATION english ALTER MAPPING FOR int WITH intdict;
select to_tsquery('FRUIT-239476234');
to_tsquery
-----------------------
'fruit' & '239476234'
But you would probably be better off creating your own TEXT SEARCH DICTIONARY as well as copying the 'english' CONFIGURATION and modifying the copy, rather than modifying the default ones in place. Otherwise you have the risk that upgrading will silently lose your changes.
If you don't want to wait for v13, you could back-patch this change and compile into your own version of the extension for a prior server.
This is done by the text search parser, which is not configurable (short of writing your own parser in C, which is supported).
The simplest solution is to pre-process all search strings by replacing - with a space.

SIMILAR TO function in postgresql is not working as expected

I am using below code to do calculation
select column1 from tablename where code SIMILAR TO '%(-|_|–)EST[1-2][0-9](-|_)%'
for this column value -CSEST190-KCY18-04-01-L the condition was passed, but in actual I want to ignore this type of data.
The correct value which should pass through the above condition is
-CS-EST19-0-KCY18-04-01-L
-CS_EST19-0-KCY18-04-01-L
Any suggestions, how to avoid this type of confusion?
Easiest way is to go full-regex, instead of using SQL standard SIMILAR TO.
select column1 from tablename where code ~ '[_–-]EST[12][0-9][_-]'
Notice this is does not have to match the full string, and you don't have to add .* on both ends (equivalent of % in LIKE and SIMILAR TO). The reason you got a match on that is, because of the underscore _, which is a single wildcard character.
Also, I switched the order, in the square brackets, so that the dash is the last character. That way it's treated as a character literal, not as a range specifier.

Selecting words out of table which sound similar

I read an interesting article about English and phonetics - and would like to see if my newfound knowledge can be applied in TSQL to generate a fuzzy result set. In one of my applications, there is a table containing words, which I extracted from a word list. It is literally a one-column table -
Word |
------
A
An
Apple
...
their
there
Is there an built-in function in SQL Server to Select a word which Sounds The same, even though it is spelled different? (The globalization settings are on en-ZA - as last time I checked)
SELECT Word FROM WordTable WHERE Word = <word that sounds similar>
SoundEx()
SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken.
Difference()
Returns an integer value that indicates the difference between the SOUNDEX values of two character expressions.
SELECT word
, SoundEx(word) As word
, SoundEx(word_that_sounds_similar) As word_that_sounds_similar
, Difference(SoundEx(word), SoundEx(word_that_sounds_similar)) As how_similar
FROM wordtable
WHERE Difference(SoundEx(word), SoundEx(word_that_sounds_similar)) <= 1 /* quite close! */
The value returned by Difference() indicates how similar the two words are.
A value of 0 indicates a strong match and a value of 4 means slim-to-no match.

Can I use PIVOT without the brackets?

PIVOT
(
count(DueCount) FOR dueLibraries.s_folder IN ([Assembly Report-TUL],[Balance-TUL],[BOM-TUL],[Hydrostatic-TUL],[Inspection-TUL],[IOM Manual-TUL],[MTR-TUL],[NDT-TUL],[Performance-TUL],[Inputs - TUL],[Transmitted])
) as MonthlyTally
I rather just have this:
PIVOT
(
count(DueCount) FOR dueLibraries.s_folder IN (select * from dueLibraries)
) as MonthlyTally
Is there a way to do that?
Your question is two-fold, as it seems.
First of all, the identifiers in your first snippet's IN list are delimited identifiers. They have to be delimited with square brackets because they do not obey the rules for regular identifiers in Transact-SQL (particularly, because they include spaces and hyphens).
The second part of your question is about replacing the explicit list of columns with something like a mask, to make the list dynamic. Now, there's no available syntax for that, and your only option seems to be a dynamic query with a PIVOT clause. Here's one example of how it can be implemented.

PostgreSQL ignores dashes when ordering

I have a PostgreSQL 8.4 database that is created with the da_DK.utf8 locale.
dbname=> show lc_collate;
lc_collate
------------
da_DK.utf8
(1 row)
When I select something from a table where I order on a character varying column I get a strange behaviour IMO. When ordering the result PostgreSQL ignores dashes that prefixes the value, e.g.:
select name from mytable order by name asc;
May return something like
name
----------------
Ad...
Ae...
Ag...
- Ak....
At....
The dash prefix seems to be ignored.
I can fix this issue by converting the column to latin1 when ordering:
select name from mytable order by convert_to(name, 'latin1') asc;
The I get the expected result as:
name
----------------
- Ak....
Ad...
Ae...
Ag...
At....
Why does the dash prefix get ignored by default? Can that behavior be changed?
This is because da_DK.utf8 locale defines it this way. Linux locale aware utilities, for example sort will also work like this.
Your convert_to(name, 'latin1') will break if it finds a character which is not in Latin 1 character set, for example €, so it isn't a good workaround.
You can use order by convert_to(name, 'SQL_ASCII'), which will ignore locale defined sort and simply use byte values.
Ugly hack edit:
order by
(
ascii(name) between ascii('a') and ascii('z')
or ascii(name) between ascii('A') and ascii('Z')
or ascii(name)>127
),
name;
This will sort first anything which starts with ASCII non-letter. This is very ugly, because sorting further in string would behave strange, but it can be good enough for you.
A workaround that will work in my specific case is to replace dashes with exclamation points. I happen to know that I will never get exclamation points and it will be sorted before any letters or digits.
select name from mytable order by translate(name, '-', '!') asc
It will certainly affect performance so I may look into creating a special column for sorting but I really don't like that either...
I don't know how seems ordering rules for Dutch, but for Polish special characters like space, dashes etc are not "counted" in sorting in most dictionaries. Some good sort routines do the same and ignores such special characters. Probably in Dutch there is similar rule, and this rule is implemented by Ubuntu locale aware sort function.