Do text_pattern_ops comparators understand UTF-8? - postgresql

According to the PostgreSQL 9.2 documentation, if I am using a locale other than the C locale (en_US.UTF-8 in my case), btree indexes on text columns for supporting queries like
SELECT * from my_table WHERE text_col LIKE 'abcd%'
need to be created using text_pattern_ops like so
CREATE INDEX my_idx ON my_table (text_col text_pattern_ops)
Now section 11.9 of the documentation states that this results in a "character by character" comparison. Are these (non-wide) C characters or does the comparison understand UTF-8?

Good question, I'm not totally sure but my tentative understanding is:
Here Postgresql means "real characters" (eventually multibyte), not bytes. The comparison "understands UTF-8" always, with or without this special index.
The point is that, for locales that have special (non C) collation rules, we normally want to follow those rules (and call the respective locale libraries) when doing comparisons ( <, >...) and sorting. But we don't want to use those collations for POSIX regular matching and LIKE patterns. Hence the existence of two different types of indexes for text.

The operators in the text_pattern_ops operator class actually do a memcmp() on the strings, so the documentation is perhaps slightly inaccurate talking about characters.
But this doesn't really affect the question whether they support UTF-8. The indexing of pattern matching operations in the described fashion does support UTF-8. The underlying operators don't have to worry about the encoding.

Related

single byte, locale and index scan in postgres

Is there any relation between single byte character locale and index scan in postgreSQL?
I already read the pages in https://www.postgresql.org/doc, but I could not find it.
No, not really. The important division is not between single-byte and multi-byte locales, but between the C (or POSIX) locale and all others.
The C locale is English with a sort order (collation) that sorts words character by character, with lower code points (for example, ASCII values) sorted before higher ones. Because this collation is much simpler than natural language collations, comparisons and sorting are much faster with the C locale. Moreover, and this is where indexes come in, a simple B-tree index on a string column can support LIKE conditions, because words are ordered character by character. With other collations, you need to use the text_pattern_ops operator class to support LIKE.
So if you can live with the strange sort order of the C collation, use the C locale by all means.
Do not use any single-byte locales, ever. Even with the C locale, use the encoding UTF8. There is no benefit in choosing any other encoding, and you will find it quite annoying if you encounter a character that you want to store in your database, but cannot.

PostgreSQL SELECT can alter a table?

So I'm new to SQL like databases and the place that I work at migrated to PostgreSQL. One table drastically reduced its contents. The point is, I only used SELECT statements, and changed the name of the columns with AS. Is there a way I might have changed the table data?
When you migrate from a DBMS to another DBMS you must be sure that the objects created are strictly equivalent... The question seems to be trivial, but is'nt.
As a matter fact one important consideration for litterals (char/varchar...) is to verify the collation used formerly and the collation you have used to create the newly database in PostGreSQL.
Collation in an RDBMS is the way to adjust the behavior of character strings with regard to certain parameters such as the distinction, or not, of upper and lower case letters, the distinction, or not, of diacritical characters (accents, ligatures...), specific sorting to language, etc. And constitutes a superset of the character encoding.
Did you verify this point when using some WHERE clause to search some litterals ? If not, try to restricts litteral in applying the right collation (COLLATE operator) or use UPPER function to avoid the distinguish between upper and lower chars...

Should I save ASCII-only varchar in UTF-8 or ASCII?

I have a varchar column that contains only ASCII symbols. I don't need to sort by this field, but I need to search it by full equality.
Default locale is en.UTF8. Will I gain anything if I create this column with collate "C"?
Yes, it makes a difference.
Even if you do not sort deliberately, there are various operations requiring sort steps internally (some aggregate functions, DISTINCT, nested loop joins etc.).
Also, any index on the field has to sort values internally - and observe collation rules unless COLLATE "C" applies (no collation rules).
For searches by full equality you'll want an index - which works either way (for equality), but it's faster overall without collation rules. Depending on the details of your use case, the effect may be negligible or substantial. The impact grows with the length of your strings. I ran a benchmark on a related case some time ago:
Slow query ordering by a column in a joined table
Also, there are more pattern matching options with locale "C". The alternative would be to create an index with the special varchar_pattern_ops operator class.
Related:
PostgreSQL LIKE query performance variations
Operator “~<~” uses varchar_pattern_ops index while normal ORDER BY clause doesn't?
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Postgres 9.5 introduced performance improvements with a technique called "abbreviated keys", which ran into problems with some locales. So it was deactivated, except for the C locale. Quoting The release notes of Postgres 9.5.2:
Disable abbreviated keys for string sorting in non-C locales (Robert Haas)
PostgreSQL 9.5 introduced logic for speeding up comparisons of string
data types by using the standard C library function strxfrm() as a
substitute for strcoll(). It now emerges that most versions of glibc
(Linux's implementation of the C library) have buggy implementations
of strxfrm() that, in some locales, can produce string comparison
results that do not match strcoll(). Until this problem can be better
characterized, disable the optimization in all non-C locales. (C
locale is safe since it uses neither strcoll() nor strxfrm().)
Unfortunately, this problem affects not only sorting but also entry
ordering in B-tree indexes, which means that B-tree indexes on text,
varchar, or char columns may now be corrupt if they sort according to
an affected locale and were built or modified under PostgreSQL 9.5.0
or 9.5.1. Users should REINDEX indexes that might be affected.
It is not possible at this time to give an exhaustive list of
known-affected locales. C locale is known safe, and there is no
evidence of trouble in English-based locales such as en_US, but some
other popular locales such as de_DE are affected in most glibc
versions.
The problem also illustrates where collation rules come in, generally.

Create index on first 3 characters (area code) of phone field?

I have a Postgres table with a phone field stored as varchar(10), but we search on the area code frequently, e.g.:
select * from bus_t where bus_phone like '555%'
I wanted to create an index to facilitate with these searches, but I got an error when trying:
CREATE INDEX bus_ph_3 ON bus_t USING btree (bus_phone::varchar(3));
ERROR: 42601: syntax error at or near "::"
My first question is, how do I accomplish this, but also I am wondering if it makes sense to index on the first X characters of a field or if indexing on the entire field is just as effective.
Actually, a plain B-tree index is normally useless for pattern matching with LIKE (~~) or regexp (~), even with left-anchored patterns, if your installation runs on any other locale than "C", which is the typical case. Here is an overview over pattern matching and indices in a related answer on dba.SE
Create an index with the varchar_pattern_ops operator class (matching your varchar column) and be sure to read the chapter on operator classes in the manual.
CREATE INDEX bus_ph_pattern_ops_idx ON bus_t (bus_phone varchar_pattern_ops);
Your original query can use this index:
... WHERE bus_phone LIKE '555%'
Performance of a functional index on the first 3 characters as described in the answer by #a_horse is pretty much the same in this case.
-> SQLfiddle demo.
Generally a functional index on relevant leading characters would be be a good idea, but your column has only 10 characters. Consider that the overhead per tuple is already 28 bytes. Saving 7 bytes is just not substantial enough to make a big difference. Add the cost for the function call and the fact that xxx_pattern_ops are generally a bit faster.
In Postgres 9.2 or later the index on the full column can also serve as covering index in index-only scans.
However, the more characters in the columns, the bigger the benefit from a functional index.
You may even have to resort to a prefix index (or some other kind of hash) if the strings get too long. There is a maximum length for indices.
If you decide to go with the functional index, consider using the xxx_pattern_ops variant for a small additional performance benefit. Be sure to read about the pros and cons in the manual and in Peter Eisentraut's blog entry:
CREATE INDEX bus_ph_3 ON bus_t (left(bus_phone, 3) varchar_pattern_ops);
Explain error message
You'd have to use the standard SQL cast syntax for functional indices. This would work - pretty much like the one with left(), but like #a_horse I'd prefer left().
CREATE INDEX bus_ph_3 ON bus_t USING btree (cast(bus_phone AS varchar(3));
When using like '555%' an index on the complete column will be used just as well. There is no need to only index the first three characters.
If you do want to index only the first 3 characters (e.g. to save space), then you could use the left() funcion:
CREATE INDEX bus_ph_3 ON bus_t USING btree (left(bus_phone,3));
But in order for that index to be used, you would need to use that expression in your where clause:
where left(bus_phone,3) = '555';
But again: that is most probably overkill and the index on the complete column will be good enough and can be used for other queries as well e.g. bus_phone = '555-1234' which the index on just the first three characters would not.

Set Order By to ignore punctuation on a per-column basis

Is it possible to order the results of a PostgreSQL query by a title field that contains characters like [](),; etc but do so ignoring these punctuation characters and sorting only by the text characters?
I've read articles on changing the database collation or locale but have not found any clear instructions on how to do this on an existing database an on a per-column basis. Is this even possible?
"Normalize" for sorting
You could use regexp_replace() with the pattern '[^a-zA-Z]' in the ORDER BY clause but that only recognizes pure ASCII letters. Better use the class shorthand '\W' which recognizes additional non-ASCII letters in your locale like äüóèß etc.
Or you could improvise and "normalize all characters with diacritic elements to their base form with the help of the unaccent() function. Consider this little demo:
SELECT *
, regexp_replace(x, '[^a-zA-Z]', '', 'g')
, regexp_replace(x, '\W', '', 'g')
, regexp_replace(unaccent(x), '\W', '', 'g')
FROM (
SELECT 'XY ÖÜÄöüäĆČćč€ĞğīїıŁłŃńŇňŐőōŘřŠšŞşůŽžż‘´’„“”­–—[](),;.:̈� XY'::text AS x) t
->SQLfiddle for Postgres 9.2.
->SQLfiddle for Postgres 9.1.
Regular expression code has been updated in version 9.2. I am assuming this is the reason for the improved handling in 9.2 where all letter characters in the example are matched, while 9.1 only matches some.
unaccent() is provided by the additional module unaccent. Run:
CREATE EXTENSION unaccent;
once per database to use in (Postgres 9.1+, older versions use a different technique).
locales / collation
You must be aware that Postgres relies on the underlying operating system for locales (including collation). The sort order is governed by your chosen locale, or more specific LC_COLLATE. More in this related answer:
String sort order (LC_COLLATE and LC_CTYPE)
There are plans to incorporate collation support into Postgres directly, but that's not available at this time.
Many locales ignore the special characters you describe for sorting character data out of the box. If you have a locale installed in your system that provides the sort order you are looking for, you can use it ad-hoc in Postgres 9.1 or later:
SELECT foo FROM bar ORDER BY foo COLLATE "xy_XY"
To see which collations are installed and available in your current Postgres installation:
SELECT * FROM pg_collation;
Unfortunately it is not possible to define your own custom collation (yet) unless you hack the source code.
The collation rules are usually governed by the rules of a language as spoken in a country. The sort order telephone books would be in, if there were still telephone books ... Your operating system provides them.
For instance, in Debian Linux you can use:
locale -a
to display all generated locales. And:
dpkg-reconfigure locales
as root user (one way of several) to generate / install more.
If you want to have this ordering in one particular query you can
ORDER BY regexp_replace(title, '[^a-zA-Z]', '', 'g')
It will delete all non A-Z from sting and order by resulting field.