single byte, locale and index scan in postgres - postgresql

Is there any relation between single byte character locale and index scan in postgreSQL?
I already read the pages in https://www.postgresql.org/doc, but I could not find it.

No, not really. The important division is not between single-byte and multi-byte locales, but between the C (or POSIX) locale and all others.
The C locale is English with a sort order (collation) that sorts words character by character, with lower code points (for example, ASCII values) sorted before higher ones. Because this collation is much simpler than natural language collations, comparisons and sorting are much faster with the C locale. Moreover, and this is where indexes come in, a simple B-tree index on a string column can support LIKE conditions, because words are ordered character by character. With other collations, you need to use the text_pattern_ops operator class to support LIKE.
So if you can live with the strange sort order of the C collation, use the C locale by all means.
Do not use any single-byte locales, ever. Even with the C locale, use the encoding UTF8. There is no benefit in choosing any other encoding, and you will find it quite annoying if you encounter a character that you want to store in your database, but cannot.

Related

postgres ORDER BY - How to get the sort order to consider any letter being before numbers or punctuation

When I'm using an ORDER BY statement, the returned sort order puts punctuation characters and numbers before letters. So I'm getting:
;something
1something
something
what I would prefer, is if the letters were considered as coming before the numbers and punctuation characters for the sort, like the following:
something
1something
;something
I understand that COLLATIONs define the sort order, and have tried a few (e.g. "en_GB", "en_US"), but it's not made any difference.
What collation puts letters before numbers?
When using collations, does a columns collation have to be defined when creating a table?
Or can it be used just in the ORDER BY clause?
Thanks

Should I save ASCII-only varchar in UTF-8 or ASCII?

I have a varchar column that contains only ASCII symbols. I don't need to sort by this field, but I need to search it by full equality.
Default locale is en.UTF8. Will I gain anything if I create this column with collate "C"?
Yes, it makes a difference.
Even if you do not sort deliberately, there are various operations requiring sort steps internally (some aggregate functions, DISTINCT, nested loop joins etc.).
Also, any index on the field has to sort values internally - and observe collation rules unless COLLATE "C" applies (no collation rules).
For searches by full equality you'll want an index - which works either way (for equality), but it's faster overall without collation rules. Depending on the details of your use case, the effect may be negligible or substantial. The impact grows with the length of your strings. I ran a benchmark on a related case some time ago:
Slow query ordering by a column in a joined table
Also, there are more pattern matching options with locale "C". The alternative would be to create an index with the special varchar_pattern_ops operator class.
Related:
PostgreSQL LIKE query performance variations
Operator “~<~” uses varchar_pattern_ops index while normal ORDER BY clause doesn't?
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
Postgres 9.5 introduced performance improvements with a technique called "abbreviated keys", which ran into problems with some locales. So it was deactivated, except for the C locale. Quoting The release notes of Postgres 9.5.2:
Disable abbreviated keys for string sorting in non-C locales (Robert Haas)
PostgreSQL 9.5 introduced logic for speeding up comparisons of string
data types by using the standard C library function strxfrm() as a
substitute for strcoll(). It now emerges that most versions of glibc
(Linux's implementation of the C library) have buggy implementations
of strxfrm() that, in some locales, can produce string comparison
results that do not match strcoll(). Until this problem can be better
characterized, disable the optimization in all non-C locales. (C
locale is safe since it uses neither strcoll() nor strxfrm().)
Unfortunately, this problem affects not only sorting but also entry
ordering in B-tree indexes, which means that B-tree indexes on text,
varchar, or char columns may now be corrupt if they sort according to
an affected locale and were built or modified under PostgreSQL 9.5.0
or 9.5.1. Users should REINDEX indexes that might be affected.
It is not possible at this time to give an exhaustive list of
known-affected locales. C locale is known safe, and there is no
evidence of trouble in English-based locales such as en_US, but some
other popular locales such as de_DE are affected in most glibc
versions.
The problem also illustrates where collation rules come in, generally.

Do text_pattern_ops comparators understand UTF-8?

According to the PostgreSQL 9.2 documentation, if I am using a locale other than the C locale (en_US.UTF-8 in my case), btree indexes on text columns for supporting queries like
SELECT * from my_table WHERE text_col LIKE 'abcd%'
need to be created using text_pattern_ops like so
CREATE INDEX my_idx ON my_table (text_col text_pattern_ops)
Now section 11.9 of the documentation states that this results in a "character by character" comparison. Are these (non-wide) C characters or does the comparison understand UTF-8?
Good question, I'm not totally sure but my tentative understanding is:
Here Postgresql means "real characters" (eventually multibyte), not bytes. The comparison "understands UTF-8" always, with or without this special index.
The point is that, for locales that have special (non C) collation rules, we normally want to follow those rules (and call the respective locale libraries) when doing comparisons ( <, >...) and sorting. But we don't want to use those collations for POSIX regular matching and LIKE patterns. Hence the existence of two different types of indexes for text.
The operators in the text_pattern_ops operator class actually do a memcmp() on the strings, so the documentation is perhaps slightly inaccurate talking about characters.
But this doesn't really affect the question whether they support UTF-8. The indexing of pattern matching operations in the described fashion does support UTF-8. The underlying operators don't have to worry about the encoding.

PostgreSQL: wrong sorting on Ukrainian text

I have table with all countries on three languages: English, Russian and Ukrainian. On first two languages sorting is OK. But on Ukrainian countries' names sorting is not OK.
On first two place it stands two letters 'є' (8th position in ABC) and 'і' (12th position in ABC) and all next letters are sorted fine.
How to prevent this behaviour? DB encoding is utf-8.
If you are on 9.1, you can add the collation to be used for sorting to your ORDER BY clause:
SELECT *
FROM your_table
ORDER BY your_column COLLATE "ua_UA"
The name of the collation depends on your operating system - not sure what the correct name for Ukraine would be. But I think you get the idea.
You might also want to read this blog entry:
http://www.depesz.com/index.php/2011/03/04/waiting-for-9-1-per-column-collation-support/
UTF-8 doesn't know anything about "language". For alphabetical sort to make any sense to Postgres you need to set a locale. Your question doesn't mention locale at all so I'm guessing you're just sorting using whatever your default locale is (probably English or Russian).
If you are already using locales then I suggest providing details of your client / server locale settings as there may be a mistake there.

What Is The PostgreSQL Equivalent To SQL Server NVARCHAR?

If I have fields of NVARCHAR (or NTEXT) data type in a Microsoft SQL Server database, what would be the equivalent data type in a PostgreSQL database?
I'm pretty sure postgres varchar is the same as Oracle/Sybase/MSSQL nvarchar even though it is not explicit in the manual:
http://www.postgresql.org/docs/7.4/static/datatype-character.html
Encoding conversion functions are here:
http://www.postgresql.org/docs/current/static/functions-string.html
http://www.postgresql.org/docs/current/static/functions-string.html#CONVERSION-NAMES
Example:
create table
nvctest (
utf8fld varchar(12)
);
insert into nvctest
select convert('PostgreSQL' using ascii_to_utf_8);
select * from nvctest;
Also, there is this response to a similar question from a Postgresql rep:
All of our TEXT datatypes are
multibyte-capable, provided you've
installed PostgreSQL correctly.
This includes: TEXT (recommended)
VARCHAR CHAR
Short answer: There is no PostgreSQL equivalent to SQL Server NVARCHAR.
The types of NVARCHAR(N) on different database are not equivalent.
The standard allows for a wide choice of character collations and encodings/character sets. When dealing with unicode PostgreSQL and SQLServer fall into different camps and no equivalence exists.
These differ w.r.t.
length semantics
representable content
sort order
padding semantics
Thus moving data from one DB system (or encoding/character set) to another can lead to truncation/content loss.
Specifically there is no equivalent between a PostgreSQL (9.1) character type and SQL Server NVARCHAR.
You may migrate the data to a PostgreSQL binary type, but would then loose text querying capabilities.
(Unless PostgreSQL starts supporting a UTF-16 based unicode character set)
Length semantics
N is interpreted differently (Characters, Bytes, 2*N = Bytes) depending on database and encoding.
Microsoft SQL Server uses UCS2 encoding with the VARCHAR length interpreted as UCS-2 points, that is length*2 = bytes length ( https://learn.microsoft.com/en-us/sql/t-sql/data-types/nchar-and-nvarchar-transact-sql?view=sql-server-2017 ):
their NVARCHAR(1) can store 1 UCS2 Characters (2 bytes of UCS2).
Oracle UTF-encoding has the same semantics ( and internal CESU-8 storage).
Postgres 9.1 only has a Unicode UTF-8 character set (https://www.postgresql.org/docs/9.1/multibyte.html) , which, like
Oracle (in AL32UTF8 or AL16UTF16 encoding) can store 1 full UCS32 codepoints. That is potentially up to 4 bytes (See e.g
http://www.oracletutorial.com/oracle-basics/oracle-nvarchar2/ which explicitly state the NVARCHAR2(50) column may take up to 200 bytes).
The difference becomes significant when dealing with characters outside the basic multilingual plane which count as one "char unit" in utf8 ucs32 (go, char, char32_t, PostgreSQL ), but are represented as surrogate pairs in UTF-16 which count as two units ( Java, Javascript, C#, ABAP, wchar_t , SQLServer).
e.g.
U+1F60A SMILING FACE WITH SMILING EYES will use up all space in SQL Server NVARCHAR(2).
But only one character unit in PostgreSQL.
Classical enterprise grade DBs will offer at least a choice with UTF-16 like semantics (SAP HANA (CESU-8), DB 2 with collation, SQL Anywhere (CESU8BIN), ...)
E.g. Oracle also offers what they misleadingly call an UTF-8 Collation, which is effectivly CESU-8.
This has the same length semantics, representable content as UTF-16 (=Microsoft SQL Server) and is a suitable collation used by UTF-16 based enterprise systems ( e.g. SAP R/3 ) or under a Java application server.
Note that some databases may still interpret NVARCHAR(N) as a length in byte limitation, even with a variable length unicode encoding ( Example SAP IQ ).
Unrepresentable content
UTF-16 / CESU-8 based system can represent half surrogate pairs, while
UTF-8/UTF-32 based system can not.
This content is unrepresentable in this character set, but are a frequent occurrence in UTF-16 based enterprise systems.
(e.g. Windows pathnames may contain such non-utf-8 representable characters, see e.g. https://github.com/rust-lang/rust/issues/12056).
Thus UTF-16 is a "superset" of UTF-8/UTF-16 which is typically a killer-criteria when dealing with data from enterprise/os-systems based on this encoding ( SAP, Windows, Java, JavaScript ). Note that Javascript JSON encoding took specific care to be able to represent these characters (https://www.rfc-editor.org/rfc/rfc8259#page-10 ).
(2) and (3) are more relevant when migration queries, but not for data migration.
Binary sort order:
Note that binary sort order of CESU-8/UTF-16 is different than UTF-8/UTF-32.
UTF-16/CESU-8/Java/JavaScript/ABAP sort order:
U+0041 LATIN CAPITAL LETTER A
U+1F60A SMILING FACE WITH SMILING EYES
U+FB03 LATIN SMALL LIGATURE ffi
UTF-8 / UCS-32 (go) sort order:
U+0041 LATIN CAPITAL LETTER A
U+FB03 LATIN SMALL LIGATURE ffi
U+1F60A SMILING FACE WITH SMILING EYES
Padding semantics
Padding semantics differ on databases esp. when comparing VARCHAR with CHAR content.
It's varchar and text, assuming your database is in UNICODE encoding. If your database is in a non-UNICODE encoding, there is no special datatype that will give you a unicode string - you can store it as a bytea stream, but that will not be a string.
Standard TEXT datatype is perfectly fine for it.