PostgreSQL UTF-8 binary collation

PostgreSQL UTF-8 binary collation - postgresql

I would like to have a collation which orders the UTF-8 encoding of 0x1234 below of 0x1235 regardless of the character mapping in the Unicode standard. MySQL uses utf8_bin for this. MSSQL apparently http://msdn.microsoft.com/en-us/library/ms143350.aspx have BIN and BIN2 collations. While finding these were easy, I can't even find a list of collations PostgreSQL supports much less answer to this specific question.

The C locale will do. UTF-8 is designed so that byte ordering is also codepoint ordering. This is not trivial but consider how UTF-8 works:
Number range Byte 1 Byte 2 Byte 3
0000-007F 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx
When sorting binary data aka C locale, the first non-equal byte will determine the ordering. What we neeed to see that if two numbers encoded into UTF-8 differ then the first non-equal byte will be lower for the lower value. If the numbers are in different ranges then the first byte will indeed be lower for the lower number. Within the same range, the order is determined by literally the same bits as without encoding.

Sort order of text depends on lc_collate (not on the system locale!). The system locale only serves as a default when creating the db cluster if you don't provide another locale.
The behaviour you are expecting only works with locale C. Read all about it in the fine manual:
The C and POSIX collations both specify "traditional C" behavior, in
which only the ASCII letters "A" through "Z" are treated as letters,
and sorting is done strictly by character code byte values.
Emphasis mine. PostgreSQL 9.1 has a couple of new features for collation. Might be exactly what you are looking for.

Postgres uses the collation defined by the system locale on cluster creation.
You might try to ORDER BY encode(column,'hex')

Related

Need some clarification about LC_COLLATE and LC_CTYPE

I have gone through the official postgres documentation to know about the LC_COLLATE and LC_TYPE. But, still I don't understand it correctly.
Can anyone help me in understanding these concepts and impact of these, especially when we are trying to load data which is at oracle of encoding WE8ISO8859P15 and at postgres encoding is as utf-8 and collation/ctype is en_US.UTF-8.
Thanks in advance

This is part of the “locale”, the national language support, which is different from the encoding (but the locale has to belong to the encoding).
LC_CTYPE determines which characters are letters, numbers, space characters, punctuation etc. Different languages have different ideas about that.
LC_COLLATE determines how strings are compared and sorted.
The first has little impact on the behavior of PostgreSQL, but the second is very relevant: it determines how b-tree indexes on string columns are ordered (which is why it cannot be changed after a database has been created) and how ORDER BY sorts strings by default (which is directly user-visible).

Upgrading MariaDB collation from utf8mb4_unicode_520_ci

Currently I use:
The utf8mb4 database character set.
The utf8mb4_unicode_520_ci database collation.
I understand that utf8mb4 supports up to four bytes per character. I also understand that Unicode is a standard that continues to get updates. In the past I thought utf8 was sufficient until I had some test data get corrupted, lesson learned. However I'm having difficulty understanding the upgrade path for both the character set and collations.
The utf8mb4_unicode_520_ci database collation is based off of Unicode Collation Algorithm version 5.2.0. If you navigate to the parent directory you'll see up to version 14.0 listed at the time of typing this. Now those are the Unicode standards, then there is the supported MariaDB character sets and collations.
Offhand I'm not sure when the need to go from four bytes per character gets superseded to go to eight bytes per character or even 16 so it's not as simple a measure of just updating the database collation. Additionally I'm not seeing anything that seems to be newer than version 5.2.0 on MariaDB's documentation.
So in short my three highly related questions are:
Are the newer collations such as version 14 still fully compatible with four byte characters or have they exhausted all combinations and now require up to eight or 16 bytes per character?
What is the latest database collation that MariaDB supports in regards to Unicode versions?
In regards to the second question, a newer version than 5.2.0 is supported by MariaDB then is utf8mb4 still sufficient for a character set or not?
I am not bound to or care about MySQL compatibility.

You can inspect the collations currently supported by your MariaDB instance:
SELECT * FROM INFORMATION_SCHEMA.COLLATIONS
WHERE CHARACTER_SET_NAME = 'utf8mb4';
As far as I know, MariaDB does not support any UTF-8 collation version more current than utf8_unicode_520ci. If you try to use the '900' version, for example importing metadata from MySQL to MariaDB, you get errors.
There is no such thing as an 8-byte or 16-byte UTF-8 encoding. UTF-8 is an encoding that uses between 1 and 4 bytes per character, no more than that.
MariaDB also supports utf16 and utf32, but neither of these uses more than 4 bytes per character. Utf16 is variable-length, using one or two 16-bit code units per character. Utf32 is fixed-width, always using 32-bits (4 bytes) per character.

Postgresql support for national character data types

Looking for a reference that discusses PostgreSQL's support for the NATIONAL CHARACTER set of data types. e.g. this query runs without error:
select cast('foo' as national character varying(10))
yet the docs don't seem to discuss that type Postgres character data types
Does Postgres implement these differently from the CHARACTER data types? That is, how does the NATIONAL keyword affect how data is stored or represented?
Can someone share a link or two to any references I can't seem to find? (other than some mailing list correspondence from a while back)

If you request a national character varying in PostgresSQL, you'll get a regular character varying.
PostgreSQL uses the same encoding for normal and national characters.
“National character” is a leftover from the bad old days when people still used single-byte encodings like LATIN-1 and needed a different encoding for characters that didn't fit.
PostgreSQL has always supported UNICODE encodings, so this is not an issue. Just make sure that you don't specify an encoding other than the default UTF8.

NATIONAL CHARACTER has no real meaning in the SQL:92 standard (section 4.2.1), saying only that it means “a particular implementation-defined character repertoire”. If you are surprised, don’t be. There are many screwy aspects to the SQL standard.
As for text handling in Postgres, you would likely be interested in learning about:
character encoding
Unicode
UTF-8
collations
support for ICU in Postgres 10 and later.
See:
More robust collations with ICU support in PostgreSQL 10 by Peter Eisentraut, post, 2017-05.
Collations: Introduction, Features, Problems by Peter Eisentraut, video, 2019-07-12.
Unicode collation algorithm ( UCA )
ICU User Guide – Locale
List of locales with 209 languages, 501 regions & variants, as defined in ICU

cant find Varchar Chart of acceptable characters

Does anyone know of a simple chart or list that would show all acceptable varchar characters? I cannot seem to find this in my googling.

What codepage? Collation? Varchar stores characters assuming a specific codepage. Only the lower 127 characters (the ASCII subset) is standard. Higher characters vary by codepage.
The default codepage used matches the collation of the column, whose defaults are inherited from the table,database,server. All of the defaults can be overriden.
In sort, there IS no "simple chart". You'll have to check the character chart for the specific codepage, eg. using the "Character Map" utility in Windows.
It's far, far better to use Unicode and nvarchar when storing to the database. If you store text data from the wrong codepage you can easily end up with mangled and unrecoverable data. The only way to ensure the correct codepage is used, is to enforce it all the way from the client (ie the desktop app) to the application server, down to the database.
Even if your client/application server uses Unicode, a difference in the locale between the server and the database can result in faulty codepage conversions and mangled data.
On the other hand, when you use Unicode no conversions are needed or made.

How do you set strings to uppercase / lowercase in Unicode?

This is mostly a theoretical question I'm just very curious about. (I'm not trying to do this by coding it myself or anything, I'm not reinventing wheels.)
My question is how the uppercase/lowercase table of equivalence works for Unicode.
For example, if I had to do this in ASCII, I'd take a character, and if it falls withing the [a-z] range, I'd sum the difference between A and a.
If it doesn't fall on that range, I'd have a small equivalence table for the 10 or so accented characters plus ñ.
(Or, I could just have a full equivalence array with 256 entries, most of which would be the same as the input)
However, I'm guessing that there's a better way of specifying the equivalences in Unicode, given that there are hundreds of thousands of characters, and that theoretically, a new language or set of characters can be added (and I'm expecting that you wouldn't need to patch windows when that happens).
Does Windows have a huge hard-coded equivalence table for each character? Or how is this implemented?
A related question is how SQL Server implements Unicode-based accent-insensitive and case-insensitive queries. Does it have an internal table that tells it that é ë è E É È and Ë are all equivalent to "e"?
That doesn't sound very fast when it comes to comparing strings.
How does it access Indexes quickly? Does it already index values converted to their "base" characters, corresponding to that field's collation?
Does anyone know the internals for these things?
Thank you!

I'm going to address the MS SQL Server part of this question, but the "correct" answer actually depends on the language(s) supported and application.
When you create a table in SQL Server, each text field has either an implicitly or explicitly specified collation. This affects both sort order and comparison behavior. The default, for most English (US) locales, is Latin1_General_CI_AS, or Latin 1, Case-insensitive, Accent-Sensitive. That means that, for example, a=A, but a!=Ä and a!=ä. You can also use accent-insensitive (Latin1_General_CI_AI) which treats all the diacritic variations of "A" as equal.
Some locales support other categories of comparison; for example, French orders words containing diacritics somewhat differently than German does. Turkish considers a dotless i and dotted i semantically different, so I and i don't match even with case-insensitive comparisons if you use Turkish, case-insensitive, accent-sensitive collation.
You can change the collation per database, per table, per field, and, with some cost, even per-query. My understanding is that indices normalize according to the specified collation order, which means that basically the index keeps a flattened version of the original string. For example, with case-insensitive collations, Apple and apple are stored as apple. Queries are flattened with the same collation before the search.
In Japanese, there's another category of normalization, where fullwidth and halfwidth characters like ア=ｱ, and in some cases, two halfwidth characters are flattened to a single, semantically equivalent character (バ=ﾊﾞ). Finally, for some languages, there's another ball of wax with composite characters, where isolated diacritic characters can be composed with other characters (e.g. the umlaut in ä is one character, composed with the simple form a). Vietnamese, Thai and a few other languages have variations of this category. If there's a canonical form, Unicode normalization allows the composed and decomposed forms to be treated as equivalent. Unicode normalization is typically applied before any comparisons are made.
To summarize, for a case-insensitive comparison, you do something much like you would when comparing ASCII-range strings: flatten the left and right side of the comparison "to lower case" (for example), then compare the array as a binary array. The difference is that you need to
1) normalize the strings to the same unicode form (kC or kD)
2) normalize the strings to the same case according to the rules of that locale
3) normalize the accents according to the accent-sensitivity rules
4) compare according to a binary comparison
4) if applicable, such as in the case of sorting, compare using additional secondary and ternary sorting rules, which include things analogous to things like "Mc" sorts before "M" in some languages.
And yes, Windows stores tables for all of these rules. You don't get all of them by default in every installation, unless you add support for them with the East Asian Language Support and Complex Scripts support from control panel.

There is a mapping file that contains all the case mappings that have a 1:1 mapping ratio. Usually operating systems/frameworks/libraries support a specific version of Unicode, and since this case mappings file is versioned, you would get the mappings for whichever version of Unicode your particular OS/framework/library/whatever happened to support.
For more information on Unicode case mappings, see: http://www.unicode.org/faq/casemap_charprop.html

Most writing systems do not have separate uppercase and lowercase letters. According to Wikipedia, exceptions include "Roman, Greek, Cyrillic and Armenian alphabets".
So there aren't that many letters to worry about. This page shows that large ranges of characters follow a simple scheme of adding 1 to an uppercase character to get the lowercase equivalent (though of course there are some exceptions).

The correct answer is a little more complicated, depending on what you are trying to do.
When comparing character strings, for sorting or searching applications, the correct algorithm to use is specified in UTS #10: "Unicode Collation Algorithm". Case-insensitivity is part of the mix, but there are different ways to represent a many characters, and applications often need to treat the various representations as equivalent.
The sorting rules are locale-dependent. This is mainly an issue when you are sorting results for display to a user. Ignoring the rules can frustrate users and even result in security vulnerabilities.
If you are just trying to capitalize words for display purposes, the rules there can be tricky too; there are one-to-many conversions and other issues. Depending on the locale, the same letter may capitalize differently. The letter's position in a word can make a difference. There's also a a distinct notion of "title case", where you just want to capitalize the first letter of each word. Sometimes the title-case of a character is not the same as its upper-case.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse