Is using T to separate date and time inconsistent with RFC3339? - postgresql

In the documentation for date/time type in Postgres, it says:
ISO 8601 specifies the use of uppercase letter T to separate the date and time. PostgreSQL accepts that format on input, but on output it uses a space rather than T, as shown above. This is for readability and for consistency with RFC 3339 as well as some other database systems.
However, I cannot find that part in RFC3339.
Can anybody help me?

Section 5.6
date-time = full-date "T" full-time
NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively.
This date/time format may be used in some environments or contexts
that distinguish between the upper- and lower-case letters 'A'-'Z'
and 'a'-'z' (e.g. XML). Specifications that use this format in
such environments MAY further limit the date/time syntax so that
the letters 'T' and 'Z' used in the date/time syntax must always
be upper case. Applications that generate this format SHOULD use
upper case letters.
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.

Related

What does T mean in "YYYY-mm-DDTHH:MM"?

I am trying to pull some data from Twitter, and the date format is "YYYY-mm-DDTHH:MM". What does T mean in "YYYY-mm-DDTHH:MM"?
The T isn't substituted for a value, it's a character used in the output to designate that the second part is a Time.
For example: 2021-04-20T13:03
The format is part of the ISO 8601 international standard.

Why doesn't ICU4J match UTF-8 sort order?

I am having a hard time understanding unicode sorting order.
When I run Collator.getInstance(Locale.ENGLISH).compare("_", "#") under ICU4J 55.1 I get a return value of -1 indicating that _ comes before #.
However, looking at http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec I see that # (U+0023) comes before _ (U+005F). Why is ICU4J returning a value of -1?
First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.
Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN and _BIN2). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).
The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt
It shows:
005F ; [*010A.0020.0002] # LOW LINE
...
0023 ; [*0290.0020.0002] # NUMBER SIGN
It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.
Converting Mark Ransom's comments into an answer:
The ordering of individual characters is based on a collation table, which has little relationship to the codepoint numbers. See: http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
If you follow the first link on that page, it leads to allkeys.txt which gives the default collation ordering.
In particular, _ is 005F ; [*020B.0020.0002] # LOW LINE while # is 0023 ; [*0391.0020.0002] # NUMBER SIGN. Note that the collation numbers for _ are lower than the numbers for #.

Unicode character default collation table

I don't know which site this question belongs exactly, so posting it here.
I use Postgresql 9.2 on RHEL 6.4 and observe the following:
select foo
from unnest('{а,ә,б,в,г,д,е,ж}'::text[]) as foo
order by foo collate "kk_KZ.utf8"
gives
а
ә
б
в
г
д
е
ж
BUT
select foo
from unnest('{а,ә,б,в,г,д,е,ж}'::text[]) as foo
order by foo collate "en_US.utf8"
gives
а
б
в
г
д
е
ә -- misplaced
ж
Further, I found that there is the Default Unicode Collation Element Table [1], which lists the character in question (04D9 ; [.199D.0020.0002.04D9] # CYRILLIC SMALL LETTER SCHWA) in proper order.
I understand that it is silly to expect the cyrillic characters be handled properly by "en_US.utf8" locale, but what is the correct behavior by Unicode or any other relevant standards in cases, where a character does not normally belong to language/locale used for collation?
[1] http://www.unicode.org/Public/UCA/latest/allkeys.txt
It's not misplaced. It might be to you, but it's not to me. :-) In all seriousness, there is no correct behavior by Unicode; there simply cannot be. A character set is a mapping; the collation is a locale-specific set of rules to sort the characters in that set -- and even within the same locale there can be multiple collations.
The ICU docs has colorful examples of how thorny this kind of stuff gets, in case you're curious. Quoting extensively:
http://userguide.icu-project.org/collation
[H]ere are some of the ways languages vary in ordering strings:
The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".
Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".
Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".
Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".
Unaccented letters that are considered distinct in one language can be indistinct in another. For example, the letters "v" and "w" are two different letters according to English. However, "v" and "w" are considered variant forms of the same letter in Swedish.
A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae".
Thai requires that the order of certain letters be reversed.
French requires that letters sorted with accents at the end of the string be sorted ahead of accents in the beginning of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o".
Sometimes lowercase letters sort before uppercase letters. The reverse is required in other situations. For example, lowercase letters are usually sorted before uppercase letters in English. Latvian letters are the exact opposite.
Even in the same language, different applications might require different sorting orders. For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite.
Sorting orders can change over time due to government regulations or new characters/scripts in Unicode.
Postgresql uses the locales provided by the operating system. In your setup, locales are provided by glibc. Glibc uses a heavily modified version of an "ancient" version of ISO 14651 (see glibc Bug 14095 - Review / update collation data from Unicode / ISO 14651 for information on current pains in trying to update glibc locale data).
As of glibc 2.28, to be released on 2018-08-01, glibc will use data from ISO 14651:2016 (which is synchronized to Unicode 9), and will give the order the OP expects for en_US.
ISO 14651 is Method for comparing character strings and description of the common template tailorable ordering and it is similar to the UCA, with some differences. The CTT (Common Template Table) is the ISO14651 equivalent of the DUCET, and they are aligned.
The first time CYRILLIC SMALL LETTER SCHWA appeared in a collation table in glibc was for the az_AZ locale (Azerbaijani), where it is ordered after CYRILLIC SMALL LETTER IE. This corresponds to:
commit fcababc4e18fee81940dab20f7c40b1e1fb67209
Author: Ulrich Drepper <drepper#redhat.com>
Date: Fri Aug 3 08:42:28 2001 +0000
Update.
2001-08-03 Ulrich Drepper <drepper#redhat.com>
* locale/iso-639.def: Add Tigrinya.
From there, that ordering was eventually moved to the file iso14651_t1 as per Bug 672 - Include iso14651_t1 in collation rules, which was an effort to simplify glibc locale data. This corresponds to:
commit 5d2489928c0040d2a71dd0e63c801f2cf98e7efc
Author: Ulrich Drepper <drepper#redhat.com>
Date: Sun Feb 18 04:34:28 2007 +0000
[BZ #672]
2005-01-16 Denis Barbier <barbier#linuxfr.org>
[BZ #672]
* locales/ca_ES: Replace current collation rules by including
iso14651_t1 and adding extra rules if needed. There should be
no noticeable changes in sorted text. only ligatures and
ignoreable characters have modified weights.
* locales/da_DK: Likewise.
* locales/en_CA: Likewise.
* locales/es_US: Likewise.
* locales/fi_FI: Likewise.
* locales/nb_NO: Likewise.
[BZ #672]
* locales/iso14651_t1: Simplified. Extended.
Most locales in glibc start from iso14651_t1, and tailor it, which is what you are seeing with en_US.
While glibc based its default ordering in Azerbaijani, the DUCET instead bases it on the ordering for Kazakh and Tatar, which is where the difference comes from.
The Unicode Collation Algorithm allows any tailorings to be made to the DUCET.
There isn't a "correct" behaviour. There are various behaviours one could expect, and the most appropriate depends on the context, the audience. Sometimes any behaviour could be correct, since there isn't really a reason to force any order of cyrillic betters in an American English collation.
The Common Locale Data Repository provides locale-specific tailorings to the DUCET. The CLDR uses LDML (Locale Data Markup Language) to specify the tailorings, and the syntax is given by the Unicode Technical Specification #35, part 5.
The latest version of the data provided by the CLDR for en_US has no tailorings: it uses a modified version of the DUCET (as stated in UTS#35 under "Root collation"). It lists the cyrillic schwa after the cyrillic A, i.e., the order you were expecting.
There is also data for an en_US_POSIX locale, and that one includes some modifications, but none changes anything that isn't in ASCII.
It appears the en_US locale installed in your system uses a tailoring that puts the schwa next to E probably because of their similar form. It could be argued that would cause fewer surprises to an American English audience than sorting the schwa after A: ask people what that is and see how many will just tell you it is an "upside-down E". It isn't right or wrong, but if you ask me, it seems more appropriate than the collation found in the CLDR.

Struggling with dates formats, want YYYY-MM-DD

As an absolute beginner to SAS I quickly ran into problems with date formatting.
I have a dataset containing transaction with three types of dates: BUSDATE, SPOTDATE, MATURITY. Each transaction is represented on two lines, and I want BUSDATE and SPOTDATE from line 1 but MATURITY from line 2.
In the original set, the dates are in YYYY-MM-DD format.
DATA masterdata;
SET sourcedata(rename(BUSDATE=BUSDATE2 SPOTDATE=SPOTDATE2 MATURITY=MATURITY2));
BUSDATE=BUSDATE2;
SPOTDATE=SPOTDATE2;
IF TRANS_TYPE='Swap' THEN;
MATURITY=SPOTDATE;
RUN;
Problem is, this returns something like 17169 (which I guess is the number of days from a certain date).
How can I make it output in YYYY-MM-DD format - or is this approach wrong; should I first convert the date variables to some SAS date format?
if you have valid SAS dates, just add a FORMAT statement to your DATA STEP.
Format busdate spotdate maturity yymmdd10. ;
SAS dates are numeric variables. They represent the number of days since 1/1/1960. You use a FORMAT to display dates.
Adding to CarolinaJay's answer, you normally want to keep them as numeric format, since you can do math (like "# of days since date X") with them. However, if for some reason you need a character variable, you can do this:
date_As_char=put(datevar,YYMMDD10.);
Incidentally, YYMMDD10 will actually give you YYYY-MM-DD, as you asked for; if you want a different separator, see http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000589916.htm (YYMMDDxw. format) - if you put a letter after the last D, for certain letters, you get a different separator. Like, YYMMDDn10. gives you no separator, or YYMMDDs10. gives you slashes. YYMMDDd10. gives you dashes, just like omitting the letter would. This concept also applies to MMDDYY formats, and I think a few others.

In an ISO 8601 date, is the T character mandatory?

I'm wondering if the following date is ISO8601 compliant :
2012-03-02 14:57:05.456+0500
(for sure, 2012-03-02T14:57:05.456+0500 is compliant, but not that much human readable !)
IOW, is the T between date and time mandatory ?
It's required unless the "partners in information interchange" agree to omit it.
Quoting an earlier version of the ISO 8601 standard, section 4.3.2:
The character [T] shall be used as time designator to indicate the
start of the representation of the time of day component in these
expressions. [...]
NOTE By mutual agreement of the partners in information interchange,
the character [T] may be omitted in applications where there is no
risk of confusing a date and time of day representation with others
defined in this International Standard.
Omitting it is fairly common, but leaving it in is advisable if the representation is meant to be machine-readable and you don't have a clear agreement that you can omit it.
But according to Wikipedia:
In ISO 8601:2004 it was permitted to omit the "T" character by mutual agreement as in "200704051430", but this provision was removed in ISO 8601-1:2019. Separating date and time parts with other characters such as space is not allowed in ISO 8601, but allowed in its profile RFC 3339.
UPDATE : Mark Amery's comment makes a good point, that permission to omit the [T] does not necessarily imply permission to replace it with a space. So this:
2012-03-02T14:57:05.456+0500
is clearly compliant, and this:
2012-03-0214:57:05.456+0500
was permitted by earlier versions of the standard if the partners agreed to omit the T, but this:
2012-03-02 14:57:05.456+0500
apparently is not (though it's much more readable than the version with the T simply omitted).
Personally, if ISO 8601 compliance were required, I'd include the T, and if it weren't then I'd use a space (or a hyphen if it's going to be part of a file name).
See also RFC 3339 section 5.6, mentioned in Charles Burns's answer.
That date is not ISO-8601 compliant as Keith Thompson indicated, but it is compliant with RFC 3339, a profile of ISO 8601.
Sort of. See NOTE at the bottom of the following text from RFC 3339:
date-time = full-date "T" full-time
NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
syntax may alternatively be lower case "t" or "z" respectively.
This date/time format may be used in some environments or contexts
that distinguish between the upper- and lower-case letters 'A'-'Z'
and 'a'-'z' (e.g. XML). Specifications that use this format in
such environments MAY further limit the date/time syntax so that
the letters 'T' and 'Z' used in the date/time syntax must always
be upper case. Applications that generate this format SHOULD use
upper case letters.
NOTE: ISO 8601 defines date and time separated by "T".
Applications using this syntax may choose, for the sake of
readability, to specify a full-date and full-time separated by
(say) a space character.