0D type and n?0D randoms - kdb

In A brief introduction to q and kdb+ there are several places with creation of time records with code like 0D00:01.
And even random time generation technique using syntax:
n?0D0
fcn?0D00:00:20
I found 0D mentioned only in q4m3 2.5.2 Time Types as optional.
Are there any references to this syntax on code.kx? And are there any other useful date/time random generators exist? I checked for capital-letters, - seems 0D is the only one, see: q)#[value;;::] each ("0",/:.Q.A)

Let me first note that the 0D... syntax is not specific to the rand operator. The prefix 0D is needed when a type of a literal kdb would infer without it would be different from what you intended. For example:
q)type 08:09:10.123 / time
-19h
q)type 0D08:09:10.123 / timespan
-16h
The prefix is optional when a type can be inferred unambiguously; in case of timespan literals it's sufficient to supply more than 4 digits after the dot when using the hh:mm:ss.nnnnnnnnn notation:
q)type 08:09:10.123 / time
-19h
q)type 08:09:10.1234 / still time
-19h
q)type 08:09:10.12345 / timespan
-16h
The 0D notation is very handy when you need a timespan value but don't want to specify all the details down to nanoseconds. I think you will agree that 0D00:01 (1 minute) is easier to type and read than 00:01:00.000000000.
Going back to your question, 0D0 is just a zero-valued timespan, the same as 00:00:00.00000000. However, ? treats it as if 1D0 (or 0D24:00:00.000000000) was passed. I didn't see it documented anywhere on code.kx.com but if you think about it you'll agree that generating a timespan in the range [0; 24h) is such a common case that it definitely deserves a shortcut. And there you have it!

Related

What does Regex Index Reference Mean When Referring to Single Quotes

I was playing around with inputting regex patterns in https://regex101.com/
Beginning with the most simple of examples (see screenshot image below) I don't understand the explanation for my pattern '.' (all characters except line terminators) I am assuming that the index values provided are referencing number base terminology. That is 39 base 10, 27 base 16 & 47 base 8?
If I am correct then what is this indexing telling me?
I understand start indexes as used in the following: MathWorks regexp But this particular example, I have posted here, has regex101 referencing the single quote itself.
Perhaps my question is a little esoteric but I would appreciate any suggestions.
The quote char ', the ASCII code representation's are either:
39 is decimal,
27 is hex,
47 is octal

Extracting Portions of String

I have a field with the following types of string
X000233756_9981900025_201901_EUR_/
I firstly need to take take the characters to the left of the first _
Secondly I need to take the characters between the first and 2nd _
First _ is CHARINDEX('_',[Line_Item_Text],1) AS Position_1
Second _ is CHARINDEX('_',[Line_Item_Text],CHARINDEX('_',[Line_Item_Text],1)+1) AS Position_2
I was then expecting to be able to do
left([Line_Item_Text],CHARINDEX('_',[Line_Item_Text],1)-1) AS Data_1
Substring([Line_Item_Text],CHARINDEX('_',[Line_Item_Text],1)+1),CHARINDEX('_',[Line_Item_Text],CHARINDEX('_',[Line_Item_Text],1)+1) - CHARINDEX('_',[Line_Item_Text],1)+1)) AS Data_2"
Which should give me
X000233756
9981900025
But getting errors with incorrect number of functions when I start adding and subtracting from CHARINDEX Function.
Any ideas where I am going wrong?
TIA
Geoff
Actually, using the base string functions here is going to be an ugly nightmare. You might find that STRING_SPLIT along with some clever logic might be easier:
SELECT value
FROM STRING_SPLIT('X000233756_9981900025_201901_EUR_', '_')
WHERE LEN(value) > 6 AND NOT value LIKE '[A-Z]%';
This answer assumes that the third and fourth components would always be a 6 digit date and 3 letter currency code, and that the first (but not second) component would always start with some letter.
Demo

`uuuu` versus `yyyy` in `DateTimeFormatter` formatting pattern codes in Java?

The DateTimeFormatter class documentation says about its formatting codes for the year:
u year year 2004; 04
y year-of-era year 2004; 04
…
Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years as per SignStyle.NORMAL. Otherwise, the sign is output if the pad width is exceeded, as per SignStyle.EXCEEDS_PAD.
No other mention of “era”.
So what is the difference between these two codes, u versus y, year versus year-of-era?
When should I use something like this pattern uuuu-MM-dd and when yyyy-MM-dd when working with dates in Java?
Seems that example code written by those in the know use uuuu, but why?
Other formatting classes such as the legacy SimpleDateFormat have only yyyy, so I am confused why java.time brings this uuuu for “year of era”.
Within the scope of java.time-package, we can say:
It is safer to use "u" instead of "y" because DateTimeFormatter will otherwise insist on having an era in combination with "y" (= year-of-era). So using "u" would avoid some possible unexpected exceptions in strict formatting/parsing. See also this SO-post. Another minor thing which is improved by "u"-symbol compared with "y" is printing/parsing negative gregorian years (in far past).
Otherwise we can clearly state that using "u" instead of "y" breaks long-standing habits in Java-programming. It is also not intuitively clear that "u" denotes any kind of year because a) the first letter of the English word "year" is not in agreement with this symbol and b) SimpleDateFormat has used "u" for a different purpose since Java-7 (ISO-day-number-of-week). Confusion is guaranteed - for ever?
We should also see that using eras (symbol "G") in context of ISO is in general dangerous if we consider historic dates. If "G" is used with "u" then both fields are unrelated to each other. And if "G" is used with "y" then the formatter is satisfied but still uses proleptic gregorian calendar when the historic date mandates different calendars and date-handling.
Background information:
When developing and integrating the JSR 310 (java.time-packages) the designers decided to use Common Locale Data Repository (CLDR)/LDML-spec as the base of pattern symbols in DateTimeFormatter. The symbol "u" was already defined in CLDR as proleptic gregorian year, so this meaning was adopted to new upcoming JSR-310 (but not to SimpleDateFormat because of backwards compatibility reasons).
However, this decision to follow CLDR was not quite consistent because JSR-310 had also introduced new pattern symbols which didn't and still don't exist in CLDR, see also this old CLDR-ticket. The suggested symbol "I" was changed by CLDR to "VV" and finally overtaken by JSR-310, including new symbols "x" and "X". But "n" and "N" still don't exist in CLDR, and since this old ticket is closed, it is not clear at all if CLDR will ever support it in the sense of JSR-310. Furthermore, the ticket does not mention the symbol "p" (padding instruction in JSR-310, but not defined in CLDR). So we have still no perfect agreement between pattern definitions across different libraries and languages.
And about "y": We should also not overlook the fact that CLDR associates this year-of-era with at least some kind of mixed Julian/Gregorian year and not with the proleptic gregorian year as JSR-310 does (leaving the oddity of negative years aside). So no perfect agreement between CLDR and JSR-310 here, too.
In the javadoc section Patterns for Formatting and Parsing for DateTimeFormatter it lists the following 3 relevant symbols:
Symbol Meaning Presentation Examples
------ ------- ------------ -------
G era text AD; Anno Domini; A
u year year 2004; 04
y year-of-era year 2004; 04
Just for comparison, these other symbols are easy enough to understand:
D day-of-year number 189
d day-of-month number 10
E day-of-week text Tue; Tuesday; T
The day-of-year, day-of-month, and day-of-week are obviously the day within the given scope (year, month, week).
So, year-of-era means the year within the given scope (era), and right above it era is shown with an example value of AD (the other value of course being BC).
year is the signed year, where year 0 is 1 BC, year -1 is 2 BC, and so forth.
To illustrate: When was Julius Caesar assassinated?
March 15, 44 BC (using pattern MMMM d, y GG)
March 15, -43 (using pattern MMMM d, u)
The distinction will of course only matter if year is zero or negative, and since that is rare, most people don't care, even though they should.
Conclusion: If you use y you should also use G. Since G is rarely used, the correct year symbol is u, not y, otherwise a non-positive year will show incorrectly.
This is known as defensive programming:
Defensive programming is a form of defensive design intended to ensure the continuing function of a piece of software under unforeseen circumstances.
Note that DateTimeFormatter is consistent with SimpleDateFormat:
Letter Date or Time Component Presentation Examples
------ ---------------------- ------------ --------
G Era designator Text AD
y Year Year 1996; 96
Negative years has always been a problem, and they now fixed it by adding u.
Long story short
For 99 % of purposes you can toss a coin, it will make no difference whether you use yyyy or uuuu (or whether you use yy or uu for 2-digit year).
It depends on what you want to happen in case a year earlier than 1 CE (1 AD) occurs. The point being that in 99 % of programs such a year will never occur.
Two other answers have already presented the facts of how u and y work very nicely, but I still felt something was missing, so I am contributing the slightly more opinion-based answer.
For formatting
Assuming that you don’t expect a year before 1 CE to be formatted, the best thing you can do is to check this assumption and react appropriately in case it breaks. For example, depending on circumstances and requirements, you may print an error message or throw an exception. One very soft failure path might be to use a pattern with y (year of era) and G (era) in this case and a pattern with either u or y in the normal, current era case. Note that if you are printing the current date or the date your program was compiled, you can be sure that it is in the common era and may opt to skip the check.
For parsing
In many (most?) cases parsing also means validating meaning you have no guarantees what your input string looks like. Typically it comes from the user or from another system. An example: a date string comes as 2018-09-29. Here the choice between uuuu and yyyy should depend on what you want to happen in case the string contains a year of 0 or negative (e.g., 0000-08-17 or -012-11-13). Assuming that this would be an error, the immediate answer is: use yyyy in order for an exception to be thrown in this case. Still finer: use uuuu and after parsing perform a range check of the parsed date. The latter approach allows both for a finer validation and for a better error message in case of a validation error.
Special case (already mentioned by Meno Hochschild): If your formatter uses strict resolver style and contains y without G, parsing will always fail because strictly speaking year of era is ambiguous without era: 1950 might mean 1950 CE or 1950 BCE (1950 BC). So in this case you need u (or supplying a default era, this is possible through a DateTimeFormatterBuilder).
Long story short again
Explicit range check of your dates, specifically your years, is better than relying on the choice between uuuu and yyyy for catching unexpected very early years.
Short comparison, if you need strict parsing:
Examples with invalid Date 31.02.2022
System.out.println(DateTimeFormatter.ofPattern("dd.MM.yyyy").withResolverStyle(ResolverStyle.STRICT).parse("31.02.2022"));
prints "{MonthOfYear=2, DayOfMonth=31, YearOfEra=2022},ISO"
System.out.println(DateTimeFormatter.ofPattern("dd.MM.uuuu").withResolverStyle(ResolverStyle.STRICT).parse("31.02.2022"));
throws java.time.DateTimeException: Invalid date 'FEBRUARY 31'
So you must use 'dd.MM.uuuu' to get the expected behaviour.

BCPL octal numerical constants

I've been digging into the history of BCPL due to a question I was asked about the reasoning behind using the prefix "0x" for the representation hexadecimal numbers.
In my search I stumbled upon a really good explanation of the history behind this token. (Why are hexadecimal numbers prefixed with 0x?)
From this post, however, another questions sparked:
For octal constants, did BCPL use 8 <digit> (As per specs: http://cm.bell-labs.com/cm/cs/who/dmr/bcpl.pdf) or did it use #<digit> (As per http://rabbit.eng.miami.edu/info/bcpl_reference_manual.pdf) or were both of these syntaxes valid in different implementations of the language?
I've also been able to find a second answer here that used the # syntax which further intrigued me in the subject. (Why are leading zeroes used to represent octal numbers?)
Any historical insights are greatly appreciated.
There were many slight variations on syntax in BCPL.
For example, while the one we used had 16-bit cells (so that x!y gave you the 16-bit word from a word address at x + y (a word address being half of the byte address), we also had a need to extract from byte address and byte values (since we were primarily creating OS and control software on a 6809 byte-addressable CPU).
Hence in addition to:
x!y - get word from byte address (x + y) * 2
we also had
x!%y - get byte from byte address (x * 2) + y
x%!y - get word from byte address x + (y * 2)
x%%y - get byte from byte address x + y
I'm pretty certain they were implementation-specific as I never saw them anywhere else. And BCPL was around long before language standards were as important as they are today.
The canonical language specification would have been the earlier one from Richards since he wrote the language (and your second document is for the Essex BCPL implementation about a decade later). But keep in mind that Project MAC was the earliest iteration - there were plenty of advancements after that as well.
For example, there's a 2013 revision of the BCPL User Guide (see Martin's home page) which specifies #b, #o and #x as prefixes for various non-decimal bases.

Confused about BER (Basic Encoding Rules)

I'm trying to study and understand BER (Basic Encoding Rules).
I've been using the website http://asn1-playground.oss.com/ to experiment with different ASN.1 objects and encoding them using BER.
However, even the simplest encodings seem to confuse me.
Let's take a simple ASN.1 schema:
World-Schema DEFINITIONS AUTOMATIC TAGS ::=
BEGIN
Human ::= SEQUENCE {
name UTF8String
}
END
So basically this is just a SEQUENCE with a single UTF8String type field called name.
An example of a value that matches this sequence would be something like:
{ "Bob" }
So, using http://asn1-playground.oss.com/, I produce the BER encoding of the following data:
some-guy Human ::=
{
name "Bob"
}
I would expect this to produce one sequence object, followed by a single string object.
What I get is:
30 05 80 03 42 6F 62
Now, I understand some of this encoding. The first octet, 30, is the identifier which tells us that a SEQUENCE type is the first object. The 30 is 00110000 in binary, which means that we have a class of 0, a PC (primitive/constructed) bit of 1 (meaning constructed), and a tag number of 10000 (16 in decimal) which means SEQUENCE
So far so good. The next value is the LENGTH in bytes of the SEQUENCE, which is 05.
Okay, still so far so good.
But then... I'm totally confused by the next octet 80. What does that mean??? I would have expected a value of 00001100 (for tag number 12, meaning UTF8String.)
The bytes following the 80 are pretty straightforward: the 03 means Length of 3, and the 42 6F 62 is just the UTF8String value itself, "Bob"
The 80 is a context-specific tag 0. Please note that "AUTOMATIC TAGS" is used at the beginning of the module. This indicates that all SEQUENCE, SET and CHOICE types will have context specific tags for their components starting with [0], and incrementing by 1 for each subsequent component. This way, you don't have to worry about tag conflicts when creating your messages, especially when dealing with components which are OPTIONAL or have a DEFAULT value. If you change "AUTOMATIC" to "EXPLICIT" (which I would not recommend) you will see the [UNIVERSAL 12] that you were expecting in the encoding.
Please note that AUTOMATIC TAGS applied only to tags on components of SEQUENCE, SET, or CHOICE. It does not apply to the top level components, which is why you saw the [UNIVERSAL 16] for the SEQUENCE rather than seeing a context-specific tag there also.
80 indicates context specific class, primitive, tag number 0. This is there because you specified an AUTOMATIC TAGGING environment, which automatically assigned a [0] tag to field name in type Human.