I try to represent date objects in a data storage without the hassle of Date object in Java. So I thought of using just a time in milliseconds and store the UTC time zone as well. I thought about using simple shift routines to combine everything in a single long as time zone is just 5bits (+/-12).
Can someone see any problem with this? What other compact storage schemes (other than textual representation) of date exist and how do they compare to this?
I think you're under valuing granularity in your time zone and over valuing the need for bits in the timestamp.
A long has has 8 bytes for this purpose.
Lets say you allow yourself 2 bytes for the time zone. That leaves you with 6 for the timestamp. 6*8 = 48 bits for a timestamp.
The largest number a 48 bit unsigned integer can handle is 281474976710655.
Divide by 1000 to get from miliseconds to seconds 281474976710
Punch that number into an epoch converter: 10889-08-02T05:31:50+00:00
That's the year 10,889 when we're in 2,015.
Just use 2 bytes for the timezone. You've got the space. That will easily allow you to represent the timezone as minutes offset +-24 hours. And since it's whole bytes, the packing code will be simpler to comprehend.
Related
If at some time, the epoch is ffffffff, than the objectId created at this moment is something like :
ffffffff15580625bcb65364
Then, what could be the ObjectId created after 1 second?
Then, what could be the ObjectId created after [the Unix epoch rolls over in 32 bits]?
This would depend on the specific implementation, its programming language and their handling of math calculations.
It is possible that some implementations and languages would error when they retrieve the number of seconds since the Unix epoch as a 64-bit integer (which is quite common today) and then try to use a value which exceeds 32 bits in size for ObjectId generation. If this happens the driver will cease to be able to generate ObjectIds, consequently it may be unable to insert documents without _id values being provided by the application using some other generation strategy.
In other implementations the timestamp itself may roll over to zero, at which point the ObjectId generation will succeed with a very small timestamp value.
Yet other implementations may truncate (from either most or least significant side) the timestamp to coerce it into the 32 available bits of an ObjectId.
The ObjectId value itself doesn't actually have to have an accurate timestamp - it is required to be unique within the collection and it is "generally increasing" but MongoDB-the-database wouldn't care if ObjectId values wrapped to around zero at some point.
As docs says, timestamp is represented by 4-byte.
4-byte timestamp value, representing the ObjectId’s creation, measured in seconds since the Unix epoch
4 bytes is from -2,147,483,648 to 2,147,483,647 values, so, that is 4,294,967,295 values.
And the date from 4,294,967,295 according to unix timestamp is: GMT: Sunday, 7 February 2106 6:28:15
After this date, ObjectId won't be able to store the timestamp.
So, can ObjectId overflow? In 85 years every new ObjectId created will fail because it won't be able to create the timestamp with only 4 bytes.
When reading the PostreSQL (13) documentation, I came across this page, which lists the storage sizes of different datetime types.
Inter alia it states:
Name Storage Size Description Low Value High Value Resolution
---- ------------ ----------- --------- ---------- ----------
timestamp without time zone 8 bytes both date and time (no time zone) 4713 BC 294276 AD 1 microsecond
timestamp with time zone 8 bytes both date and time, with time zone 4713 BC 294276 AD 1 microsecond
time with time zone 12 bytes time of day (no date), with time zone 00:00:00+1559 24:00:00-1559 1 microsecond
I can understand timestamp without time zone: There are roughly (4713 + 294276) * 365 * 24 * 60 * 60 * 1000000 microseconds between low and high value. (Not accounting for calender changes and such). This amounts to roughly 2^63 values and those can be stored in 8 bytes.
However, timestamp with timezone has the same range and resolution and can additonally store the time zone information. If we already use 63 bits for the values there is only one bit left, which cannot be enough to store the timezone, so the internal storage must work somehow different.
Even stranger, while timestamps use only 8 bytes, time with time zone needs 12 bytes, although it has the same resolution and a much smaller range of allowed values.
Thus my question: How does the internal storage of these types in PostgreSQL work?
Both timestamp with time zone and timestamp without time zone are stored as an 8 byte integer representing microseconds since midnight on Jan 1 2000.
The difference is that timestamp with time zone is interpreted as being in UTC and is converted according to the current setting of the timezone parameter on display.
The definitions are in src/include/datatype/timestamp.h:
typedef int64 Timestamp;
typedef int64 TimestampTz;
time without time zone is a 8 byte integer representing microseconds since midnight (a 4 byte integer wouldn't suffice). See src/include/utils/date.h:
typedef int64 TimeADT;
time with time zone has an additional 4 byte integer representing the time zone offset in seconds.
See src/include/utils/date.h:
typedef struct
{
TimeADT time; /* all time units other than months and years */
int32 zone; /* numeric time zone, in seconds */
} TimeTzADT;
Follow the documentation and avoid time with time zone:
The type time with time zone is defined by the SQL standard, but the definition exhibits properties which lead to questionable usefulness.
and can additonally store the time zone information
No, it doesn't store time zone information.
From the same page you linked to
All timezone-aware dates and times are stored internally in UTC. They are converted to local time in the zone specified by the TimeZone configuration parameter before being displayed to the client.
(emphasis mine)
So timestamp and timestamptz are stored in the same. They just differ in the way the values are accepted from and returned to the client.
I have about 32 million tuples of data of the format:
2012-02-22T16:46:28.9670320+00:00
I have been told that the +00:00 indicates an hour:minute timezone offset, but also that Postgres only takes in hour offset (even in decimals), not the minute. So would I have to process the data in order to remove the last :00 from every tuple and read the data in as timestamps? I would like to avoid pre-processing the data file, but if Postgres will not accept the values otherwise, then I will do so.
In addition, the precision specified in the given data is 7 decimal places in the seconds part, whereas Postgres timestamp data type allows for maximum 6 decimal place precision (milliseconds). Would I have to modify the 7 decimal place precision to 6 in order to allow Postgres to read the records in, or will Postgres automatically convert the 7 to 6 as it reads the tuples?
pgsql=# SELECT '2016-07-10 20:12:21.8372949999+02:30'::timestamp with time zone AS ts;
ts-------------------------------
2016-07-10 17:42:21.837295+00
(1 row)
It seems that at least in PostgreSQL 9.4 and up (maybe earlier), minutes timezone offset handling is not documented, but does get processed properly if used. In a similar vein, if I try to read in a timestamp that has 7 decimal place precision in the seconds, then it will automatically convert that to 6 decimal place (microsecond) precision instead.
I am in the process of optimizing some UniVerse data access code we have which uses UniObjects. After some experimentation, it seems that using a UniSession.OConv call to parse certain things such as decimal numbers (most we have a MR4 or MR2 or MR2$) and dates (almost all are D2/) is extremely slow (I think it might make a call back to the server to parse it).
I have already built a parser for the MR*[$] codes, but I was wondering about the dates as they are stored so I can build one for D2/. Usually they seem to be stored as a 5 digit number. I thought it could be number of days since the Unix Epoch since our UniVerse server runs on HP-UX, but after finding '15766' as a last modified date and multiplying it by 86400 (seconds per day), I got March 02, 2013 which doesn't make sense as a last modified date since as far as I know that is still in the future.
Does anyone know what the time base of these date numbers are?
It is stored as a number of days. Just do a conversion on 0 and you will get the start date.
Edit:
As noted by Los, the Epoch used in UniVerse (and UniData) is 31st Dec 1967.
In Universe and any other Pick database, Dates and Times are stored as separate values.
The internal date is the number of days before of after 31/12/1967, which is day zero.
The internal time is the number of seconds after midnight. It can be stored as a decimal but is not normally.
In TCL there is a CDT command (stands for Convert Date) that converts dates from human readable to numeric and and vice versa:
CDT 9/28/2017
* Result: 18169
CDT 18169
* Result: 09/28/2017
I need to generate a suffix to uniquify a value. I thought of using the current data and time but need the suffix to be no more than 5 bytes long. Are there any hashing methods that can produce a hash of 5 bytes or less from a date in yyyyMMddHHmmss format?
Any other ideas? It would be simple to maintain a running counter and use the next value but this I would prefer not to have to rely on any kind of stored value.
In case you do not need to rely on printable characters, I would suggest, that you simply use the Unix timestamp. That will work great even with 4 Bytes (until January 19, 2038).
If you want to use only a subset of characters, I would suggest, that you create a list of values that you want to use.
Let's say you want to use the letters (capital and small) and the digits -> 62 values.
Now you need to convert the timestamp into base-62. Let's say your timestamp is 100:
100 = (1 * 62^1) + (38 * 62^0)
If you have stored your printable value in an array, you could use the coefficients 1 and 38 as an index into that array.
If you chose your base to small, five bytes will not be enough. In that case you can either substract a constant from the timestamp (which will buy you some time) or you can estimate when duplicate timestamps will occur and if that date is past your retirement date ;-)