Handling oddly-formatted timestamp in Postgres? - postgresql

I have about 32 million tuples of data of the format:
2012-02-22T16:46:28.9670320+00:00
I have been told that the +00:00 indicates an hour:minute timezone offset, but also that Postgres only takes in hour offset (even in decimals), not the minute. So would I have to process the data in order to remove the last :00 from every tuple and read the data in as timestamps? I would like to avoid pre-processing the data file, but if Postgres will not accept the values otherwise, then I will do so.
In addition, the precision specified in the given data is 7 decimal places in the seconds part, whereas Postgres timestamp data type allows for maximum 6 decimal place precision (milliseconds). Would I have to modify the 7 decimal place precision to 6 in order to allow Postgres to read the records in, or will Postgres automatically convert the 7 to 6 as it reads the tuples?

pgsql=# SELECT '2016-07-10 20:12:21.8372949999+02:30'::timestamp with time zone AS ts;
ts-------------------------------
2016-07-10 17:42:21.837295+00
(1 row)
It seems that at least in PostgreSQL 9.4 and up (maybe earlier), minutes timezone offset handling is not documented, but does get processed properly if used. In a similar vein, if I try to read in a timestamp that has 7 decimal place precision in the seconds, then it will automatically convert that to 6 decimal place (microsecond) precision instead.

Related

PostgreSQL: smaller timestamptz type?

Timestamptz time is 8 bytes in PostgreSQL. Is there a way to get a 6 bytes timestamptz dropping some precision?
6 bytes is pretty much out of the question, since there is no data type with that size.
With some contortions you could use a 4-byte real value:
CREATE CAST (timestamp AS bigint) WITHOUT FUNCTION;
SELECT (localtimestamp::bigint / 1000000 - 662774400)::real;
float4
--------------
2.695969e+06
(1 row)
That would give you the time since 2021-01-01 00:00:00 with a precision of about a second (but of course for dates farther from that point, the precision will deteriorate).
But the whole exercise is pretty much pointless. Trying to save 2 or 4 bytes in such a way will not be a good idea:
the space savings will be minimal; today, when you can have terabytes of storage with little effort, that seems pointless
if you don't carefully arrange your table columns, you will lose the bytes you think you have won to alignment issues
using a number instead of a proper timestamp data type will make your queries more complicated and the results hard to interpret, and it will keep you from using date arithmetic
For all these reasons, I would place this idea firmly in the realm of harmful micro-optimization.

Can a 4 byte timestamp value in MongoDb ObjectId overflow?

If at some time, the epoch is ffffffff, than the objectId created at this moment is something like :
ffffffff15580625bcb65364
Then, what could be the ObjectId created after 1 second?
Then, what could be the ObjectId created after [the Unix epoch rolls over in 32 bits]?
This would depend on the specific implementation, its programming language and their handling of math calculations.
It is possible that some implementations and languages would error when they retrieve the number of seconds since the Unix epoch as a 64-bit integer (which is quite common today) and then try to use a value which exceeds 32 bits in size for ObjectId generation. If this happens the driver will cease to be able to generate ObjectIds, consequently it may be unable to insert documents without _id values being provided by the application using some other generation strategy.
In other implementations the timestamp itself may roll over to zero, at which point the ObjectId generation will succeed with a very small timestamp value.
Yet other implementations may truncate (from either most or least significant side) the timestamp to coerce it into the 32 available bits of an ObjectId.
The ObjectId value itself doesn't actually have to have an accurate timestamp - it is required to be unique within the collection and it is "generally increasing" but MongoDB-the-database wouldn't care if ObjectId values wrapped to around zero at some point.
As docs says, timestamp is represented by 4-byte.
4-byte timestamp value, representing the ObjectId’s creation, measured in seconds since the Unix epoch
4 bytes is from -2,147,483,648 to 2,147,483,647 values, so, that is 4,294,967,295 values.
And the date from 4,294,967,295 according to unix timestamp is: GMT: Sunday, 7 February 2106 6:28:15
After this date, ObjectId won't be able to store the timestamp.
So, can ObjectId overflow? In 85 years every new ObjectId created will fail because it won't be able to create the timestamp with only 4 bytes.

How to understand Bson Timestamp?

For Mongo Bson type Timestamp, there is a constructor: BsonTimestamp(final int seconds, final int increment), how to understand the increment? what is the design consideration?
Timestamp is an internal BSON type used by MongoDB to reflect the operation time (ts) for entries in the replication oplog.
The BSON Timestamp type is designed for the specific use case of logging ordered batches of time-based operations:
the first 32 bits (time_t) are an integer value representing seconds since the Unix epoch
the second 32 bits are an integer value (ordinal) indicating ordering within a given second
The design requirement for Timestamps is based on preserving strict ordering for oplog entries rather than time precision (eg milliseconds or microseconds). The leading time component gives a course granularity of seconds; appending an incrementing ordinal value ensures strict ordering of unique Timestamp values within a given second. Using an ordered sequence instead of time precision avoids pushing down the potential conflict of two operations that might occur in the same millisecond (or microsecond).
For application use cases you should use the BSON Date type instead of a Timestamp. A BSON Date is the same size (in bits) as a Timestamp, but provides more granularity for time:
BSON Date is a 64-bit integer that represents the number of milliseconds since the Unix epoch (Jan 1, 1970). This results in a representable date range of about 290 million years into the past and future.

Storing date as time in millis

I try to represent date objects in a data storage without the hassle of Date object in Java. So I thought of using just a time in milliseconds and store the UTC time zone as well. I thought about using simple shift routines to combine everything in a single long as time zone is just 5bits (+/-12).
Can someone see any problem with this? What other compact storage schemes (other than textual representation) of date exist and how do they compare to this?
I think you're under valuing granularity in your time zone and over valuing the need for bits in the timestamp.
A long has has 8 bytes for this purpose.
Lets say you allow yourself 2 bytes for the time zone. That leaves you with 6 for the timestamp. 6*8 = 48 bits for a timestamp.
The largest number a 48 bit unsigned integer can handle is 281474976710655.
Divide by 1000 to get from miliseconds to seconds 281474976710
Punch that number into an epoch converter: 10889-08-02T05:31:50+00:00
That's the year 10,889 when we're in 2,015.
Just use 2 bytes for the timezone. You've got the space. That will easily allow you to represent the timezone as minutes offset +-24 hours. And since it's whole bytes, the packing code will be simpler to comprehend.

PostgreSQL field type for unix timestamp?

PostgreSQL field type for unix timestamp :
to store it as unix time stamp
to retrieve it as a unix timestamp as well.
Have been going through Date/Time Types postgreSQL V 9.1.
Is integer the best way to go!? (this is what I had done when I was using MySQL. Had used int(10))
The unix epoch timestamp right now (2014-04-09) is 1397071518. So we need an data type capable of storing a number at least this large.
What data types are available?
If you refer to the PostgreSQL documentation on numeric types you'll find the following options:
Name Size Minimum Maximum
smallint 2 bytes -32768 +32767
integer 4 bytes -2147483648 +2147483647
bigint 8 bytes -9223372036854775808 +9223372036854775807
What does that mean in terms of time representation?
Now, we can take those numbers and convert them into dates using an epoch converter:
Name Size Minimum Date Maximum Date
smallint 2 bytes 1969-12-31 1970-01-01
integer 4 bytes 1901-12-13 2038-01-18
bigint 8 bytes -292275055-05-16 292278994-08-17
Note that in the last instance, using seconds puts you so far into the past and the future that it probably doesn't matter. The result I've given is for if you represent the unix epoch in milliseconds.
So, what have we learned?
smallint is clearly a bad choice.
integer is a decent choice for the moment, but your software will blow up in the year 2038. The Y2K apocalypse has nothing on the Year 2038 Problem.
Using bigint is the best choice. This is future-proofed against most conceivable human needs, though the Doctor may still criticise it.
You may or may not consider whether it might not be best to store your timestamp in another format such as the ISO 8601 standard.
I'd just go with using TIMESTAMP WITH(OUT) TIME ZONE and use EXTRACT to get a UNIX timestamp representation when you need one.
Compare
SELECT NOW();
with
SELECT EXTRACT(EPOCH FROM NOW());
integer would be good, but not enough good, because postgresql doesn't support unsigned types