How to reduce date in yyyyMMddHHmmss format to 5 bytes? - hash

I need to generate a suffix to uniquify a value. I thought of using the current data and time but need the suffix to be no more than 5 bytes long. Are there any hashing methods that can produce a hash of 5 bytes or less from a date in yyyyMMddHHmmss format?
Any other ideas? It would be simple to maintain a running counter and use the next value but this I would prefer not to have to rely on any kind of stored value.

In case you do not need to rely on printable characters, I would suggest, that you simply use the Unix timestamp. That will work great even with 4 Bytes (until January 19, 2038).
If you want to use only a subset of characters, I would suggest, that you create a list of values that you want to use.
Let's say you want to use the letters (capital and small) and the digits -> 62 values.
Now you need to convert the timestamp into base-62. Let's say your timestamp is 100:
100 = (1 * 62^1) + (38 * 62^0)
If you have stored your printable value in an array, you could use the coefficients 1 and 38 as an index into that array.
If you chose your base to small, five bytes will not be enough. In that case you can either substract a constant from the timestamp (which will buy you some time) or you can estimate when duplicate timestamps will occur and if that date is past your retirement date ;-)

Related

Can a 4 byte timestamp value in MongoDb ObjectId overflow?

If at some time, the epoch is ffffffff, than the objectId created at this moment is something like :
ffffffff15580625bcb65364
Then, what could be the ObjectId created after 1 second?
Then, what could be the ObjectId created after [the Unix epoch rolls over in 32 bits]?
This would depend on the specific implementation, its programming language and their handling of math calculations.
It is possible that some implementations and languages would error when they retrieve the number of seconds since the Unix epoch as a 64-bit integer (which is quite common today) and then try to use a value which exceeds 32 bits in size for ObjectId generation. If this happens the driver will cease to be able to generate ObjectIds, consequently it may be unable to insert documents without _id values being provided by the application using some other generation strategy.
In other implementations the timestamp itself may roll over to zero, at which point the ObjectId generation will succeed with a very small timestamp value.
Yet other implementations may truncate (from either most or least significant side) the timestamp to coerce it into the 32 available bits of an ObjectId.
The ObjectId value itself doesn't actually have to have an accurate timestamp - it is required to be unique within the collection and it is "generally increasing" but MongoDB-the-database wouldn't care if ObjectId values wrapped to around zero at some point.
As docs says, timestamp is represented by 4-byte.
4-byte timestamp value, representing the ObjectId’s creation, measured in seconds since the Unix epoch
4 bytes is from -2,147,483,648 to 2,147,483,647 values, so, that is 4,294,967,295 values.
And the date from 4,294,967,295 according to unix timestamp is: GMT: Sunday, 7 February 2106 6:28:15
After this date, ObjectId won't be able to store the timestamp.
So, can ObjectId overflow? In 85 years every new ObjectId created will fail because it won't be able to create the timestamp with only 4 bytes.

Operating with datetimes in SQLite

I'm interested in knowing the different possibilities to operate with datetimes in SQLite and understand its pros and cons. I did not find anywhere a detailed explanation of all the alternatives.
So far I have learned that
SQLite doesn't actually have a native storage class for timestamps /
dates. It stores these values as NUMERIC or TEXT typed values
depending on input format. Date manipulation is done using the builtin
date/time functions, which know how to convert inputs from the other
formats.
(quoted from here)
When any operation between datetimes is needed, I have seen two different approaches:
julianday function
SELECT julianday(OneDatetime) - julianday(AnotherDatetime) FROM MyTable;
Number of days is returned, but this can be fractional.
Therefore, you can also get some other measures of time with some extra operations. For instance, to get minutes:
SELECT CAST ((
julianday(OneDatetime) - julianday(AnotherDatetime)
) * 24 * 60 AS INTEGER)
Apparently julianday could cause some problems:
Bear in mind that julianday returns the (fractional) number of 'days'
- i.e. 24hour periods, since noon UTC on the origin date. That's usually not what you need, unless you happen to live 12 hours west of
Greenwich. E.g. if you live in London, this morning is on the same
julianday as yesterday afternoon.
More information in this post.
strftime function
SELECT strftime("%s", OneDatetime)-strftime("%s", AnotherDatetime) FROM MyTable;
Number of seconds is returned. Similarly, you can also get some other measures of time with some extra operations. For instance, to get minutes:
SELECT (strftime("%s", OneDatetime)-strftime("%s", AnotherDatetime))/60 FROM MyTable;
More information here.
My conclusion so far is: julianday seems easier to use, but can cause some problems. strftime seems more verbose, but also safer. Both of them provide only as results a single unit (either days or hours or minutes or seconds), but not a combination of many.
Question
1) Is there any other possibility to operate with datetimes?
2) What would be the best way to get directly the difference of two datetimes in time format (or date or datetime), where datetime would be formatted as 'YYYY-mm-dd HH:MM:SS', and the result would be something in the same format?
I would have imagined that something like the following would work, but it does not:
SELECT DATETIME('2016-11-04 08:05:00') - DATETIME('2016-11-04 07:00:00') FROM MyTable;
> 01:05:00
Julian day numbers are perfectly safe when computing differences.
The only problem would be if you tried to convert them into a date by truncating any fractional digits; this would result in noon, not midnight. (The same could happen if you tried to store them in integer variables.) But that is not what you're doing here.
SQLite has no built-in function to compute date/time differences; you have to convert date/time values into some number first. Whether you use (Julian) days or seconds does not really matter from a technical point of view; use whatever is easier in your program.
If you started with a different format, you might want to convert the resulting difference back into that format, e.g.:
time(difference_value, 'unixepoch') -- from seconds to hh:mm:ss
time(0.5 + difference_value) -- from Julian days to hh:mm:ss

Handling oddly-formatted timestamp in Postgres?

I have about 32 million tuples of data of the format:
2012-02-22T16:46:28.9670320+00:00
I have been told that the +00:00 indicates an hour:minute timezone offset, but also that Postgres only takes in hour offset (even in decimals), not the minute. So would I have to process the data in order to remove the last :00 from every tuple and read the data in as timestamps? I would like to avoid pre-processing the data file, but if Postgres will not accept the values otherwise, then I will do so.
In addition, the precision specified in the given data is 7 decimal places in the seconds part, whereas Postgres timestamp data type allows for maximum 6 decimal place precision (milliseconds). Would I have to modify the 7 decimal place precision to 6 in order to allow Postgres to read the records in, or will Postgres automatically convert the 7 to 6 as it reads the tuples?
pgsql=# SELECT '2016-07-10 20:12:21.8372949999+02:30'::timestamp with time zone AS ts;
ts-------------------------------
2016-07-10 17:42:21.837295+00
(1 row)
It seems that at least in PostgreSQL 9.4 and up (maybe earlier), minutes timezone offset handling is not documented, but does get processed properly if used. In a similar vein, if I try to read in a timestamp that has 7 decimal place precision in the seconds, then it will automatically convert that to 6 decimal place (microsecond) precision instead.

Storing date as time in millis

I try to represent date objects in a data storage without the hassle of Date object in Java. So I thought of using just a time in milliseconds and store the UTC time zone as well. I thought about using simple shift routines to combine everything in a single long as time zone is just 5bits (+/-12).
Can someone see any problem with this? What other compact storage schemes (other than textual representation) of date exist and how do they compare to this?
I think you're under valuing granularity in your time zone and over valuing the need for bits in the timestamp.
A long has has 8 bytes for this purpose.
Lets say you allow yourself 2 bytes for the time zone. That leaves you with 6 for the timestamp. 6*8 = 48 bits for a timestamp.
The largest number a 48 bit unsigned integer can handle is 281474976710655.
Divide by 1000 to get from miliseconds to seconds 281474976710
Punch that number into an epoch converter: 10889-08-02T05:31:50+00:00
That's the year 10,889 when we're in 2,015.
Just use 2 bytes for the timezone. You've got the space. That will easily allow you to represent the timezone as minutes offset +-24 hours. And since it's whole bytes, the packing code will be simpler to comprehend.

What does a negative integer Date value mean in MongoDB?

I've discovered an issue with some of data being stored in MongoDB. We have a field that stores a Date, and normally this includes values like ISODate("1992-08-30T00:00:00.000Z") or ISODate("1963-08-15T00:00:00.000Z"). That's nice and straight-forward; I can easily look at those dates and see August 30, 1992 or August 15, 1963.
However, I've noticed a couple of entries where the date looks something like this instead:
Date(-61712668800000)
I'm honestly not sure how the data got persisted that way in the first place, as it should have been stored the former way. And I'll have to address the software bug with my code that is intermittently causing it to be stored that way.
However, the bigger problem is what to do with data entries that look like that. I'm not even sure what date that was supposed to be. My first assumption is that it's just milliseconds, like a UNIX timestamp or something, but that's not right. Even if I flip the negative sign and remove some of the trailing zeros, that still ends up being a date way in the future (e.g. July 23, 2165), and that's not correct. It should be a date in the past.
And the other big problem is that I'm not sure how to even search for this in the database. I can't utilize a $type query because the type is still 9 (i.e. it still thinks it's a "Date").
Has anyone else encountered these weird date entries before? How can I find them? And how can I recover the actual date from them?
The problem seems to be that your code is storing dates prior to the epoch, which are furthermore so far into the past that they cannot be represented using an ISODate wrapper:
As per the documentation
(emphasis added)
Date
BSON Date is a 64-bit integer that represents the number of
milliseconds since the Unix epoch (Jan 1, 1970). This results in a
representable date range of about 290 million years into the past and
future.
The official BSON specification refers to the BSON Date type as the
UTC datetime.
Changed in version 2.0: BSON Date type is signed. [2] Negative values
represent dates before 1970.
Although not explicitly stated in the Mongo documentation, it appears that they are following a strict interpretation of the ISO 8601 standard and not one of the variants which are allowed "by trading partner agreement" based on what I found at wikipedia
Years
YYYY ±YYYYY ISO 8601 prescribes, as a minimum, a
four-digit year [YYYY] to avoid the year 2000 problem. It therefore
represents years from 0000 to 9999, year 0000 being equal to 1 BC and
all others AD. However, years prior to 1583 are not automatically
allowed by the standard. Instead "values in the range [0000] through
[1582] shall only be used by mutual agreement of the partners in
information interchange."[9]
To represent years before 0000 or after 9999, the standard also
permits the expansion of the year representation but only by prior
agreement between the sender and the receiver.[10] An expanded year
representation [±YYYYY] must have an agreed-upon number of extra year
digits beyond the four-digit minimum, and it must be prefixed with a +
or − sign[11] instead of the more common AD or BC (or the less widely
used BCE/CE) notation; by convention 1 BC is labelled +0000, 2 BC is
labeled -0001, and so on.[12]
If you read through the rest of the article you will also see that the reason the number of digits must be pre-defined is so that the date can be stored unambiguously without using separator characters such as "-" between the components.