redis key scheme for analytics - hash

I want to create analytics using redis - basic counters per object, per hour/day/week/month/year and total
what redis data structure would be effective for this and how can I avoid doing many calls to redis?
would it better to have each model have this sets of keys:
hash - model:<id>:years => every year has a counter
hash - model:<id>:<year> => every month has a counter
hash - model:<id>:<year>:<month> => every day has a counter
hash - model:<id>:<year>:<month>:<day> => every hour has a counter
if this scheme is correct, how would I chart this data without doing many calls to redis? I would have to loop on all year in model:<id>:years and fetch the month, then loop on the month, etc? Or I just grab all fields and their values from all keys as a batch request and then process that in the server?

It's better to use a zset for this instead of an hash. Using timestamp as score you will be able to retrieve data for specific time range
For a date range you will use model:<id>:<year>:<month>, for an hour range (using model:<id>:<year>:<month>:<day>) and so on...
Indeed, if the date range is larger than a month (e.g. from January 1st 2014 to March 20th 2014), you will have to retrieve multiple zset (model:<id>:2014:01, model:<id>:2014:02 and model:<id>:2014:03) and merge the results.
If you really want to do a date range inside a single request, you can always store day precision data inside model:<id>:<year>. And if you want to handle date range over multiple years, you will just need to have a single zset e.g. model:<id>:byDay.
However, please note that storing historical data will increase memory consumption over time so you should already think about data retention. With Redis you can either use EXPIRE on zset or do it yourself with crons.

Related

How to combine rows hours into just one day with MongoDB?

Are you able to use MongoDB to combine rows of data into one row?
I'm using dates with year, month, day and hour. The data is shown per hour. Is there a way to combine data of the hours into just one day with data. I would basically remove the hour column and sum the hour data into per day data.
I'm not sure what you mean by "the data is shown per hour" - do you mean it's stored in the database that way?
MongoDB doesn't have rows and columns - the equivalent of a row is a document, and the column equivalent is a field. Unlike in traditional SQL, a field isn't just one piece of information (a string, number/date, boolean, null, etc). It can be more than one piece of data - it can be an array, or a document, or an array of documents, etc.
Anyway, based on the small amount of information I have on your situation, I'd absolutely design the data with the bucket pattern. https://www.mongodb.com/blog/post/building-with-patterns-the-bucket-pattern
You could $unset the 'measurements' array and just keep the sum/count fields if that's what you want.
If your data is already set in stone, then I'd use an aggregation pipeline to group all the documents ('rows') together - the group _id would be year, month, day, and you could sum/count/min/max/etc the data in the group too.

Weekly hour allocation problem in Rails and Postgresql

If I have a list of tasks with a certain date ranges, and the task is broken into weekly hour chunks of work (ie. 30 hours from 2018-12-31 to 2019-01-06 ... etc starting from Monday).
The kind of operations I would like to do are
Display all the weekly hours of all the tasks for a list of users
Sum the weekly hours for a user for all his tasks for the week
When the duration of the task is modified, create/destroy the weekly hour chunks.
Would it be more efficient to store these weekly records as
start date/end date/hours,
year/week number/hours
Storing start/end date probably give more flexibility to the table as it could potentially store non-weekly align hours.
Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date, and populating the weeks in between (without converting to date ranges). Also easier validation for updating the hours for a week, as long as the week number is 1-53.
Wondering if anyone has tried out either option and can give any pointers on their preferred option.
I would probably go for a daterange column.
That gives you the flexibility to have differently sized chunks and allows you to define an exclusion constraint to prevent overlapping ranges.
Finding the row for a given week is still quite simple using the "contains" operator #>, e.g. where the_column #> to_date('2019-24', 'iyyy-iw') finds the row(s) that contain week number 24 in 2019.
The expression to_date('2019-24', 'iyyy-iw') returns the first day (Monday) of the specified week.
Finding all rows that are between two weeks can also be done, however construction the corresponding date range looks a bit ugly. You can either construction an inclusive range with the first and last day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-24', 'iyyy-iw') + 6, '[]')
Or you can create a range with an exclusive upper range with the next week's first day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-25', 'iyyy-iw'), '[)')
While ranges can be indexed quite efficiently and , the required GIST indexes are a bit more expensive to maintain than a B-Tree index on two integer columns.
Another downside of using ranges (if you don't really need the flexibility) is that they take up more space than two integer columns (14 byte instead of 8, or even 4 with two smallint). So if the size of the table is of any concern, then your current solution with the year/week columns is more efficient.
"Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date"
If your input is a start and end date to begin with (rather than a "week number"), then I would definitely go for a daterange column. If that start and end date cover more than one week, then you store only one row, rather than multiple rows.

Loading date or datetime into date dimension

Let's say I have a date dimension and from my business requirements I know that the most granular I would need to go is to examine the specific day of the month that an event occurred.
The data I am given provides me with the exact time that an event occurred (YYYY-MM-DD HH:MM:SS). I have two opitons:
Before loading the data into the date dimension, slice the HH:MM:SS from the date.
Create the time attributes in my date dimension and insert the full date time.
The way I see it, I should go with the option 1. This would remove redundant data and save some space. However, if I go with option 2, should the business requirements ever change or if my manager suddenly wants to be more granular I wouldn't need to modify my original design. Which option is more commonly used? Are there more options that I did not consider?
Update - follow up question
I receive new data every month. If I used a pre built date dimension with all the dates would I then need to run my script every month to populate the table with new dates of that month or would I have a continuous process where by every day insert into the table one row, which would be that date?
I would agree with you and avoid option 2. A standard date dimension table is at the individual date level. If you did need to analyse by time of day, you could create an additional time of day dimension at the level of a second in a single day, and link to that from your fact table.
Your date dimension should be created by script automatically, rather than from the dates that events occurred. This allows you to analyse across a range of events from other facts, and on dates where no events occur, using a standard, prebuilt dimension.
I would also include the full date/time stamp as a column in the fact table, along with the 'DateKey' to the dimension table. This would allow you some visibility/analysis of the timestamp, you would not lose the data, and would still allow you to analyse by the date dimension.
Update - follow up question
Your pre-built date dimension (the standard way of doing it) would usually contain some dates in the future. There's no reason not to, for example, include another 5 years of dates in the table. But if you'd like it to gradually grow over time, you could have a script that is run once a day, once a month, or once a year to add new dates. Its totally up to you! There are many example scripts for building date dimensions- just google date dimension script. They exist for the language of your choice, e.g. SQL, C#, Power Query, etc.

Storing dates as nodes in Neo4j

I'm new to Neo4j so maybe I'm just completely wrong on this, but I'll give it a try!
Our data is mostly composed by reservations, users and facilities stored as nodes.
I need both to count the total reservations that occurred in a specific time frame and the overall income (stored as reservation.income) in this timeframe.
I was thinking to overcome the problem by creating the date as a node, in this way I can assign a relationship [:PURCHASED_ON] to all the reservations that occurred on a specific date.
As far as I've understood, creating the date as a node could give me a few pros:
I could split the date from dd/mm/yyyy and store them as integer properties, in this way I could use mathematical operators such as > and <
I could create a label for the node representing each month as a word
It should be easier to sum() the income on a day or a month
Basicly, I was thinking about doing something like this
CREATE (d:Day {name:01/11/2016 day: TOINT(01), month: TOINT(11), year: TOINT(2016)}
I have seen that a possible solution could be to create a node for every year, every month (1-12) and every day (1-31), but I think that would just complicate terribly the architecture of my Graph since every reservation has an "insert_date" (the day it's created) and then the official "reservation_date" (the day it's due).
Am I onto something here or is it just a waste of time? Thanks!
You may want to look at the GraphAware TimeTree library, as date handling is a complex thing, and this seems to be the logical conclusion of the direction you're going. The TimeTree also has support for finding events attached to your time tree between date ranges, at which point you can perform further operations (counting, summing of income, etc).
There are many date/time functions in the APOC plugin that you should take a look at.
As an example, you can use apoc.date.fields (incorrectly called by the obsolete name apoc.date.fieldsFormatted in the APOC doc) to get the year, month, day to put in your node:
WITH '01/11/2016' AS d
WITH apoc.date.fields(d, 'MM/dd/yyyy') AS f
CREATE (d:Day {name: d, day: f.days, month: f.month, year: f.years});
NOTE: The properties in the returned map have names that are oddly plural. I have submitted an issue requesting that the names be made singluar.

Setting up indexes for date comparison in DynamoDB

How do I setup an index on DynamoDB table to compare dates? i.e. for example, I have a column in my table called synchronizedAt, I want my query to fetch all the rows that were never synchronized (i.e. 0) or weren’t synchronized in the past 2 weeks i.e. (new Date().getTime()) - (1000 * 60 * 60 * 24 * 7 * 4)
It depends by the other attributes of your table.
You may use an Hash and Range Primary Key if the set of Hash values is relatively small and stable; in this case you could filter the dates by putting them in the Range, but anyway the queries will be done by specifying the Index also, and because of this it may or may not make sense to pre-query for all the Index values in order to perform a loop where ask for the interesting Range (inside of the loop for each Index value).
An alternative could be an Hash and Range GSI. In this case you might put a fixed dumb value as Index, in order to query for the range of all Items at once.
Lastly, the less efficient Scan, but with large tables it will be a problem (the larger the table the more time the Scan will take to complete).
I had similar requirement to query on the date range. In my case date range was only criteria. The issue with DynamoDB is you cannot create an Index with Just Range key. It always require Hashkey and Query on such index always expect equal to condition for Hashkey.
So I tricked the DB. I created a Key as Century and populated with century year of the date. For example 1 Jan 2019, century key value is 20. For 1 Jan 2020 century key value is also 20. Very easy to derived from any date. Then Created GSI with Hashkey as Century and RangeKey as date. While querying it is very easy to derive century from date range and Build query condition Hashkey as century and date range. Since I am dealing with data no more than 5 years, trick won't fail for next 75 years. :)
It is not so "nice to have" workaround but it work for me quite well. May be it will help someone else as well.