Weekly hour allocation problem in Rails and Postgresql - postgresql

If I have a list of tasks with a certain date ranges, and the task is broken into weekly hour chunks of work (ie. 30 hours from 2018-12-31 to 2019-01-06 ... etc starting from Monday).
The kind of operations I would like to do are
Display all the weekly hours of all the tasks for a list of users
Sum the weekly hours for a user for all his tasks for the week
When the duration of the task is modified, create/destroy the weekly hour chunks.
Would it be more efficient to store these weekly records as
start date/end date/hours,
year/week number/hours
Storing start/end date probably give more flexibility to the table as it could potentially store non-weekly align hours.
Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date, and populating the weeks in between (without converting to date ranges). Also easier validation for updating the hours for a week, as long as the week number is 1-53.
Wondering if anyone has tried out either option and can give any pointers on their preferred option.

I would probably go for a daterange column.
That gives you the flexibility to have differently sized chunks and allows you to define an exclusion constraint to prevent overlapping ranges.
Finding the row for a given week is still quite simple using the "contains" operator #>, e.g. where the_column #> to_date('2019-24', 'iyyy-iw') finds the row(s) that contain week number 24 in 2019.
The expression to_date('2019-24', 'iyyy-iw') returns the first day (Monday) of the specified week.
Finding all rows that are between two weeks can also be done, however construction the corresponding date range looks a bit ugly. You can either construction an inclusive range with the first and last day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-24', 'iyyy-iw') + 6, '[]')
Or you can create a range with an exclusive upper range with the next week's first day: daterange(to_date('2019-24', 'iyyy-iw'), to_date('2019-25', 'iyyy-iw'), '[)')
While ranges can be indexed quite efficiently and , the required GIST indexes are a bit more expensive to maintain than a B-Tree index on two integer columns.
Another downside of using ranges (if you don't really need the flexibility) is that they take up more space than two integer columns (14 byte instead of 8, or even 4 with two smallint). So if the size of the table is of any concern, then your current solution with the year/week columns is more efficient.
"Storing week number means given a date range, creating the weekly chunks is as simple as finding the week number of the start date and the week number of the end date"
If your input is a start and end date to begin with (rather than a "week number"), then I would definitely go for a daterange column. If that start and end date cover more than one week, then you store only one row, rather than multiple rows.

Related

Extract highest date per month from a list of dates

I have a date column which I am trying to query to return only the largest date per month.
What I currently have, albeit very simple, returns 99% of what I am looking for. For example, If I list the column in ascending order the first entry is 2016-10-17 and ranges up to 2017-10-06.
A point to note is that the last day of every month may not be present in the data, so I'm really just looking to pull back whatever is the "largest" date present for any existing month.
The query I'm running at the moment looks like
SELECT MAX(date_col)
FROM schema_name.table_name
WHERE <condition1>
AND <condition2>
GROUP BY EXTRACT (MONTH FROM date_col)
ORDER BY max;
This does actually return most of what I'm looking for - what I'm actually getting back is
"2016-11-30"
"2016-12-30"
"2017-01-31"
"2017-02-28"
"2017-03-31"
"2017-04-28"
"2017-05-31"
"2017-06-30"
"2017-07-31"
"2017-08-31"
"2017-09-29"
"2017-10-06"
which are indeed the maximal values present for every month in the column. However, the result set doesn't seem to include the maximum date value from October 2016 (The first months worth of data in the column). There are multiple values in the column for that month, ranging up to 2016-10-31.
If anyone could point out why the max value for this month isn't being returned, I'd much appreciate it.
You are grouping by month (1 to 12) rather than by month and year. Since 2017-10-06 is greater than any day in October 2016, that's what you get for the "October" group.
You should
GROUP BY date_trunc('month', date_col)

How can I make a database structure in mongodb for saving the date range?

Here I'm saving the date range using golang. Suppose we have to save the all monday comes between the range of the 1-may-2018 to 14-july-2018.
How we will find all the monday between these range using golang and on the other hand we have set the start_time (8:00 A.M.) and the end_time (6:00 P.M.) of the first two coming monday in the database but on the third monday we have a change in the schedule that there is a time change like start_time (9:00 A.M.) and end_time (5:00 P.M.). Then how I will make my database to make this situation in practically using the golang.
Can Anybody help me for this to solve this solution. I made a database for and I do ppr work on it and make some fields shown below:-
Fields for Schedule //Schedule is a collection name
Id (int)
Day (string)
Start_hours (int)
Start_minutes (int)
End_hours (int)
End_minutes (int)
Start_date (timestamp)
End_date (timestamp)
How I will select monday between the selected range and how will I do the situation I explained above can anybody give guidance to me to make this situation easier. Thank you if this is a basic question then I'm really sorry.
I'd make something like this.
Find the first Monday date from the date range (see for example How do I get the first Monday of a given month in Go?
Mondays happen every week, so you can easily find the rest of dates by adding 7 days till the end date
Store all the Monday dates you found with the start and end times
I wouldn't bother with hours and minutes as you can easily get them from the timestamps in Go. Here is the simplified DB structure I would make
Fields for Schedule //Schedule is a collection name
Id (int)
Day (string)
Date (timestamp) // the actual date
Start (timestamp)
End (timestamp)
You don't need any more fields. You can get the day of the week (Day (string) in your structure, e.g. Monday) from the Date field too, but I believe if you want to query the collection by different days, this might speed things up, but be careful if you need to adjust for time zones. If you work with more than one, then store everything in UTC and you may have an extra filed Timezone, cos a date could be Monday for one zone and Sunday for another.
So, the Schedule will hold weekdays and start and end times for each of them. I'm not sure if you need to store initial date ranges, the Schedule collection will hold that range as well, form the first record to the last one. In my mind, I'd initially populate the collection with a given date range, then later on, I can modify it by adding new days, or deleting them.
When you query this collection with some start and end date range for the Date field, if your first result comes newer than 7 days from the start, this means you miss 1 or more entries from the start. If the last result comes older than 7 days from the range end, this means you miss some entries prior to the range end.
There is nothing specific to Go, in my opinion, Go works well with dates and you don't need any special date structures in your DB.

OBIEE YTD Issues

I have a fact table housing different granularity (date grain)
Monthly
Daily
The month data can be accessed by filtering by end of month date or using YYYYMM date format. In OBIEE RPD repo, the fact is set to LAST Aggregation.
I want to perform Year to Date analysis. And I want to sum only month end dates.
Using function TODATE(Measure), it tends to sum up all the data through out the month e.grain
Date Amount YTD TODate(Amount)
31/01/2106 100 100
28/02/2016 200 300
14/03/2016 50 350*
31/03/2016 100 450
I want YTD to ignore 50 and return 400, so also any other dates that falls within any month. And if if I Select 14/03/2016 I want 350 to return.
Thanks.
Alter the table to add a flag, something that flags Y if the record is at the specified monthly grain, and N if the record is not at the specified monthly grain.
In the logical layer, create two distinct LTSs with the first filtering on the flag for Y. This will be where you will calculate and source all your to date measures. The second LTS can either be filtered to N, or can be left to all the data depending on what you want to do with it.
The performance increases should come from the fact that any month measures you build off that monthly LTS will only hit records flagged as month, and will bypass all that other data that is not relevant. So if a user runs a report only asking for monthly measures, the query will automatically filter to that specific data.
What will happen is if a user selects your to date measure and a specific date measure on the same report, OBIEE should fire off two separate queries to get the data and stitch together based on common dimensions.
Could someone create this in the front end? Probably. You would have to do some sort of PERIODROLLING function, and tell it to aggregate at the month level, but I am afraid it may still roll those days up into a larger than desired number. A TODATE function will not work here.

Postgres - Convert Date Range to Individual Month

I have found similar help, but the issue was more complex, I am only good with the basics of SQL and am striking out here. I get a handful of columns a,b,c,startdate,enddate and i need to parse that data out into multiple rows depending on how many months are within the range.
Eg: a,b,c,1/1/2015, 3/15,2015 would become:
a,b,c,1/1/2015,value_here_doesnt_matter
a,b,c,2/1/2015,value_here_doesnt_matter
a,b,c,3/1/2015,value_here_doesnt_matter
Does not matter if the start date or end date is on a specific day, the only thing that matters is month and year. So if the range included any day in a given month, I'd want to output start days for each month in the range, with the 1st as a default day.
Could I have any advice on which direction to begin? I'm attempting generate_series, but am unsure if this is the right approach or how to make it work with keeping the data in the first few arbitrary columns consistent.
I think generate_series is the way to go. Without knowing what the rest of your data looks like, I would start with something like this:
select
a, b, c, generate_series(startdate, enddate, interval '1 month')::date
from
my_table

Web analytics schema with postgres

I am building a web analytics tool and use Postgresql as a database. I will not insert postgres each user visit but only aggregated data each 5 seconds:
time country browser num_visits
========================================
0 USA Chrome 12
0 USA IE 7
5 France IE 5
As you can see each 5 seconds I insert multiple rows (one per each dimensions combination).
In order to reduce the number of rows need to be scanned in queries, I am thinking to have multiple tables with the above schema based on their resolution: 5SecondResolution, 30SecondResolution, 5MinResolution, ..., 1HourResolution. Now when the user asks about the last day I will go to the hour resolution table which is smaller than the 5 sec resolution table (although I could have used that one too - it's just more rows to scan).
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc. AFAIU I have traded one query to a huge table (that has many relevant rows to scan) with multiple queries to medium tables + combine results on client side.
Does this sound like a good optimization?
Any other considerations on this?
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc.
You can't do that if you want your results to be accurate. Imagine if they're asking for one hour resolution from 01:30 to 04:30. You're imagining that you'd get the first and last half hour from the 5 second (or 1 minute) res table, then the rest from the one hour table.
The problem is that the one-hour table is offset by half an hour, so the answers won't actually be correct; each hour will be from 2:00 to 3:00, etc, when the user wants 2:30 to 3:30. It's an even more serious problem as you move to coarser resolutions.
So: This is a perfectly reasonable optimisation technique, but only if you limit your users' search start precision to the resolution of the aggregated table. If they want one hour resolution, force them to pick 1:00, 2:00, etc and disallow setting minutes. If they want 5 min resolution, make them pick 1:00, 1:05, 1:10, ... and so on. You don't have to limit the end precision the same way, since an incomplete ending interval won't affect data prior to the end and can easily be marked as incomplete when displayed. "Current day to date", "Hour so far", etc.
If you limit the start precision you not only give them correct results but greatly simplify the query. If you limit the end precision too then your query is purely against the aggregated table, but if you want "to date" data it's easy enough to write something like:
SELECT blah, mytimestamp
FROM mydata_1hour
WHERE mytimestamp BETWEEN current_date + INTERVAL '1' HOUR AND current_date + INTERVAL '4' HOUR
UNION ALL
SELECT sum(blah), current_date + INTERVAL '5' HOUR
FROM mydata_5second
WHERE mytimestamp BETWEEN current_date + INTERVAL '4' HOUR AND current_date + INTERVAL '5' HOUR;
... or even use several levels of union to satisfy requests for coarser resolutions.
You could use inheritance/partition. One resolution master table and many hourly resolution children tables ( and, perhaps, many minutes and seconds resolution children tables).
Thus you only have to select from the master table only, let the constraint of each children tables decide which is which.
Of course you have to add a trigger function to separate insert into appropriate children tables.
Complexities in insert versus complexities in display.
PostgreSQL - View or Partitioning?