Given a fixed (but non-linear) calendar of future equipment usage; how to calculate the life at a given date in time - tsql

I have a challenge that i'm really struggling with.
I have a table which contains a 'scenario' that a user has defined, that they will consume 'usage'. For example, how many hours a machine will be turned on for.
In month 1, they will use 300 (hours, stored as an integer in minutes), Month 2 100, Month 3 450 etc etc etc.
I then have a list of tasks which need to be performed at specific intervals of usage. Given the above scenario, how could I forecast when these tasks will be due (the date). I also need to show repeat accomplishments and the dates.
The task contains the number of consumed hours at the last point of accomplishment, the interval between accomplishments and the expected life total the next time it is due (Last Done + Interval = Next Due)
I've tried SO many different options, ideally I want this to be compiled at run time (i.e. the only things that are saved into a permanent table are the forecast and the list of tasks). I have 7-800 scenarios, plus given the number of pieces of equipment, there are 12,000 tasks to be carried out. I need to show at least the next 100 years of tasks.
At the minute, I get the scenario, cross apply a list of dates between now and the year 2118. I then (in the where clause) filter out where the period number (the month number of the year) isn't matching the date, then divide the period usage by the number of days in that period. That gives me a day to day usage over the next 100 years (~36,000 Rows) When I join on the 12000 tasks and then try to filter where the due at value matches my dates table, I can't return ANY rows, even just the date column and the query runs for 7-8 minutes.
We measure more than just hours of usage too, there are 25 different measurement points, all of which are specificed in the scenario.
I can't use linear regression, because lets say for example, we shut down over the summer, and we don't utilize the equipment, then an average usage over the year means that tasks are due, even when we are telling the scenario we're shut down.
Are there any strategies out there that I could apply. I'm not looking for a 'Here's the SQL' answer, I'm just running out of strategies to form a solid query that can deal with the volume of data I'm dealing with.
I can get a query running perfectly with one task, but it just doesn't scale... at all...
If SQL isn't the answer, then i'm open to suggestions.
Thanks,
Harry

Related

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

What is the most convenient alternative to store quantity of time elapsed in a database?

Taking into account that the elapsed time. For example: 1 month, 1 hour and 30 minutes, 45 minutes, 2 weeks, etc., is a concept called duration in the library moment js or a data type interval in Postgres Databases.
Both are abstractions of the same concept with pros and cons for calculations from the backend / frontend.
What would be the best way to store this data:
an integer that stores the amount milliseconds, or
a field of type interval of Postgres.
So that when working with this data the duration is interpreted although possibly in the same or different units of time?
It it usually a good idea to use the data type that is designed for the purpose, in this case interval.
The main advantage is that an interval like 1 month or 1 year 1 hour (which does not correspond to a certain number of seconds) will behave as expected if you add it to or subtract it from a timestamp.
Storing an interval as a numeric data type will only work well if you don't need to represent intervals like the above correctly — but the you can also use an interval like 36173.034 seconds, which will work just as well.

Writing an Apple Watch Complication that predicts future values and displays time sensitive data

I am in the process of writing an Apple Watch Complication for WatchOS 2. The particular data I am trying to show is given (via web request) in intervals of time ranging from 3-6 minutes. I have a predictive algorithm that can predict what the data values will look like. This presents a problem to me.
Because I want to display the data my predictive algorithm has to offer in time travel, I would like to use getTimelineEntriesForComplication (the version that asks for data after a certain date) to supply the future values that my algorithm believes will be true to the timeline. However, when time moves forward (as it tends to do) and we reach the time that one of these predicted data points was set to occur at, the predicted value is no longer accurate.
For instance, lets say it is 12:00 PM, and I currently have an (accurate) data value of A. The predictive algorithm might predict the following data values of the next two hours:
12:30 PM | B
1:00 PM | C
1:30 PM | D
2:00 PM | E
However, when 12:30 PM actually comes around, the actualy data value might be F. In addition, the algorithm will generate a new set of predictions all the way to 2:30 PM. I understand I can use updateTimelineForComplication to indicate that the timeline has to be rebuilt, but I have two problems with this method:
I fear I will exceed the execution time limit rather quickly
updateTimelineForComplication flushes the entire timeline, which seems wasteful to me considering that all the past data is perfectly valid, its simply the next 4 or so values that need to be updated.
Is there a better way to handle this problem?
At present, there's no way to alter a specific timeline entry, without reloading the entire timeline. You could submit a feature request to Apple.
Summary
To summarize the details that follow, even though your server updates its predictions every 3-6 minutes, the complication server will only update itself at 10 minute intervals, starting at the top of an hour. Reloading the timeline is your only option, as it will guarantee that all your predictions are updated and accurate within 10 minutes.
Specific findings
What I've found in past tests involving extendTimelineForComplication: using the minimum 10-minute update interval, is that the dataSource is asked for 100 entries before and after a sliding window based on the current time.
The sliding window isn't centered on the current time. For watchOS 2.0.1, it appears to be skewed to ask for more recent future entries (after ~14-27 minutes in the future), and less recent past entries (before ~100 minutes in the past).
Reloading is the only way to update any entries that fall within the ~two hour sliding window.
Issues
In my experience, extendTimelineForComplication has been less reliable than reloading the timeline, as a timeline that that is never reloaded needs to be trimmed to discard entries. The fewer number of entries per hour, the less frequently this occurs, but once the timeline cache grows large enough, the SDK appears to aggressively discard entries from the head and tail of the cache, even if those entries fall within the 72-hour time-travel window. The worst I've seen is only being able to time-travel forward 30 entries, instead of 100.
Having provided those details, I wouldn't suggest that anyone try to take advantage of any behaviors that may change in the future.
Daily budget and battery life
As for the daily budget, it sound more ominous than it is, but I think you'd have to do some intense calculations before the complication server cuts you off. Even with ten minute updates, I never exceeded the budget. The real issue is battery use. You'll find that frequent updates can drain your battery before the day is over. This is probably the most significant reason for Apple's recommendation:
Complications should provide as much data as possible during each update cycle, so specify a date as far into the future as you can manage. Do not ask the system to update your complication within minutes. Provide data to last for many hours or for an entire day.

Core Reporting API Total results found

I want to return a large result-set of Google Analytics data across a two month period.
However, the total results found is not accurate or what I expect.
If I narrow the start-date and end-date to a particular day it returns roughly 40k of results. Which over a two month period there should be 2.4 million records. However the total results found from the api suggests 350k.
There is some discrepancy and the numbers do not add up when selecting a larger date range. I can confirm there is no gap in ga data over the two month period.
Would be great if someone has come across this issue and has found a reason for it.
In your query you need to supply a sampiling level
samplingLevel=DEFAULT Optional.
Use this parameter to set the sampling level (i.e. the number of visits used to
calculate the result) for a reporting query. The allowed values are consistent with
the web interface and include:
•DEFAULT — Returns response with a sample size that balances speed and accuracy.
•FASTER — Returns a fast response with a smaller sample size.
•HIGHER_PRECISION — Returns a more accurate response using a large sample size,
but this may result in the response being slower.
If not supplied, the DEFAULT sampling level will be used.
There is no way to completely remove sampling large request will still return sampled data even if you have set it to Higher_precission make your request smaller go day by day if you have to.
If you want to pay for a premium Google Analytics account you can extract your data into BigQuery and you will have access to unsampled reports.

Random Lookup Methodology

I have a postgres database with a table that contains rows I want to look up at pseudorandom intervals. Some I want to look up once an hour, some once a day, and some once a week. I would like the lookups to be at pseudorandom intervals inside their time window. So, the look up I want to do once a day should happen at a different time each time that it runs.
I suspect there is an easier way to do this, but here's the rough plan I have:
Have a settings column for each lookup item. When the script starts, it randomizes the epoch time for each lookup and sets it in the settings column, identifying the time for the next lookup. I then run a continuous loop with a wait 1 to see if the epoch time matches any of the requested lookups. Upon running the lookup, it recalculate when the next lookup should be.
My questions:
Even in the design phase, this looks like it's going to be a duct tape and twine routine. What's the right way to do this?
If by chance, my idea is the right way to do this, is my idea of repeating the loop with a wait 1 the right way to go? If I had 2 lookups back to back, there's a chance I could miss one but I can live with that.
Thanks for your help!
Add a column to the table for NextCheckTime. You could use either a timestamp or just an integer with the raw epoch time. Add a (non-unique) index on NextCheckTime.
When you add a row to the database, populate NextCheckTime by taking the current time, adding the base interval, and adding/subtracting a random factor (maybe 25% of the base interval, or whatever is appropriate for your situation). For example:
my $interval = 3600; # 1 hour in seconds
my $next_check = time + int($interval * (0.75 + rand 0.5));
Then in your loop, just SELECT * FROM table ORDER BY NextCheckTime LIMIT 1. Then sleep until the NextCheckTime returned by that (assuming it's not already in the past), perform the lookup, and update NextCheckTime as described above.
If you need to handle rows newly added by some other process, you might put a limit on the sleep. If the NextCheckTime is more than 10 minutes in the future, then sleep 10 minutes and repeat the SELECT to see if any new rows have been added. (Again, the exact limit depends on your situation.)
How big is your data set? If it's a few thousand rows than just randomizing the whole list and grabbing the first x rows is ok. As the size of your set grows, this becomes less and less scalable. The performance drops off at a non-linear rate. But if you only need to run this once an hour at most, then it's no big deal if it takes a minute or two as long as it doesn't kill other processes on the same box.
If you have a gapless sequence, whether there from the beginning or added on, then you can use indexes with something like:
$i=random(0,sizeofset-1);
select * From table where seqid=$i;
and get good scalability to millions and millions of rows.