Creating Date Cutoffs on continuous processes - postgresql

I'm an intern databasing some CNC machine metrics and i'm a little stuck with a particular query, only started sql last week so please forgive me if this is a dumb question.
If a machine is running (state=on) past date,23:59, and I want to collect machine hours for that day, there is no logged off time, as the state=off column has not been recorded yet, thus I cannot collect that machine data. To work around this, I want to record a state off time of 23:59:59, and then create a new row with the same entity ID with a state_on time of day+1,00:00:01.
Here is what I have written so far, where am I going wrong? What combination of trigger, insert, procedure, case, etc should I use? Any suggestions are welcome, I've tried to look at some reference material, and want the first bit to look something like this.
CASE
WHEN min(stoff.last_changed) IS NULL
AND now() = '____-__-__ 23:59:59.062538+13'
THEN min(stoff.last_changed) IS now()
ELSE min(stoff.last_changed)
END
I know this is only the first component, but it fits into a larger select used within a view,let me know if I need to post anything else

This is a fairly complex query because there are a few possibilities you need to consider (given the physical setup of the CNC machine these may not both apply:
The machine might be running at midnight so you need to 'insert' a midnight start time (and treat midnight as the stop time for the previous day).
The machine might be running all day (so there are no records in your table for the day at all)
There are a number of ways you can implement this; I have chosen one that I think will be easiest for you to understand (and have avoided window functions). The response does use CTE's but I think these are easy enough to understand (it's really just a chain of queries; each one using the result of the previous one).
Lets setup some example data:
create table states (
entity_id int,
last_changed timestamp,
state bool
);
insert into states(entity_id, last_changed, state) values
(1, '2019-11-26 01:00', false),
(1, '2019-11-26 20:00', true),
(1, '2019-11-27 01:00', false),
(1, '2019-11-27 02:00', true),
(1, '2019-11-27 22:00', false);
Now the query (it's not as bad as it looks!):
-- Lets start with a range of dates that you want the report for (needed because its possible that there are
-- no entries at all in the states table for a particular date)
with date_range as (
select i::date from generate_series('2019-11-26', '2019-11-29', '1 day'::interval) i
),
-- Get all combinations of machine (entity_id) and date
allMachineDays as (
select distinct states.entity_id, date_range.i as started
from
states cross join date_range),
-- Work out what the state at the start of each day was (if no earlier state available then assume false)
stateAtStartOfDay as (
select
entity_id, started,
COALESCE(
(
select state
from states
where states.entity_id = allMachineDays.entity_id
and states.last_changed<=allMachineDays.started
order by states.last_changed desc limit 1
)
,false) as state
from allMachineDays
)
,
-- Now we can add in the state at the start of each day to the other state changes
statesIncludingStartOfDay as (
select * from stateAtStartOfDay
union
select * from states
),
-- Next we add the time that the state changed
statesWithEnd as (
select
entity_id,
state,
started,
(
select started from statesIncludingStartOfDay substate
where
substate.entity_id = states.entity_id and
substate.started > states.started
order by started asc
limit 1
) as ended
from
statesIncludingStartOfDay states
)
-- finally lets work out the duration
select
entity_id,
state,
started,
ended,
ended - started as duration
from
statesWithEnd states
where
ended is not null -- cut off the last midnight as its no longer needed
order by entity_id,started
Hopefully this makes sense and there are no errors in my logic! Note that I have made some assumptions about your data (i.e. last_changed is the time that the state began). If you just want a runtime for each day it's pretty easy to just add a group by to the last query and sum up duration.
It might help you to understand this if you run it one step at a time; gor example start with the following and then add in extra with clauses one at a time:
with date_range as (
select i::date from generate_series('2019-11-26', '2019-11-29', '1 day'::interval) i
)
select * from date_range

Related

Best way to model state changes for point in time queries

I'm working on a system that needs to be able to find the "state" of an item at a particular time in history. The state is binary (either on or off). In this case it's to determine where to direct (to a particular "keyspace") a piece of timestamped data as determined by the timestamp of the data. I'm having a hard time deciding what the best way to model the data is.
Method 1 is to use the tstzrange with state being implied by the bounds of the range:
create extension btree_gist;
create table core.range_director (
range tstzrange,
directee_id text,
keyspace text,
-- allow a directee to be directed to multiple keyspaces at once
exclude using gist (directee_id with =, keyspace with =, range with &&)
);
insert into core.range_director values
('[2021-01-15 00:00:00 -0:00,2021-01-20 00:00:00 -0:00)', 'THING_ID', 'KEYSPACE_1'),
('[2021-01-15 00:00:00 -0:00,)', 'THING_ID', 'KEYSPACE_2');
select keyspace from core.range_director
where directee_id = 'THING_ID' and range_director.range #> '2021-01-15'::timestamptz;
-- returns KEYSPACE_1 and KEYSPACE_2
select keyspace from core.range_director
where directee_id = 'THING_ID' and range_director.range #> '2021-01-21'::timestamptz;
-- returns KEYSPACE_2
Method 2 is to have explicit state changes:
create table core.status_director (
status_time timestamptz,
status text,
directee_id text,
keyspace text
); -- not sure what pk to use for this method
insert into core.status_director values
('2021-01-15 00:00:00 -0:00','Open','THING_ID','KEYSPACE_1'),
('2021-01-20 00:00:00 -0:00','Closed','THING_ID','KEYSPACE_1'),
('2021-01-15 00:00:00 -0:00','Open','THING_ID','KEYSPACE_2');
select distinct on(keyspace) keyspace, status from core.status_director
where directee_id = 'THING_ID'
and status_time < '2021-01-16'
order by keyspace, status_time desc;
-- returns KEYSPACE_1:Open KEYSPACE_2:Open
select distinct on(keyspace) keyspace, status from core.status_director
where directee_id = 'THING_ID'
and status_time < '2021-01-21'
order by keyspace, status_time desc;
-- returns KEYSPACE_1:Closed, KEYSPACE_2:Open
-- so, client code has to ensure that it only directs to status=Open keyspaces
Maybe there are other methods that would work as well, but these two seem to make the most sense to me. The benefit of the first method is the really easy query, but the down side is that you now have to update rows to close the state whereas in the second method you can just post new states which seems easier.
The table could conceivable grow into thousands or tens of thousands of rows, but will probably not grow into millions (but does the best method change depending on the expected row count?). I have a couple of similar tables with the same point-in-time "state" queries so it's really important that I get the model for them right.
My instinct is to go with Method 1, but are there any footguns or performance considerations that I'm not thinking of that would urge the use case towards Method 2 (or another method I haven't considered?)
No footguns with Method 1, just great big huge cannons. With that method how do you determine the current status. You need to scan each status change and for each one toggle the status, or perhaps use something like "count(*)%2" odd gives one state even another. What happens if any row gets deleted, or data purged and you do not know how many state transactions there were. With the Method 2 you retrieve the greatest date and directly obtain the status.
For myself I would do Method 3. That being Method1 + Method 2. Yes I would have a date range of the status and the status value itself. That gives me complex historical analysis as I have the complete history as well as direct access to current status at any time.
So after doing a bunch of research on the topic I found that my case is a variation of a "Valid-Time State Table". See ch. 2 and ch. 5 of Developing Time-Oriented Database Applications in SQL by Richard Snodgrass.
The support for these tables isn't great but it's not terrible either (at least PostgreSQL has tstzranges to work with). Method 1 of my post is largely sufficient - the main wrinkle is between the state table and other tables.
Since PostgreSQL doesn't have native support for these kinds of temporal tables, you have to build referential integrity yourself. There's a bunch of ways to do this, but for anyone in the future looking for some direction, here is an example of what that might look like for a referential query on two bitemporal tables:
create table a (
row_id bigserial, -- to track individual rows
id int,
pov tstzrange, -- period of validity
pop tstzrange -- period of presence
);
create table b (
row_id bigserial,
id int,
pov tstzrange,
pop tstzrange,
a_id int
);
-- are we good?
with each_pov as (
select bool_or(a.pov #> b.pov) as ok
from a
join b on a.id = b.a_id
and upper(a.pop) is null
and upper(b.pop) is null
group by b.pov
) select coalesce(
bool_and(each_pov.ok),
(select count(*) = 0 from b where upper(pop) is null)
) from each_pov;
You can put the query into a constraint trigger on both the main table and the referenced table to get something approaching sequenced referential integrity for the current period of presence.

Proper table to track employee changes over time?

I have been using Python to do this in memory, but I would like to know the proper way to set up an employee mapping table in Postgres.
row_id | employee_id | other_id | other_dimensions | effective_date | expiration_date | is_current
Unique constraint on (employee_id, other_id), so a new row would be inserted whenever there is a change
I would want the expiration date from the previous row to be updated to the new effective_date minus 1 day, and the is_current should be updated to False
Ultimate purpose is to be able to map each employee back accurately on a given date
Would love to hear some best practices so I can move away from my file-based method where I read the whole roster into memory and use pandas to make changes, then truncate the original table and insert the new one.
Here's a general example built using the column names you provided that I think does more or less what you want. Don't treat it as a literal ready-to-run solution, but rather an example of how to make something like this work that you'll have to modify a bit for your own actual use case.
The rough idea is to make an underlying raw table that holds all your data, and establish a view on top of this that gets used for ordinary access. You can still use the raw table to do anything you need to do to or with the data, no matter how complicated, but the view provides more restrictive access for regular use. Rules are put in place on the view to enforce these restrictions and perform the special operations you want. While it doesn't sound like it's significant for your current application, it's important to note that these restrictions can be enforced via PostgreSQL's roles and privileges and the SQL GRANT command.
We start by making the raw table. Since the is_current column is likely to be used for reference a lot, we'll put an index on it. We'll take advantage of PostgreSQL's SERIAL type to manage our raw table's row_id for us. The view doesn't even need to reference the underlying row_id. We'll default the is_current to a True value as we expect most of the time we'll be adding current records, not past ones.
CREATE TABLE raw_employee (
row_id SERIAL PRIMARY KEY,
employee_id INTEGER,
other_id INTEGER,
other_dimensions VARCHAR,
effective_date DATE,
expiration_date DATE,
is_current BOOLEAN DEFAULT TRUE
);
CREATE INDEX employee_is_current_index ON raw_employee (is_current);
Now we define our view. To most of the world this will be the normal way to access employee data. Internally it's a special SELECT run on-demand against the underlying raw_employee table that we've already defined. If we had reason to, we could further refine this view to hide more data (it's already hiding the low-level row_id as mentioned earlier) or display additional data produced either via calculation or relations with other tables.
CREATE OR REPLACE VIEW employee AS
SELECT employee_id, other_id,
other_dimensions, effective_date, expiration_date,
is_current
FROM raw_employee;
Now our rules. We construct these so that whenever someone tries an operation against our view, internally it'll perform a operation against our raw table according to the restrictions we define. First INSERT; it mostly just passes the data through without change, but it has to account for the hidden row_id:
CREATE OR REPLACE RULE employee_insert AS ON INSERT TO employee DO INSTEAD
INSERT INTO raw_employee VALUES (
NEXTVAL('raw_employee_row_id_seq'),
NEW.employee_id, NEW.other_id,
NEW.other_dimensions,
NEW.effective_date, NEW.expiration_date,
NEW.is_current
);
The NEXTVAL part enables us to lean on PostgreSQL for row_id handling. Next is our most complicated one: UPDATE. Per your described intent, it has to match against employee_id, other_id pairs and perform two operations: updating the old record to be no longer current, and inserting a new record with updated dates. You didn't specify how you wanted to manage new expiration dates, so I took a guess. It's easy to change it.
CREATE OR REPLACE RULE employee_update AS ON UPDATE TO employee DO INSTEAD (
UPDATE raw_employee SET is_current = FALSE
WHERE raw_employee.employee_id = OLD.employee_id AND
raw_employee.other_id = OLD.other_id;
INSERT INTO raw_employee VALUES (
NEXTVAL('raw_employee_row_id_seq'),
COALESCE(NEW.employee_id, OLD.employee_id),
COALESCE(NEW.other_id, OLD.other_id),
COALESCE(NEW.other_dimensions, OLD.other_dimensions),
COALESCE(NEW.effective_date, OLD.expiration_date - '1 day'::INTERVAL),
COALESCE(NEW.expiration_date, OLD.expiration_date + '1 year'::INTERVAL),
TRUE
);
);
The use of COALESCE enables us to update columns that have explicit updates, but keep old values for ones that don't. Finally, we need to make a rule for DELETE. Since you said you want to ensure you can track employee histories, the best way to do this is also the simplest: we just disable it.
CREATE OR REPLACE RULE employee_delete_protect AS
ON DELETE TO employee DO INSTEAD NOTHING;
Now we ought to be able to insert data into our raw table by performing INSERT operations on our view. Here are two sample employees; the first has a few weeks left but the second is about to expire. Note that at this level we don't need to care about the row_id. It's an internal implementation detail of the lower level raw table.
INSERT INTO employee VALUES (
1, 1,
'test', CURRENT_DATE - INTERVAL '1 week', CURRENT_DATE + INTERVAL '3 weeks',
TRUE
);
INSERT INTO employee VALUES (
2, 2,
'another test', CURRENT_DATE - INTERVAL '1 month', CURRENT_DATE,
TRUE
);
The final example is deceptively simple after all the build-up that we've done. It performs an UPDATE operation on the view, and internally it results in an update to the existing employee #2 plus a new entry for employee #2.
UPDATE employee SET expiration_date = CURRENT_DATE + INTERVAL '1 year'
WHERE employee_id = 2 AND other_id = 2;
Again I'll stress that this isn't meant to just take and use without modification. There should be enough info here though for you to make something work for your specific case.

Find closest positive value based on multiple criteria

First of all I am still learning sql / postgresql so I am eagerly looking for explanations and thought process / strategy instead of just the raw answer. And I apologize in advance for the potential future misunderstandings or "stupid" questions.
Also if you know a great site which propose exercices or challenges in order to master sql / postgresql, I take everything :)
I am looking for a way to return the closest value, based on other specific results in the same table.
In the same table, I am tracking different events:
ESESS = End session event. Gives me a new timestamp (ts) every time Georges (id) finishes a session (let's say Georges is using a computer, so end session = shut the computer down)
USD = Amount of money inventory update event. Each time Georges spends/earn money, those 3 columns will return me the new balance (v), as well as his id and timestamp (ts) when the balance has been updated.
What I am trying to get is the balance at the end of each session.
My plan was to return esess.id and usd.v only if (ts.esess - ts.usd) is equal to the smallest minimum positive value.
So some sort of lookup from the ts.usd, when (ts.esess - ts.usd) match the condition...but I'm struggling with that part.
Here is the strategy in the following link:
QUERY PLAN
Here is the query:
SELECT
sessId, moneyV
FROM
(
SELECT
ts as sessTs,
mid as sessId
FROM
table1
WHERE
n='esess'
) as sess
INNER JOIN
(
SELECT
ts as moneyTs,
mid as moneyId,
v as moneyV
FROM
table1
WHERE
n='usd'
)as balance
ON sessId = moneyId
WHERE
sessTs - moneyTs =
(
SELECT
sessTs - moneyTs as timeDiff
FROM
table1
WHERE
sessTs - moneyTs > 0
ORDER BY
timeDiff ASC
LIMIT 1
)
;
So how should I proceed?
Also, I dug to find answers and find this post in particular, but did not understand everything and did not manage to make it work properly...
Thanks in advance!

TSQL view select optimization when function is present

I have this simple SQL as a source in a SSIS task:
Select * from budgetview
the source is:
CREATE VIEW [dbo].[BudgetView] AS
SELECT DISTINCT Country,
SDCO AS Company,
SDAN8 AS Customer,
SDLITM AS PrintableItemNumber,
dbo.fn_DateFromJulian(SDIVD) AS Date,
SDPQOR/100.0 AS Quantity,
SDAEXP/100.0 AS Value,
SDITWT/10000.0 AS Weight
FROM dbo.F553460
There are NO advices for indexes, every thing seems optimized.
The function fn_DateFromJulian source is:
CREATE FUNCTION [dbo].[fn_DateFromJulian]
(
#JulianDate numeric(6,0)
)
RETURNS date
AS
BEGIN
declare #resultdate date=dateadd(year,#JulianDate/1000,'1900-01-01')
set #resultdate=dateadd(day,#JulianDate%1000 -1,#resultdate)
return #resultdate
END
The problem is that i am waiting around 20 minutes just to get the rows going in SSIS....
I am waiting there 20mins BEFORE it gets started
Are there any suggestions to find the culprit?
My assumption is that the time spent on the view is consumed by calculating the Julian date value. Without seeing the actual query plan, it seems a fair guess based on the articles below.
Rewrite the original function as a table valued function below (I've simply mashed your code together, there are likely opportunities for improvement)
CREATE FUNCTION dbo.fn_DateFromJulianTVF
(
#JulianDate numeric(6,0)
)
RETURNS TABLE AS
RETURN
(
SELECT dateadd(day,#JulianDate%1000 -1,dateadd(year,#JulianDate/1000,CAST('1900-01-01' AS date))) AS JDEDate
)
Usage would be
CREATE VIEW [dbo].[BudgetView] AS
SELECT DISTINCT Country,
SDCO AS Company,
SDAN8 AS Customer,
SDLITM AS PrintableItemNumber,
J.JDEDate AS [Date],
SDPQOR/100.0 AS Quantity,
SDAEXP/100.0 AS Value,
SDITWT/10000.0 AS Weight
FROM dbo.F553460 AS T
CROSS APPLY
dbo.fn_DateFromJulianTVF(T.SDIVD) AS J
Scalar valued function, smell like code reuse, performs like a reused disposable diaper
https://sql.kiwi/2012/09/compute-scalars-expressions-and-execution-plan-performance.html
http://blogs.lobsterpot.com.au/2011/11/08/when-is-a-sql-function-not-a-function/
Just checking, but am I right to understand that for every unique value of T.SDIVD there will be just one unique result value of the function ? In other words, no two different T.SDIVD will return the same value from the function?
In that case what is happening here (IMHO) is that you first do scan over the entire table, for each and every record calculate the f(SDIVD) value and then send that entire resultset through an aggregation (DISTINCT).
Since functions are far from optimal in MSSQL I'd suggest to limit their use by turning around the chain of events and doing it like this:
CREATE VIEW [dbo].[BudgetView] AS
SELECT /* DISTINCT */
Country,
Company,
Customer,
PrintableItemNumber,
dbo.fn_DateFromJulian(SDIVD) AS Date,
Quantity,
Value,
Weight
FROM (
SELECT DISTINCT Country,
SDCO AS Company,
SDAN8 AS Customer,
SDLITM AS PrintableItemNumber,
SDIVD,
SDPQOR/100.0 AS Quantity,
SDAEXP/100.0 AS Value,
SDITWT/10000.0 AS Weight
FROM dbo.F553460 ) dist_F553460
)
If you had lots of double records this should improve performance, if you only had few of them it won't make much of a difference, if any. If you know you have no doubles at all you should get rid of the DISTINCT in the first place as that is what causing the delay!
Anyway, regarding the function you can add the following trick:
CREATE FUNCTION [dbo].[fn_DateFromJulian]
(
#JulianDate numeric(6,0)
)
RETURNS date
WITH SCHEMABINDING
AS
BEGIN
declare #resultdate date=dateadd(year,#JulianDate/1000,'1900-01-01')
set #resultdate=dateadd(day,#JulianDate%1000 -1,#resultdate)
return #resultdate
END
The WITH SCHEMABINDING causes some internal optimisations that will make its execution slightly faster, YMMV. There are limitations to it, but here it will work nicely.
Edit: removed the 'outer' DISTINCT since it's (likely, cf my first assumption) not needed.

Using DATEDIFF in T-SQL

I am using DATEDIFF in an SQL statement. I am selecting it, and I need to use it in WHERE clause as well. This statement does not work...
SELECT DATEDIFF(ss, BegTime, EndTime) AS InitialSave
FROM MyTable
WHERE InitialSave <= 10
It gives the message: Invalid column name "InitialSave"
But this statement works fine...
SELECT DATEDIFF(ss, BegTime, EndTime) AS InitialSave
FROM MyTable
WHERE DATEDIFF(ss, BegTime, EndTime) <= 10
The programmer in me says that this is inefficient (seems like I am calling the function twice).
So two questions. Why doesn't the first statement work? Is it inefficient to do it using the second statement?
Note: When I originally wrote this answer I said that an index on one of the columns could create a query that performs better than other answers (and mentioned Dan Fuller's). However, I was not thinking 100% correctly. The fact is, without a computed column or indexed (materialized) view, a full table scan is going to be required, because the two date columns being compared are from the same table!
I believe there is still value in the information below, namely 1) the possibility of improved performance in the right situation, as when the comparison is between columns from different tables, and 2) promoting the habit in SQL developers of following best practice and reshaping their thinking in the right direction.
Making Conditions Sargable
The best practice I'm referring to is one of moving one column to be alone on one side of the comparison operator, like so:
SELECT InitialSave = DateDiff(second, T.BegTime, T.EndTime)
FROM dbo.MyTable T
WHERE T.EndTime <= T.BegTime + '00:00:10'
As I said, this will not avoid a scan on a single table, however, in a situation like this it could make a huge difference:
SELECT InitialSave = DateDiff(second, T.BegTime, T.EndTime)
FROM
dbo.BeginTime B
INNER JOIN dbo.EndTime E
ON B.BeginTime <= E.EndTime
AND B.BeginTime + '00:00:10' > E.EndTime
EndTime is in both conditions now alone on one side of the comparison. Assuming that the BeginTime table has many fewer rows, and the EndTime table has an index on column EndTime, this will perform far, far better than anything using DateDiff(second, B.BeginTime, E.EndTime). It is now sargable, which means there is a valid "search argument"--so as the engine scans the BeginTime table, it can seek into the EndTime table. Careful selection of which column is by itself on one side of the operator is required--it can be worth experimenting by putting BeginTime by itself by doing some algebra to switch to AND B.BeginTime > E.EndTime - '00:00:10'
Precision of DateDiff
I should also point out that DateDiff does not return elapsed time, but instead counts the number of boundaries crossed. If a call to DateDiff using seconds returns 1, this could mean 3 ms elapsed time, or it could mean 1997 ms! This is essentially a precision of +- 1 time units. For the better precision of +- 1/2 time unit, you would want the following query comparing 0 to EndTime - BegTime:
SELECT DateDiff(second, 0, EndTime - BegTime) AS InitialSave
FROM MyTable
WHERE EndTime <= BegTime + '00:00:10'
This now has a maximum rounding error of only one second total, not two (in effect, a floor() operation). Note that you can only subtract the datetime data type--to subtract a date or a time value you would have to convert to datetime or use other methods to get the better precision (a whole lot of DateAdd, DateDiff and possibly other junk, or perhaps using a higher precision time unit and dividing).
This principle is especially important when counting larger units such as hours, days, or months. A DateDiff of 1 month could be 62 days apart (think July 1, 2013 - Aug 31 2013)!
You can't access columns defined in the select statement in the where statement, because they're not generated until after the where has executed.
You can do this however
select InitialSave from
(SELECT DATEDIFF(ss, BegTime, EndTime) AS InitialSave
FROM MyTable) aTable
WHERE InitialSave <= 10
As a sidenote - this essentially moves the DATEDIFF into the where statement in terms of where it's first defined. Using functions on columns in where statements causes indexes to not be used as efficiently and should be avoided if possible, however if you've got to use datediff then you've got to do it!
beyond making it "work", you need to use an index
use a computed column with an index, or a view with an index, otherwise you will table scan. when you get enough rows, you will feel the PAIN of the slow scan!
computed column & index:
ALTER TABLE MyTable ADD
ComputedDate AS DATEDIFF(ss,BegTime, EndTime)
GO
CREATE NONCLUSTERED INDEX IX_MyTable_ComputedDate ON MyTable
(
ComputedDate
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
create a view & index:
CREATE VIEW YourNewView
AS
SELECT
KeyValues
,DATEDIFF(ss, BegTime, EndTime) AS InitialSave
FROM MyTable
GO
CREATE CLUSTERED INDEX IX_YourNewView
ON YourNewView(InitialSave)
GO
You have to use the function instead of the column alias - it is the same with count(*), etc. PITA.
As an alternate, you can use computed columns.