Calculate time between steps in PostgreSQL - postgresql

I have an audit table that tracks the steps in a process and I need to track the time for each step. It used to be stored in MS SQL Server but is now stored in PostgreSQL. I have the query from SQL server and have not been successful at converting it. Here is the MS SQL working example: http://www.sqlfiddle.com/#!18/1b423/1
Here are the rules:
The steps are not required to be sequential, so step 1 can happen
after step 5.
The records for an order are not stored sequentially by step
or order, but rather are intermixed with other orders based upon
the Time Entered.
The sample data being ordered by Order Number then New is NOT
normal and cannot be depended upon.
Each step can be repeated for any given order, if repeated for an
order, then sum the times by step.
The starting step record is always null in the Old column
Starting step is calculated as the time difference between
when it is in the New column and when it is the value in the Old
column for a given order.
For the steps that the order never came out of, the time is computed
up to the present moment
A step can be repeated many times and am only looking for the total
time spent in each step.
I cannot get the date difference to sum or handle the null old status value for the first step. I get various forms of this error when running the following sql.
ERROR: function isnull(timestamp without time zone, timestamp with
time zone) does not exist LINE 4: sum(a1.timeentered -
isnull(a2.timeentered,now())) as "tota...
^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.
SELECT
a1.ordernumber,
a1."new" AS "Step",
sum(a1.timeentered - isnull(a2.timeentered,now())) as "total time"
FROM
audittrail AS a1
LEFT JOIN
audittrail AS a2
ON
a1."new" = a2."old" AND
a1.ordernumber = a2.ordernumber
GROUP BY
a1.ordernumber,
a1."new"
ORDER BY
a1.ordernumber ASC
Here is the sample data as well as a link to a sample online: http://www.sqlfiddle.com/#!17/e6fd5a
Old New Time Entered Order Number
NULL Step 1 4/30/12 10:43 1C2014A
Step 1 Step 2 5/2/12 10:17 1C2014A
Step 2 Step 3 5/2/12 10:28 1C2014A
Step 3 Step 4 5/2/12 11:14 1C2014A
Step 4 Step 5 5/2/12 11:19 1C2014A
Step 5 Step 9 5/3/12 11:23 1C2014A
NULL Step 1 5/18/12 15:49 1C2014B
Step 1 Step 2 5/21/12 9:21 1C2014B
Step 2 Step 3 5/21/12 9:34 1C2014B
Step 3 Step 4 5/21/12 10:08 1C2014B
Step 4 Step 5 5/21/12 10:09 1C2014B
Step 5 Step 6 5/21/12 16:27 1C2014B
Step 6 Step 9 5/21/12 18:07 1C2014B
NULL Step 1 6/12/12 10:28 1C2014C
Step 1 Step 2 6/13/12 8:36 1C2014C
Step 2 Step 3 6/13/12 9:05 1C2014C
Step 3 Step 4 6/13/12 10:28 1C2014C
Step 4 Step 6 6/13/12 10:50 1C2014C
Step 6 Step 8 6/13/12 12:14 1C2014C
Step 8 Step 4 6/13/12 15:13 1C2014C
Step 4 Step 5 6/13/12 15:23 1C2014C
Step 5 Step 8 6/13/12 15:30 1C2014C
Step 8 Step 9 6/18/12 14:04 1C2014C
This is the expected result:
| OrderNumber | Step | Total Time in Step (seconds) |
|-------------|--------|------------------------------|
| 1C2014A | Step 1 | 171240 |
| 1C2014A | Step 2 | 660 |
| 1C2014A | Step 3 | 2760 |
| 1C2014A | Step 4 | 300 |
| 1C2014A | Step 5 | 86640 |
| 1C2014A | Step 9 | 324902599 |
| 1C2014B | Step 1 | 235920 |
| 1C2014B | Step 2 | 780 |
| 1C2014B | Step 3 | 2040 |
| 1C2014B | Step 4 | 60 |
| 1C2014B | Step 5 | 22680 |
| 1C2014B | Step 6 | 6000 |
| 1C2014B | Step 9 | 323323159 |
| 1C2014C | Step 1 | 79680 |
| 1C2014C | Step 2 | 1740 |
| 1C2014C | Step 3 | 4980 |
| 1C2014C | Step 4 | 3840 |
| 1C2014C | Step 5 | 420 |
| 1C2014C | Step 6 | 5040 |
| 1C2014C | Step 8 | 875160 |
| 1C2014C | Step 9 | 320918539 |

This turned out to be harder than I thought. This is the full query I used. It doesn't have the subsitution for ISNULL function, but it gets most of the way there. I used the extract function from Date/Time Functions. Specifically, to get everything in seconds, I used extract(epoch from ...
SELECT
a1.ordernumber,
a1."new" AS "Step",
sum(extract(epoch from a2.timeentered) -
extract(epoch from a1.timeentered)) as total_time
FROM
audittrail AS a1
LEFT JOIN
audittrail AS a2
ON
a1.new = a2.old AND
a1.ordernumber = a2.ordernumber
GROUP BY
a1.ordernumber,
a1.new
ORDER BY a1.ordernumber ASC
which gives
ordernumber
Step
total_time
1C2014A
Step 1
171240
1C2014A
Step 2
660
1C2014A
Step 3
2760
1C2014A
Step 4
300
1C2014A
Step 5
86640
1C2014A
Step 9
NULL
1C2014B
Step 1
235920
1C2014B
Step 2
780
1C2014B
Step 3
2040
1C2014B
Step 4
60
1C2014B
Step 5
22680
1C2014B
Step 6
6000
1C2014B
Step 9
NULL
1C2014C
Step 1
79680
1C2014C
Step 2
1740
1C2014C
Step 3
4980
1C2014C
Step 4
3840
1C2014C
Step 5
420
1C2014C
Step 6
5040
1C2014C
Step 8
875160
1C2014C
Step 9
NULL
To me this calculation looks wrong. For me, it makes more sense (for example) for the entry for Step 3/Order 1C2014A, the total_time should be 11 minutes or 660 seconds. To achieve this, swap old and new in the join and swap a1 and a2 in the sum(part(epoch....) to become
SELECT
a1.ordernumber,
a1.new AS Step,
sum(extract(epoch from a1.timeentered) -
extract(epoch from a2.timeentered)) as total_time
FROM
audittrail AS a1
LEFT JOIN
audittrail AS a2
ON
a1.old = a2.new AND
a1.ordernumber = a2.ordernumber
GROUP BY
a1.ordernumber,
a1.new
ORDER BY a1.ordernumber ASC

Just replace isnull and datediff with equivalent PostgreSQL expressions in the second query line.
select a1.OrderNumber as "OrderNumber", a1.New as "Step",
extract('epoch' from
sum(coalesce(a2.TimeEntered, now()) - a1.TimeEntered))::integer
as "Total Time in Step (seconds)"
from AuditTrail a1
left join AuditTrail a2
on a1.New = a2.Old
and a1.OrderNumber = a2.OrderNumber
group by a1.OrderNumber, a1.New
order by a1.OrderNumber;

Related

A Postgres query to get subtraction of a value in a row by the value in the next row

I have a table like(mytable):
id | value
=========
1 | 4
2 | 5
3 | 8
4 | 16
5 | 8
...
I need a query to give me subtraction on each rows by next row:
id | value | diff
=================
1 | 4 | 4 (4-Null)
2 | 5 | 1 (5-4)
3 | 8 | 3 (8-5)
4 | 16 | 8 (16-8)
5 | 8 | -8 (8-16)
...
Right now I use a python script to do so, but I guess it's faster if I create a view from this table.
You should use window functions - LAG() in this case:
SELECT id, value, value - LAG(value, 1) OVER (ORDER BY id) AS diff
FROM mytable
ORDER BY id;

kdb+/q: Fast vector update given a list of keys and values to be updated

Given a list of ids/keys and a set of corresponding values for a constant column:
q)ikeys: 1 2 3 5;
q)ivals: 100 100.5 101.5 99.5;
What is the fastest way to update the `toupd column in the following table such that the rows that match the given ikeys are updated to the new values in ivals:i.e.
q) show tab;
ikeys | `toupd `noupd
------|--------------
1 | 0.5 1
2 | 100.5 2
3 | 500.5 4
4 | 400.5 8
5 | 400.5 16
6 | 600.5 32
7 | 700.5 64
is updated to:
q) show restab;
ikeys | `toupd `noupd
------|--------------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
furthermore, is there a canonical method with which one could update multiple columns in this manner.
thanks
A dot amend is another approach which more easily generalises to more than one column. It can also take advantage of amend-in-place which would be the most memory efficient approach as it doesn't create a duplicate copy of the table in memory (assumes global).
ikeys:1 2 3 5
ivals:100 100.5 101.5 99.5
tab:([ikeys:1+til 7]toupd:.5 100.5 500.5 400.5 400.5 600.5 700.5;noupd:1 2 4 8 16 32 64)
q).[tab;(([]ikeys);`toupd);:;ivals]
ikeys| toupd noupd
-----| -----------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
/amend in place
.[`tab;(([]ikeys);`toupd);:;ivals]
/generalise to two columns
q).[tab;(([]ikeys);`toupd`noupd);:;flip(ivals;1000 2000 3000 4000)]
ikeys| toupd noupd
-----| -----------
1 | 100 1000
2 | 100.5 2000
3 | 101.5 3000
4 | 400.5 8
5 | 99.5 4000
6 | 600.5 32
7 | 700.5 64
/you could amend in place here too
.[`tab;(([]ikeys);`toupd`noupd);:;flip(ivals;1000 2000 3000 4000)]
Here are two different ways of doing it.
tab lj ([ikeys] toupd: ivals)
or
m: ikeys
update toupd: ivals from tab where ikeys in m
I'm sure there are plenty more ways. If you want to find out which is fastest for your purpose (and your data), try using q)\t:1000 yourCodeHere for large tables and see which suits you best.
As for which is the canonical way for multiple columns, I imagine it would be the update, but it's a matter of personal preference, just do whatever is fastest.
A dictionary is also a common method of updating values given a mapping. Indexing the dictionary with the ikeys column gives the new values and then we fill in nulls with the old toupd column values.
q)show d:ikeys!ivals
1| 100
2| 100.5
3| 101.5
5| 99.5
q)update toupd:toupd^d ikeys from tab
ikeys| toupd noupd
-----| -----------
1 | 100 1
2 | 100.5 2
3 | 101.5 4
4 | 400.5 8
5 | 99.5 16
6 | 600.5 32
7 | 700.5 64
It also worth noting that the update condition with the where clause is not guaranteed to work in all cases, e.g. if you have more mapping values than appear in your ikeys column.
q)m:ikeys:1 2 3 5 7 11
q)ivals:100 100.5 101.5 99.5 100 100
q)update toupd: ivals from tab where ikeys in m
'length

How would you create a group identifier based on one column, but sorted by another?

I am attempting to create column Group via T-SQL.
If a cluster of accounts are in a row, consider that as one group. if the account is seen again lower in the list (cluster or not), then consider it a new group. This seems straight forward, but I cannot seem to see the solution... Below there are three clusters of account 3456, each having a different group number (Group 1,4, and 6)
+-------+---------+------+
| Group | Account | Sort |
+-------+---------+------+
| 1 | 3456 | 1 |
| 1 | 3456 | 2 |
| 2 | 9878 | 3 |
| 3 | 5679 | 4 |
| 4 | 3456 | 5 |
| 4 | 3456 | 6 |
| 4 | 3456 | 7 |
| 5 | 1295 | 8 |
| 6 | 3456 | 9 |
+-------+---------+------+
UPDATE: I left this out of the original requirements, but a cluster of accounts could have more than two accounts. I updated the example data to include this scenario.
Here's how I'd do it:
--Sample Data
DECLARE #table TABLE (Account INT, Sort INT);
INSERT #table
VALUES (3456,1),(3456,2),(9878,3),(5679,4),(3456,5),(3456,6),(1295,7),(3456,8);
--Solution
SELECT [Group] = DENSE_RANK() OVER (ORDER BY grouper.groupID), grouper.Account, grouper.Sort
FROM
(
SELECT t.*, groupID = ROW_NUMBER() OVER (ORDER BY t.sort) +
CASE t.Account WHEN LEAD(t.Account,1) OVER (ORDER BY t.sort) THEN 1 ELSE 0 END
FROM #table AS t
) AS grouper;
Results:
Group Account Sort
------- ----------- -----------
1 3456 1
1 3456 2
2 9878 3
3 5679 4
4 3456 5
4 3456 6
5 1295 7
6 3456 8
Update based on OPs comment below (20190508)
I spent a couple days banging my head on how to handle groups of three or more; it was surprisingly difficult but what I came up with handles bigger clusters and is way better than my first answer. I updated the sample data to include bigger clusters.
Note that I include a UNIQUE constraint for the sort column - this creates a unique index. You don't need the constraint for this solution to work but, having an index on that column (clustered, nonclustered unique or just nonclustered) will improve the performance dramatically.
--Sample Data
DECLARE #table TABLE (Account INT, Sort INT UNIQUE);
INSERT #table
VALUES (3456,1),(3456,2),(9878,3),(5679,4),(3456,5),(3456,6),(1295,7),(1295,8),(1295,9),(1295,10),(3456,11);
-- Better solution
WITH Groups AS
(
SELECT t.*, Grouper =
CASE t.Account WHEN LAG(t.Account,1,t.Account) OVER (ORDER BY t.Sort) THEN 0 ELSE 1 END
FROM #table AS t
)
SELECT [Group] = SUM(sg.Grouper) OVER (ORDER BY sg.Sort)+1, sg.Account, sg.Sort
FROM Groups AS sg;
Results:
Group Account Sort
----------- ----------- -----------
1 3456 1
1 3456 2
2 9878 3
3 5679 4
4 3456 5
4 3456 6
5 1295 7
5 1295 8
5 1295 9
5 1295 10
6 3456 11

PostgreSQL WITH RECURSIVE query to get ordered parent-child chain by a Partition Key

I have the issue writing a sql script on PostgreSQL 9.6.6 which orders steps in a process by using the steps' parent-child ID's, and this grouped/partitioned per process ID. I couldn't find this special case here, so I apologize if I missed it and would please you to provide me the link to the solution in the comments.
The case: I have a table which looks like this:
processID | stepID | parentID
1 1 NULL
1 3 5
1 2 4
1 4 3
1 5 1
2 1 NULL
2 3 5
2 2 4
2 4 3
2 5 1
Now I have to order the steps by starting with the step where parentID is NULL for each processID .
Note: I cannot simply order StepID or parentID as new steps I put within the whole process get a higher stepID then the last step in the process (continuous generating surrogate key).
I have to order the steps for every processID, that I will receive the following output:
processID | stepID | parentID
1 1 NULL
1 5 1
1 3 5
1 4 3
1 2 4
2 1 NULL
2 5 1
2 3 5
2 4 3
2 2 4
I tried to do this with the CTE function WITH RECURSIVE:
WITH RECURSIVE
starting (processID,stepID, parentID) AS
(
SELECT b.processID,b.stepID, b.parentID
FROM process b
WHERE b.parentID ISNULL
),
descendants (processID,stepID, parentID) AS
(
SELECT b.processID,b.stepID, b.stepparentID
FROM starting b
UNION ALL
SELECT b.processID,b.stepID, b.parentID
FROM process b
JOIN descendants AS c ON b.parentID = c.stepID
)
SELECT * FROM descendants
The result is not what I am searching for. As we have hundreds of processes, I receive a list where the first records are the different processIDs which have a NULL value as parentID.
I guess I have to recursive the whole script on the processID again, but have no idea how.
Thank you for your help!
You should calculate the level of each step:
with recursive starting as (
select processid, stepid, parentid, 0 as level
from process
where parentid is null
union all
select p.processid, p.stepid, p.parentid, level+ 1
from starting s
join process p on s.stepid = p.parentid and s.processid = p.processid
)
select *
from starting
order by processid, level
processid | stepid | parentid | level
-----------+--------+----------+-------
1 | 1 | | 0
1 | 5 | 1 | 1
1 | 3 | 5 | 2
1 | 4 | 3 | 3
1 | 2 | 4 | 4
2 | 1 | | 0
2 | 5 | 1 | 1
2 | 3 | 5 | 2
2 | 4 | 3 | 3
2 | 2 | 4 | 4
(10 rows)
Of course, you can skip the last column in the final select if you do not need it.

How to count rows using a variable date range provided by a table in PostgreSQL

I suspect I require some sort of windowing function to do this. I have the following item data as an example:
count | date
------+-----------
3 | 2017-09-15
9 | 2017-09-18
2 | 2017-09-19
6 | 2017-09-20
3 | 2017-09-21
So there are gaps in my data first off, and I have another query here:
select until_date, until_date - (lag(until_date) over ()) as delta_days from ranges
Which I have generated the following data:
until_date | delta_days
-----------+-----------
2017-09-08 |
2017-09-11 | 3
2017-09-13 | 2
2017-09-18 | 5
2017-09-21 | 3
2017-09-22 | 1
So I'd like my final query to produce this result:
start_date | ending_date | total_items
-----------+-------------+--------------
2017-09-08 | 2017-09-10 | 0
2017-09-11 | 2017-09-12 | 0
2017-09-13 | 2017-09-17 | 3
2017-09-18 | 2017-09-20 | 15
2017-09-21 | 2017-09-22 | 3
Which tells me the total count of items from the first table, per day, based on the custom ranges from the second table.
In this particular example, I would be summing up total_items BETWEEN start AND end (since there would be overlap on the dates, I'd subtract 1 from the end date to not count duplicates)
Anyone know how to do this?
Thanks!
Use the daterange type. Note that you do not have to calculate delta_days, just convert ranges to dataranges and use the operator <# - element is contained by.
with counts(count, date) as (
values
(3, '2017-09-15'::date),
(9, '2017-09-18'),
(2, '2017-09-19'),
(6, '2017-09-20'),
(3, '2017-09-21')
),
ranges (until_date) as (
values
('2017-09-08'::date),
('2017-09-11'),
('2017-09-13'),
('2017-09-18'),
('2017-09-21'),
('2017-09-22')
)
select daterange, coalesce(sum(count), 0) as total_items
from (
select daterange(lag(until_date) over (order by until_date), until_date)
from ranges
) s
left join counts on date <# daterange
where not lower_inf(daterange)
group by 1
order by 1;
daterange | total_items
-------------------------+-------------
[2017-09-08,2017-09-11) | 0
[2017-09-11,2017-09-13) | 0
[2017-09-13,2017-09-18) | 3
[2017-09-18,2017-09-21) | 17
[2017-09-21,2017-09-22) | 3
(5 rows)
Note, that in the dateranges above lower bounds are inclusive while upper bound are exclusive.
If you want to calculate items per day in the dateranges:
select
daterange, total_items,
round(total_items::dec/(upper(daterange)- lower(daterange)), 2) as items_per_day
from (
select daterange, coalesce(sum(count), 0) as total_items
from (
select daterange(lag(until_date) over (order by until_date), until_date)
from ranges
) s
left join counts on date <# daterange
where not lower_inf(daterange)
group by 1
) s
order by 1
daterange | total_items | items_per_day
-------------------------+-------------+---------------
[2017-09-08,2017-09-11) | 0 | 0.00
[2017-09-11,2017-09-13) | 0 | 0.00
[2017-09-13,2017-09-18) | 3 | 0.60
[2017-09-18,2017-09-21) | 17 | 5.67
[2017-09-21,2017-09-22) | 3 | 3.00
(5 rows)