Postgres Function: how to return the first full set of data that occurs after specified date/time - postgresql

I have a requirement to extract rows of data, but only if all said rows make a full set. We have a sequence table that is updated every minute, with data for 80 bins. We need to know the status of bins 1 thru 80 every minute as part of our production process.
I am generating a new report (postgres function) that needs to take a snapshot at roughly 00:01:00:AM (IE 1 minute past midnight). Initially I thougtht this to be an easy task, just grab the first 80 rows of data that occur at/after this time, however I see that, depending on network activity and industrial computer priorities, the table is not religiously updated at exactly 00:01:00AM or any minute for that matter. Updates can occur milliseconds or even seconds later, and take 500ms to 800ms to update the database. Sometimes a given minute can be missing altogether (production processes take precedence over data capture, but the sequence data is not super critical anyway)
My thinking is it would be more reliable to look for the first complete set of data anytime from 00:01:00AM onwards. So effectively, I have a table that looks a bit like this:
Apologies, I know you prefer for images of this manner to not be pasted in this manner, but I could not figure out how to create a textual table like this here (carriage return or Enter button is ignored!)
Basically, the above table is typical, but 1st minute is not guaranteed, and for that matter, I would not be 100% confident that all 80 bins are logged for a given minute. Hence my question: how to return the first complete set of data, where all 80 bins (rows) have been captured for a particular minute?
Thinking about it, I could do some sort of rowcount in the function, ensuring there are 80 rows for a given minute, but this seems less intuitive. I would like to know for sure that for each row of a given minute, bin 1 is represented, bint 2, bin 3...
Ultimately a call to this function will supply a min/max date/time and that period of time will be checked for the first available minute with a full set of bins data.
I am reasonably sure this will involve a window function, as all rows have to be assessed prior to data extraction. I've used windows functions a few times now, but still a green newbie compared to others here, so help is appreciated.
My final code, thanks to help from #klin:-
StartTime = DATE_TRUNC('minute', tme1);
EndTime = DATE_TRUNC('day', tme1) + '23 hours'::interval;
SELECT "BinSequence".*
FROM "BinSequence"
JOIN(
SELECT "binMinute" AS binminute, count("binMinute")
FROM "BinSequence"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
HAVING COUNT (DISTINCT "binBinNo") = 80 -- verifies that each and every bin is represented in returned data
) theseTuplesOnly
ON theseTuplesOnly.binminute = "binMinute"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
ORDER BY 1
LIMIT 80

Use the aggregate function count(*) grouping data by minutes (date_trunc('minute', datestamp) gives full minutes from datestamp), e.g.:
create table bins(datestamp time, bin int);
insert into bins values
('00:01:10', 1, 'a'),
('00:01:20', 2, 'b'),
('00:01:30', 3, 'c'),
('00:01:40', 4, 'd'),
('00:02:10', 3, 'e'),
('00:03:10', 2, 'f'),
('00:03:10', 3, 'g'),
('00:03:10', 4, 'h');
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
order by 1
minute | count
----------+-------
00:01:00 | 4
00:02:00 | 1
00:03:00 | 3
(3 rows)
If you are not sure that all bins are unique in consecutive minutes, use distinct (this will make the query slower):
select date_trunc('minute', datestamp) as minute, count(distinct bin)
...
You cannot select counts in aggregated minnutes and all columns of the table in a single simple select. If you want to do that, you should join a derived table or use the operator in or use a window function. A join seems to be the simplest:
select b.*, count
from bins b
join (
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
having count(bin) = 4
) s
on date_trunc('minute', datestamp) = minute
order by 1;
datestamp | bin | param | count
-----------+-----+-------+-------
00:01:10 | 1 | a | 4
00:01:20 | 2 | b | 4
00:01:30 | 3 | c | 4
00:01:40 | 4 | d | 4
(4 rows)
Note also how to use having() to filter results in the above query.
You can test the query here.

Related

How can I count elements satisfying a condition in a group, with PostgresSQL

with this query:
SELECT date_trunc('minute', ts) ts, instrument
FROM test
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
I am grouping rows by minutes but I would like to generate a boolean value that tells me if, in the group, there is at least one row with the timestamp where the seconds are < 10 and at least one row with the timestamp where the seconds are > 50.
In short, something like:
lessThan10 = false
moreThan50 = false
for each row in the one minute group:
if row.ts.seconds < 10 then lessThan10 = true
if row.ts.seconds > 50 then moreThan50 = true
return lessThan10 && moreThan50
What I am trying to achieve is to find out if all the records I aggregate cover the beginning and the end of the minute; it's ok if there are holes here and there, but it's possible the data we capture stops and restarts at second 40 for example and, in that case, I'd like to be able to discard the whole minute.
As the data rate varies quite a lot, I can't check for a minimum number of row. There may be a better solution to achieve this, so I'm open to it as well.
Use EXTRACT() to get the seconds of the min and max values of ts:
SELECT date_trunc('minute', ts) ts, instrument,
EXTRACT(SECOND FROM MIN(ts)) < 10 lessThan10,
EXTRACT(SECOND FROM MAX(ts)) > 50 moreThan50
FROM test
GROUP BY date_trunc('minute', ts), instrument
ORDER BY ts
See the demo.

Best way to maintain an ordered list in PostgreSQL?

Say I have a table called list, where there are items like these (the ids are random uuids):
id rank text
--- ----- -----
x 0 Hello
x 1 World
x 2 Foo
x 3 Bar
x 4 Baz
I want to maintain the property that rank column always goes from 0 to n-1 (n being the number of rows)---if a client asks to insert an item with rank = 3, then the pg server should push the current 3 and 4 to 4 and 5, respectively:
id rank text
--- ----- -----
x 0 Hello
x 1 World
x 2 Foo
x 3 New Item!
x 4 Bar
x 5 Baz
My current strategy is to have a dedicated insertion function add_item(item) that scans through the table, filter out items with rank equal or greater than that of the item being inserted, and increment those ranks by one. However, I think this approach will run into all sorts of problems---like race conditions.
Is there a more standard practice or more robust approach?
Note: The rank column is completely independent of rest of the columns, and insertion is not the only operation I need to support. Think of it as the back-end of a sortable to-do list, and the user can add/delete/reorder the items on the fly.
Doing verbatim what you suggest might be difficult or not possible at all, but I can suggest a workaround. Maintain a new column ts which stores the time a record is inserted. Then, insert the current time along with rest of the record, i.e.
id rank text ts
--- ----- ----- --------------------
x 0 Hello 2017-12-01 12:34:23
x 1 World 2017-12-03 04:20:01
x 2 Foo ...
x 3 New Item! 2017-12-12 11:26:32
x 3 Bar 2017-12-10 14:05:43
x 4 Baz ...
Now we can easily generate the ordering you want via a query:
SELECT id, rank, text,
ROW_NUMBER() OVER (ORDER BY rank, ts DESC) new_rank
FROM yourTable;
This would generate 0 to 5 ranks in the above sample table. The basic idea is to just use the already existing rank column, but to let the timestamp break the tie in ordering should the same rank appear more than once.
you can wrap it up to function if you think its worth of:
t=# with u as (
update r set rank = rank + 1 where rank >= 3
)
insert into r values('x',3,'New val!')
;
INSERT 0 1
the result:
t=# select * from r;
id | rank | text
----+------+----------
x | 0 | Hello
x | 1 | World
x | 2 | Foo
x | 3 | New val!
x | 4 | Bar
x | 5 | Baz
(6 rows)
also worth of mention you might have concurrency "chasing condition" problem on highly loaded systems. the code above is just a sample
You can have a “computed rank” which is a double precision and a “displayed rank” which is an integer that is computed using the row_number window function on output.
When a row is inserted that should rank between two rows, compute the new rank as the arithmetic mean of the two ranks.
The advantage is that you don't have to update existing rows.
The down side is that you have to calculate the displayed ranks before you can insert a new row so that you know where to insert it.
This solution (like all others) are subject to race conditions.
To deal with these, you can either use table locks or serializable transactions.
The only way to prevent a race condition would be to lock the table
https://www.postgresql.org/docs/current/sql-lock.html
Of course this would slow you down if there are lots of updates and inserts.
If can somehow limit the scope of your updates then you can do a SELECT .... FOR UPDATE on that scope. For example if the records have a parent_id you can do a select for update on the parent record first and any other insert who does the same select for update would have to wait till your transaction is done.
https://www.postgresql.org/docs/current/explicit-locking.html#:~:text=5.-,Advisory%20Locks,application%20to%20use%20them%20correctly.
Read the section on advisory locks to see if you can use those in your application. They are not enforced by the system so you'll need to be careful of how you write your application.

How to accumulate values tsql

I have to solve a problem and don't know how to do it. Im using SQL Server 2012.
I have the data like this schema:
-----------------------------------------------------------------------------------
DriverId | BeginDate | EndDate | NextBegin | Rest in | Drive Time | Drive
| | | Date | Hours | in Minutes | KM
-----------------------------------------------------------------------------------
integer datetime datetime datetime integer integer decimal(10,3)
Rest in hours = EndDate - NextBeginDate
Drive Time in Minutes = BeginDate - EndDate
I have to search the first rest => 36 hours then
Do
Compute how many days are
SUM(DriveTime)
SUM(TotalKM)
until next rest => 36 hours
IF No More Rest EXIT DO
Loop
From the begining to the first Rest is discard
From the last Rest to the end is discard
I have data in excel sheet you can download from here: Download Excel with data example
I'm sorry for my english, I hope you can understand and help me, thank you in advance.
There are several parts to the query. The first part pulls out the rows where Rest is >= 36 and assigns a row number. The result is stored in a CTE called BigRest.
with BigRest(RowNumber, DriverId, BeginDate, EndDate)
as
(
select ROW_NUMBER() over(partition by d.DriverId order by d.DriverId, d.BeginDate) RowNumber,d.DriverId, d.BeginDate, d.EndDate
from Drive d
where d.Rest >= 36
)
Then I assign the row number from BigRest to each row in Drive (which is what I'm calling the table that has all the data in it) based on the BeginDate. So the data is effectively segmented by the days where Rest >= 36. Each segment gets a number called DriveGroup.
;with Grouped(DriverId, BeginDate, EndDate, DriveTime, DriveKM, DriveGroup)
as
(
select d.DriverId, d.BeginDate, d.EndDate, d.Drivetime, d.DriveKM, (select Top 1 RowNumber from BigRest b where b.DriverId = d.DriverId and b.BeginDate >= d.BeginDate order by b.BeginDate)
from Drive d
)
Finally, I select the data from Grouped, cross applying it with some aggregate data from itself. We can filter out the rows where the DriveGroup is 1 or null because those represent the beginning and end rows that don't matter (the "do nothing" rows).
select distinct DriverId, MinBeginDate BeginDate, MaxEndDate EndDate, DATEDIFF(D, MinBeginDate, MaxEndDate)+1 Days, DriveTimeSum Drive, DriveKMSum KM
from
(
select g.DriverId, g.BeginDate, g.EndDate, g.DriveGroup, g.DriveTime, c.DriveTimeSum, c.DriveKMSum, c.MinBeginDate, c.MaxEndDate
from Grouped g
cross apply(select SUM(g2.DriveTime) DriveTimeSum,
SUM(g2.DriveKM) DriveKMSum,
MIN(g2.BeginDate) MinBeginDate,
MAX(g2.EndDate) MaxEndDate
from Grouped g2
where g2.DriverId = g.DriverId
and g2.DriveGroup = g.DriveGroup) as c
where g.DriveGroup is not null
and g.DriveGroup > 1
) x
Here's a SQL Fiddle
I'd encourage you to look at the results at each step of the query to see what's actually going on.

TSQL Cursor Alternative to Speed up my query

Row Status Time
1 Status1 1383264075
2 Status1 1383264195
3 Status1 1383264315
4 Status2 1383264435
5 Status2 1383264555
6 Status2 1383264675
7 Status2 1383264795
8 Status1 1383264915
9 Status3 1383265035
10 Status3 1383265155
11 Status2 1383265275
12 Status3 1383265395
13 Status1 1383265515
14 Status1 1383265535
15 Status2 1383265615
The [Time] column holds POSIX time
I want to be able to calculate the number of seconds a given [Status] is active for within a given time period without using CURSORS. If this is the only then that is fine as I've already done that.
Using the above sample data extract, how do I calculate how long "Status1" has been active for?
That is, Substract Row1.[Time] from Row4.[Time], Substract Row8.[Time] from Row9.[Time], Substract Row13.[Time] from Row15.[Time].
Thankyou in advance
Assuming that each row represents that the specific Status is active from the specified Time until the next row, one would have to somehow calculate the difference between row N and N+1. One way would be to use a nested query (try it here: SQL Fiddle).
SELECT SUM(Duration) as Duration
FROM (
SELECT f.Status, s.Time-f.Time as Duration
FROM Table1 f
JOIN Table1 s on s.Row = f.Row+1
WHERE f.Status = 'Status1') a
The solution by #erikxiv will work if the Row values have no gaps. If they do have gaps, you could try the following method:
SELECT
TotalDuration = SUM(next.Time - curr.Time)
FROM
dbo.atable AS curr
CROSS APPLY
(
SELECT TOP (1) Time
FROM dbo.atable
WHERE Row > curr.Row
ORDER BY Row ASC
) AS next
WHERE
curr.Status = 'Status1'
;
For every row matching the specified status, the correlated subquery in the CROSS APPLY clause will fetch the next Time value based on the ascending order of Row. The current row's time is then subtracted from the next row's time and all the differences are added up using SUM().
Please note that in both solutions it is implied that the order of Row values follows the order of Time values. In other words, ORDER BY Row is assumed to be equivalent to ORDER BY Time or, if Time can have duplicates, to ORDER BY Time, Row.

how to use multiple arguments in kdb where query?

I want to select max elements from a table within next 5, 10, 30 minutes etc.
I suspect this is not possible with multiple elements in the where clause.
Using both normal < and </: is failing. My code/ query below:
`select max price from dat where time</: (09:05:00; 09:10:00; 09:30:00)`
Any ideas what am i doing wrong here?
The idea is to get the max price for each row within next 5, 10, 30... minutes of the time in that row and not just 3 max prices in the entire table.
select max price from dat where time</: time+\:(5 10 30)
This won't work but should give the general idea.
To further clarify, i want to calculate the max price in 5, 10, 30 minute intervals from time[i] of each row of the table. So for each table row max price within x+5, x+10, x+30 minutes where x is the time entry in that row.
You could try something like this:
select c1:max price[where time <09:05:00],c2:max price[where time <09:10:00],c3:max price from dat where time< 09:30:00
You can paramatize this query however you like. So if you have a list of times, l:09:05:00 09:10:00 09:15:00 09:20:00 ... You can create a function using a functional form of the query above to work for different lengths of l, something like:
q)f:{[t]?[dat;enlist (<;`time;max t);0b;(`$"c",/:string til count t)!flip (max;flip (`price;flip (where;((<),/:`time,/:t))))]}
q)f l
You can extend f to take different functions instead of max, work for different tables etc.
This works but takes a lot of time. For 20k records, ~20 seconds, too much!. Any way to make it faster
dat: update tmlst: time+\:mtf*60 from dat;
dat[`pxs]: {[x;y] {[x; ts] raze flip raze {[x;y] select min price from x where time<y}[x] each ts }[x; y`tmlst]} [dat] each dat;
this constructs a step dictionary to map the times to your buckets:
q)-1_select max price by(`s#{((neg w),x)!x,w:(type x)$0W}09:05:00 09:10:00 09:30:00)time from dat
you may also be able to abuse wj:
q)wj[{(prev x;x)}09:05:00 09:10:00 09:30:00;`time;([]time:09:05:00 09:10:00 09:30:00);(delete sym from dat;(max;`price))]
if all your buckets are the same size, it's much easier:
q)select max price by 300 xbar time from dat where time<09:30:00 / 300-second (5-min) buckets