How to find end point of same value with interval? - postgresql

I have a table like this below.
ID Time State
1 1 "active"
1 2 "active"
1 3 "active"
1 4 "inactive"
2 2 "inactive"
2 3 "active"
3 1 "active"
3 3 "active"
3 4 "inactive"
I want to sort table with start/end time by state.
It might need lag() window function but I don't know how to find end point of same state.
My expected table should look like this.
ID Start End State
1 1 4 "active"
1 4 NULL "inactive"
2 2 3 "inactive"
2 3 NULL "active"
3 1 4 "active"
3 4 NULL "inactive"

demo:db<>fiddle
SELECT DISTINCT ON (sum) -- 5
id,
-- 4
first_value(time) OVER (PARTITION BY sum ORDER BY time) as start,
first_value(lead) OVER (PARTITION BY sum ORDER BY time DESC) as end,
state
FROM (
SELECT
*,
-- 3
SUM(CASE WHEN is_prev_state THEN 0 ELSE 1 END) OVER (ORDER BY id, time)
FROM (
SELECT
*,
-- 1
lead(time) OVER (PARTITION BY id ORDER BY time),
-- 2
state = lag(state) OVER (PARTITION BY id ORDER BY time) as is_prev_state
FROM states
)s
)s
lead() takes the next value to the current row. To e.g. the time == 4 (id == 1) goes to the row with time == 3. The idea is to get a possible end of group onto the right row.
lag() does the opposite thing. It takes the previous value the current row. With that I can check whether a state has changed or not: Is the current state the same of the last one.
With this line I create the groups for every single state: If state change happened sum up one value. If not hold the same value (adding 0).
Now I have the possible last value per state (given through (1)) and can get the first value. This is done with the window function first_value() which gives you the first value of an ordered group. To get the last value you just have to order the group descending. (Why not using last_value())
DISTINCT ON filters only the very first row of the (with SUM() function generated) group

Related

how to get last known contiguous value in postgres ltree field?

I have a child table called wbs_numbers. the primary key id is a ltree
A typical example is
id
series_id
abc.xyz.00001
1
abc.xyz.00002
1
abc.xyz.00003
1
abc.xyz.00101
1
so the parent table called series. it has a field called last_contigous_max.
given the above example, i want the series of id 1 to have its last contigous max be 3
can always assume that the ltree of wbs is always 3 fragment separated by dot. and the last fragment is always a 5 digit numeric string left padded by zero. can always assume the first child is always ending with 00001 and the theoretical total children of a series will never exceed 9999.
If you think of it as gaps and islands, the wbs_numbers will never start with a gap within a series. it will always start with an island.
meaning to say this is not possible.
id
series_id
abc.xyz.00010
1
abc.xyz.00011
1
abc.xyz.00012
1
abc.xyz.00101
1
This is possible
id
series_id
abc.xyz.00001
1
abc.xyz.00004
1
abc.xyz.00005
1
abc.xyz.00051
1
abc.xyz.00052
1
abc.xyz.00100
1
abc.xyz.10001
2
abc.xyz.10002
2
abc.xyz.10003
2
abc.xyz.10051
2
abc.xyz.10052
2
abc.xyz.10100
2
abc.xyz.20001
3
abc.xyz.20002
3
abc.xyz.20003
3
abc.xyz.20004
3
abc.xyz.20052
3
abc.xyz.20100
3
so the last max contiguous in this case is
for series id 1 => 1
for series id 2 => 3
for series id 3 => 4
What's the query to calculate the last_contigous_max number for any given series_id?
I also don't mind having another table just to store "islands".
Also, you can safely assume that wbs_number records will never be deleted once created. The id in the wbs_numbers table will never be altered once filled in as well.
Meaning to say islands will only grow and never shrink.
You can carry out your problem following these steps:
extract your integer value from your "id" field
compute a ranking value sided with your id value
filter out when your ranking value does not match your id value
get tied last row for each of your matches
WITH cte AS (
SELECT *, CAST(RIGHT(id_, 4) AS INTEGER) AS idval
FROM tab
), ranked AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY series_id ORDER BY idval) AS rn
FROM cte
)
SELECT series_id, idval
FROM ranked
WHERE idval = rn
ORDER BY ROW_NUMBER() OVER(PARTITION BY series_id ORDER BY idval DESC)
FETCH FIRST ROWS WITH TIES
Check the demo here.

How can I query concurrent events, i.e. usage, in postgres?

From my first table of events below, what query would give me my second table of usage?
start
end
08:42
08:47
08:44
08:50
start
end
count
08:42
08:44
1
08:44
08:47
2
08:47
08:50
1
What if any indexes should I create to speed this up?
The main thing I often need is the peak usage and when it is (i.e. max count row from above), so also is there a quicker way to get one/both of these?
Also, is it quicker to query for each second (which I can imagine how to do), e.g:
time
count
08:42
1
08:43
1
08:44
2
08:45
2
08:46
2
08:47
1
08:48
1
08:49
1
NB my actual starts/ends are timestamp(6) with time zone and I have thousands of records, but I hope my example above is useful.
step-by-step demo:db<>fiddle
SELECT
t as start,
lead as "end",
sum as count
FROM (
SELECT
t,
lead(t) OVER (ORDER BY t), -- 2a
type,
SUM(type) OVER (ORDER BY t) -- 2b
FROM (
SELECT -- 1
start as t,
1 as type
FROM mytable
UNION
SELECT
stop,
-1 as type
FROM mytable
) s
) s
WHERE sum > 0 -- 3
Put all time values into one column. Add the 1 value to former start values and -1 to former end values
a) put the next time values into the current record b) use cumulative SUM() over the newly added 1/-1 value. Each start point increased the count, each end value decreased it. This is your expected count
Remove all records without an interval.
The above only works properly if your borders are distinct. If you have interval borders at the same time point, you have to change the UNION into UNION ALL (which keeps same values) and group this result afterwards to generate for example -2 from two -1 values at same time slot:
step-by-step demo:db<>fiddle
SELECT
t as start,
lead as "end",
sum as count
FROM (
SELECT
t,
lead(t) OVER (ORDER BY t),
type,
SUM(type) OVER (ORDER BY t)
FROM (
SELECT
t,
SUM(type) AS type
FROM (
SELECT
start as t,
1 as type
FROM mytable
UNION ALL
SELECT
stop,
-1 as type
FROM mytable
) s
GROUP BY t
) s
) s
WHERE sum > 0

Postgresql : Average over a limit of Date with group by

I have a table like this
item_id date number
1 2000-01-01 100
1 2003-03-08 50
1 2004-04-21 10
1 2004-12-11 10
1 2010-03-03 10
2 2000-06-29 1
2 2002-05-22 2
2 2002-07-06 3
2 2008-10-20 4
I'm trying to get the average for each uniq Item_id over the last 3 dates.
It's difficult because there are missing date in between so a range of hardcoded dates doesn't always work.
I expect a result like :
item_id MyAverage
1 10
2 3
I don't really know how to do this. Currently i manage to do it for one item but i have trouble extending it to multiples items :
SELECT AVG(MyAverage.number) FROM (
SELECT date,number
FROM item_list
where item_id = 1
ORDER BY date DESC limit 3
) as MyAverage;
My main problem is with generalising the "DESC limit 3" over a group by id.
attempt :
SELECT item_id,AVG(MyAverage.number)
FROM (
SELECT item_id,date,number
FROM item_list
ORDER BY date DESC limit 3) as MyAverage
GROUP BY item_id;
The limit is messing things up there.
I have made it " work " using between date and date but it's not working as i want because i need a limit and not an hardcoded date..
Can anybody help
You can use row_number() to assign 1 to 3 for the records with the last date for an ID an then filter for that.
SELECT x.item_id,
avg(x.number)
FROM (SELECT il.item_id,
il.number,
row_number() OVER (PARTITION BY il.item_id
ORDER BY il.date DESC) rn
FROM item_list il) x
WHERE x.rn BETWEEN 1 AND 3
GROUP BY x.item_id;

How can 'brand new, never before seen' IDs be counted per month in redshift?

A fair amount of material is available detailing methods utilising dense_rank() and the like to count distinct somethings per month, however, I've been unable to find anything that allows a count of distinct per month which also removes/discounts any id's that have been seen in prior month groups.
The data can be imagined like so:
id (int8 type) | observed time (timestamp utc)
------------------
1 | 2017-01-01
2 | 2017-01-02
1 | 2017-01-02
1 | 2017-02-02
2 | 2017-02-03
3 | 2017-02-04
1 | 2017-03-01
3 | 2017-03-01
4 | 2017-03-01
5 | 2017-03-02
The process of the count can be seen as:
1: in 2017-01 we saw devices 1 and 2 so the count is 2
2: in 2017-02 we saw devices 1, 2 and 3. We know already about devices 1 and 2, but not 3, so the count is 1
3: in 2017-03 we saw devices 1, 3, 4 and 5. We already know about 1 and 3, but not 4 or 5, so the count is 2.
with the desired output being something like:
observed time | count of new id
--------------------------
2017-01 | 2
2017-02 | 1
2017-03 | 2
Explicitly, I am looking to have a new table, with an aggregated month per row, with a count of how many new ids occur within that month that have not been seen at all before.
The IRL case allows devices to be seen more than once in a month, but this shouldn't impact the count. It also uses integer for storage (both positive and negative) of the id, and time periods will be to the second in true timestamps. The size of the data set is also significant.
My initial attempt is along the lines of:
WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months
However, I'm stuck on the next part i.e counting the number of new ID that were not seen in prior months. I believe the solution might be a window function, but I'm having trouble working out which or how.
First thing I thought of. The idea is to
(innermost query) calculate the earliest month that each id was seen,
(next level up) join that back to the main my_table dataset, and then
(outer query) count distinct ids by month after nulling out the already-seen ids.
I tested it out and got the desired result set. Joining the earliest month back to the original table seemed like the most natural thing to do (vs. a window function). Hopefully this is performant enough for your Redshift!
select observed_month,
-- Null out the id if the observed_month that we're grouping by
-- is NOT the earliest month that the id was seen.
-- Then count distinct id
count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
select t.id,
date_trunc('month', t.observed_time) as observed_month,
earliest.earliest_month
from my_table t
join (
-- What's the earliest month an id was seen?
select id,
date_trunc('month', min(observed_time)) as earliest_month
from my_table
group by 1
) earliest
on t.id = earliest.id
)
group by 1
order by 1;

T-SQL table variable data order

I have a UDF which returns table variable like
--
--
RETURNS #ElementTable TABLE
(
ElementID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
ElementValue VARCHAR(MAX)
)
AS
--
--
Is the order of data in this table variable guaranteed to be same as the order data is inserted into it. e.g. if I issue
INSERT INTO #ElementTable(ElementValue) VALUES ('1')
INSERT INTO #ElementTable(ElementValue) VALUES ('2')
INSERT INTO #ElementTable(ElementValue) VALUES ('3')
I expect data will always be returned in that order when I say
select ElementValue from #ElementTable --Here I don't use order by
EDIT:
If order by is not guaranteed then the following query
SELECT T1.ElementValue,T2.ElementValue FROM dbo.MyFunc() T1
Cross Apply dbo.MyFunc T2
order by t1.elementid
will not produce 9x9 matrix as
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
consistently.
Is there any possibility that it could be like
1 2
1 1
1 3
2 3
2 2
2 1
3 1
3 2
3 3
How to do it using my above function?
No, the order is not guaranteed to be the same.
Unless, of course you are using ORDER BY. Then it is guaranteed to be the same.
Given your update, you obtain it in the obvious way - you ask the system to give you the results in the order you want:
SELECT T1.ElementValue,T2.ElementValue FROM dbo.MyFunc() T1
Cross join dbo.MyFunc() T2
order by t1.elementid, t2.elementid
You are guaranteed that if you're using inefficient single row inserts within your UDF, that the IDENTITY values will match the order in which the individual INSERT statements were specified.
Order is not guaranteed.
But if all you want is just simply to get your records back in the same order you inserted them, then just order by your primary key. Since you already have that field setup as an auto-increment, it should suffice.
...or use a deterministic function
SELECT TOP 9
M1 = (ROW_NUMBER() OVER(ORDER BY id) + 2) / 3,
M2 = (ROW_NUMBER() OVER(ORDER BY id) + 2) % 3 + 1
FROM
sysobjects
M1 M2
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3