Postgres - partitioning a table - postgresql

How does the code look like to partition the following table. date and status are given, partition column shall be added. Column group is only to explain where the group starts and ends.
Finally, I like to do some analytics, e.g. how long takes the process per group.
In words but don't know to convert to code:
status 'approved' always defines the end. Only an 'open' after 'approval' defines the start. The other 'open' are not relevant.
date
status
Group
Partition
1.10.2022
open
Group 1 Starts
1
2.10.2022
waiting
1
3.10.2022
open
1
4.10.2022
waiting
1
5.10.2022
approved
Group 1 Ends
1
7.10.2022
open
Group 2 Start
2
8.10.2022
waiting
2
9.10.2022
open
2
10.10.2022
waiting
2
11.10.2022
open
2
12.10.2022
waiting
2
15.10.2022
approved
Group 2 Ends
2
17.10.2022
open
Group 3 Starts
3
20.10.2022
waiting
3
Thanks for the solution. Works fine :-) And sorry for not using the right expression. If Group is better than Partition even better...
Can we make it slightly more complicated?
This patter in the table applis to several parent records. So in reality there is an additional column Parent ID. This table below is then for example for parent ID A. There are many more parents.
How can an additional grouping be added by Parent ID?
At eeach new parent the counting starts again at 1

Assuming you have the first two columns and want to derive the last two, treat this as a gaps-and-islands problem:
with groups as ( -- Assign partitions
select *,
coalesce(
sum(case when status = 'approved' then 1 else 0 end)
over (order by date rows between unbounded preceding
and 1 preceding),
0
) + 1 as partition
from do_part
)
select date, status,
case -- Construct text descriptions
when partition != coalesce(lead(partition) over w, partition)
then format('Group %s Ends', partition)
when partition = lag(partition) over w
then ''
else format('Group %s Starts', partition)
end as "group",
partition
from groups
window w as (order by date);
Fiddle here

demo based on (Mike Organek)'s fiddle.
idea: left join with distinct on can properly cut the group.
SELECT DISTINCT ON (date,status)
date,
status,
coalesce(date_d, CURRENT_DATE) AS date_end
FROM
do_part t
LEFT JOIN (
SELECT
date AS date_d
FROM
do_part
WHERE
status = 'approved'
ORDER BY
date) s ON s.date_d >= t.date
ORDER BY
date,status,
date_d;
Final query (can be simplified):
WITH cte AS (
SELECT DISTINCT ON (date,
status)
date,
status,
coalesce(date_d, CURRENT_DATE) AS date_end
FROM
do_part t
LEFT JOIN (
SELECT
date AS date_d
FROM
do_part
WHERE
status = 'approved'
ORDER BY
date) s ON s.date_d >= t.date
ORDER BY
date,
status,
date_d
),
cte1 AS (
SELECT
*,
date_end - first_value(date) OVER (PARTITION BY date_end ORDER BY date) AS date_gap,
dense_rank() OVER (ORDER BY date_end),
CASE WHEN (date = first_value(date) OVER (PARTITION BY date_end ORDER BY date)) THEN
'group begin'
WHEN (status = 'approved') THEN
'group end '
ELSE
NULL
END AS grp
FROM
cte
)
SELECT
*,
CASE WHEN grp IS NOT NULL THEN
grp || dense_rank::text
END
FROM
cte1;

Related

PostgreSQL - SQL function to loop through all months of the year and pull 10 random records from each

I am attempting to pull 10 random records from each month of this year using this query here but I get an error "ERROR: relation "c1" does not exist
"
Not sure where I'm going wrong - I think it may be I'm using Mysql syntax instead, but how do I resolve this?
My desired output is like this
Month
Another header
2021-01
random email 1
2021-01
random email 2
total of ten random emails from January, then ten more for each month this year (til November of course as Dec yet to happen)..
With CTE AS
(
Select month,
email,
Row_Number() Over (Partition By month Order By FLOOR(RANDOM()*(1-1000000+1))) AS RN
From (
SELECT
DISTINCT(TO_CHAR(DATE_TRUNC('month', timestamp ), 'YYYY-MM')) AS month
,CASE
WHEN
JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'name') = 'email'
THEN
JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'value')
END AS email
FROM form_submits_y2 fs
WHERE fs.website_id IN (791)
AND month LIKE '2021%'
GROUP BY 1,2
ORDER BY 1 ASC
)
)
SELECT *
FROM CTE C1
LEFT JOIN
(SELECT RN
,month
,email
FROM CTE C2
WHERE C2.month = C1.month
ORDER BY RANDOM() LIMIT 10) C3
ON C1.RN = C3.RN
ORDER By month ASC```
You can't reference an outer table inside a derived table with a regular join. You need to use left join lateral to make that work
I did end up finding a more elegant solution to my query here via this source from github :
SELECT
month
,email
FROM
(
Select month,
email,
Row_Number() Over (Partition By month Order By FLOOR(RANDOM()*(1-1000000+1))) AS RN
From (
SELECT
TO_CHAR(DATE_TRUNC('month', timestamp ), 'YYYY-MM') AS month
,CASE
WHEN JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'name') = 'email'
THEN JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'value')
END AS email
FROM form_submits_y2 fs
WHERE fs.website_id IN (791)
AND month LIKE '2021%'
GROUP BY 1,2
ORDER BY 1 ASC
)
) q
WHERE
RN <=10
ORDER BY month ASC

SQL Server - Select with Group By together Raw_Number

I'm using SQL Server 2000 (80). So, it's not possible to use the LAG function.
I have a code a data set with four columns:
Purchase_Date
Facility_no
Seller_id
Sale_id
I need to identify missing Sale_ids. So every sale_id is a 100% sequential, so the should not be any gaps in order.
This code works for a specific date and store if specified. But i need to work on entire data set looping looping through every facility_id and every seller_id for ever purchase_date
declare #MAXCOUNT int
set #MAXCOUNT =
(
select MAX(Sale_Id)
from #table
where
Facility_no in (124) and
Purchase_date = '2/7/2020'
and Seller_id = 1
)
;WITH TRX_COUNT AS
(
SELECT 1 AS Number
union all
select Number + 1 from TRX_COUNT
where Number < #MAXCOUNT
)
select * from TRX_COUNT
where
Number NOT IN
(
select Sale_Id
from #table
where
Facility_no in (124)
and Purchase_Date = '2/7/2020'
and seller_id = 1
)
order by Number
OPTION (maxrecursion 0)
My Dataset
This column:
case when
Sale_Id=0 or 1=Sale_Id-LAG(Sale_Id) over (partition by Facility_no, Purchase_Date, Seller_id)
then 'OK' else 'Previous Missing' end
will tell you which Seller_Ids have some sale missing. If you want to go a step further and have exactly your desired output, then filter out and distinct the 'Previous Missing' ones, and join with a tally table on not exists.
Edit: OP mentions in comments they can't use LAG(). My suggestion, then, would be:
Make a temp table that that has the max(sale_id) group by facility/seller_id
Then you can get your missing results by this pseudocode query:
Select ...
from temptable t
inner join tally N on t.maxsale <=N.num
where not exists( select ... from sourcetable s where s.facility=t.facility and s.seller=t.seller and s.sale=N.num)
> because the only way to "construct" nonexisting combinations is to construct them all and just remove the existing ones.
This one worked out
; WITH cte_Rn AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY Facility_no, Purchase_Date, Seller_id ORDER BY Purchase_Date) AS [Rn_Num]
FROM (
SELECT
Facility_no,
Purchase_Date,
Seller_id,
Sale_id
FROM MyTable WITH (NOLOCK)
) a
)
, cte_Rn_0 as (
SELECT
Facility_no,
Purchase_Date,
Seller_id,
Sale_id,
-- [Rn_Num] AS 'Skipped Sale'
-- , case when Sale_id = 0 Then [Rn_Num] - 1 Else [Rn_Num] End AS 'Skipped Sale for 0'
, [Rn_Num] - 1 AS 'Skipped Sale for 0'
FROM cte_Rn a
)
SELECT
Facility_no,
Purchase_Date,
Seller_id,
Sale_id,
-- [Skipped Sale],
[Skipped Sale for 0]
FROM cte_Rn_0 a
WHERE NOT EXISTS
(
select * from cte_Rn_0 b
where b.Sale_id = a.[Skipped Sale for 0]
and a.Facility_no = b.Facility_no
and a.Purchase_Date = b.Purchase_Date
and a.Seller_id = b.Seller_id
)
--ORDER BY Purchase_Date ASC

How to shorten loading time with multiple subqueries and joins?

I have a query to find which v_address belonged to which v_group, and when was the creation date of v_address. However, each v_group was sometimes active and inactive and it has dates. So, I wrote query to find the period of those changes.
My problem is my query took too long time to run because it has multiple subqueries and joins.
Does anyone have a better idea to shorten to load data?
I attached my query below identifier = 81 is for smaller testing. For the final result, I will need to retrieve data for more than 100k ids.
I tried to change inner joins to left joins but it loses null values and still took too long time to retrieve data.
with v_address as (
select id, v_group_id, created_at
from prod.venue_address --different schema and table. v_address.v_group_id can have multiple v_address.id
group by 1,2,3
order by 3 asc
),
v_group as (
select identifier, final_status, created_at
from dwh.venue_address_archive
where identifier = 81
group by 1,2,3
order by 3 asc
),
filtering as (
select identifier, created_at,
case when sum(case when final_state = 'active' then 1 else 0 end) > 0 then 'active' else 'inactive' end as filtered_status --This filters either of active or inactive
from v_group
group by 1,2
order by 2 asc
),
prev as (
select identifier, created_at, filtered_status,
lag(case when filtered_status = 'active' then 'active' else 'inactive' end) over (partition by identifier order by created_at) = filtered_status as is_prev_status
from filtering
group by 1,2,3
),
periods as (
select identifier, filtered_status, created_at, is_prev_status,
sum(case when is_prev_status = true then 0 else 1 end) over (order by identifier, created_at) as period
from prev
group by 1,2,3,4
),
islands_gaps_start as (
select identifier, period, min(created_at) as start_at
from periods
group by 2,1
),
islands_gaps as (
select identifier, period, start_at,
lead(start_at) over (partition by identifier order by period) as end_at
from islands_gaps_start
)
select vg.identifier as "vg_id", p.created_at as "vg_created_at", p.filtered_status as "status", p.is_prev_status, p.period, va.id as "va_id", g.start_at, g.end_at
from v_address va
left join v_group vg
on va.venue_id = vg.identifier
inner join filtering f
on vg.identifier = f.identifier
inner join prev pr
on pr.filtered_status = f.filtered_status
inner join periods p
on p.filtered_status = pr.filtered_status
inner join islands_gaps_start gs
on p.period = gs.period
inner join islands_gaps g
on gs.start_at = g.start_at
group by 6,1,2,3,4,5,7,8
order by 2 asc
I already have the output with an example identifier = 81 but I have to run this query for more than 100k identifiers so, I'm looking for any advice that I can shorten my query.

How to match records for two different groups?

I have one main table called Event_log which contains all of the records that I need for this query. Within this table there is one column that I'm calling "Grp". To simplify things, assume that there are only two possible values for this Grp: A and B. So now we have one table, Event_log, with one column "Grp" and one more column called "Actual Date". Lastly I want to add one more Flag column to this table, which works as follows.
First, I order all of the records in descending order by date as demonstrated below. Then, I want to flag each Group "A" row with a 1 or a 0. For all "A" rows, if the previous record (earlier in date) = "B" row then I want to flag 1. Otherwise flag a 0. So this initial table looks like this before setting this flag:
Actual Date Grp Flag
1-29-13 A
12-27-12 B
12-26-12 B
12-23-12 A
12-22-12 A
But after these calculations are done, it should look like this:
Actual Date Grp Flag
1-29-13 A 1
12-27-12 B NULL
12-26-12 B NULL
12-23-12 A 0
12-22-12 A 0
How can I do this? This is simpler to describe than it is to query!
You can use something like:
select el.ActualDate
, el.Grp
, Flag = case
when el.grp = 'B' then null
when prev.grp = 'B' then 1
else 0
end
from Event_log el
outer apply
(
select top 1 prev.grp
from Event_log prev
where el.ActualDate > prev.ActualDate
order by prev.ActualDate desc
) prev
order by el.ActualDate desc
SQL Fiddle with demo.
Try this
;with cte as
(
SELECT CAST('01-29-13' As DateTime) ActualDate,'A' Grp
UNION ALL SELECT '12-27-12','B'
UNION ALL SELECT '12-26-12','B'
UNION ALL SELECT '12-23-12','A'
UNION ALL SELECT '12-22-12','A'
)
, CTE2 as
(
SELECT *, ROW_NUMBER() OVER (order by actualdate desc) rn
FROM cte
)
SELECT a.*,
case
when A.Grp = 'A' THEN
CASE WHEN b.Grp = 'B' THEN 1 ELSE 0 END
ELSE NULL
END Flag
from cte2 a
LEFT OUTER JOIN CTE2 b on a.rn + 1 = b.rn

T-SQL if value exists use it other wise use the value before

I have the following table
-----Account#----Period-----Balance
12345---------200901-----$11554
12345---------200902-----$4353
12345 --------201004-----$34
12345 --------201005-----$44
12345---------201006-----$1454
45677---------200901-----$14454
45677---------200902-----$1478
45677 --------201004-----$116776
45677 --------201005-----$996
56789---------201006-----$1567
56789---------200901-----$7894
56789---------200902-----$123
56789 --------201003-----$543345
56789 --------201005-----$114
56789---------201006-----$54
I want to select the account# that have a period of 201005.
This is fairly easy using the code below. The problem is that if a user enters 201003-which doesnt exist- I want the query to select the previous value.*NOTE that there is an account# that has a 201003 period and I still want to select it too.*
I tried CASE, IF ELSE, IN but I was unsuccessfull.
PS:I cannot create temp tables due to system limitations of 5000 rows.
Thank you.
DECLARE #INPUTPERIOD INT
#INPUTPERIOD ='201005'
SELECT ACCOUNT#, PERIOD , BALANCE
FROM TABLE1
WHERE PERIOD =#INPUTPERIOD
SELECT t.ACCOUNT#, t.PERIOD, t.BALANCE
FROM (SELECT ACCOUNT#, MAX(PERIOD) AS MaxPeriod
FROM TABLE1
WHERE PERIOD <= #INPUTPERIOD
GROUP BY ACCOUNT#) q
INNER JOIN TABLE1 t
ON q.ACCOUNT# = t.ACCOUNT#
AND q.MaxPeriod = t.PERIOD
select top 1 account#, period, balance
from table1
where period >= #inputperiod
; WITH Base AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Period DESC) RN FROM #MyTable WHERE Period <= 201003
)
SELECT * FROM Base WHERE RN = 1
Using CTE and ROW_NUMBER() (we take all the rows with Period <= the selected date and we take the top one (the one with auto-generated ROW_NUMBER() = 1)
; WITH Base AS
(
SELECT *, 1 AS RN FROM #MyTable WHERE Period = 201003
)
, Alternative AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Period DESC) RN FROM #MyTable WHERE NOT EXISTS(SELECT 1 FROM Base) AND Period < 201003
)
, Final AS
(
SELECT * FROM Base
UNION ALL
SELECT * FROM Alternative WHERE RN = 1
)
SELECT * FROM Final
This one is a lot more complex but does nearly the same thing. It is more "imperative like". It first tries to find a row with the exact Period, and if it doesn't exists does the same thing as before. At the end it unite the two result sets (one of the two is always empty). I would always use the first one, unless profiling showed me the SQL wasn't able to comprehend what I'm trying to do. Then I would try the second one.