Random number generation based on condition in sqlserver - tsql

Need to generate random number based on ID3 column. Have mentioned each rows logic in remarks.
Note: This is a sample data I need to apply on big set of data
Is there any possibility to create a rand number or incrementing number based on condition.
IF condition passes then retain same else generate another one (rand +1)

This will give you an incrementing number with the behaviour you want, assuming ID1 is an incrementing number. If its not you will need some other way to order the data since order is inherent to the behaviour you want.
Here I have assumed we can order by ID1 since its not used anywhere else in the logic
create table #t
(
ID1 INT,
ID2 INT,
ID3 INT
)
insert into #t(ID1, ID2, ID3) values(1,1,1),(2,1,1),(3,2,1),(4,2,31),(5,2,1),(6,2,1),(7,2,23),(8,2,31);
with c1 as
(
select ID1, ID2, ID3,
case when ID3 != 1 or lag(ID3,1,2) over (order by ID1) != 1 then 1 else 0 end as IncrementHere
from #t
)
select ID1, ID2, ID3, sum(IncrementHere) over (order by ID1 rows unbounded preceding) as IncrementingNumber
from c1

Related

Calculate the cumulative total between 2 columns until a non-zero value is reached in 1st column. Once non-zero value is reached, the sum restarts

I am querying a database using TSQL in SSMS.I have a dataset that contains two unique ID's, A and B. For each of these ID's I want to sum columns OAC and Adj cumulatively until the next non-zero value is reached in column OAC. In other words, the non-zero values on column OAC will remain the same while the values between them will be adding consequent column Adj values using the first non-zero OAC value as a stop and restart point.
The data table can be created using the below
drop table #T
CREATE TABLE #T(
Id varchar(10),
PeriodNum int,
row_num int,
OAC money ,
Adj money
) ON [PRIMARY]
GO
insert into #T
values
('A','201606','1','5','0'),
('A','201905','2','0','-2'),
('A','201906','3','100','0'),
('A','202008','4','0','-6'),
('A','202009','5','0','-8'),
('A','202106','6','0','-11'),
('A','202109','7','23','0'),
('B','201606','1','3','0'),
('B','201905','2','0','25'),
('B','201906','3','60','0'),
('B','202008','4','0','12'),
('B','202009','5','0','-5'),
('B','202106','6','0','6'),
('B','202109','7','6','0')
I tried the following code to calculate the desired result
select * , sum( iif(t.OAC<> 0,(t.OAC + t.Adj),0)) over (partition by t.ID order by t.row_num asc) as Calc
from #T t
This did not work as the needed result looks as per below in the calc column.
A courtesy of https://blog.jooq.org/10-sql-tricks-that-you-didnt-think-were-possible/ :
Of course, that vendor is not Microsoft, so we're stuck to less finessed options.
One way to go about this is to first use a conditional aggregate on a flag to create "groupings", let's call that [sumstep], and then partition on that to have separate running totals:
;with cte as
(
select *,sum(case when OAC=0 then 0 else 1 end) over (partition by id order by row_num) as sumstep
from #t
)
select *,sum(OAC+Adj) over (partition by id,sumstep order by row_num) as Calc
from cte

How to find the average of the three maximum values in a specific group in a moving window in Big Query?

I have a data set as in the table below. I want to find the average of the maximum three values in a rolling 12 month window grouped by id.
id date value
id1 2020/01/01 500
id1 2021/02/01 300
id1 2021/03/01 150
id1 2021/08/01 100
id1 2021/12/01 400
id2 2020/01/01 50
id2 2020/02/01 900
id2 2021/12/01 100
So my expected output is:
id date value
id1 2020/01/01 500
id1 2021/02/01 300
id1 2021/03/01 225
id1 2021/08/01 183.33
id1 2021/12/01 283.33
id2 2020/01/01 50
id2 2020/02/01 500
id2 2021/12/01 100
I.e. for id1 2021/12/01: (400+300+150)/3 = 283.33 which is the average of the three largest values in a rolling 12 month window for group ID1.
I managed to get to this point:
CREATE TEMP FUNCTION avg_array(arr ANY TYPE) AS ((
SELECT AVG(val) FROM(
SELECT val FROM UNNEST(arr) val ORDER BY val DESC LIMIT 3)
)
);
SELECT id, date, avg_array(val_arr)
FROM (
SELECT
id, date, ARRAY_AGG(value) OVER (
PARTITION BY id
ORDER BY id, date DESC ROWS BETWEEN CURRENT ROW AND 11 FOLLOWING
) as val_arr
FROM `table` )
Which works, but I feel like there must be a better way to do this. Specifically, I can't figure out how to get the average of the maximum three from the OVER as well rather than creating a seperate function.
(If not possible to combine date window with finding maximum values, it would also be useful for me to know how to find the average of the maximum three in any group by group without creating a seperate function)
`
In your code, the year of the date in the “PARTITION BY id,EXTRACT(YEAR FROM date) “ statement is missing.
CREATE TEMP FUNCTION avg_array(arr ANY TYPE) AS ((
SELECT AVG(val) FROM(
SELECT val FROM UNNEST(arr) val ORDER BY val DESC LIMIT 3))
);
SELECT id, date, avg_array(val_arr)
FROM (
SELECT
id, date, ARRAY_AGG(value) OVER (
PARTITION BY id,EXTRACT(YEAR FROM date)
ORDER BY id, date DESC ROWS BETWEEN CURRENT ROW AND 11 FOLLOWING
) as val_arr
FROM `table` )
order by id,date asc
Here, you can see a sample code to get the maximum 3 numbers of a group:
select id,AVG(value) as vg from (
select id,date,value from (
select id, date, value from `table`
order by value desc) a limit 3
) b group by id
You can see more information about over function in this link.
Consider below approach
select id, date,
(select round(avg(value), 2) from (
select value from t.arr value
order by value desc
limit 3
)) value
from (
select *, array_agg(value) over last_12_month arr from table
window last_12_month as (partition by id
order by 12 * (extract(year from date)) + extract(month from date)
range between 11 preceding and current row
)
) t
if applied to sample data in your question - output is

Update null values in a column based on non null values percentage of the column

I need to update the null values of a column in a table for each category based on the percentage of the non-null values. The following table shows the null values for a particular category -
There are only two types of values in the column. The percentage of types based on rows is -
The number of rows with null values is 7, I need to randomly populate the null values based on the percentage share of the non-null values as shown below - 38%(CV) of 7 = 3, 63%(NCV) of 7 = 4
If you want to dynamically calculate the "NULL rate", one way to do it could be:
with pcts as (
select
(select count(*)::numeric from the_table where type = 'cv') / (select count(*) from the_table where type is not null) as cv_pct,
(select count(*)::numeric from the_table where type = 'ncv') / (select count(*) from the_table where type is not null) as ncv_pct,
(select count(*) from the_table where type is null) as null_count
), calc as (
select d.ctid,
p.cv_pct,
p.ncv_pct,
row_number() over () as rn,
case
when row_number() over () <= round(null_count * p.cv_pct) then 'cv'
else 'ncv'
end as new_type
from the_table d
cross join pcts p
where type is null
)
update the_table t
set type = c.new_type
from calc c
where t.ctid = c.ctid
The first CTE calculates the percentage of each type and the total number of NULL values (in theory the percentage of the NCV type isn't really needed, but I included it for completeness)
The second then calculates for each row which new type should be used. This is done by multiplying the "current" row number with the expected percentage (the CASE expression)
This is then used to update the target table. I have used the ctid as an alternative for a primary key, because your sample data does not have any unique column (or combination of columns). If you do have a primary key that you haven't shown, replace ctid with that primary key column.
I wouldn't be surprised though, if there was a shorter, more efficient way to do it, but for now I can't think of a better alternative.
Online example
If you are on PG11 or later, you can use the groups frame to do this in what should be close to a single pass (except reordering for output when sorted by tid) with window functions:
select tid, category, id, type,
case
when type is not null then type
when round(
(count(*) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding))::numeric /
coalesce(
nullif(
count(*) over (partition by category
order by type nulls last
groups 2 preceding
exclude group), 0), 1
) *
count(*) over (partition by category
order by type nulls last
groups current row)
) >= row_number() over (partition by category, type
order by tid)
then
first_value(type) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding)
else
first_value(type) over (partition by category
order by type nulls last
groups 1 preceding
exclude group)
end as extended_type
from cv_ncv
order by tid;
Working fiddle here.

find rows not following by the same values in 3 columns

I have a table named raw_data with the following data
as You can see id 1 and 2 share the same values in field desa, kecamatan and kabupaten, also id 3,4,5.
So basically I want to select all rows that is not followed by the same previous values. expected result would be:
I know it's easy to do this in any programming languages such as PHP, but I need this in postgresql. is this doable? Thanks in Advance.
Assuming higher id denotes latest row, if a row with same all three columns is present not together and you don't want to filter it out as it doesn't have same values as previous row (order by id or created_date), then you can make use of analytic lag() function:
select *
from (
select
t.*,
case
when desa = lag(desa) over (order by id)
and kecamatan = lag(kecamatan) over (order by id)
and kabupaten = lag(kabupaten) over (order by id)
then 0 else 1
end flag
from your_table t
) t where flag = 1;

PostgreSQL: set a column with the ordinal of the row sorted via another field

I have a table segnature describing an item with a varchar field deno and a numeric field ord. A foreign key fk_collection tells which collection the row is part of.
I want to update field ord so that it contains the ordinal of that row per each collection, sorted by field deno.
E.g. if I have something like
[deno] ord [fk_collection]
abc 10
aab 10
bcd 10
zxc 20
vbn 20
Then I want a result like
[deno] ord [fk_collection]
abc 1 10
aab 0 10
bcd 2 10
zxc 1 20
vbn 0 20
I tried with something like
update segnature s1 set ord = (select count(*)
from segnature s2
where s1.fk_collection=s2.fk_collection and s2.deno<s1.deno
)
but query is really slow: 150 collections per about 30000 items are updated in 10 minutes about.
Any suggestion to speed up the process?
Thank you!
You can use a window function to generate the "ordinal" number:
with numbered as (
select deno, fk_collection,
row_number() over (partition by fk_collection order by deno) as rn,
ctid as id
from segnature
)
update segnature
set ord = n.rn
from numbered n
where n.id = segnature.ctid;
This uses the internal column ctid to uniquely identify each rows. The ctid comparison is quite slow, so if you have a real primary (or unique) key in that table, use that column instead.
Alternatively without the common table expression:
update segnature
set ord = n.rn
from (
select deno, fk_collection,
row_number() over (partition by fk_collection order by deno) as rn,
ctid as id
from segnature
) as n
where n.id = segnature.ctid;
SQLFiddle example: http://sqlfiddle.com/#!15/e997f/1