How to Return Records Equal to a Specific Percentage of an Aggregate in Transact-SQL? - tsql

My requirement is to provide a random sample of claims that comprise 2.5% of the total amount paid and also comprise 2.5% of total claims for a given population. The goal is to deliver records in a report that meet both criteria. My staging table is defined as follows:
[RecordId] UniqueIdentifier NOT NULL PRIMARY KEY DEFAULT NEWID()
,ClaimNO varchar(50)
,Company_ID varchar(10)
,HPCode varchar(10)
,FinancialResponsibility varchar(30)
,ProviderType varchar(50)
,DateOfService date
,DatePaid date
,ClaimType varchar(50)
,TotalBilled numeric(11,2)
,TotalPaid numeric(11,2)
,ProcessorType varchar(100)
I've already built the logic to return 2.5% of the total number of claims but need guidance in how best to ensure both criterion are met.
Here's what I've tried thus far:
with cteTotals as (
Select Count(*) as TotalClaims, sum(TotalPaid) as TotalPaid, sum(TotalPaid) * .025 as PaidSampleAmount
from [Z_Monthly_Quality_Review]
),
ctePopulation as (
Select *
from [Z_Monthly_Quality_Review]
),
cteSampleRows as (
select TOP 2.5 PERCENT NEWID() RandomID, RecordID, ClaimNo, HPCode, FinancialResponsibility, ProviderType, ProcessorType,
Format(DateOfService, 'MM/dd/yyyy') as DateOfService, Format(DatePaid, 'MM/dd/yyyy') as DatePaid, ClaimType, TotalBilled, TotalPaid
from [Z_Monthly_Quality_Review]
order by NEWID()
),
cteSamplePaid as (
Select Top 2.5 PERCENT NEWID() RandomID, RecordID, ClaimNo, HPCode, FinancialResponsibility, ProviderType, ProcessorType,
Format(DateOfService, 'MM/dd/yyyy') as DateOfService, Format(DatePaid, 'MM/dd/yyyy') as DatePaid, ClaimType, TotalBilled, TotalPaid
from [Z_Monthly_Quality_Review] mqr
inner join ctePopulation cte on mqr.ClaimNo = cte.ClaimNO
order by NEWID()
)
Since both criterion must be satisfied, how should I structure both CTEs to ensure this? In my cteSamplePaid, how do I ensure that the sum of total paid equals 2.5% of the total population? Would this be accomplished with a Having clause? The end result will be displayed to my business users via SQL Server Reporting Services. Ideally, I would want to provide them with 1 sample that meets both criteria. If that's not possible, how do I randomly sample claims from both criterion?

Don't think there is a guaranteed way it will add up to 2.5% of the total. There's no guarantee results and the performance would be very poor as it you would essentially have to brute force every possible combination of rows. A way to get very close to your goal would be to use return rows that add up to an acceptable margin of error.
Since no sample data was provided, I just used AdventureWorks2017 (downloaded from here)
USE AdventureWorks2017
GO
DROP TABLE IF EXISTS #SalesData
SELECT SalesOrderID AS ID,TotalDue
INTO #SalesData
FROM Sales.SalesOrderHeader
Declare #DesiredPercentage Numeric(10,3) = .025 /*Desired sum percentage of total rows*/
,#AcceptableMargin Numeric(10,3) = .01 /*Random row total can be plus or minus this percentage of the desired sum*/
DECLARE #DesiredSum Numeric(16,2) = #DesiredPercentage *(SELECT SUM(TotalDue) FROM #SalesData)
/*For loop*/
DECLARE #RowNum INT
,#LoopCounter INT = 1
WHILE (1=1)
BEGIN
DROP TABLE IF EXISTS #RandomData
SELECT RowNum = ROW_NUMBER() OVER (ORDER BY B.RandID),A.*,RunningTotal = SUM(TotalDue) OVER (ORDER BY B.RandID)
INTO #RandomData
FROM #SalesData AS A
CROSS APPLY (SELECT RandID = NEWID()) AS B
WHERE TotalDue < #DesiredSum /*If single row bigger than desired sum, then filter it out*/
ORDER BY B.RandID
SELECT Top(1) #RowNum = RowNum
FROM #RandomData AS A
CROSS APPLY (SELECT DeltaFromDesiredSum = ABS(RunningTotal-#DesiredSum)) AS B
WHERE RunningTotal BETWEEN #DesiredSum *(1-#AcceptableMargin) AND #DesiredSum *(1+#AcceptableMargin)
ORDER BY DeltaFromDesiredSum
IF (#RowNum IS NOT NULL)
BREAK;
IF (#LoopCounter >=100) /*Prevents infinite loops*/
THROW 59194,'Result unable to be generated in 100 tries. Recommend expanding acceptable margin',1;
SET #LoopCounter +=1;
END
SELECT *
FROM #RandomData
WHERE RowNum <= #RowNum
SELECT RandomRowTotal = SUM(TotalDue)
,DesiredSum = #DesiredSum
,PercentageFromDesiredSum = Concat(Cast(Round(100*(1-SUM(TotalDue)/#DesiredSum),2) as Float),'%')
FROM #RandomData
WHERE RowNum <= #RowNum

Related

SQL Server - Select with Group By together Raw_Number

I'm using SQL Server 2000 (80). So, it's not possible to use the LAG function.
I have a code a data set with four columns:
Purchase_Date
Facility_no
Seller_id
Sale_id
I need to identify missing Sale_ids. So every sale_id is a 100% sequential, so the should not be any gaps in order.
This code works for a specific date and store if specified. But i need to work on entire data set looping looping through every facility_id and every seller_id for ever purchase_date
declare #MAXCOUNT int
set #MAXCOUNT =
(
select MAX(Sale_Id)
from #table
where
Facility_no in (124) and
Purchase_date = '2/7/2020'
and Seller_id = 1
)
;WITH TRX_COUNT AS
(
SELECT 1 AS Number
union all
select Number + 1 from TRX_COUNT
where Number < #MAXCOUNT
)
select * from TRX_COUNT
where
Number NOT IN
(
select Sale_Id
from #table
where
Facility_no in (124)
and Purchase_Date = '2/7/2020'
and seller_id = 1
)
order by Number
OPTION (maxrecursion 0)
My Dataset
This column:
case when
Sale_Id=0 or 1=Sale_Id-LAG(Sale_Id) over (partition by Facility_no, Purchase_Date, Seller_id)
then 'OK' else 'Previous Missing' end
will tell you which Seller_Ids have some sale missing. If you want to go a step further and have exactly your desired output, then filter out and distinct the 'Previous Missing' ones, and join with a tally table on not exists.
Edit: OP mentions in comments they can't use LAG(). My suggestion, then, would be:
Make a temp table that that has the max(sale_id) group by facility/seller_id
Then you can get your missing results by this pseudocode query:
Select ...
from temptable t
inner join tally N on t.maxsale <=N.num
where not exists( select ... from sourcetable s where s.facility=t.facility and s.seller=t.seller and s.sale=N.num)
> because the only way to "construct" nonexisting combinations is to construct them all and just remove the existing ones.
This one worked out
; WITH cte_Rn AS (
SELECT *, ROW_NUMBER() OVER(PARTITION BY Facility_no, Purchase_Date, Seller_id ORDER BY Purchase_Date) AS [Rn_Num]
FROM (
SELECT
Facility_no,
Purchase_Date,
Seller_id,
Sale_id
FROM MyTable WITH (NOLOCK)
) a
)
, cte_Rn_0 as (
SELECT
Facility_no,
Purchase_Date,
Seller_id,
Sale_id,
-- [Rn_Num] AS 'Skipped Sale'
-- , case when Sale_id = 0 Then [Rn_Num] - 1 Else [Rn_Num] End AS 'Skipped Sale for 0'
, [Rn_Num] - 1 AS 'Skipped Sale for 0'
FROM cte_Rn a
)
SELECT
Facility_no,
Purchase_Date,
Seller_id,
Sale_id,
-- [Skipped Sale],
[Skipped Sale for 0]
FROM cte_Rn_0 a
WHERE NOT EXISTS
(
select * from cte_Rn_0 b
where b.Sale_id = a.[Skipped Sale for 0]
and a.Facility_no = b.Facility_no
and a.Purchase_Date = b.Purchase_Date
and a.Seller_id = b.Seller_id
)
--ORDER BY Purchase_Date ASC

Update null values in a column based on non null values percentage of the column

I need to update the null values of a column in a table for each category based on the percentage of the non-null values. The following table shows the null values for a particular category -
There are only two types of values in the column. The percentage of types based on rows is -
The number of rows with null values is 7, I need to randomly populate the null values based on the percentage share of the non-null values as shown below - 38%(CV) of 7 = 3, 63%(NCV) of 7 = 4
If you want to dynamically calculate the "NULL rate", one way to do it could be:
with pcts as (
select
(select count(*)::numeric from the_table where type = 'cv') / (select count(*) from the_table where type is not null) as cv_pct,
(select count(*)::numeric from the_table where type = 'ncv') / (select count(*) from the_table where type is not null) as ncv_pct,
(select count(*) from the_table where type is null) as null_count
), calc as (
select d.ctid,
p.cv_pct,
p.ncv_pct,
row_number() over () as rn,
case
when row_number() over () <= round(null_count * p.cv_pct) then 'cv'
else 'ncv'
end as new_type
from the_table d
cross join pcts p
where type is null
)
update the_table t
set type = c.new_type
from calc c
where t.ctid = c.ctid
The first CTE calculates the percentage of each type and the total number of NULL values (in theory the percentage of the NCV type isn't really needed, but I included it for completeness)
The second then calculates for each row which new type should be used. This is done by multiplying the "current" row number with the expected percentage (the CASE expression)
This is then used to update the target table. I have used the ctid as an alternative for a primary key, because your sample data does not have any unique column (or combination of columns). If you do have a primary key that you haven't shown, replace ctid with that primary key column.
I wouldn't be surprised though, if there was a shorter, more efficient way to do it, but for now I can't think of a better alternative.
Online example
If you are on PG11 or later, you can use the groups frame to do this in what should be close to a single pass (except reordering for output when sorted by tid) with window functions:
select tid, category, id, type,
case
when type is not null then type
when round(
(count(*) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding))::numeric /
coalesce(
nullif(
count(*) over (partition by category
order by type nulls last
groups 2 preceding
exclude group), 0), 1
) *
count(*) over (partition by category
order by type nulls last
groups current row)
) >= row_number() over (partition by category, type
order by tid)
then
first_value(type) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding)
else
first_value(type) over (partition by category
order by type nulls last
groups 1 preceding
exclude group)
end as extended_type
from cv_ncv
order by tid;
Working fiddle here.

Finding the percentage (%) range of average value in SQL

I am wanting to return the values that lie within 20% of the average value within the Duration column in my database.
I want to build on the code below but instead of returning Where Duration is less than the average value of duration I want to return all values which lay within 20% of the AVG(Duration) value.
Select * From table
Where Duration < (Select AVG(Duration) from table)
Here is one way...
Select * From table
Where Duration between (Select AVG(Duration)*0.8 from table)
and (Select AVG(Duration)*1.2 from table)
perhaps this to avoid repeated scans:
with cte as ( Select AVG(Duration) as AvgDuration from table )
Select * From table
Where Duration between (Select AvgDuration*0.8 from cte)
and (Select AvgDuration*1.2 from cte)
or
Select table.* From table
cross join ( Select AVG(Duration) as AvgDuration from table ) cj
Where Duration between cj.AvgDuration*0.8 and cj.AvgDuration*1.2
or using a window function:
Select d.*
from (
SELECT table.*
, AVG(Duration) OVER() as AvgDuration
From table
) d
Where d.Duration between d.AvgDuration*0.8 and d.AvgDuration*1.2
The last one might be the most efficient method.

How do I find the sum of all transactions since an event?

So, let's say that I have a group of donors, and they make donations on an irregular basis. I can put the donor name, the donation amount, and the donation date into a table, but then I want to do a report that shows all of that information PLUS the value of all donations after that amount.
I know that I can parse through this using a loop, but is there a better way?
I'm cheating here by not bothering with the code that would go through and assign a transaction number by donor and ensure that everything is the right order. That's easy enough.
DECLARE #Donors TABLE (
ID INT IDENTITY
, Name NVARCHAR(30)
, NID INT
, Amount DECIMAL(7,2)
, DonationDate DATE
, AmountAfter DECIMAL(7,2)
)
INSERT INTO #Donors VALUES
('Adam Zephyr',1,100.00,'2017-01-14',NULL)
, ('Adam Zephyr',2,200.00,'2017-01-17',NULL)
, ('Adam Zephyr',3,150.00,'2017-01-20',NULL)
, ('Braden Yu',1,50.00,'2017-01-11',NULL)
, ('Braden Yu',2,75.00,'2017-01-19',NULL)
DECLARE #Counter1 INT = 0
, #Name NVARCHAR(30)
WHILE #Counter1 < (SELECT MAX(ID) FROM #Donors)
BEGIN
SET #Counter1 += 1
SET #Name = (SELECT Name FROM #Donors WHERE ID = #Counter1)
UPDATE d1
SET AmountAfter = (SELECT ISNULL(SUM(Amount),0) FROM #Donors d2 WHERE ID > #Counter1 AND Name = #Name)
FROM #Donors d1
WHERE d1.ID = #Counter1
END
SELECT * FROM #Donors
It seems like there ought to be a way to do this recursively, but I just can't wrap my head around it.
This would show the latest donation per Name which I presume is the donor and the total of all amounts donated by that person. Perhaps it's more appropriate to use NID for the partitions.
;with MostRecentDonations as (
select *,
row_number() over (partition by Name order by DonationDate desc) as rn,
sum(Amount) over (partition by Name) as TotalDonations
from #Donors
)
select * from MostRecentDonations
where rn = 1;
There's certainly no need to store a running total anywhere unless you have some kind of performance issue.
EDIT:
I've thought about your question and now I'm thinking that you just want a running total with all the transactions included. That's easy too:
select *,
sum(Amount) over (partition by Name order by DonationDate) as DonationsToDate
from #Donors
order by Name, DonationDate;

T-SQL if value exists use it other wise use the value before

I have the following table
-----Account#----Period-----Balance
12345---------200901-----$11554
12345---------200902-----$4353
12345 --------201004-----$34
12345 --------201005-----$44
12345---------201006-----$1454
45677---------200901-----$14454
45677---------200902-----$1478
45677 --------201004-----$116776
45677 --------201005-----$996
56789---------201006-----$1567
56789---------200901-----$7894
56789---------200902-----$123
56789 --------201003-----$543345
56789 --------201005-----$114
56789---------201006-----$54
I want to select the account# that have a period of 201005.
This is fairly easy using the code below. The problem is that if a user enters 201003-which doesnt exist- I want the query to select the previous value.*NOTE that there is an account# that has a 201003 period and I still want to select it too.*
I tried CASE, IF ELSE, IN but I was unsuccessfull.
PS:I cannot create temp tables due to system limitations of 5000 rows.
Thank you.
DECLARE #INPUTPERIOD INT
#INPUTPERIOD ='201005'
SELECT ACCOUNT#, PERIOD , BALANCE
FROM TABLE1
WHERE PERIOD =#INPUTPERIOD
SELECT t.ACCOUNT#, t.PERIOD, t.BALANCE
FROM (SELECT ACCOUNT#, MAX(PERIOD) AS MaxPeriod
FROM TABLE1
WHERE PERIOD <= #INPUTPERIOD
GROUP BY ACCOUNT#) q
INNER JOIN TABLE1 t
ON q.ACCOUNT# = t.ACCOUNT#
AND q.MaxPeriod = t.PERIOD
select top 1 account#, period, balance
from table1
where period >= #inputperiod
; WITH Base AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Period DESC) RN FROM #MyTable WHERE Period <= 201003
)
SELECT * FROM Base WHERE RN = 1
Using CTE and ROW_NUMBER() (we take all the rows with Period <= the selected date and we take the top one (the one with auto-generated ROW_NUMBER() = 1)
; WITH Base AS
(
SELECT *, 1 AS RN FROM #MyTable WHERE Period = 201003
)
, Alternative AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Period DESC) RN FROM #MyTable WHERE NOT EXISTS(SELECT 1 FROM Base) AND Period < 201003
)
, Final AS
(
SELECT * FROM Base
UNION ALL
SELECT * FROM Alternative WHERE RN = 1
)
SELECT * FROM Final
This one is a lot more complex but does nearly the same thing. It is more "imperative like". It first tries to find a row with the exact Period, and if it doesn't exists does the same thing as before. At the end it unite the two result sets (one of the two is always empty). I would always use the first one, unless profiling showed me the SQL wasn't able to comprehend what I'm trying to do. Then I would try the second one.