Split set into uneven percentage buckets

Split set into uneven percentage buckets - tsql

Every day I am returned a set of x rows (between 5 and 2000).
I need to update a column from this set based on rules. I think this (not exactly working) example demonstrates this
/*
35% a
25% b
30% c
10% null
*/
WITH tally
(vals, updateThis, bucket)
AS
(
SELECT
DATEADD(DAY, - ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), GETDATE())
, NULL
, NTILE(100) OVER (ORDER BY (SELECT NULL))
FROM
(
VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0)) AS a(n)
CROSS JOIN (VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0)) AS b(n)
CROSS JOIN (VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0)) AS c(n)
)
--UPDATE
--SET updateThis
, updated
AS
(
SELECT
t.vals
, CASE
WHEN t.bucket <= 35 THEN 'a'
WHEN t.bucket > 35 AND t.bucket <=60 THEN 'b'
WHEN t.bucket > 60 AND t.bucket <=90 THEN 'c'
WHEN t.bucket > 60 AND t.bucket <=90 THEN 'NULL'
END AS updated
, t.bucket
FROM tally t
)
SELECT
U.updated
, COUNT(1) AS actual
FROM
updated u
GROUP BY U.updated
this solution is not precise and it might not update all the rows even if a + b + c did make up 100%. Also It wouldn't work for sets smaller than 100 rows.
My current working solution is:
Calculate total rows
Calculate actual rows needed (CEILING((#totalRows * ratio) / 100)
Update the final set in a WHILE LOOP, selecting current value and rows needed.
Is there a better - set based solution that would help me get rid of the loop?

Don't know, if I get this correctly...
First of all there seems to be a rather obvious mistake here:
WHEN t.bucket > 60 AND t.bucket <=90 THEN 'NULL'
Shouldn't this be this:
WHEN t.bucket >90 THEN 'NULL'
The function NTILE will spread your sets into rather even buckets. Check my output and find how this behaves in the corner-cases. I suggest to use a computed percentage per row like here:
WITH tally
(vals, bucket)
AS
(
SELECT
DATEADD(DAY, - ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), GETDATE())
,NTILE(100) OVER (ORDER BY (SELECT NULL))
FROM
(
VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0)) AS a(n)
CROSS JOIN (VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0)) AS b(n)
CROSS JOIN (VALUES (0), (0), (0), (0), (0), (0), (0), (0), (0)) AS c(n)
)
SELECT *
INTO #tmpBuckets
FROM Tally;
--I use this #tmpBuckets-table to get closer to your I have a table scenario
WITH Numbered AS
(
SELECT *
,ROW_NUMBER() OVER(ORDER BY vals DESC) / ((SELECT COUNT(*) FROM #tmpBuckets)/100.0) AS RunningPercentage
FROM #tmpBuckets
)
,ComputeBuckets AS
(
SELECT
t.*
, CASE
WHEN t.RunningPercentage <= 35 THEN 'a'
WHEN t.RunningPercentage > 35 AND t.RunningPercentage <=60 THEN 'b'
WHEN t.RunningPercentage > 60 AND t.RunningPercentage <=90 THEN 'c'
WHEN t.RunningPercentage >90 THEN 'NULL'
END AS ShnugoMethod
, CASE
WHEN t.bucket <= 35 THEN 'a'
WHEN t.bucket > 35 AND t.RunningPercentage <=60 THEN 'b'
WHEN t.bucket > 60 AND t.RunningPercentage <=90 THEN 'c'
WHEN t.bucket > 90 THEN 'NULL'
END AS ZikatoMethod
FROM Numbered t
)
SELECT cb.*
FROM ComputeBuckets cb
ORDER BY cb.vals DESC
GO
DROP TABLE #tmpBuckets;
I think you know, how to use such a cte to update the source table. Otherwise just come back with another question :-)

Related

Is there a smarter method to create series with different intervalls for count within a query?

I want to create different intervalls:
0 to 10 steps 1
10 to 100 steps 10
100 to 1.000 steps 100
1.000 to 10.000 steps 1.000
to query a table for count the items.
with "series" as (
(SELECT generate_series(0, 10, 1) AS r_from)
union
(select generate_series(10, 90, 10) as r_from)
union
(select generate_series(100, 900, 100) as r_from)
union
(select generate_series(1000, 9000, 1000) as r_from)
order by r_from
)
, "range" as ( select r_from
, case
when r_from < 10 then r_from + 1
when r_from < 100 then r_from + 10
when r_from < 1000 then r_from + 100
else r_from + 1000
end as r_to
from series)
select r_from, r_to,(SELECT count(*) FROM "my_table" WHERE "my_value" BETWEEN r_from AND r_to) as "Anz."
FROM "range";

I think generate_series is the right way, there is another way, we can use simple math to calculate the numbers.
SELECT 0 as r_from,1 as r_to
UNION ALL
SELECT power(10, steps ) * v ,
power(10, steps ) * v + power(10, steps )
FROM generate_series(1, 9, 1) v
CROSS JOIN generate_series(0, 3, 1) steps
so that might as below
with "range" as
(
SELECT 0 as r_from,1 as r_to
UNION ALL
SELECT power(10, steps) * v ,
power(10, steps) * v + power(10, steps)
FROM generate_series(1, 9, 1) v
CROSS JOIN generate_series(0, 3, 1) steps
)
select r_from, r_to,(SELECT count(*) FROM "my_table" WHERE "my_value" BETWEEN r_from AND r_to) as "Anz."
FROM "range";
sqlifddle

Rather than generate_series you could create defined integer range types (int4range), then test whether your value is included within the range (see Range/Multirange Functions and Operators. So
with ranges (range_set) as
( values ( int4range(0,10,'[)') )
, ( int4range(10,100,'[)') )
, ( int4range(100,1000,'[)') )
, ( int4range(1000,10000,'[)') )
) --select * from ranges;
select lower(range_set) range_start
, upper(range_set) - 1 range_end
, count(my_value) cnt
from ranges r
left join my_table mt
on (mt.my_value <# r.range_set)
group by r.range_set
order by lower(r.range_set);
Note the 3rd parameter in creating the ranges.
Creating a CTE as above is good if your ranges are static, however if dynamic ranges are required you can put the ranges into a table. Changes ranges then becomes a matter to managing the table. Not simple but does not require code updates. The query then reduces to just the Main part of the above:
select lower(range_set) range_start
, upper(range_set) - 1 range_end
, count(my_value) cnt
from range_tab r
left join my_table mt
on (mt.my_value <# r.range_set)
group by r.range_set
order by lower(r.range_set);
See demo for both here.

Redshift Dist key, IDentity column or join column? Cardinality of Column, Used in Join consideration for sort Key

I have a table that has an Identity column called ID and another column called DateID that references another table.
The date column is used in joins but the ID column has much more cardinality.
Distinct count for ID column : 657167
Distinct count for DateID column: 350
Can anyone please provide any insights as to which column would be a better choice for distribution key?
*Also regarding another question:
I have a dilemma in selecting sort and dist keys in my table.
sort Keys
Should I consider cardinality when selecting a sort key?
A column that would join with other tables would be candidates for a sort key, Is my assumption correct?
If I use compound sort key and use two columns does the order of columns matter?
If I define the column DateID as dist key should I put DateID in front of customerId while defining compound sort keys?*
another question merged to this old question as they are related.
P.S. I read some articles regarding choosing dist key and they say I should be using a column that is used in joining with other tables and has greater cardinality.
SELECT SP.*,
CP.*,
TV.*
FROM
(
SELECT * --> there are about 20 aggregation statements in the select statement
FROM FactCustomer f -- contains about 600K records
JOIN DimDate d -- contains about 700 records
ON f.DateID = d.DateID
JOIN DimTime t -- contains 24 records
ON f.TimeID = t.HourID
JOIN DimSalesBranch s -- contains about 64K records
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-12-31'
AND StartHour >= 9
AND starthour > 0
AND (EndHour <= 22)
) SP
LEFT JOIN
(
SELECT * --> there are about 20 aggregation statements in the select statement
FROM FactCustomer f
JOIN DimDate d
ON f.DateID = d.DateID
JOIN DimTime t
ON f.TimeID = t.HourID
JOIN DimSalesBranch s
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-09-16'
AND StartHour >= 9
AND (EndHour <= 22)
) CP
ON SP.StartDate = CP.StartDate_CP
AND SP.EndDate = CP.EndDate_CP
LEFT JOIN
(
SELECT * --> there are about 6 aggregation statements in the select statement
FROM FactSalesTargetBranch f
JOIN DimDate d
ON f.DateID = d.DateID
JOIN DimSalesBranch s
ON f.BranchID = s.BranchID
WHERE s.BranchID IN ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 )
AND d.DateTimeInfo >= (CASE
WHEN s.OpeningDate > '2018-01-01' THEN
s.OpeningDate
ELSE
'2018-01-01'
END
)
AND d.DateTimeInfo <= '2018-09-16'
) TV
ON SP.StartDate = TV.StartDate_TV
AND SP.EndDate = TV.EndDate_TV;
Any insights much appreciated.
Regards.

In this case
Use "even" distribution for your main table, this will allow good
paralellism. (dateid will be a bad candidate)
Use "all" distribution for your dateid table (the smaller table that
you join with)
Generally, "even" distribution is a good choice and will give you the best results unless you need to join large tables together.
see https://docs.aws.amazon.com/redshift/latest/dg/c_choosing_dist_sort.html

Postgresql Query Results in Division by 0 After Use of Case to Check for 0

The following query is using a subquery to allow for a weighted value to be calculated. The problem I am receiving is a division by 0 error that occurs at random for true 0 value aggregates as well as possible >0 aggregate returns from the subquery.
SELECT
table1.id,
SUM(subquery1.total_value_1),
CASE
WHEN SUM(subquery1.total_value_1) = 0 THEN 0
ELSE ROUND(SUM(percentage_value * (table1.value_1 /subquery1.total_value_1 ::FLOAT)) ::NUMERIC,2)
END AS percentage_value
FROM
table1,
(SELECT
id,
SUM(value_1) AS total_value_1
FROM
table1
WHERE
report_time BETWEEN '2016-10-28 00:00' AND '2016-10-29 23:59'
GROUP BY
id
) subquery1
WHERE
table1.id = subquery1.id
AND report_time BETWEEN '2016-10-28 00:00' AND '2016-10-29 23:59'
AND table1.id = 12572
GROUP BY
table1.id
ORDER BY
table1.id
In some instances, the Case statement is still doing the evaluation of the division despite the value of subquery1.total_value_1 being 0. Just to note, there is no possibility for subquery1.total_value_1 being NULL, as the table defaults this value to 0 on insert if the value added is not defined.

In example below, sum(column) is 1 for both rows, while column is equal to zero or one:
a=# with v as (
select generate_series(0,1,1) al
)
select sum(v.al) over(),v.al
from v;
sum | al
-----+----
1 | 0
1 | 1
(2 rows)
so in your SUM(subquery1.total_value_1) = 0 can be not equal to zero, but subquery1.total_value_1 ::FLOAT will be, this way you get division by zero

T-SQL Counting Distinct Rows to Include Zeroes as Unique

I have the following test script :
DECLARE #Test TABLE (number INT)
INSERT INTO #Test VALUES (6)
INSERT INTO #Test VALUES (6)
INSERT INTO #Test VALUES (6)
INSERT INTO #Test VALUES (2)
INSERT INTO #Test VALUES (2)
INSERT INTO #Test VALUES (0)
INSERT INTO #Test VALUES (0)
INSERT INTO #Test VALUES (0)
INSERT INTO #Test VALUES (0) INSERT INTO #Test VALUES (0)
SELECT * FROM #Test
SELECT count(*) FROM #Test GROUP BY number
Results
number
6
6
6
2
2
0
0
0
0
0
(No column name)
5
2
3
I'm trying to get a count of 7 , i.e. distinct for the 6's and 2's and unique for the zeros?

The simplest way I came up with is this:
SELECT COUNT(DISTINCT NULLIF(Number, 0)) + SUM(CASE WHEN Number = 0 THEN 1 END)
FROM #Test
The NULLIF makes the COUNT ignore numbers that are equal to 0, the DISTINCT is responsible for counting each number only once, and the SUM with the CASE is calculating the number of 0 records.

Not exactly sure why you would want this but you could do both queries separately and perform a UNION ALL to combine the results.
Test Data
CREATE TABLE #TestData (Number int)
INSERT INTO #TestData (Number)
VALUES
(6), (6), (6), (2), (2), (0), (0), (0), (0), (0)
Query
SELECT DISTINCT
Number
FROM #TestData
WHERE Number <> 0
UNION ALL
SELECT
Number
FROM #TestData
WHERE Number = 0
Results
Number
2
6
0
0
0
0
0
If you want to return the number 7, then just wrap this in an outer query like this;
SELECT
COUNT(1) FinalCount
FROM
(
SELECT DISTINCT
Number
FROM #TestData
WHERE Number <> 0
UNION ALL
SELECT
Number
FROM #TestData
WHERE Number = 0
) a
Result
FinalCount
7

try this:
DECLARE #Test TABLE (number INT)
INSERT INTO #Test VALUES (6), (6), (6), (2), (2), (0), (0), (0), (0) ,(0)
SELECT COUNT(DISTINCT NewNumber)
FROM (SELECT number,
(CASE WHEN number = 0 THEN ROW_NUMBER() OVER(ORDER BY number) * RAND()
ELSE number END) AS NewNumber
FROM #Test) AS T
Output result:
7
if you will add GROUP BY [number] you will receive:
number cnt
0 5
2 1
6 1

Check for equal amounts of negative numbers as positive numbers

I have a table with two columns: intGroupID, decAmount
I want to have a query that can basically return the intGroupID as a result if for every positive(+) decAmount, there is an equal and opposite negative(-) decAmount.
So a table of (id=1,amount=1.0),(1,2.0),(1,-1.0),(1,-2.0) would return back the intGroupID of 1, because for each positive number there exists a negative number to match.
What I know so far is that there must be an equal number of decAmounts (so I enforce a count(*) % 2 = 0) and the sum of all rows must = 0.0. However, some cases that get by that logic are:
ID | Amount
1 | 1.0
1 | -1.0
1 | 2.0
1 | -2.0
1 | 3.0
1 | 2.0
1 | -4.0
1 | -1.0
This has a sum of 0.0 and has an even number of rows, but there is not a 1-for-1 relationship of positives to negatives. I need a query that can basically tell me if there is a negative amount for each positive amount, without reusing any of the rows.
I tried counting the distinct absolute values of the numbers and enforcing that it is less than the count of all rows, but it's not catching everything.
The code I have so far:
DECLARE #tblTest TABLE(
intGroupID INT
,decAmount DECIMAL(19,2)
);
INSERT INTO #tblTest (intGroupID ,decAmount)
VALUES (1,-1.0),(1,1.0),(1,2.0),(1,-2.0),(1,3.0),(1,2.0),(1,-4.0),(1,-1.0);
DECLARE #intABSCount INT = 0
,#intFullCount INT = 0;
SELECT #intFullCount = COUNT(*) FROM #tblTest;
SELECT #intABSCount = COUNT(*) FROM (
SELECT DISTINCT ABS(decAmount) AS absCount FROM #tblTest GROUP BY ABS(decAmount)
) AS absCount
SELECT t1.intGroupID
FROM #tblTest AS t1
/* Make Sure Even Number Of Rows */
INNER JOIN
(SELECT COUNT(*) AS intCount FROM #tblTest
)
AS t2 ON t2.intCount % 2 = 0
/* Make Sure Sum = 0.0 */
INNER JOIN
(SELECT SUM(decAmount) AS decSum FROM #tblTest)
AS t3 ON decSum = 0.0
/* Make Sure Count of Absolute Values < Count of Values */
WHERE
#intABSCount < #intFullCount
GROUP BY t1.intGroupID
I think there is probably a better way to check this table, possibly by finding pairs and removing them from the table and seeing if there's anything left in the table once there are no more positive/negative matches, but I'd rather not have to use recursion/cursors.

Create TABLE #tblTest (
intA INT
,decA DECIMAL(19,2)
);
INSERT INTO #tblTest (intA,decA)
VALUES (1,-1.0),(1,1.0),(1,2.0),(1,-2.0),(1,3.0),(1,2.0),(1,-4.0),(1,-1.0), (5,-5.0),(5,5.0) ;
SELECT * FROM #tblTest;
SELECT
intA
, MIN(Result) as IsBalanced
FROM
(
SELECT intA, X,Result =
CASE
WHEN count(*)%2 = 0 THEN 1
ELSE 0
END
FROM
(
---- Start thinking here --- inside-out
SELECT
intA
, x =
CASE
WHEN decA < 0 THEN
-1 * decA
ELSE
decA
END
FROM #tblTest
) t1
Group by intA, X
)t2
GROUP BY intA

Not tested but I think you can get the idea
This returns the id that do not conform
The not is easier to test / debug
select pos.*, neg.*
from
( select id, amount, count(*) as ccount
from tbl
where amount > 0
group by id, amount ) pos
full outer join
( select id, amount, count(*) as ccount
from tbl
where amount < 0
group by id, amount ) neg
on pos.id = neg.id
and pos.amount = -neg.amount
and pos.ccount = neg.ccount
where pos.id is null
or neg.id is null
I think this will return a list of id that do conform
select distinct(id) from tbl
except
select distinct(isnull(pos.id, neg.id))
from
( select id, amount, count(*) as ccount
from tbl
where amount > 0
group by id, amount ) pos
full outer join
( select id, amount, count(*) as ccount
from tbl
where amount < 0
group by id, amount ) neg
on pos.id = neg.id
and pos.amount = -neg.amount
and pos.ccount = neg.ccount
where pos.id is null
or neg.id is null

Boy, I found a simpler way to do this than my previous answers. I hope all my crazy edits are saved for posterity.
This works by grouping all numbers for an id by their absolute value (1, -1 grouped by 1).
The sum of the group determines if there are an equal number of pairs. If it is 0 then it is equal, any other value for the sum means there is an imbalance.
The detection of evenness by the COUNT aggregate is only necessary to detect an even number of zeros. I assumed that 0's could exist and they should occur an even number of times. Remove it if this isn't a concern, as 0 will always pass the first test.
I rewrote the query a bunch of different ways to get the best execution plan. The final result below only has one big heap sort which was unavoidable given the lack of an index.
Query
WITH tt AS (
SELECT intGroupID,
CASE WHEN SUM(decAmount) > 0 OR COUNT(*) % 2 = 1 THEN 1 ELSE 0 END unequal
FROM #tblTest
GROUP BY intGroupID, ABS(decAmount)
)
SELECT tt.intGroupID,
CASE WHEN SUM(unequal) != 0 THEN 'not equal' ELSE 'equals' END [pair]
FROM tt
GROUP BY intGroupID;
Tested Values
(1,-1.0),(1,1.0),(1,2),(1,-2), -- should work
(2,-1.0),(2,1.0),(2,2),(2,2), -- fail, two positive twos
(3,1.0),(3,1.0),(3,-1.0), -- fail two 1's , one -1
(4,1),(4,2),(4,-.5),(4,-2.5), -- fail: adds up the same sum, but different values
(5,1),(5,-1),(5,0),(5,0), -- work, test zeros
(6,1),(6,-1),(6,0), -- fail, test zeros
(7,1),(7,-1),(7,-1),(7,1),(7,1) -- fail, 3 x 1
Results
A pairs
_ _____
1 equal
2 not equal
3 not equal
4 not equal
5 equal
6 not equal
7 not equal

The following should return "disbalanced" groups:
;with pos as (
select intGroupID, ABS(decAmount) m
from TableName
where decAmount > 0
), neg as (
select intGroupID, ABS(decAmount) m
from TableName
where decAmount < 0
)
select distinct IsNull(p.intGroupID, n.intGroupID) as intGroupID
from pos p
full join neg n on n.id = p.id and abs(n.m - p.m) < 1e-8
where p.m is NULL or n.m is NULL
to get unpaired elements, select satement can be changed to following:
select IsNull(p.intGroupID, n.intGroupID) as intGroupID, IsNull(p.m, -n.m) as decAmount
from pos p
full join neg n on n.id = p.id and abs(n.m - p.m) < 1e-8
where p.m is NULL or n.m is NULL

Does this help?
-- Expected result - group 1 and 3
declare #matches table (groupid int, value decimal(5,2))
insert into #matches select 1, 1.0
insert into #matches select 1, -1.0
insert into #matches select 2, 2.0
insert into #matches select 2, -2.0
insert into #matches select 2, -2.0
insert into #matches select 3, 3.0
insert into #matches select 3, 3.5
insert into #matches select 3, -3.0
insert into #matches select 3, -3.5
insert into #matches select 4, 4.0
insert into #matches select 4, 4.0
insert into #matches select 4, -4.0
-- Get groups where we have matching positive/negatives, with the same number of each
select mat.groupid, min(case when pos.PositiveCount = neg.NegativeCount then 1 else 0 end) as 'Match'
from #matches mat
LEFT JOIN (select groupid, SUM(1) as 'PositiveCount', Value
from #matches where value > 0 group by groupid, value) pos
on pos.groupid = mat.groupid and pos.value = ABS(mat.value)
LEFT JOIN (select groupid, SUM(1) as 'NegativeCount', Value
from #matches where value < 0 group by groupid, value) neg
on neg.groupid = mat.groupid and neg.value = case when mat.value < 0 then mat.value else mat.value * -1 end
group by mat.groupid
-- If at least one pair within a group don't match, reject
having min(case when pos.PositiveCount = neg.NegativeCount then 1 else 0 end) = 1

You can compare your values this way:
declare #t table(id int, amount decimal(4,1))
insert #t values(1,1.0),(1,-1.0),(1,2.0),(1,-2.0),(1,3.0),(1,2.0),(1,-4.0),(1,-1.0),(2,-1.0),(2,1.0)
;with a as
(
select count(*) cnt, id, amount
from #t
group by id, amount
)
select id from #t
except
select b.id from a
full join a b
on a.cnt = b.cnt and a.amount = -b.amount
where a.id is null
For some reason i can't write comments, however Daniels comment is not correct, and my solution does accept (6,1),(6,-1),(6,0) which can be correct. 0 is not specified in the question and since it is a 0 value it can be handled eather way. My answer does NOT accept (3,1.0),(3,1.0),(3,-1.0)
To Blam: No I am not missing
or b.id is null
My solution is like yours, but not exactly identical

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Split set into uneven percentage buckets - tsql

Related

Is there a smarter method to create series with different intervalls for count within a query?

Redshift Dist key, IDentity column or join column? Cardinality of Column, Used in Join consideration for sort Key

Postgresql Query Results in Division by 0 After Use of Case to Check for 0

T-SQL Counting Distinct Rows to Include Zeroes as Unique

Check for equal amounts of negative numbers as positive numbers

Categories

Resources