Include zero count in groupby

Include zero count in groupby - postgresql

I have this table
person outcome
Peter positive
Peter positive
Peter positive
Eric positive
Eric positive
Eric negative
and want to count the number of rows each person has a positive/negative outcome.
select person, outcome, count(*)
from public.test123
group by person, outcome
person outcome count
Peter positive 3
Eric positive 2
Eric negative 1
But I also want a zero count for Peter negative. I've seen answers like this but I have nothing to join the table to?
How can I groupby, count and include zeros?
person outcome count
Peter positive 3
Peter negative 0
Eric positive 2
Eric negative 1
zxc
create table public.test123 (
person VARCHAR(20),
outcome VARCHAR(20));
insert into public.test123(person, outcome)
VALUES
('Peter', 'positive'),
('Peter', 'positive'),
('Peter', 'positive'),
('Eric', 'positive'),
('Eric', 'positive'),
('Eric', 'negative');

step-by-step demo:db<>fiddle
SELECT
s.person,
s.outcome,
SUM((t.outcome IS NOT NULL)::int) as cnt -- 4
FROM (
SELECT
*
FROM unnest(ARRAY['positive', 'negative']) as x(outcome), -- 1
(
SELECT DISTINCT -- 2
person
FROM test123
) s
) s
LEFT JOIN test123 t ON t.person = s.person AND t.outcome = s.outcome -- 3
GROUP BY s.person, s.outcome
Create a list of all possible outcome values.
Join it with all possible person values. Now you have a cartesian table with all possible combinations.
This can be used to join your original table.
Count all non-NULL values for each combination (in that case I uses SUM() with all non-NULL values == 1, 0 else)

Related

join 2 tables with different dates into one date column

I have two tables: a_table and b_table. They contain closing records and checkout records, that for each customer can be performed on different dates. I would like to combine these 2 tables together, so that there is only one date field, one customer field, one close and one check field.
a_table
time_modified customer_name
2021-05-03 Ben
2021-05-08 Ben
2021-07-10 Jerry
b_table
time_modified account_id
2021-05-06 Ben
2021-07-08 Jerry
2021-07-12 Jerry
Expected result
date account_id_a close check
2021-05-03 Ben 1 0
2021-05-06 Ben 0 1
2021-05-08 Ben 1 0
2021-07-08 Jerry 0 1
2021-07-10 Jerry 1 1
2021-07-12 Jerry 0 1
The query so far:
with a_table as (
select rz.time_modified::date, rz.customer_name,
case when rz.time_modified::date is not null then 1 else 0 end as close
from schema.rz
),
b_table as (
select bo.time_modified::date, bo.customer_name,
case when bo.time_modified::date is not null then 1 else 0 end as check
from schema.bo
)
SELECT (CURRENT_DATE::TIMESTAMP - (i * interval '1 day'))::date as date,
a.*, b.*
FROM generate_series(1,2847) i
left join a_table a
on a.time_modified = i.date
left join b_table b
on b.time_modified = i.date
The query above returns:
SQL Error [500310] [0A000]: [Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;

you just need to do a union rather than a join.
Join merges two tables into one where union adds the second table to the first

First off the error you are getting is due to the use of the generate_series() function in a query where its results need to be combined with table data. Generate_series() is a lead-node-only function and its results cannot be used on compute nodes. You will need to generate the number series you desire in another way. See How to Generate Date Series in Redshift for possible ways to do this.
I'm not sure I follow your query entirely but it seems like you want to UNION the tables and not JOIN them. You haven't defined what rz and bo are so it is a bit confusing. However UNION and some calculation for close and check seems like the way to go

How to calculate median in AWS Redshift?

Most databases have a built in function for calculating the median but I don't see anything for median in Amazon Redshift.
You could calculate the median using a combination of the nth_value() and count() analytic functions but that seems janky. I would be very surprised if an analytics db didn't have a built in method for computing median so I'm assuming I'm missing something.
http://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_NTH_WF.html
http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html

And as of 2014-10-17, Redshift supports the MEDIAN window function:
# select min(median) from (select median(num) over () from temp);
min
-----
4.0

Try the NTILE function.
You would divide your data into 2 ranked groups and pick the minimum value from the first group. That's because in datasets with an odd number of values, the first ntile will have 1 more value than the second. This approximation should work very well for large datasets.
create table temp (num smallint);
insert into temp values (1),(5),(10),(2),(4);
select num, ntile(2) over(order by num desc) from temp ;
num | ntile
-----+-------
10 | 1
5 | 1
4 | 1
2 | 2
1 | 2
select min(num) as median from (select num, ntile(2) over(order by num desc) from temp) where ntile = 1;
median
--------
4

I had difficulty with this also, but got some help from Amazon. Since the 2014-06-30 version of Redshift, you can do this with the PERCENTILE_CONT or PERCENTILE_DISC window functions.
They're slightly weird to use, as they will tack the median (or whatever percentile you choose) onto every row. You put that in a subquery and then take the MIN (or whatever) of the median column.
# select count(num), min(median) as median
from
(select num, percentile_cont (0.5) within group (order by num) over () as median from temp);
count | median
-------+--------
5 | 4.0
(The reason it's complicated is that window functions can also do their own mini-group-by and ordering to give you the median of many groups all at once, and other tricks.)
In the case of an even number of values, CONT(inuous) will interpolate between the two middle values, where DISC(rete) will pick one of them.

I typically use the NTILE function to split the data into two groups if I’m looking for an answer that’s close enough. However, if I want the exact median (e.g. the midpoint of an even set of rows), I use a technique suggested on the AWS Redshift Discussion Forum.
This technique orders the rows in both ascending and descending order, then if there is an odd number of rows, it returns the average of the middle row (that is, where row_num_asc = row_num_desc), which is simply the middle row itself.
CREATE TABLE temp (num SMALLINT);
INSERT INTO temp VALUES (1),(5),(10),(2),(4);
SELECT
AVG(num) AS median
FROM
(SELECT
num,
SUM(1) OVER (ORDER BY num ASC) AS row_num_asc,
SUM(1) OVER (ORDER BY num DESC) AS row_num_desc
FROM
temp) AS ordered
WHERE
row_num_asc IN (row_num_desc, row_num_desc - 1, row_num_desc + 1);
median
--------
4
If there is an even number of rows, it returns the average of the two middle rows.
INSERT INTO temp VALUES (9);
SELECT
AVG(num) AS median
FROM
(SELECT
num,
SUM(1) OVER (ORDER BY num ASC) AS row_num_asc,
SUM(1) OVER (ORDER BY num DESC) AS row_num_desc
FROM
temp) AS ordered
WHERE
row_num_asc IN (row_num_desc, row_num_desc - 1, row_num_desc + 1);
median
--------
4.5

SSRS 2005 column chart: show series label missing when data count is zero

I have a pretty simple chart with a likely common issue. I've searched for several hours on the interweb but only get so far in finding a similar situation.
the basics of what I'm pulling contains a created_by, person_id and risk score
the risk score can be:
1 VERY LOW
2 LOW
3 MODERATE STABLE
4 MODERATE AT RISK
5 HIGH
6 VERY HIGH
I want to get a headcount of persons at each risk score and display a risk count even if there is a count of 0 for that risk score but SSRS 2005 likes to suppress zero counts.
I've tried this in the point labels
=IIF(IsNothing(count(Fields!person_id.value)),0,count(Fields!person_id.value))
Ex: I'm missing values for "1 LOW" as the creator does not have any "1 LOW" they've assigned risk scores for.
*here's a screenshot of what I get but I'd like to have a column even for a count when it still doesn't exist in the returned results.
#Nathan
Example scenario:
select professor.name, grades.score, student.person_id
from student
inner join grades on student.person_id = grades.person_id
inner join professor on student.professor_id = professor.professor_id
where
student.professor_id = #professor
Not all students are necessarily in the grades table.
I have a =Count(Fields!person_id.Value) for my data points & series is grouped on =Fields!score.Value
If there were a bunch of A,B,D grades but no C & F's how would I show labels for potentially non-existent counts

In your example, the problem is that no results are returned for grades that are not linked to any students. To solve this ideally there would be a table in your source system which listed all the possible values of "score" (e.g. A - F) and you would join this into your query such that at least one row was returned for each possible value.
If such a table doesn't exist and the possible score values are known and static, then you could manually create a list of them in your query. In the example below I create a subquery that returns a combination of all professors and all possible scores (A - F) and then LEFT join this to the grades and students tables (left join means that the professor/score rows will be returned even if no students have those scores in the "grades" table).
SELECT
professor.name
, professorgrades.score
, student.person_id
FROM
(
SELECT professor_id, score
FROM professor
CROSS JOIN
(
SELECT 'A' AS score
UNION
SELECT 'B'
UNION
SELECT 'C'
UNION
SELECT 'D'
UNION
SELECT 'E'
UNION
SELECT 'F'
) availablegrades
) professorgrades
INNER JOIN professor ON professorgrades.professor_id = professor.professor_id
LEFT JOIN grades ON professorgrades.score = grades.score
LEFT JOIN student ON grades.person_id = student.person_id AND
professorgrades.professor_id = student.professor_id
WHERE professorgrades.professor_id = 1
See a live example of how this works here: SQLFIDDLE

SELECT RS.RiskScoreId, RS.Description, SUM(DT.RiskCount) AS RiskCount
FROM (
SELECT RiskScoreId, 1 AS RiskCount
FROM People
UNION ALL
SELECT RiskScoreId, 0 AS RiskCount
FROM RiskScores
) DT
INNER JOIN RiskScores RS ON RS.RiskScoreId = DT.RiskScoreId
GROUP BY RS.RiskScoreId, RS.Description
ORDER BY RS.RiskScoreId

T-SQL elegant solution to divide a numeric value to multiple accounts

I have a problem that I believe has a perfectly elegant solution, but would like some help.
So I have a table of persons and a numerical value. Besides that, there is a table with the rules of division of that value (per person) to multiple accounts, rule can be either a max value or a percentage of the value.
This is a simplified version of these tables.
Persons(PersonID int, Value decimal)
Account(AccountID int, PersonID int)
Distribution(AccountID int, MaxValue decimal Null, Percentage decimal null)
At some point I need to divide those numerical values to a third table - that holds the account and value divided to that account.
AccountValues(AccountID int, AccountValue decimal)
The count of the accounts (per person) is not fixed. In the distribution table - if both of the distribution values are null - all the left over value goes to that account.
The order of distribution is by their ID's.
The data could look something like this.
Persons table
PersonID Value
1 1000,00
2 2000,00
3 5000,00
4 500,00
Accounts table
AccountID PersonID
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 4
9 4
10 4
Distribution table
AccountID MaxValue Percentage
1 500,00 null
2 null null
3 null 0,5
4 null 0,2
5 null null
6 1000,00 null
7 null null
8 2000,00 null
9 null 0,2
10 null null
Still a bit new to T-SQL so need help with the simplest and most efficient solution.
So for now I'm thinking of 3 possible solutions.
1. The least elegant - count the max number of accounts per person and do a loop that many times.
2. Cursors - the best way perhaps?
3. CTE recursion (about which I know nothing about)

I've used a CTE. There might be a smarter way to do the totalling, but I think this works.
Data setup:
declare #Persons table (PersonID int not null,Value decimal(18,4) not null)
insert into #Persons(PersonID,Value) values
(1,1000.00),
(2,2000.00),
(3,5000.00),
(4,500.00)
declare #Accounts table (AccountID int not null,PersonID int not null)
insert into #Accounts(AccountID,PersonID) values
(1,1),
(2,1),
(3,2),
(4,2),
(5,2),
(6,3),
(7,3),
(8,4),
(9,4),
(10,4)
declare #Distribution table (AccountID int not null,MaxValue decimal(18,4) null,Percentage decimal(6,5) null)
insert into #Distribution (AccountID,MaxValue,Percentage) values
(1,500.00,null),
(2,null,null),
(3,null,0.5),
(4,null,0.2),
(5,null,null),
(6,1000.00,null),
(7,null,null),
(8,2000.00,null),
(9,null,0.2),
(10,null,null)
declare #AccountValues table (AccountID int not null,Value decimal(18,4) null)
Actual query:
;With DisbValues as (
select
a.AccountID,
p.PersonID,
CASE
WHEN d.MaxValue is not null then d.MaxValue
WHEN d.Percentage is not null then d.Percentage * p.Value
END as Value,
p.Value as TotalAvailable
from
#Distribution d
inner join
#Accounts a
on
d.AccountID = a.AccountID
inner join
#Persons p
on
a.PersonID = p.PersonID
), CumulativeValues as (
select
AccountID,
PersonID,
Value,
COALESCE((select SUM(Value) from DisbValues d2 where d2.PersonID = d.PersonID and d2.AccountID < d.AccountID),0) as PrevValue,
TotalAvailable
from
DisbValues d
)
insert into #AccountValues (AccountID,Value)
select
AccountID,
CASE WHEN PrevValue < TotalAvailable THEN
CASE WHEN PrevValue + Value < TotalAvailable THEN Value --Entirely satisfied
ELSE TotalAvailable - PrevValue --Partially satisfied
END
ELSE
0 --Not satisfied
END
from CumulativeValues
The first CTE (DisbValues) eliminates the need to think in terms of percentages (I've assumed that we're working with a percentage of the total value available, not of the remainder when trying to satisfy a particular account). The second CTE (CumulativeValues) then adds up all of the values that earlier accounts would require to be filled.
We can then, in the final query, break things down into 3 cases, as indicated by the comments.

Summing From Consecutive Rows

Assume we have a table and we want to do a sum of the Expend column so that the summation only adds up values of the same Week_Name.
SN Week_Name Exp Sum
-- --------- --- ---
1 Week 1 10 0
2 Week 1 20 0
3 Week 1 30 60
4 Week 2 40 0
5 Week 2 50 90
6 Week 3 10 0
I will assume we will need to `Order By' Week_Name, then compare the previous Week_Name(previous row) with the current row Week_name(Current row).
If both are the same, put zero in the SUM column.
If not the same, add all expenditure, where Week_Name = Week_Name(Previous row) and place in the Sum column. The final output should look like the table above.
Any help on how to achieve this in T-SQL is highly appreciated.

Okay, I was eventually able to resolve this issue, praise Jesus! If you want the exact table I gave above, you can use GilM's response below, it is perfect. If you want your table to have running Cumulatives, i.e. Rows 3 shoud have 60, Row 5, should have 150, Row 6 160 etc. Then, you can use my code below:
USE CAPdb
IF OBJECT_ID ('dbo.[tablebp]') IS NOT NULL
DROP TABLE [tablebp]
GO
CREATE TABLE [tablebp] (
tablebpcCol1 int PRIMARY KEY
,tabledatekey datetime
,tableweekname varchar(50)
,expenditure1 numeric
,expenditure_Cummulative numeric
)
INSERT INTO [tablebp](tablebpcCol1,tabledatekey,tableweekname,expenditure1,expenditure_Cummulative)
SELECT b.s_tablekey,d.PK_Date,d.Week_Name,
SUM(b.s_expenditure1) AS s_expenditure1,
SUM(b.s_expenditure1) + COALESCE((SELECT SUM(s_expenditure1)
FROM source_table bs JOIN dbo.Time dd ON bs.[DATE Key] = dd.[PK_Date]
WHERE dd.PK_Date < d.PK_Date),0)
FROM source_table b
INNER JOIN dbo.Time d ON b.[Date key] = d.PK_Date
GROUP BY d.[PK_Date],d.Week_Name,b.s_tablekey,b.s_expenditure1
ORDER BY d.[PK_Date]
;WITH CTE AS (
SELECT tableweekname
,Max(expenditure_Cummulative) AS Week_expenditure_Cummulative
,MAX(tablebpcCol1) AS MaxSN
FROM [tablebp]
GROUP BY tableweekname
)
SELECT [tablebp].*
,CASE WHEN [tablebp].tablebpcCol1 = CTE.MaxSN THEN Week_expenditure_Cummulative
ELSE 0 END AS [RunWeeklySum]
FROM [tablebp]
JOIN CTE on CTE.tableweekname = [tablebp].tableweekname

I'm not sure why your SN=6 line is 0 rather than 10. Do you really not want the sum for the last Week? If having the last week total is okay, then you might want something like:
;WITH CTE AS (
SELECT Week_Name,SUM([Expend.]) as SumExpend
,MAX(SN) AS MaxSN
FROM T
GROUP BY Week_Name
)
SELECT T.*,CASE WHEN T.SN = CTE.MaxSN THEN SumExpend
ELSE 0 END AS [Sum]
FROM T
JOIN CTE on CTE.Week_Name = T.Week_Name
Based on the requst in the comment wanting a running total in SUM you could try this:
;WITH CTE AS (
SELECT Week_Name, MAX(SN) AS MaxSN
FROM T
GROUP BY Week_Name
)
SELECT T.SN, T.Week_Name,T.Exp,
CASE WHEN T.SN = CTE.MaxSN THEN
(SELECT SUM(EXP) FROM T T2
WHERE T2.SN <= T.SN) ELSE 0 END AS [SUM]
FROM T
JOIN CTE ON CTE.Week_Name = T.Week_Name
ORDER BY SN

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse