Postgres - ranking with additional conditions, then pulling the top 3 results - postgresql

I want to combine 2 conditions and get the top 3 rank result.
My dataset has 2 components, i.e. 1 major group and with sub-groups mapping to each major group
For e.g. i take part of the dataset for Group A
Group | Sub-group | Revenue | Source
---------------------------------------
A | A-1 | 50 | Y
A | A-2 | 40 | Y
A | A-3 | 60 | Y
A | A-4 | 80 | Y
A | A-5 | 100 | Y
A | A-6 | 140 | X
A | A-7 | 20 | X
A | A-8 | 300 | X
And under Revenue, there are different sources, such as revenue from source X and source Y. I am only interested at source Y.
So I want to pull get a list of eligible results, with top 3 revenue > rank them. A separate step to pull data of Group A for sub-groups that have revenue from source Y. My steps:
WITH
sum_revenue AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_rev
FROM table t
GROUP BY 1, 2
),
subgroup_rev AS (
SELECT DISTINCT
group,
total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY total_rev DESC
)
AS row
FROM sum_revenue
GROUP BY 1, 2
),
source_rev AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_sourceY_rev
FROM table t
WHERE
subgroup = 'Y'
GROUP BY 1, 2
),
eligible AS (
SELECT DISTINCT
group,
subgroup,
total_revenue,
FROM source_rev
WHERE total_sourceY_rev < 1 --identify top 3 revenue with source Y rev < $1
),
agg_list AS (
SELECT DISTINCT
group,
STRING_AGG(DISTINCT eligible.subgroup) AS eligible_list
INNER JOIN eligible USING (group)
WHERE subgroup_rev.row <= 3
GROUP BY 1
)
SELECT DISTINCT
group,
eligible_list
FROM agg_list
WHERE group IN ('A')
I'm expecting to get aggregated list of top 3 revenue of subgroup which fulfils the condition with source Y revenue <$1.
)
But I am getting full list (or partial list aggregated (not top 3), my result gave me 7. I tried running without aggregate and it did return 7 individual rows as well.
what could have gone wrong ? I thought i have filtered with row <=3, and I also tried with INNER JOIN when i thought i might introduced redundant subgroup if using LEFT JOIN.
seems to be issue with joining with eligible alias, because when i break down to test with pulling directly from subgroup_rev without joining eligible, i can get the top 3 revenue subgroups.
However, I need the condition with the subgroups having source Y revenue, that are having highest 3, so ideally i shouldn't hardcode row as 3 (because some cases the some of the top 3s might not necessarily have source Y revenue.
[EDIT] based on feedback
Result:
Group | eligible_list
---------------------------
A | A-1,A-2,A-3,A-4,A-5
Expected result:
Group | eligible_list
---------------------------
A | A-3,A-4,A-5

It may be your subgroup_rev subquery.
Since you are using a OVER with a PARTITION as well as a GROUP BY clause, you may just end of creating groups for each combination of group and subgroup. I believe your subgroup_rev subquery can just be:
subgroup_rev AS (
SELECT DISTINCT
group,
total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY total_rev DESC
)
AS row
FROM sum_revenue)
Like #GMB said, it's hard to follow without seeing how sources play into the revenue column, but too much grouping may ruin your rankings.
You may end up with a all groups having only 1 member so all of their rankings would be 1 and thus would all be returned in by your final WHERE clause.

I restructured my funnel of data and finally got it work:
WITH
source_rev AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_sourceY_rev
FROM table t
WHERE
subgroup = 'Y'
GROUP BY 1, 2
),
eligible AS (
SELECT DISTINCT
group,
subgroup,
total_revenue,
FROM source_rev
WHERE total_sourceY_rev < 1 --identify top 3 revenue with source Y rev < $1
),
sum_revenue AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_rev
FROM table t
INNER JOIN eligible
USING (group)
GROUP BY 1, 2
),
subgroup_rev AS (
SELECT DISTINCT
group,
subgroup,
sum_revenue.total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY sum_revenue.total_rev DESC
)
AS row
FROM eligible
INNER JOIN sum_revenue
USING (group)
GROUP BY 1, 2,3
),
agg_list AS (
SELECT DISTINCT
group,
STRING_AGG(DISTINCT eligible.subgroup) AS eligible_list
WHERE subgroup_rev.row <= 3
GROUP BY 1
)
SELECT DISTINCT
group,
eligible_list
FROM agg_list
WHERE group IN ('A')
The difference is that
i first narrow the data with revenue from source Y and
find those eligible data with conditional select
then i am having an alias that pull total revenue of all source and finally INNER JOIN them together
do the ranking, so that the ranking is only being performed on the eligible data (i.e. source Y revenue < $1), and finally
aggregating them.
Result with eligible list with source Y revenue < $1:
Group | eligible_list
---------------------------
A | A-3,A-4,A-5
I note that some share feedback with me of having too many subgroups. Happy to get any other feedback for a better way to do this if there's more efficient suggestion. Thanks!

Related

Count With Conditional on PostgreSQL

I have a table with people and another with visits. I want to count all visits but if the person signed up with 'emp' or 'oth' on ref_signup then remove the first visit. Example:
This are my tables:
PEOPLE:
id | ref_signup
---------------------
20 | emp
30 | oth
23 | fri
VISITS
id | date
-------------------------
20 | 10-01-2019
20 | 10-05-2019
23 | 10-09-2019
23 | 10-10-2019
30 | 09-10-2019
30 | 10-07-2019
On this example the visit count should be 4 because persons with id's 20 and 30 have their ref_signup as emp or oth, so it should exclude their first visit, but count from the second and forward.
This is what I have as a query:
SELECT COUNT(*) as visit_count FROM visits
LEFT JOIN people ON people.id = visits.people_id
WHERE visits.group_id = 1
Would using a case on the count help on this case as I just want to remove one visit not all of the visits from the person.
Subtract from COUNT(*) the distinct number of person.ids with person.ref_signup IN ('emp', 'oth'):
SELECT
COUNT(*) -
COUNT(DISTINCT CASE WHEN p.ref_signup IN ('emp', 'oth') THEN p.id END) as visit_count
FROM visits v LEFT JOIN people p
ON p.id = v.id
See the demo.
Result:
| visit_count |
| ----------- |
| 4 |
Note: this code and demo fiddle use the column names of your sample data.
Premise, select the count of visits from each person, along with a synthetic column that contains a 1 if the referral was from emp or oth, a 0 otherwise. Select the sum of the count minus the sum of that column.
SELECT SUM(count) - SUM(ignore_first) FROM (SELECT COUNT(*) as count, CASE WHEN ref_signup in ('emp', 'oth') THEN 1 ELSE 0 END as ignore_first as visit_count FROM visits
LEFT JOIN people ON people.id = visits.people_id
WHERE visits.group_id = 1 GROUP BY id) a
where's "people_id" in your example ?
SELECT COUNT(*) as visit_count
FROM visits v
JOIN people p ON p.id = v.people_id
WHERE p.ref_signup IN ('emp','oth');
then remove the first visit.
You cannot select count and delete the first visit at same time.
DELETE FROM visits
WHERE id IN (
SELECT id
FROM visits v
JOIN people p ON p.id = v.people_id
WHERE p.ref_signup IN ('emp','oth')
ORDER BY v.id
LIMIT 1
);
edit: typos
First, I create the tables
create table people (id int primary key, ref_signup varchar(3));
insert into people (id, ref_signup) values (20, 'emp'), (30, 'oth'), (23, 'fri');
create table visits (people_id int not null, visit_date date not null);
insert into visits (people_id, visit_date) values (20, '10-01-2019'), (20, '10-05-2019'), (23, '10-09-2019'), (23, '10-10-2019'), (30, '09-10-2019'), (30, '10-07-2019');
You can use the row_number() window function to mark which visit is "visit number one":
select
*,
row_number() over (partition by people_id order by visit_date) as visit_num
from people
join visits
on people.id = visits.people_id
Once you have that, you can do another query on those results, and use the filter clause to count up the correct rows that match the condition where visit_num > 1 or ref_signup = 'fri':
-- wrap the first query in a WITH clause
with joined_visits as (
select
*,
row_number() over (partition by people_id order by visit_date) as visit_num
from people
join visits
on people.id = visits.people_id
)
select count(1) filter (where visit_num > 1 or ref_signup = 'fri')
from joined_visits;
-- First get the corrected counts for all users
WITH grouped_visits AS (
SELECT
COUNT(visits.*) -
CASE WHEN people.ref_signup IN ('emp', 'oth') THEN 1 ELSE 0 END
AS visit_count
FROM visits
INNER JOIN people ON (people.id = visits.id)
GROUP BY people.id, people.ref_signup
)
-- Then sum them
SELECT SUM(visit_count)
FROM grouped_visits;
This should give you the result you're looking for.
On a side note, I can't help but think clever use of a window function could do this in a single shot without the CTE.
EDIT: No, it can't since window functions run after needed WHERE and GROUP BY and HAVING clauses.

Querying for the distinct count used to create a grouped, aggregated, and filtered row set

I have a table that looks like this:
control=# select * from animals;
age_range | weight | species
-----------+--------+---------
0-9 | 1 | lion
0-9 | 2 | lion
10-19 | 2 | tiger
10-19 | 3 | horse
20-29 | 2 | tiger
20-29 | 2 | zebra
I perform a query that summarizes weights of animals within age range groups, and I only want to return rows that have aggregated weights above
a certain number.
Summary Query:
SELECT
age_range,
SUM(animals.weight) AS weight,
COUNT(DISTINCT animals.species) AS distinct_species
FROM animals
GROUP BY age_range
HAVING SUM(animals.weight) > 3;
Summary Results:
age_range | weight | distinct_species
-----------+--------+------------------
10-19 | 5 | 2
20-29 | 4 | 2
Now here's the rub. Along with this summary, I want to report the distinct number of species used to create the above summary row set as a whole. For simplicity, let's refer to this number as the 'Distinct Species Total'. In this simple example, since only 3 species (tiger, zebra, horse) were used in yielding the 2 rows of this summary, and not 'lion', the 'Distinct Species Total' should be 3. But I can't figure out how to successfully query for that number. Since the summary query must use a having clause in order to apply a filter to an already grouped and aggregated row set, this presents problems in trying to query for the 'Distinct Species Total'.
This returns the wrong number, 2, because it is incorrectly a distinct count of a distinct count:
SELECT
COUNT(DISTINCT distinct_species) AS distinct_species_total
FROM (
SELECT
age_range,
SUM(animals.weight) AS weight,
COUNT(DISTINCT animals.species) AS distinct_species
FROM animals
GROUP BY age_range
HAVING SUM(animals.weight) > 3
) x;
And of course this returns the wrong number, 4, because it does not consider filtering the grouped and aggregated summary result using a having clause:
SELECT
COUNT(DISTINCT species) AS distinct_species_total
FROM animals;
Any help at all in getting leading me on the right path here is appreciated, and will hopefully help others with a similar problem, but in the end I do need a solution that will work with Amazon Redshift.
Join the result set with the original animals table and count the distinct species.
select distinct x.age_range,x.weight,count(distinct y.species) as distinct_species_total
from
(
select age_range,sum(animals.weight) as weight
from animals
group by age_range
having sum(animals.weight) > 3
) x
join animals y on x.age_range=y.age_range

PostgreSQL Window Function "column must appear in the GROUP BY clause"

I'm trying to get a leaderboard of summed user scores from a list of user score entries. A single user can have more than one entry in this table.
I have the following table:
rewards
=======
user_id | amount
I want to add up all of the amount values for given users and then rank them on a global leaderboard. Here's the query I'm trying to run:
SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION BY user_id) FROM rewards;
I'm getting the following error:
ERROR: column "rewards.user_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION...
Isn't user_id already in an "aggregate function" because I'm trying to partition on it? The PostgreSQL manual shows the following entry which I feel is a direct parallel of mine, so I'm not sure why mine's not working:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
They're not grouping by depname, so how come theirs works?
For example, for the following data:
user_id | score
===============
1 | 2
1 | 3
2 | 5
3 | 1
I would expect the following output (I have made a "tie" between users 1 and 2):
user_id | SUM(score) | rank
===========================
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
So user 1 has a total score of 5 and is ranked #1, user 2 is tied with a score of 5 and thus is also rank #1, and user 3 is ranked #3 with a score of 1.
You need to GROUP BY user_id since it's not being aggregated. Then you can rank by SUM(score) descending as you want;
SQL Fiddle Demo
SELECT user_id, SUM(score), RANK() OVER (ORDER BY SUM(score) DESC)
FROM rewards
GROUP BY user_id;
user_id | sum | rank
---------+-----+------
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
There is a difference between window functions and aggregate functions. Some functions can be used both as a window function and an aggregate function, which can cause confusion. Window functions can be recognized by the OVER clause in the query.
The query in your case then becomes, split in doing first an aggregate on user_id followed by a window function on the total_amount.
SELECT user_id, total_amount, RANK() OVER (ORDER BY total_amount DESC)
FROM (
SELECT user_id, SUM(amount) total_amount
FROM table
GROUP BY user_id
) q
ORDER BY total_amount DESC
If you have
SELECT user_id, SUM(amount) ....
^^^
agreagted function (not window function)
....
FROM .....
You need
GROUP BY user_id

One to one left join across two tables

So.. This is the query I have:
WITH cat AS (
SELECT day, domain
FROM table1
GROUP BY day, domain
), dog AS (
SELECT day, domain, SUM(count)
FROM table2
GROUP BY day, domain
)
SELECT c.domain,
COALESCE(SUM(d.count),0) AS count
FROM cat c
LEFT JOIN dog d
ON c.domain = d.domain
AND c.day <= d.day
GROUP BY c.domain;
Here's what the "cat" returns:
day | domain
------------+-----------
2015-10-01 | nba.com
2015-10-02 | nba.com
And here's what the "dog" returns:
day | domain | count
------------+----------------+--------
2015-10-03 | nba.com | 2
And here is what the full query returns:
domain | count
------------+-------
nba.com | 4
Count is 4 because the LEFT JOIN satisfies both of the rows in "cat". However, I want the left join only being applied ONCE.. (i.e. have a count of 2 rather than 4). That is the "dog" count of 2 be only applied ONCE (if it is ever satisfied).. and not more than that. Is this possible? I hope I made sense here
Your question is not quite clear.
If you want to find the first row from dog:
WITH cat AS (
SELECT day, domain
FROM table1
GROUP BY day, domain
), dog AS (
SELECT day, domain, SUM(count) count
FROM table2
GROUP BY day, domain
)
SELECT DISTINCT on (c.domain) c.domain,
COALESCE(d.count, 0) AS count
FROM cat c
LEFT JOIN dog d
ON c.domain = d.domain
AND c.day <= d.day
ORDER BY c.domain, c.day, d.day;
If you want to sum all rows from dog (which meet the conditions):
WITH cat AS (
SELECT day, domain
FROM table1
GROUP BY day, domain
), dog AS (
SELECT day, domain, SUM(count) count
FROM table2
GROUP BY day, domain
), cat_and_dogs as (
SELECT DISTINCT ON(c.domain, d.day) c.domain, d.count
FROM cat c
LEFT JOIN dog d
ON c.domain = d.domain
AND c.day <= d.day
)
SELECT domain,
COALESCE(sum(count), 0) AS count
FROM cat_and_dogs
GROUP BY domain;
Basically, I agree with #klin that question is not clear. If you don NOT want to apply LEFT JOIN each times for all domain entries from cat result, domain column should be unique, so you will NOT hit SUM(count) in dog result twice.
You can use count column just in cat table to simply your query, but if you have a reason to keep them separate, it might be a good idea to rely on some other unique column to join on.
WITH cat(day,domain) AS (
VALUES
('2015-10-01'::DATE,'nba.com'),
('2015-10-02'::DATE,'nba.com')
),dog(day,domain,count) AS (
VALUES
('2015-10-03'::DATE, 'nba.com', 1),
('2015-10-03'::DATE, 'nba.com', 1)
),dog_data AS (
SELECT day, domain, SUM(count) count
FROM dog
GROUP BY day, domain
), result_to_normilize AS (
SELECT DISTINCT ON(c.domain,c.day) c.domain,c.day,
COALESCE(SUM(d.count),0) AS count
FROM cat c
LEFT JOIN dog_data d
ON c.domain = d.domain
AND c.day <= d.day
GROUP BY c.domain,c.day
-- | domain | day | count
-- | nba.com | 2015-10-01 | 2
-- | nba.com | 2015-10-02 | 2
)
SELECT DISTINCT ON (r.domain) r.domain, r.count
FROM result_to_normilize r;
-- | domain | count
-- | nba.com | 2

Finding exact matches to a requested set of values

Hi I'm facing a challenge. There is a table progress.
User_id | Assesment_id
-----------------------
1 | Test_1
2 | Test_1
3 | Test_1
1 | Test_2
2 | Test_2
1 | Test_3
3 | Test_3
I need to pull out the user_id who have completed only Test_1 & test_2 (i.e User_id:2). The input parameters would be the list of Assesment id.
Edit:
I want those who have completed all the assessments on the list, but no others.
User 3 did not complete Test_2, and so is excluded.
User 1 completed an extra test, and is also excluded.
Only User 2 has completed exactly those assessments requested.
You don't need a complicated join or even subqueries. Simply use the INTERSECT operator:
select user_id from progress where assessment_id = 'Test_1'
intersect
select user_id from progress where assessment_id = 'Test_2'
I interpreted your question to mean that you want users who have completed all of the tests in your assessment list, but not any other tests. I'll use a technique called common table expressions so that you can follow step by step, but it is all one query statement.
Let's say you supply your assessment list as rows in a table called Checktests. We can count those values to find out how many tests are needed.
If we use a LEFT OUTER JOIN then values from the right-side table will be null. So the test_matched column will be null if an assessment is not on your list. COUNT() ignores null values, so we can use this to find out how many tests were taken that were on the list, and then compare this to the number of all tests the user took.
with x as
(select count(assessment_id) as tests_needed
from checktests
),
dtl as
(select p.user_id,
p.assessment_id as test_taken,
c.assessment_id as test_matched
from progress p
left join checktests c on p.assessment_id = c.assessment_id
),
y as
(select user_id,
count(test_taken) as all_tests,
count(test_matched) as wanted_tests -- count() ignores nulls
from dtl
group by user_id
)
select user_id
from y
join x on y.wanted_tests = x.tests_needed
where y.wanted_tests = y.all_tests ;