One to one left join across two tables - postgresql

So.. This is the query I have:
WITH cat AS (
SELECT day, domain
FROM table1
GROUP BY day, domain
), dog AS (
SELECT day, domain, SUM(count)
FROM table2
GROUP BY day, domain
)
SELECT c.domain,
COALESCE(SUM(d.count),0) AS count
FROM cat c
LEFT JOIN dog d
ON c.domain = d.domain
AND c.day <= d.day
GROUP BY c.domain;
Here's what the "cat" returns:
day | domain
------------+-----------
2015-10-01 | nba.com
2015-10-02 | nba.com
And here's what the "dog" returns:
day | domain | count
------------+----------------+--------
2015-10-03 | nba.com | 2
And here is what the full query returns:
domain | count
------------+-------
nba.com | 4
Count is 4 because the LEFT JOIN satisfies both of the rows in "cat". However, I want the left join only being applied ONCE.. (i.e. have a count of 2 rather than 4). That is the "dog" count of 2 be only applied ONCE (if it is ever satisfied).. and not more than that. Is this possible? I hope I made sense here

Your question is not quite clear.
If you want to find the first row from dog:
WITH cat AS (
SELECT day, domain
FROM table1
GROUP BY day, domain
), dog AS (
SELECT day, domain, SUM(count) count
FROM table2
GROUP BY day, domain
)
SELECT DISTINCT on (c.domain) c.domain,
COALESCE(d.count, 0) AS count
FROM cat c
LEFT JOIN dog d
ON c.domain = d.domain
AND c.day <= d.day
ORDER BY c.domain, c.day, d.day;
If you want to sum all rows from dog (which meet the conditions):
WITH cat AS (
SELECT day, domain
FROM table1
GROUP BY day, domain
), dog AS (
SELECT day, domain, SUM(count) count
FROM table2
GROUP BY day, domain
), cat_and_dogs as (
SELECT DISTINCT ON(c.domain, d.day) c.domain, d.count
FROM cat c
LEFT JOIN dog d
ON c.domain = d.domain
AND c.day <= d.day
)
SELECT domain,
COALESCE(sum(count), 0) AS count
FROM cat_and_dogs
GROUP BY domain;

Basically, I agree with #klin that question is not clear. If you don NOT want to apply LEFT JOIN each times for all domain entries from cat result, domain column should be unique, so you will NOT hit SUM(count) in dog result twice.
You can use count column just in cat table to simply your query, but if you have a reason to keep them separate, it might be a good idea to rely on some other unique column to join on.
WITH cat(day,domain) AS (
VALUES
('2015-10-01'::DATE,'nba.com'),
('2015-10-02'::DATE,'nba.com')
),dog(day,domain,count) AS (
VALUES
('2015-10-03'::DATE, 'nba.com', 1),
('2015-10-03'::DATE, 'nba.com', 1)
),dog_data AS (
SELECT day, domain, SUM(count) count
FROM dog
GROUP BY day, domain
), result_to_normilize AS (
SELECT DISTINCT ON(c.domain,c.day) c.domain,c.day,
COALESCE(SUM(d.count),0) AS count
FROM cat c
LEFT JOIN dog_data d
ON c.domain = d.domain
AND c.day <= d.day
GROUP BY c.domain,c.day
-- | domain | day | count
-- | nba.com | 2015-10-01 | 2
-- | nba.com | 2015-10-02 | 2
)
SELECT DISTINCT ON (r.domain) r.domain, r.count
FROM result_to_normilize r;
-- | domain | count
-- | nba.com | 2

Related

Postgres - ranking with additional conditions, then pulling the top 3 results

I want to combine 2 conditions and get the top 3 rank result.
My dataset has 2 components, i.e. 1 major group and with sub-groups mapping to each major group
For e.g. i take part of the dataset for Group A
Group | Sub-group | Revenue | Source
---------------------------------------
A | A-1 | 50 | Y
A | A-2 | 40 | Y
A | A-3 | 60 | Y
A | A-4 | 80 | Y
A | A-5 | 100 | Y
A | A-6 | 140 | X
A | A-7 | 20 | X
A | A-8 | 300 | X
And under Revenue, there are different sources, such as revenue from source X and source Y. I am only interested at source Y.
So I want to pull get a list of eligible results, with top 3 revenue > rank them. A separate step to pull data of Group A for sub-groups that have revenue from source Y. My steps:
WITH
sum_revenue AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_rev
FROM table t
GROUP BY 1, 2
),
subgroup_rev AS (
SELECT DISTINCT
group,
total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY total_rev DESC
)
AS row
FROM sum_revenue
GROUP BY 1, 2
),
source_rev AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_sourceY_rev
FROM table t
WHERE
subgroup = 'Y'
GROUP BY 1, 2
),
eligible AS (
SELECT DISTINCT
group,
subgroup,
total_revenue,
FROM source_rev
WHERE total_sourceY_rev < 1 --identify top 3 revenue with source Y rev < $1
),
agg_list AS (
SELECT DISTINCT
group,
STRING_AGG(DISTINCT eligible.subgroup) AS eligible_list
INNER JOIN eligible USING (group)
WHERE subgroup_rev.row <= 3
GROUP BY 1
)
SELECT DISTINCT
group,
eligible_list
FROM agg_list
WHERE group IN ('A')
I'm expecting to get aggregated list of top 3 revenue of subgroup which fulfils the condition with source Y revenue <$1.
)
But I am getting full list (or partial list aggregated (not top 3), my result gave me 7. I tried running without aggregate and it did return 7 individual rows as well.
what could have gone wrong ? I thought i have filtered with row <=3, and I also tried with INNER JOIN when i thought i might introduced redundant subgroup if using LEFT JOIN.
seems to be issue with joining with eligible alias, because when i break down to test with pulling directly from subgroup_rev without joining eligible, i can get the top 3 revenue subgroups.
However, I need the condition with the subgroups having source Y revenue, that are having highest 3, so ideally i shouldn't hardcode row as 3 (because some cases the some of the top 3s might not necessarily have source Y revenue.
[EDIT] based on feedback
Result:
Group | eligible_list
---------------------------
A | A-1,A-2,A-3,A-4,A-5
Expected result:
Group | eligible_list
---------------------------
A | A-3,A-4,A-5
It may be your subgroup_rev subquery.
Since you are using a OVER with a PARTITION as well as a GROUP BY clause, you may just end of creating groups for each combination of group and subgroup. I believe your subgroup_rev subquery can just be:
subgroup_rev AS (
SELECT DISTINCT
group,
total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY total_rev DESC
)
AS row
FROM sum_revenue)
Like #GMB said, it's hard to follow without seeing how sources play into the revenue column, but too much grouping may ruin your rankings.
You may end up with a all groups having only 1 member so all of their rankings would be 1 and thus would all be returned in by your final WHERE clause.
I restructured my funnel of data and finally got it work:
WITH
source_rev AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_sourceY_rev
FROM table t
WHERE
subgroup = 'Y'
GROUP BY 1, 2
),
eligible AS (
SELECT DISTINCT
group,
subgroup,
total_revenue,
FROM source_rev
WHERE total_sourceY_rev < 1 --identify top 3 revenue with source Y rev < $1
),
sum_revenue AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_rev
FROM table t
INNER JOIN eligible
USING (group)
GROUP BY 1, 2
),
subgroup_rev AS (
SELECT DISTINCT
group,
subgroup,
sum_revenue.total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY sum_revenue.total_rev DESC
)
AS row
FROM eligible
INNER JOIN sum_revenue
USING (group)
GROUP BY 1, 2,3
),
agg_list AS (
SELECT DISTINCT
group,
STRING_AGG(DISTINCT eligible.subgroup) AS eligible_list
WHERE subgroup_rev.row <= 3
GROUP BY 1
)
SELECT DISTINCT
group,
eligible_list
FROM agg_list
WHERE group IN ('A')
The difference is that
i first narrow the data with revenue from source Y and
find those eligible data with conditional select
then i am having an alias that pull total revenue of all source and finally INNER JOIN them together
do the ranking, so that the ranking is only being performed on the eligible data (i.e. source Y revenue < $1), and finally
aggregating them.
Result with eligible list with source Y revenue < $1:
Group | eligible_list
---------------------------
A | A-3,A-4,A-5
I note that some share feedback with me of having too many subgroups. Happy to get any other feedback for a better way to do this if there's more efficient suggestion. Thanks!

Count With Conditional on PostgreSQL

I have a table with people and another with visits. I want to count all visits but if the person signed up with 'emp' or 'oth' on ref_signup then remove the first visit. Example:
This are my tables:
PEOPLE:
id | ref_signup
---------------------
20 | emp
30 | oth
23 | fri
VISITS
id | date
-------------------------
20 | 10-01-2019
20 | 10-05-2019
23 | 10-09-2019
23 | 10-10-2019
30 | 09-10-2019
30 | 10-07-2019
On this example the visit count should be 4 because persons with id's 20 and 30 have their ref_signup as emp or oth, so it should exclude their first visit, but count from the second and forward.
This is what I have as a query:
SELECT COUNT(*) as visit_count FROM visits
LEFT JOIN people ON people.id = visits.people_id
WHERE visits.group_id = 1
Would using a case on the count help on this case as I just want to remove one visit not all of the visits from the person.
Subtract from COUNT(*) the distinct number of person.ids with person.ref_signup IN ('emp', 'oth'):
SELECT
COUNT(*) -
COUNT(DISTINCT CASE WHEN p.ref_signup IN ('emp', 'oth') THEN p.id END) as visit_count
FROM visits v LEFT JOIN people p
ON p.id = v.id
See the demo.
Result:
| visit_count |
| ----------- |
| 4 |
Note: this code and demo fiddle use the column names of your sample data.
Premise, select the count of visits from each person, along with a synthetic column that contains a 1 if the referral was from emp or oth, a 0 otherwise. Select the sum of the count minus the sum of that column.
SELECT SUM(count) - SUM(ignore_first) FROM (SELECT COUNT(*) as count, CASE WHEN ref_signup in ('emp', 'oth') THEN 1 ELSE 0 END as ignore_first as visit_count FROM visits
LEFT JOIN people ON people.id = visits.people_id
WHERE visits.group_id = 1 GROUP BY id) a
where's "people_id" in your example ?
SELECT COUNT(*) as visit_count
FROM visits v
JOIN people p ON p.id = v.people_id
WHERE p.ref_signup IN ('emp','oth');
then remove the first visit.
You cannot select count and delete the first visit at same time.
DELETE FROM visits
WHERE id IN (
SELECT id
FROM visits v
JOIN people p ON p.id = v.people_id
WHERE p.ref_signup IN ('emp','oth')
ORDER BY v.id
LIMIT 1
);
edit: typos
First, I create the tables
create table people (id int primary key, ref_signup varchar(3));
insert into people (id, ref_signup) values (20, 'emp'), (30, 'oth'), (23, 'fri');
create table visits (people_id int not null, visit_date date not null);
insert into visits (people_id, visit_date) values (20, '10-01-2019'), (20, '10-05-2019'), (23, '10-09-2019'), (23, '10-10-2019'), (30, '09-10-2019'), (30, '10-07-2019');
You can use the row_number() window function to mark which visit is "visit number one":
select
*,
row_number() over (partition by people_id order by visit_date) as visit_num
from people
join visits
on people.id = visits.people_id
Once you have that, you can do another query on those results, and use the filter clause to count up the correct rows that match the condition where visit_num > 1 or ref_signup = 'fri':
-- wrap the first query in a WITH clause
with joined_visits as (
select
*,
row_number() over (partition by people_id order by visit_date) as visit_num
from people
join visits
on people.id = visits.people_id
)
select count(1) filter (where visit_num > 1 or ref_signup = 'fri')
from joined_visits;
-- First get the corrected counts for all users
WITH grouped_visits AS (
SELECT
COUNT(visits.*) -
CASE WHEN people.ref_signup IN ('emp', 'oth') THEN 1 ELSE 0 END
AS visit_count
FROM visits
INNER JOIN people ON (people.id = visits.id)
GROUP BY people.id, people.ref_signup
)
-- Then sum them
SELECT SUM(visit_count)
FROM grouped_visits;
This should give you the result you're looking for.
On a side note, I can't help but think clever use of a window function could do this in a single shot without the CTE.
EDIT: No, it can't since window functions run after needed WHERE and GROUP BY and HAVING clauses.

Sum with different condition for every line

In my Postgresql 9.3 database I have a table stock_rotation:
+----+-----------------+---------------------+------------+---------------------+
| id | quantity_change | stock_rotation_type | article_id | date |
+----+-----------------+---------------------+------------+---------------------+
| 1 | 10 | PURCHASE | 1 | 2010-01-01 15:35:01 |
| 2 | -4 | SALE | 1 | 2010-05-06 08:46:02 |
| 3 | 5 | INVENTORY | 1 | 2010-12-20 08:20:35 |
| 4 | 2 | PURCHASE | 1 | 2011-02-05 16:45:50 |
| 5 | -1 | SALE | 1 | 2011-03-01 16:42:53 |
+----+-----------------+---------------------+------------+---------------------+
Types:
SALE has negative quantity_change
PURCHASE has positive quantity_change
INVENTORY resets the actual number in stock to the given value
In this implementation, to get the current value that an article has in stock, you need to sum up all quantity changes since the latest INVENTORY for the specific article (including the inventory value). I do not know why it is implemented this way and unfortunately it would be quite hard to change this now.
My question now is how to do this for more than a single article at once.
My latest attempt was this:
WITH latest_inventory_of_article as (
SELECT MAX(date)
FROM stock_rotation
WHERE stock_rotation_type = 'INVENTORY'
)
SELECT a.id, sum(quantity_change)
FROM stock_rotation sr
INNER JOIN article a ON a.id = sr.article_id
WHERE sr.date >= (COALESCE(
(SELECT date FROM latest_inventory_of_article),
'1970-01-01'
))
GROUP BY a.id
But the date for the latest stock_rotation of type INVENTORY can be different for every article.
I was trying to avoid looping over multiple article ids to find this date.
In this case I would use a different internal query to get the max inventory per article. You are effectively using stock_rotation twice but it should work. If it's too big of a table you can try something else:
SELECT sr.article_id, sum(quantity_change)
FROM stock_rotation sr
LEFT JOIN (
SELECT article_id, MAX(date) AS date
FROM stock_rotation
WHERE stock_rotation_type = 'INVENTORY'
GROUP BY article_id) AS latest_inventory
ON latest_inventory.article_id = sr.article_id
WHERE sr.date >= COALESCE(latest_inventory.date, '1970-01-01')
GROUP BY sr.article_id
You can use DISTINCT ON together with ORDER BY to get the latest INVENTORY row for each article_id in the WITH clause.
Then you can join that with the original table to get all later rows and add the values:
WITH latest_inventory as (
SELECT DISTINCT ON (article_id) id, article_id, date
FROM stock_rotation
WHERE stock_rotation_type = 'INVENTORY'
ORDER BY article_id, date DESC
)
SELECT article_id, sum(sr.quantity_change)
FROM stock_rotation sr
JOIN latest_inventory li USING (article_id)
WHERE sr.date >= li.date
GROUP BY article_id;
Here is my take on it: First, build the list of products at their last inventory state, using a window function. Then, join it back to the entire list, filtering on operations later than the inventory date for the item.
with initial_inventory as
(
select article_id, date, quantity_change from
(select article_id, date, quantity_change, rank() over (partition by article_id order by date desc)
from stockRotation
where type = 'INVENTORY'
) a
where rank = 1
)
select ii.article_id, ii.quantity_change + sum(sr.quantity_change)
from initial_inventory ii
join stockRotation sr on ii.article_id = sr.article_id and sr.date > ii.date
group by ii.article_id, ii.quantity_change

Cascading sum hierarchy using recursive cte

I'm trying to perform recursive cte with postgres but I can't wrap my head around it. In terms of performance issue there are only 50 items in TABLE 1 so this shouldn't be an issue.
TABLE 1 (expense):
id | parent_id | name
------------------------------
1 | null | A
2 | null | B
3 | 1 | C
4 | 1 | D
TABLE 2 (expense_amount):
ref_id | amount
-------------------------------
3 | 500
4 | 200
Expected Result:
id, name, amount
-------------------------------
1 | A | 700
2 | B | 0
3 | C | 500
4 | D | 200
Query
WITH RECURSIVE cte AS (
SELECT
expenses.id,
name,
parent_id,
expense_amount.total
FROM expenses
WHERE expenses.parent_id IS NULL
LEFT JOIN expense_amount ON expense_amount.expense_id = expenses.id
UNION ALL
SELECT
expenses.id,
expenses.name,
expenses.parent_id,
expense_amount.total
FROM cte
JOIN expenses ON expenses.parent_id = cte.id
LEFT JOIN expense_amount ON expense_amount.expense_id = expenses.id
)
SELECT
id,
SUM(amount)
FROM cte
GROUP BY 1
ORDER BY 1
Results
id | sum
--------------------
1 | null
2 | null
3 | 500
4 | 200
You can do a conditional sum() for only the root row:
with recursive tree as (
select id, parent_id, name, id as root_id
from expense
where parent_id is null
union all
select c.id, c.parent_id, c.name, p.root_id
from expense c
join tree p on c.parent_id = p.id
)
select e.id,
e.name,
e.root_id,
case
when e.id = e.root_id then sum(ea.amount) over (partition by root_id)
else amount
end as amount
from tree e
left join expense_amount ea on e.id = ea.ref_id
order by id;
I prefer doing the recursive part first, then join the related tables to the result of the recursive query, but you could do the join to the expense_amount also inside the CTE.
Online example: http://rextester.com/TGQUX53703
However, the above only aggregates on the top-level parent, not for any intermediate non-leaf rows.
If you want to see intermediate aggregates as well, this gets a bit more complicated (and is probably not very scalable for large results, but you said your tables aren't that big)
with recursive tree as (
select id, parent_id, name, 1 as level, concat('/', id) as path, null::numeric as amount
from expense
where parent_id is null
union all
select c.id, c.parent_id, c.name, p.level + 1, concat(p.path, '/', c.id), ea.amount
from expense c
join tree p on c.parent_id = p.id
left join expense_amount ea on ea.ref_id = c.id
)
select e.id,
lpad(' ', (e.level - 1) * 2, ' ')||e.name as name,
e.amount as element_amount,
(select sum(amount)
from tree t
where t.path like e.path||'%') as sub_tree_amount,
e.path
from tree e
order by path;
Online example: http://rextester.com/MCE96740
The query builds up a path of all IDs belonging to a (sub)tree and then uses a scalar sub-select to get all child rows belonging to a node. That sub-select is what will make this quite slow as soon as the result of the recursive query can't be kept in memory.
I used the level column to create a "visual" display of the tree structure - this helps me debugging the statement and understanding the result better. If you need the real name of an element in your program you would obviously only use e.name instead of pre-pending it with blanks.
I could not get your query to work for some reason. Here's my attempt that works for the particular table you provided (parent-child, no grandchild) without recursion. SQL Fiddle
--- step 1: get parent-child data together
with parent_child as(
select t.*, amount
from
(select e.id, f.name as name,
coalesce(f.name, e.name) as pname
from expense e
left join expense f
on e.parent_id = f.id) t
left join expense_amount ea
on ea.ref_id = t.id
)
--- final step is to group by id, name
select id, pname, sum(amount)
from
(-- step 2: group by parent name and find corresponding amount
-- returns A, B
select e.id, t.pname, t.amount
from expense e
join (select pname, sum(amount) as amount
from parent_child
group by 1) t
on t.pname = e.name
-- step 3: to get C, D we union and get corresponding columns
-- results in all rows and corresponding value
union
select id, name, amount
from expense e
left join expense_amount ea
on e.id = ea.ref_id
) t
group by 1, 2
order by 1;

Unpack expression results from case statement

Four categories in category table.
id | name
--------------
1 | 'wine'
2 | 'chocolate'
3 | 'autos'
4 | 'real estate'
Two of the many (thousands of) forecasters in forecaster table.
id | name
--------------
1 | 'sothebys'
2 | 'cramer'
Relevant forecasts by the forecasters for the categories in the forecast table.
| id | forecaster_id | category_id | forecast |
|----+---------------+-------------+--------------------------------------------------------------|
| 1 | 1 | 1 | 'bad weather, prices rise short-term' |
| 2 | 1 | 2 | 'cocoa bean surplus, prices drop' |
| 3 | 1 | 3 | 'we dont deal with autos - no idea' |
| 4 | 2 | 2 | 'sell, sell, sell' |
| 5 | 2 | 3 | 'demand for cocoa will skyrocket - prices up - buy, buy buy' |
I want prioritized mapping of (forecaster, category, forecast) such that, if a forecast exists for some primary forecaster (e.g. 'cramer') use it because I trust him more. If a forecast exists for some secondary forecaster (e.g. 'sothebys') use that. If no forecast exists for a category, return a row with that category and null for forecast.
I have something that almost works and after I get the logic down I hope to turn into parameterized query.
select
case when F1.category is not null
then (F1.forecaster, F1.category, F1.forecast)
when F2.category is not null
then (F2.forecaster, F2.category, F2.forecast)
else (null, C.category, null)
end
from
(
select
FR.name as forecaster,
C.id as cid,
C.category as category,
F.forecast
from
forecast F
inner join forecaster FR on (F.forecaster_id = FR.id)
inner join category C on (C.id = F.category_id)
where FR.name = 'cramer'
) F1
right join (
select
FR.name as forecaster,
C.id as cid,
C.category as category,
F.forecast
from
forecast F
inner join forecaster FR on (F.forecaster_id = FR.id)
inner join category C on (C.id = F.category_id)
where FR.name = 'sothebys'
) F2 on (F1.cid = F2.cid)
full outer join category C on (C.id = F2.cid);
This gives:
'(sothebys,wine,"bad weather, prices rise short-term")'
'(cramer,chocolate,"sell, sell, sell")'
'(cramer,autos,"demand for cocoa will skyrocket - prices up - buy, buy buy")'
'(,"real estate",)'
While that is the desired data it is a record of one column instead of three. The case was the only way I could find to achieve the ordering of cramer first sothebys next and there is lots of duplication. Is there a better way and how can the tuple like results be pulled back apart into columns?
Any suggestions, especially related to removal of duplication or general simplification appreciated.
This sounds like a case for DISTINCT ON (untested):
SELECT DISTINCT ON (c.id)
fr.name AS forecaster,
c.name AS category,
f.forecast
FROM forecast f
JOIN forecaster fr ON f.forecaster_id = fr.id
RIGHT JOIN category c ON f.category_id = c.id
ORDER BY
c.id,
CASE WHEN fr.name = 'cramer' THEN 0
WHEN fr.name = 'sothebys' THEN 1
ELSE 2
END;
For each category, the first row in the ordering will be picked. Since Cramer has a higher id than Sotheby's, it will be given preference.
Adapt the ORDER BY clause if you need a more complicated ranking.