Sum of one column grouped by 2nd column with groups made based on 3rd column - group-by

Data
So my data looks like:
product user_id value id
pizza 1 50 1
burger 1 30 2
pizza 2 50 3
fries 1 10 4
pizza 3 50 5
burger 1 30 6
burger 2 30 7
Problem Statement
And I wanted to compute Lifetime values of customers of each product as a metric to know which product is doing great in terms of user retention.
Desired Output
My desired output is:
product
value_by_customers_of_these_products
total_customers
ltv
pizza
250
3
250/3 = 83.33
burger
200
2
200/2 = 100
fries
120
1
120/1 = 120
Columns Description:
value_by_customers_of_these_products : Total value generated by
customers of each product including orders which do not contain the
product
total_customers : Simple COUNT(DISTINCT user_id) GROUP BY product
Current Workaround
Currently I am doing this:
SELECT "pizza" AS product, SUM(value) value_by_customers_of_these_products, COUNT(DISTINCT user_id) users FROM orders WHERE user_id in (SELECT user_id FROM orders WHERE product = "pizza")
UNION ALL
SELECT "burger" AS product, SUM(value) value_by_customers_of_these_products, COUNT(DISTINCT user_id) users FROM orders WHERE user_id in (SELECT user_id FROM orders WHERE product = "burger")
UNION ALL
SELECT "fries" AS product, SUM(value) value_by_customers_of_these_products, COUNT(DISTINCT user_id) users FROM orders WHERE user_id in (SELECT user_id FROM orders WHERE product = "fries")
I have a python script obtaining DISTINCT product names from my table and then repeating the query string for each product and updating query from time to time. This is really a pain as I have to do every time a new product is launched and sky-rocketing length of query is another issue. How can I achieve this via built-in BigQuery functions or minimal headache?
Code to generate Sample Data
WITH orders as (SELECT "pizza" AS product,
1 AS user_id,
50 AS value, 1 AS id,
UNION ALL SELECT "burger", 1, 30,2
UNION ALL SELECT "pizza", 2, 50,3
UNION ALL SELECT "fries", 1, 10,4
UNION ALL SELECT "pizza", 3, 50,5
UNION ALL SELECT "burger", 1, 30, 6
UNION ALL SELECT "burger", 3, 30, 7)

Use below
with user_value as (
select user_id, sum(value) values
from `project.dataset.table`
group by user_id
), product_user as (
select distinct product, user_id
from `project.dataset.table`
)
select product,
sum(values) as value_by_customers_of_these_products,
count(user_id) as total_customers,
round(sum(values) / count(user_id), 2) as ltv
from product_user
join user_value
using(user_id)
group by product
if applied to sample data in your question - output is

Related

postgresql RIGHT Join: limit returned rows

I have the following schema:
expenses
id
name, varchar
cost, double
date, DATE
category_id, int f_key
user_id, int f_key
1
Pizza
22.9
22/08/2022
1
1
2
Pool
34.9
23/08/2022
2
1
categories
id
name, varchar
1
Food
2
Leisure
3
Medicine
4
Fancy food
users_categories(user_id int foreign key, category_id foreign key)
user_id int f_key
category_id int f_key
1
1
1
2
1
3
2
4
And two users with id 1 and 2.
Relation between user and category is many to many.
Problem:
I want to get statistics (total cost amount and count) for all categories. For categories where there are no expenses I want to return 0. Here is my query:
SELECT categories.name as name, count(expenses.name) as count, round(SUM(price)::numeric,2) as sum
FROM expenses
Right JOIN categories ON expenses.category_id = categories.id
and expenses.category_id in (
select users_categories.category_id from users_categories where users_categories.user_id = 1
)
and expenses.id in(
Select expenses.id from expenses
join users_categories on expenses.category_id = users_categories.category_id
and expenses.user_id = 1
AND (extract(year from date) = 2022 OR CAST(2022 AS int) is null)
AND (extract(month from date) = 8 OR CAST(8 AS int) is null)
)
GROUP BY categories.id ORDER BY categories.id
The response is:
name
count
sum
Food
1
22.9
Leisure
1
33.9
Medicine
0
null
Fancy food
0
null
How I should edit my query to eliminate the last row, because this category doesn't belong to the user 1.
In your query you used user_categories as subquery so it will not filter category ids,
Try this Query
SELECT categories.name as name,count(expenses.name) as count, coalesce(round(SUM(price)::numeric,2),0) as sum from
categories
left join users_categories on users_categories.category_id= categories.id
left join expenses ON expenses.category_id = categories.id
AND (extract(year from date) = 2022 OR CAST(2022 AS int) is null)
AND (extract(month from date) = 8 OR CAST(8 AS int) is null)
where users_categories.user_id='1'
GROUP BY categories.name,categories.id ORDER BY categories.id
OUTPUT :
name count sum
Food 1 22.90
Leisure 1 34.90
Medicine 0 0
You want to move expenses.category_id in ... out of the ON condition and into a WHERE clause.
When it is in the ON clause, that means rows which were removed by the in-test just get NULL-fabricated anyway. You want to remove those rows after the NULL-fabrication is done, so that they remain removed. But why do you use that in-test anyway? Seems like it would be much simpler written as another join.
What I understand is that you are trying to get the count and sum of expenses for all the categories related to the user_id 1 within the month of august 2022.
Please try out the following query.
WITH statistics
AS (SELECT e.category_id,
Count(e.*) AS count,
Round(Sum(e.cost), 2) AS sum
FROM expenses e
WHERE e.user_id = 1
AND ( e.date BETWEEN '01/08/2022' AND '31/08/2022' )
GROUP BY e.category_id),
user_category
AS (SELECT uc.category_id,
COALESCE(s.count, 0) AS count,
COALESCE(s.sum, 0) AS sum
FROM users_categories uc
LEFT JOIN statistics s
ON uc.category_id = s.id
WHERE uc.user_id = 1)
SELECT c.NAME,
u.count,
u.sum
FROM categories c
INNER JOIN user_category u
ON u.category_id = c.id;

Selecting other columns not in count, group by

So I have a table as follows
product_id sender_id timestamp ...other columns...
1 2 1222
1 2 3423
1 2 1231
2 2 890
3 4 234
2 3 234234
I want to get rows where sender_id = 2, but I want to count and group by product_id and sort by timestamp descending. This means I need the following result
product_id sender_id timestamp count ...other columns...
1 2 3423 3
2 2 890 1
I tried the following query:
SELECT product_id, sender_id, timestamp, count(product_id), ...other columns...
FROM table
WHERE sender_id = 2
GROUP BY product_id
But I get the following error Error in query: ERROR: column "table.sender_id" must appear in the GROUP BY clause or be used in an aggregate function
Seems like I cannot SELECT columns that are not in the GROUP BY. Another method which I found online was to join
SELECT product_id, sender_id, timestamp, count, ...other columns...
FROM table
JOIN (
SELECT product_id, COUNT(product_id) AS count
FROM table
GROUP BY (product_id)
) table1 ON table.product_id = table1.product_id
WHERE sender_id = 2
GROUP BY product_id
But doing this simply lists all rows without grouping or counting. My guess is that the ON part simply extends table again.
Try grouping using product_id, sender_id
select product_id, sender_id, count(product_id), max(timestamp) maxtm
from t
where sender_id = 2
group by product_id, sender_id
order by maxtm desc
If you want other columns too:
select t.*, t1.product_count
from t
inner join (
select product_id, sender_id, count(product_id) product_count, max(timestamp) maxtm
from t
where sender_id = 2
group by product_id, sender_id
) t1
on t.product_id = t1.product_id and t.sender_id = t1.sender_id and t.timestamp = t1.maxtm
order by t1.maxtm desc
Just do a workout with your data:
CREATE TABLE products (product_id INTEGER,
sender_id INTEGER,
time_stamp INTEGER)
INSERT INTO products VALUES
(1,2,1222),
(1,2,3423),
(1,2,1231),
(2,2,890),
(3,4,234),
(2,3,234234)
SELECT product_id,sender_id,string_agg(time_stamp::text,','),count(product_id)
FROM products
WHERE sender_id=2
GROUP BY product_id,sender_id
Here you have distinct time_stamp ,so you need to apply some aggregate or just remove that column in select statement.
If you remove time_stamp in select statement then it would be very easy like below :
SELECT product_id,sender_id,count(product_id)
FROM products
WHERE sender_id=2
GROUP BY product_id,sender_id

select only when an item exists on a table 3 or more times postgres

I have 2 tables. SalesOrderDetail and SalesOrderHeader.
SalesOrderDetails contains SalesOrderID and ProductID columns.
SalesOrderHeader contains SalesOrderID and CustomerID.
I want to make a query that shows all the Customers who ordered 3 or more products with different ProductID and how many orders he made(with 3 or more different products). I know that a customer made an order of 3 or more products when the table SalesOrderDetail have his SalesOrderID number more than 3 or more times.
So the Customer with ID 29825 has ordered 12 different Products.
And here's my code:
SELECT "SalesOrderHeader"."CustomerID", count("SalesOrderDetail"."SalesOrderID") AS TotalOrders
FROM
public."SalesOrderHeader",
public."SalesOrderDetail"
WHERE
"SalesOrderHeader"."SalesOrderID" = "SalesOrderDetail"."SalesOrderID"
GROUP BY "SalesOrderHeader"."CustomerID"
HAVING count("SalesOrderDetail"."SalesOrderID") >= 3
Problem with this is that is shows the number of products he ordered but I want the total orders with 3 or more different products.
If you want the total orders with 3 or more products per customer, then use two levels of aggregation:
select soh."CustomerID", count(*) as NumOrders
from public."SalesOrderHeader" soh join
(select SalesOrderID, count(distinct ProductID) as numproducts
from public."SalesOrderDetail" sod
group by SalesOrderId
) sod
on sod."SalesOrderID" = soh."SalesOrderID"
where numproducts >= 3
group by soh."CustomerID"

How to normalize group by count results?

How can the results of a "group by" count be normalized by the count's sum?
For example, given:
User Rating (1-5)
----------------------
1 3
1 4
1 2
3 5
4 3
3 2
2 3
The result will be:
User Count Percentage
---------------------------
1 3 .42 (=3/7)
2 1 .14 (=1/7)
3 2 .28 (...)
4 1 .14
So for each user the number of ratings they provided is given as the percentage of the total ratings provided by everyone.
SELECT DISTINCT ON (user) user, count(*) OVER (PARTITION BY user) AS cnt,
count(*) OVER (PARTITION BY user) / count(*) OVER () AS percentage;
The count(*) OVER (PARTITION BY user) is a so-called window function. Window functions let you perform some operation over a "window" created by some "partition" which is here made over the user id. In plain and simple English: the partitioned count(*) is calculated for each distinct user value, so in effect it counts the number of rows for each user value.
Without using a windowing function or variables, you will need to cross join a grouped subquery on a second "maxed" subquery then select again to return a subset you can work with.
SELECT
B.UserID,
B.UserCount,
A.CountAll
FROM
(
SELECT
CountAll=SUM(UserCount)
FROM
(
SELECT
UserCount=COUNT(*)
FROM
MyTable
GROUP BY
UserID
) AS A
)AS C
CROSS JOIN(
SELECT
UserID,
UserCount=COUNT(*)
FROM
MyTable
GROUP BY
UserID
)AS B

Using Derived Tables and CTEs to Display Details?

I am teaching myself T-SQL and am struggling to comprehend the following example..
Suppose you want to display several nonaggregated columns along with
some aggregate expressions that apply to the entire result set or to a
larger grouping level. For example, you may need to display several
columns from the Sales.SalesOrderHeader table and calculate the
percent of the TotalDue for each sale compared to the TotalDue for all
the customer’s sales. If you group by CustomerID, you can’t include
other nonaggregated columns from Sales.SalesOrderHeader unless you
group by those columns. To get around this, you can use a derived
table or a CTE.
Here are two examples given...
SELECT c.CustomerID, SalesOrderID, TotalDue, AvgOfTotalDue,
TotalDue/SumOfTotalDue * 100 AS SalePercent
FROM Sales.SalesOrderHeader AS soh
INNER JOIN
(SELECT CustomerID, SUM(TotalDue) AS SumOfTotalDue,
AVG(TotalDue) AS AvgOfTotalDue
FROM Sales.SalesOrderHeader
GROUP BY CustomerID) AS c ON soh.CustomerID = c.CustomerID
ORDER BY c.CustomerID;
WITH c AS
(SELECT CustomerID, SUM(TotalDue) AS SumOfTotalDue,
AVG(TotalDue) AS AvgOfTotalDue
FROM Sales.SalesOrderHeader
GROUP BY CustomerID)
SELECT c.CustomerID, SalesOrderID, TotalDue,AvgOfTotalDue,
TotalDue/SumOfTotalDue * 100 AS SalePercent
FROM Sales.SalesOrderHeader AS soh
INNER JOIN c ON soh.CustomerID = c.CustomerID
ORDER BY c.CustomerID;
Why doesn't this query produce the same result..
SELECT CustomerID, SalesOrderID, TotalDue, AVG(TotalDue) AS AvgOfTotalDue,
TotalDue/SUM(TotalDue) * 100 AS SalePercent
FROM Sales.SalesOrderHeader
GROUP BY CustomerID, SalesOrderID, TotalDue
ORDER BY CustomerID
I'm looking for someone to explain the above examples in another way or step through it logically so I can understand how they work?
The aggregates in this statement (i.e. SUM and AVG) don't do anything:
SELECT CustomerID, SalesOrderID, TotalDue, AVG(TotalDue) AS AvgOfTotalDue,
TotalDue/SUM(TotalDue) * 100 AS SalePercent
FROM Sales.SalesOrderHeader
GROUP BY CustomerID, SalesOrderID, TotalDue
ORDER BY CustomerID
The reason for this is you're grouping by TotalDue, so all records in the same group have the same value for this field. In the case of AVG this means you're guarenteed for AvgOfTotalDue to always equal TotalDue. For SUM it's possible you'd get a different result, but as you're also grouping by SalesOrderID (which I'd imagine is unique in the SalesOrderHeader table) you will only have one record per group, so again this will always equal the TotalDue value.
With the CTE example you're only grouping by CustomerId; as a customer may have many sales orders associated with it, these aggregate values will be different to the TotalDue.
EDIT
Explanation of the aggregate of field included in group by:
When you group by a value, all rows with that same value are collected together and aggregate functions are performed over them. Say you had 5 rows with a total due of 1 and 3 with a total due of 2 you'd get two result lines; one with the 1s and one with the 2s. Now if you perform a sum on these you have 3*1 and 2*2. Now divide by the number of rows in that result line (to get the average) and you have 3*1/3 and 2*2/2; so things cancel out leaving you with 1 and 2.
select totalDue, avg(totalDue)
from (
select 1 totalDue
union all select 1 totalDue
union all select 1 totalDue
union all select 2 totalDue
union all select 2 totalDue
) x
group by totalDue
select uniqueId, totalDue, avg(totalDue), sum(totalDue)
from (
select 1 uniqueId, 1 totalDue
union all select 2 uniqueId, 1 totalDue
union all select 3 uniqueId, 1 totalDue
union all select 4 uniqueId, 2 totalDue
union all select 5 uniqueId, 2 totalDue
) x
group by uniqueId
Runnable Example: http://sqlfiddle.com/#!2/d41d8/21263