Converting counts inside query result tables to percentages of total - postgresql

I have a table and want to calculate the percentage of total by store_id which each (category_id, store_id) subtotal represents. My code is below:
WITH
example_table (name, store_id)
AS
(
select name, store_id
from category
join film_category using (category_id)
join film using (film_id)
join inventory using (film_id)
join rental using (inventory_id)
)
SELECT name, store_id, cast(count(*) as numeric)/(SELECT count(*) FROM example_table)
FROM example_table
GROUP BY name, store_id
ORDER BY name, store_id
This code actually works, as in, it doesn't throw an error, only they're not the results I'm looking for. Here each of the subtotals is divided by the total across both stores and all 16 names. Instead, I want the subtotals divided by their respective store totals or divided by their respective name totals.
I'm wondering how to perform calculations on those subtotals in general.
Thanks in advance,

I believe you need to explore the possibilities of using aggregate functions combined with an OVER(PARTITION BY ...) e.g.
SELECT DISTINCT
name, store_id, store_id_count, name_count
FROM (
select name, store_id
, count(*) over(partition by store_id) as store_id_count
, count(*) over(partition by name) as name_count
from category
join film_category using (category_id)
join film using (film_id)
join inventory using (film_id)
join rental using (inventory_id)
) AS example_table
When using aggregate function with the over clause you get the wanted counts on each row of the result, and it seems that in this case you need this. Note that select distinct has been used simply to reduce the final number of rows returned, you might still need to use a group by but I am not sure if you do.
Once you have the needed values within the derived table (aliases as example_table) then it should be a simple matter of some arithmetic in the overall select clause.

Related

Postgres query filter by non column in table

i have a challenge whose consist in filter a query not with a value that is not present in a table but a value that is retrieved by a function.
let's consider a table that contains all sales on database
id, description, category, price, col1 , ..... col n
i have function that retrieve me a table of similar sales from one (based on rules and business logic) . This function performs a query again on all records in the sales table and match validation in some fields.
similar_sales (sale_id integer) - > returns a integer[]
now i need to list all similar sales for each one present in sales table.
select s.id, similar_sales (s.id)
from sales s
but the similar_sales can be null and i am interested only return sales which contains at least one.
select id, similar
from (
select s.id, similar_sales (s.id) as similar
from sales s
) q
where #similar > 1 (Pseudocode)
limit x
i can't do the limit in subquery because i don't know what sales have similar or not.
I just wanted do a subquery for a set of small rows and not all entire table to get query performance gains (pagination strategy)
you can try this :
select id, similar
from sales s
cross join lateral similar_sales (s.id) as similar
where not isempty(similar)
limit x

ROW_NUMBER() OVER PARTITION optimization

I have the following query:
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Code ORDER BY Price ASC) as RowNum
from Offers) r
where RowNum = 1
Offers table contains about 10 million records. But there are only ~4000 distinct codes there. So I need to get the row with the lowest price for each code and there will be only 4000 rows in the result.
I have an Index on (Code, Price) columns with all other columns in INCLUDE statement.
The query runs for 2 minutes. And if I look at the execution plan, I see Index scan with 10M actual rows. So, I guess it scans the whole index to get needed values.
Why MSSQL do the whole index scan? Is it because subquery needs the whole data? How to avoid this scan? Is there a SQL hint to process only the first row in partition?
Is there another way to optimize such query?
After trying multiple different solutions, I've found the fastest query with CROSS APPLY statement:
SELECT C.*
FROM (SELECT DISTINCT Code from Offers) A
CROSS APPLY (SELECT TOP 1 *
FROM Offers B
WHERE A.Code = B.Code
ORDER by Price) C
It take ~1 second to run.
Try creating an index on ( Code, Price ) without including the other columns and then (assuming that there is a unique Id column):
select L.*
from Offers as L inner join
( select Id,
Row_Number() over ( partition by Code order by Price ) as RN
from Offers ) as R on R.Id = L.Id and R.RN = 1
An index scan on a smaller index ought to help.
Second guess would be to get the Id of the row with the lowest Price for each Code explicitly: Get distinct Code values, get Id of top 1 (to avoid problems with duplicate prices) Min( Price ) row for that Code, join with Offers to get complete rows. Again, the more compact index should help.
Not sure if you'll get any significant performance gains, but you may want to try the WITH TIES clause
Example
Select Top 1 with Ties *
From Offers
Order By Row_Number() over (Partition By Code Order By Price)

select last of an item for each user in postgres

I want to get the last entry for each user but the customer_id is a hash 'ASAG#...' order by customer_id destroys the query. Is there an alternative?
Select Distinct On (l.customer_id)
l.customer_id
,l.created_at
,l.text
From likes l
Order By l.customer_id, l.created_at Desc
Your current query already appears to be working, q.v. here:
Demo
I don't know why your current query is not generating the results you would expect. It should return one distinct record for every customer, corresponding to the more recent one, given your ORDER BY statement.
In any case, if it does not do what you want, an alternative would be to use ROW_NUMBER() here with a partition by user. The inner query assigns a row number to each user, with the value 1 going to the most recent record for each user. Then the outer query retains only the latest record.
SELECT
t.customer_id,
t.created_at,
t.text
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) rn
FROM likes
) t
WHERE t.rn = 1
To speed up the inner query which uses ROW_NUMBER() you can try adding a composite index on the customer_id and created_at columns:
CREATE INDEX yourIdx ON likes (customer_id, created_at);

Inner join with count and group by

I have 2 tables
Timetable :
pupil_id, staff_id, subject, lesson_id
Staff_info :
staff_id, surname
The timetable table contains 1000s of rows because each student's ID is listed under each period they do.
I want to list all the teacher's names, and the number of lessons they do (count). So I have to do SELECT with DISTINCT.
SELECT DISTINCT TIMETABLE.STAFF_ID,
COUNT(TIMETABLE.LESSON_ID),
STAFF.SURNAME
FROM STAFF
INNER JOIN TIMETABLE ON TIMETABLE.STAFF_ID = STAFF.STAFF_ID
GROUP BY TIMETABLE.STAFF_ID
However I get the error:
Column 'STAFF.SURNAME' is invalid in the select list because it is not
contained in either an aggregate function or the GROUP BY clause.
This should do what you want:
SELECT s.STAFF_ID, COUNT(tt.LESSON_ID),
s.SURNAME
FROM STAFF s INNER JOIN
TIMETABLE tt
ON tt.STAFF_ID = s.STAFF_ID
GROUP BY s.STAFF_ID, s.SURNAME;
Notes:
You don't need DISTINCT unless there are duplicates in either table. That seems unlikely with this data structure, but if a staff member could have two of the same lesson, you would use COUNT(DISTINCT tt.LESSON_ID).
Table aliases make the query easier to write and to read.
You should include STAFF.SURNAME in the GROUP BY as well as the id.
I have a preference for taking the STAFF_ID column from the table where it is the primary key.
If you wanted staff with no lessons, you would change the INNER JOIN to LEFT JOIN.
SELECT T.STAFF_ID,
T.CNT,
S.SURNAME
FROM STAFF S
JOIN (
SELECT STAFF_ID, CNT = COUNT(/*DISTINCT*/ LESSON_ID)
FROM TIMETABLE
GROUP BY STAFF_ID
) T ON T.STAFF_ID = S.STAFF_ID
Another option:
SELECT DISTINCT si.staff_id, surname, COUNT(lesson_id) OVER(PARTITION BY staff_Id)
FROM Staff_info si
INNER JOIN Timetable tt ON si.staff_id = tt.staff_id
When using Aggregate function(Count, Sum, Min, Max, Avg) in the Select column's list, any other columns that are in the Select column's list but not in a aggregate function, should be mentioned in GROUP BY section too. So you need to change your query as follow and add STAFF.SURNAME to GROUP BY section too:
SELECT TIMETABLE.STAFF_ID,
COUNT(TIMETABLE.LESSON_ID),
STAFF.SURNAME
FROM STAFF
INNER JOIN TIMETABLE ON TIMETABLE.STAFF_ID = STAFF.STAFF_ID
GROUP BY TIMETABLE.STAFF_ID,STAFF.SURNAME
Distinct is useless also in your scenario. and also as you are going to show the teachers name and Count lessons, you do not need to add TIMETABLE.STAFF_ID to Select's column's list,, but it should remain in Group By section to prevent duplicate names.
SELECT COUNT(TIMETABLE.LESSON_ID),
STAFF.SURNAME
FROM STAFF
INNER JOIN TIMETABLE ON TIMETABLE.STAFF_ID = STAFF.STAFF_ID
GROUP BY TIMETABLE.STAFF_ID,STAFF.SURNAME
You may need to take a look at this W3C post for more info

SQL SELECT in another table with most recent date

I have a list of Matter data in Table1 that I need to query, as well as get the most recent Invoice Number in Table2 that is tied to the original Matter. I'm having extreme difficulty in joining these tables together and only getting one result for each Matter as I only want the most recent Invoice #.
Any and all help would be greatly appreciated.
Table1
Table2
RESULT
The following assigns numbers to each invoice row in order of date, and selects only the most recent. Note that this assumes InvoiceDate is stored as a date,datetime, or something else that will sort chronologically, and that in the event of two invoices for the same date, returning either will be fine. If you need to return both invoices in the event of ties, replace row_number with rank.
Select * from Table1 a
inner join
(Select *
, row_number() over (partition by MatterID order by InvoiceDate desc) as RN
from Table2) b
on a.MatterID = b.MatterID and b.RN = 1