How do I do LIMIT within GROUP in the same table? - postgresql

I can't figure out how to do limit within group although I've read all similar questions here. Reading PSQL doc didn't help either :( Consider the following:
CREATE TABLE article_relationship
(
article_from INT NOT NULL,
article_to INT NOT NULL,
score INT
);
I want to get a list of top 5 related articles per given article IDs sorted by score.
Here is what I tried:
select DISTINCT o.article_from
from article_relationship o
join lateral (
select i.article_from, i.article_to, i.score from article_relationship i
order by score desc
limit 5
) p on p.article_from = o.article_from
where o.article_from IN (18329382, 61913904, 66538293, 66540477, 66496909)
order by o.article_from;
And it returns nothing. I was under impression that outer query is like loop so I guess I only need source IDs there.
Also what if I want to join on articles table where there are columns id and title and get titles of related articles in resultset?
I added join in inner query:
select o.id, p.*
from articles o
join lateral (
select a.title, i.article_from, i.article_to, i.score
from article_relationship i
INNER JOIN articles a on a.id = i.article_to
where i.article_from = o.id
order by score desc
limit 5
) p on true
where o.id IN (18329382, 61913904, 66538293, 66540477, 66496909)
order by o.id;
But it made it very very slow.

The problem with no rows returning from your query is that your join condition is wrong: ON p.article_from = o.article_from; this should obviously be ON p.article_from = o.article_to.
That issue aside, your query will not return the top 5 scoring relations per article id; instead it will return the article IDs that reference one of the 5 top rated referenced articles throughout the table and (also) at least 1 of the 5 referenced articles for which you specify the id.
You can get the top 5 rated referenced articles per referencing article with a window function to rank the scores in a sub-select and then select only the top 5 in the main query. Specifying a list of referenced article IDs effectively means that you will rank how these referenced articles are scored for each referencing article:
SELECT article_from, article_to, score
FROM (
SELECT article_from, article_to, score,
rank() OVER (PARTITION BY article_from ORDER BY score DESC) AS rnk
FROM article_relationship
WHERE article_to IN (18329382, 61913904, 66538293, 66540477, 66496909) ) a
WHERE rnk < 6
ORDER BY article_from, score DESC;
This is different from your code in that it returns up to 5 records for each article_from but it is consistent with your initial description.
Adding columns from table articles is trivially done in the main query:
SELECT a.article_from, a.article_to, a.score, articles.*
FROM (
SELECT article_from, article_to, score,
rank() OVER (PARTITION BY article_from ORDER BY score DESC) AS rnk
FROM article_relationship
WHERE article_to IN (18329382, 61913904, 66538293, 66540477, 66496909) ) a
JOIN articles ON articles.id = a.article_to
WHERE a.rnk < 6
ORDER BY a.article_from, a.score DESC;

Version with join lateral
select o.id as from_id, p.article_to as to_id, a.title, a.journal_id, a.pub_date_p from articles o
join lateral (
select i.article_to from article_relationship i
where i.article_from = o.id
order by score desc
limit 5
) p on true
INNER JOIN articles a on a.id = p.article_to
where o.id IN (18329382, 61913904, 66538293, 66540477, 66496909)
order by o.id;

Related

Performing a batch select latest n in Postgres?

Suppose I have query for fetching the latest 10 books for a given author like this:
SELECT *
FROM books
WHERE author_id = #author_id
ORDER BY published DESC, id
LIMIT 10
Now if I have a list of n authors I want to get the latest books for, then I can run this query n times. Note that n is reasonably small. However, this seems like an optimization opportunity.
Is there are single query that can efficiently fetch the latest 10 books for n given authors?
This query doesn't work (only fetches 10, not n * 10 books):
SELECT *
FROM books
WHERE author_id = ANY(#author_ids)
ORDER BY published DESC, id
LIMIT 10
First provided author wise book where book is serialized by recent published date for generating a number using ROW_NUMBER() and then in outer subquery add a condition for fetching the desired result.
SELECT *
FROM (SELECT *
, ROW_NUMBER() OVER (PARTITION BY author_id ORDER BY published DESC) row_num
FROM books
WHERE author_id = ANY(#author_ids)) t
WHERE t.row_num <= 10
SELECT b.*
FROM authors a
JOIN LATERAL (
SELECT *
FROM books b
WHERE b.author = a.id
ORDER BY b.published DESC, b.id
LIMIT 10
) b ON TRUE
WHERE a.id = ANY(#author_ids)

How can I sort rows by number of corresponding rows in different table?

I'm making a small-scale reddit clone. There is a table for posts, a table for comments (relevant only for context), and a table for posts_comments. I'm trying to sort posts by the number of comments the post has.
This is the init for the posts_comments table
CREATE TABLE posts_comments (
id SERIAL PRIMARY KEY,
parent_id INTEGER,
comment_id INTEGER,
post_id INTEGER
)
This is the call I have, but it doesn't seem right
SELECT * FROM posts p
JOIN posts_comments pc ON p.id = pc.post_id
ORDER BY (SELECT COUNT(*) FROM pc WHERE pc.post_id = p.id) DESC
LIMIT $1
OFFSET $2
I want the output to be a list of posts sorted by the number of comments linked to that post
maybe like this:
SELECT
COUNT(pc.post_id) OVER (PARTITION BY p.id) AS num_comments
,* FROM posts p
LEFT OUTER JOIN posts_comments pc ON p.id = pc.post_id
ORDER BY 1 DESC
LIMIT $1
OFFSET $2
of it you only want the list of posts and not the comments.
SELECT
COUNT(pc.post_id) AS num_comments
,p.* FROM posts p
LEFT OUTER JOIN posts_comments pc ON p.id = pc.post_id
GROUP BY p.id
ORDER BY 1 DESC
LIMIT $1
OFFSET $2

Can't solve this SQL query

I have a difficulty dealing with a SQL query. I use PostgreSQL.
The query says: Show the customers that have done at least an order that contains products from 3 different categories. The result will be 2 columns, CustomerID, and the amount of orders. I have written this code but I don't think it's correct.
select SalesOrderHeader.CustomerID,
count(SalesOrderHeader.SalesOrderID) AS amount_of_orders
from SalesOrderHeader
inner join SalesOrderDetail on
(SalesOrderHeader.SalesOrderID=SalesOrderDetail.SalesOrderID)
inner join Product on
(SalesOrderDetail.ProductID=Product.ProductID)
where SalesOrderDetail.SalesOrderDetailID in
(select DISTINCT count(ProductCategoryID)
from Product
group by ProductCategoryID
having count(DISTINCT ProductCategoryID)>=3)
group by SalesOrderHeader.CustomerID;
Here are the database tables needed for the query:
where SalesOrderDetail.SalesOrderDetailID in
(select DISTINCT count(ProductCategoryID)
Is never going to give you a result as an ID (SalesOrderDetailID) will never logically match a COUNT (count(ProductCategoryID)).
This should get you the output I think you want.
SELECT soh.CustomerID, COUNT(soh.SalesOrderID) AS amount_of_orders
FROM SalesOrderHeader soh
INNER JOIN SalesOrderDetail sod ON soh.SalesOrderID = sod.SalesOrderID
INNER JOIN Product p ON sod.ProductID = p.ProductID
HAVING COUNT(DISTINCT p.ProductCategoryID) >= 3
GROUP BY soh.CustomerID
Try this :
select CustomerID,count(*) as amount_of_order from
SalesOrder join
(
select SalesOrderID,count(distinct ProductCategoryID) CategoryCount
from SalesOrderDetail JOIN Product using (ProductId)
group by 1
) CatCount using (SalesOrderId)
group by 1
having bool_or(CategoryCount>=3) -- At least on CategoryCount>=3

SQL Server 2012 Passing parameter from main query to the Joined subquery

I need to select some settings from some joined tables, but only if Items ORDER BY EndTime DESC ItemID is among first 1000 Items.
Do do this I built the following Query that, although surely can be improved, works:
SELECT ss.ModuleCode, ss.MaxItems , w.*
FROM Subscriptions ss
JOIN Sellers s ON s.UID=ss.UID
JOIN Items i ON s.UserID=i.UserID
JOIN Items ii ON i.ItemID=ii.ItemID
JOIN Modules mo ON ss.ModuleCode=mo.ModuleCode
JOIN Settings w ON w.UID=s.UID AND ss.ModuleCode=w.WCode
FULL JOIN GoogleFonts f ON f.FontCode=a.FontFamily
JOIN ( SELECT
ItemID
FROM Items
WHERE UserID=#UserID
ORDER BY EndTime DESC
OFFSET 0 ROWS
FETCH FIRST (1000) ROWS ONLY
) it ON it.ItemID=i.ItemID
WHERE it.ItemID=#ItemID
AND .....
but since MaxItems is not always 1000 and its value is defined by ss.MaxItems,
I would replace the fixed value of 1000 with the dynamic value of ss.MaxItems, but I haven't find a way to do it:
Although not optimal since makes the query much heavier, I tried putting instead of 1000 a further query with this result:
SELECT ss.ModuleCode, ss.MaxItems , w.*
FROM Subscriptions ss
JOIN Sellers s ON s.UID=ss.UID
JOIN Items i ON s.UserID=i.UserID
JOIN Items ii ON i.ItemID=ii.ItemID
JOIN Modules mo ON ss.ModuleCode=mo.ModuleCode
JOIN Settings w ON w.UID=s.UID AND ss.ModuleCode=w.WCode
FULL JOIN GoogleFonts f ON f.FontCode=a.FontFamily
JOIN ( SELECT
ItemID
FROM Items
WHERE UserID=#UserID
ORDER BY EndTime DESC
OFFSET 0 ROWS
FETCH FIRST ( SELECT ss.MaxItems
FROM Subscriptions ss
JOIN Sellers s ON s.UID=ss.UID
JOIN Items i ON s.UserID=i.UserID
JOIN Modules mo ON ss.ModuleCode=mo.ModuleCode
JOIN Settings w ON w.UID=s.UID AND ss.ModuleCode=w.WCode
WHERE i.ItemID=#ItemID) ROWS ONLY
) it ON it.ItemID=i.ItemID
Where it.ItemID=#ItemID
AND .....
but since this returns more than 1 value it is not accepted: limiting to TOP 1 result the latest subquery will work but will not be fully dynamic as required.
Can suggest how to solve or at least suggest the path for the solution?
Thanks!
Instead of fetch use row_number:
JOIN (SELECT ItemID, ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY EndTime) as seqnum
FROM Items it
WHERE UserID = #UserID
) it
ON it.ItemID = i.ItemID AND seqnum <= ss.maxitems

T-SQL Subquery on Latest Date with InnerJoin

I want to do a similiar thing like this guy:
T-SQL Subquery Max(Date) and Joins
I have to do this with an n:m relation.
So the layout is:
tbl_Opportunity
tbl_Opportunity_tbl_OpportunityData
tbl_OpportunityData
So as you see there is an intersection table which connects opportunity with opportunitydata.
For every opportunity there are multiple opportunity datas. In my view i only want a list with all opportunites and the data from the latest opportunity datas.
I tried something like this:
SELECT
dbo.tbl_Opportunity.Id, dbo.tbl_Opportunity.Subject,
dbo.tbl_User.UserName AS Responsible, dbo.tbl_Contact.Name AS Customer,
dbo.tbl_Opportunity.CreationDate, dbo.tbl_Opportunity.ActionDate AS [Planned Closure],
dbo.tbl_OpportunityData.Volume,
dbo.tbl_OpportunityData.ChangeDate, dbo.tbl_OpportunityData.Chance
FROM
dbo.tbl_Opportunity
INNER JOIN
dbo.tbl_User ON dbo.tbl_Opportunity.Creator = dbo.tbl_User.Id
INNER JOIN
dbo.tbl_Contact ON dbo.tbl_Opportunity.Customer = dbo.tbl_Contact.Id
INNER JOIN
dbo.tbl_Opprtnty_tbl_OpprtnityData ON dbo.tbl_Opportunity.Id = dbo.tbl_Opprtnty_tbl_OpprtnityData.Id
INNER JOIN
dbo.tbl_OpportunityData ON dbo.tbl_Opprtnty_tbl_OpprtnityData.Id2 = dbo.tbl_OpportunityData.Id
The problem is my view now includes a row for every opportunity data, since I don't know how to filter that I only want the latest data.
Can you help me? is my problem description clear enough?
thank you in advance :-)
best wishes,
laurin
; WITH Base AS (
SELECT dbo.tbl_Opportunity.Id, dbo.tbl_Opportunity.Subject, dbo.tbl_User.UserName AS Responsible, dbo.tbl_Contact.Name AS Customer,
dbo.tbl_Opportunity.CreationDate, dbo.tbl_Opportunity.ActionDate AS [Planned Closure], dbo.tbl_OpportunityData.Volume,
dbo.tbl_OpportunityData.ChangeDate, dbo.tbl_OpportunityData.Chance
FROM dbo.tbl_Opportunity INNER JOIN
dbo.tbl_User ON dbo.tbl_Opportunity.Creator = dbo.tbl_User.Id INNER JOIN
dbo.tbl_Contact ON dbo.tbl_Opportunity.Customer = dbo.tbl_Contact.Id INNER JOIN
dbo.tbl_Opprtnty_tbl_OpprtnityData ON dbo.tbl_Opportunity.Id = dbo.tbl_Opprtnty_tbl_OpprtnityData.Id INNER JOIN
dbo.tbl_OpportunityData ON dbo.tbl_Opprtnty_tbl_OpprtnityData.Id2 = dbo.tbl_OpportunityData.Id
)
, OrderedByDate AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY ChangeDate DESC) RN FROM Base
)
SELECT * FROM OrderedByDate WHERE RN = 1
To make it more readable I'm using CTE (the WITH part). In the end the real "trick" is doing a ROW_NUMBER() partitioning the data by tbl_Opportunity.Id and ordering the partitions by ChangeDate DESC (and I call it RN). Clearly the maximum date in each partition will be RN = 1 and then we filter it by RN.
Without using CTE it will be something like this:
SELECT * FROM (
SELECT dbo.tbl_Opportunity.Id, dbo.tbl_Opportunity.Subject, dbo.tbl_User.UserName AS Responsible, dbo.tbl_Contact.Name AS Customer,
dbo.tbl_Opportunity.CreationDate, dbo.tbl_Opportunity.ActionDate AS [Planned Closure], dbo.tbl_OpportunityData.Volume,
dbo.tbl_OpportunityData.ChangeDate, dbo.tbl_OpportunityData.Chance,
ROW_NUMBER() OVER (PARTITION BY dbo.tbl_Opportunity.Id ORDER BY dbo.tbl_OpportunityData.ChangeDate DESC) RN
FROM dbo.tbl_Opportunity INNER JOIN
dbo.tbl_User ON dbo.tbl_Opportunity.Creator = dbo.tbl_User.Id INNER JOIN
dbo.tbl_Contact ON dbo.tbl_Opportunity.Customer = dbo.tbl_Contact.Id INNER JOIN
dbo.tbl_Opprtnty_tbl_OpprtnityData ON dbo.tbl_Opportunity.Id = dbo.tbl_Opprtnty_tbl_OpprtnityData.Id INNER JOIN
dbo.tbl_OpportunityData ON dbo.tbl_Opprtnty_tbl_OpprtnityData.Id2 = dbo.tbl_OpportunityData.Id
) AS Base WHERE RN = 1
The statement can be simplified for one more step further:
SELECT TOP 1 WITH TIES
dbo.tbl_Opportunity.Id, dbo.tbl_Opportunity.Subject, dbo.tbl_User.UserName AS Responsible,
dbo.tbl_Contact.Name AS Customer, dbo.tbl_Opportunity.CreationDate,
dbo.tbl_Opportunity.ActionDate AS [Planned Closure], dbo.tbl_OpportunityData.Volume,
dbo.tbl_OpportunityData.ChangeDate, dbo.tbl_OpportunityData.Chance
FROM
dbo.tbl_Opportunity INNER JOIN
dbo.tbl_User ON dbo.tbl_Opportunity.Creator = dbo.tbl_User.Id INNER JOIN
dbo.tbl_Contact ON dbo.tbl_Opportunity.Customer = dbo.tbl_Contact.Id INNER JOIN
dbo.tbl_Opprtnty_tbl_OpprtnityData ON dbo.tbl_Opportunity.Id = dbo.tbl_Opprtnty_tbl_OpprtnityData.Id INNER JOIN
dbo.tbl_OpportunityData ON dbo.tbl_Opprtnty_tbl_OpprtnityData.Id2 = dbo.tbl_OpportunityData.Id
ORDER BY
ROW_NUMBER() OVER (PARTITION BY dbo.tbl_Opportunity.Id ORDER BY dbo.tbl_OpportunityData.ChangeDate DESC);