The query:
SELECT COUNT(*) as count_all,
posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id;
Returns n records in Postgresql:
count_all | post_id
-----------+---------
1 | 6
3 | 4
3 | 5
3 | 1
1 | 9
1 | 10
(6 rows)
I just want to retrieve the number of records returned: 6.
I used a subquery to achieve what I want, but this doesn't seem optimum:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
How would I get the number of records in this context right in PostgreSQL?
I think you just need COUNT(DISTINCT post_id) FROM votes.
See "4.2.7. Aggregate Expressions" section in http://www.postgresql.org/docs/current/static/sql-expressions.html.
EDIT: Corrected my careless mistake per Erwin's comment.
There is also EXISTS:
SELECT count(*) AS post_ct
FROM posts p
WHERE EXISTS (SELECT FROM votes v WHERE v.post_id = p.id);
In Postgres and with multiple entries on the n-side like you probably have, it's generally faster than count(DISTINCT post_id):
SELECT count(DISTINCT p.id) AS post_ct
FROM posts p
JOIN votes v ON v.post_id = p.id;
The more rows per post there are in votes, the bigger the difference in performance. Test with EXPLAIN ANALYZE.
count(DISTINCT post_id) has to read all rows, sort or hash them, and then only consider the first per identical set. EXISTS will only scan votes (or, preferably, an index on post_id) until the first match is found.
If every post_id in votes is guaranteed to be present in the table posts (referential integrity enforced with a foreign key constraint), this short form is equivalent to the longer form:
SELECT count(DISTINCT post_id) AS post_ct
FROM votes;
May actually be faster than the EXISTS query with no or few entries per post.
The query you had works in simpler form, too:
SELECT count(*) AS post_ct
FROM (
SELECT FROM posts
JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) sub;
Benchmark
To verify my claims I ran a benchmark on my test server with limited resources. All in a separate schema:
Test setup
Fake a typical post / vote situation:
CREATE SCHEMA y;
SET search_path = y;
CREATE TABLE posts (
id int PRIMARY KEY
, post text
);
INSERT INTO posts
SELECT g, repeat(chr(g%100 + 32), (random()* 500)::int) -- random text
FROM generate_series(1,10000) g;
DELETE FROM posts WHERE random() > 0.9; -- create ~ 10 % dead tuples
CREATE TABLE votes (
vote_id serial PRIMARY KEY
, post_id int REFERENCES posts(id)
, up_down bool
);
INSERT INTO votes (post_id, up_down)
SELECT g.*
FROM (
SELECT ((random()* 21)^3)::int + 1111 AS post_id -- uneven distribution
, random()::int::bool AS up_down
FROM generate_series(1,70000)
) g
JOIN posts p ON p.id = g.post_id;
All of the following queries returned the same result (8093 of 9107 posts had votes).
I ran 4 tests with EXPLAIN ANALYZE ant took the best of five on Postgres 9.1.4 with each of the three queries and appended the resulting total runtimes.
As is.
After ..
ANALYZE posts;
ANALYZE votes;
After ..
CREATE INDEX foo on votes(post_id);
After ..
VACUUM FULL ANALYZE posts;
CLUSTER votes using foo;
count(*) ... WHERE EXISTS
253 ms
220 ms
85 ms -- winner (seq scan on posts, index scan on votes, nested loop)
85 ms
count(DISTINCT x) - long form with join
354 ms
358 ms
373 ms -- (index scan on posts, index scan on votes, merge join)
330 ms
count(DISTINCT x) - short form without join
164 ms
164 ms
164 ms -- (always seq scan)
142 ms
Best time for original query in question:
353 ms
For simplified version:
348 ms
#wildplasser's query with a CTE uses the same plan as the long form (index scan on posts, index scan on votes, merge join) plus a little overhead for the CTE. Best time:
366 ms
Index-only scans in the upcoming PostgreSQL 9.2 can improve the result for each of these queries, most of all for EXISTS.
Related, more detailed benchmark for Postgres 9.5 (actually retrieving distinct rows, not just counting):
Select first row in each GROUP BY group?
Using OVER() and LIMIT 1:
SELECT COUNT(1) OVER()
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
LIMIT 1;
WITH uniq AS (
SELECT DISTINCT posts.id as post_id
FROM posts
JOIN votes ON votes.post_id = posts.id
-- GROUP BY not needed anymore
-- GROUP BY posts.id
)
SELECT COUNT(*)
FROM uniq;
For followers, I like the OP's inner query method:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
) as x;
Since then you can use HAVING in there as well:
SELECT COUNT(*) FROM (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id HAVING count(*) > 1
) as x;
Or the equivalent CTE
with posts_coalesced as (
SELECT COUNT(*) as count_all, posts.id as post_id
FROM posts
INNER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id )
select count(*) from posts_coalesced;
Related
Suppose I have query for fetching the latest 10 books for a given author like this:
SELECT *
FROM books
WHERE author_id = #author_id
ORDER BY published DESC, id
LIMIT 10
Now if I have a list of n authors I want to get the latest books for, then I can run this query n times. Note that n is reasonably small. However, this seems like an optimization opportunity.
Is there are single query that can efficiently fetch the latest 10 books for n given authors?
This query doesn't work (only fetches 10, not n * 10 books):
SELECT *
FROM books
WHERE author_id = ANY(#author_ids)
ORDER BY published DESC, id
LIMIT 10
First provided author wise book where book is serialized by recent published date for generating a number using ROW_NUMBER() and then in outer subquery add a condition for fetching the desired result.
SELECT *
FROM (SELECT *
, ROW_NUMBER() OVER (PARTITION BY author_id ORDER BY published DESC) row_num
FROM books
WHERE author_id = ANY(#author_ids)) t
WHERE t.row_num <= 10
SELECT b.*
FROM authors a
JOIN LATERAL (
SELECT *
FROM books b
WHERE b.author = a.id
ORDER BY b.published DESC, b.id
LIMIT 10
) b ON TRUE
WHERE a.id = ANY(#author_ids)
Let's say I want the ids that are counted as a result of my query, and checked if the same ids appear in a different month.
Here I join 2 tables via distinct id's and count the returning rows to know how many of the matching id's I have. Here that is for the month June.
I'd like:
eg. in June 100 distinct ids
eg. in July 90 of the same ids left
Please help!
I am stuck as my Sql is not very advanced,...
with total as (
select distinct(transactions.u_id), count(*)
from transactions
join contacts using (u_id)
join table using (contact_id)
where transactions.when_created between '2020-06-01' AND '2020-06-30'
group by transactions.u_id
HAVING COUNT(*) > 1
)
SELECT
COUNT(*)
FROM
total
Let's say that you are interested about the query of the like of
select transactions.u_id, count(*) as total
from transactions
join contacts using (u_id)
join table using (contact_ud)
where transactions.when_created between '2020-06-01' and '2020-06-30'
group by transactions.u_id;
You are also interested in
select transactions.u_id, count(*) as total
from transactions
join contacts using (u_id)
join table using (contact_ud)
where transactions.when_created between '2020-08-01' and '2020-08-30'
group by transactions.u_id;
And you want to get:
the ids which can be found by both
the minimum total
Then, you can do something of the like of
select t1.u_id,
case
when t1.total > t2.total then t2.total
else t1.total
end as total
from (
select transactions.u_id, count(*) as total
from transactions
join contacts using (u_id)
join table using (contact_ud)
where transactions.when_created between '2020-06-01' and '2020-06-30'
group by transactions.u_id) t1
join (
select transactions.u_id, count(*) as total
from transactions
join contacts using (u_id)
join table using (contact_ud)
where transactions.when_created between '2020-06-01' and '2020-06-30'
group by transactions.u_id) t2
on t1.u_id = t2.u_id
I have a below SQL query to get the three records for notifying purpose.
SELECT orders.msg
FROM orders
INNER JOIN
(
SELECT id
FROM orders
WHERE type_id = 12
ORDER BY id DESC LIMIT 3 OFFSET 0
) AS items
ON orders.id = items.id;
When trying to make the query optimized, i made the changes as below.
SELECT orders.msg
FROM orders
WHERE type_id = 12
ORDER BY id DESC LIMIT 3 OFFSET 0;
Is the modified query seems to be OK or did i miss anything here or any other way of doing is there??
The simplified version on the bottom looks logically identical, to me, to the one on top:
SELECT msg
FROM orders
WHERE type_id = 12
ORDER BY id DESC LIMIT 3;
Note that the above query could benefit from the following index:
CREATE INDEX idx ON orders (type_id, id, msg);
This index would completely cover the WHERE, ORDER BY, and SELECT clauses.
You can try this also:
SELECT orders.msg
FROM orders
WHERE orders.id
IN (
SELECT id
FROM orders
WHERE type_id = 12
ORDER BY id
DESC LIMIT 3 OFFSET 0
)
I'm making a small-scale reddit clone. There is a table for posts, a table for comments (relevant only for context), and a table for posts_comments. I'm trying to sort posts by the number of comments the post has.
This is the init for the posts_comments table
CREATE TABLE posts_comments (
id SERIAL PRIMARY KEY,
parent_id INTEGER,
comment_id INTEGER,
post_id INTEGER
)
This is the call I have, but it doesn't seem right
SELECT * FROM posts p
JOIN posts_comments pc ON p.id = pc.post_id
ORDER BY (SELECT COUNT(*) FROM pc WHERE pc.post_id = p.id) DESC
LIMIT $1
OFFSET $2
I want the output to be a list of posts sorted by the number of comments linked to that post
maybe like this:
SELECT
COUNT(pc.post_id) OVER (PARTITION BY p.id) AS num_comments
,* FROM posts p
LEFT OUTER JOIN posts_comments pc ON p.id = pc.post_id
ORDER BY 1 DESC
LIMIT $1
OFFSET $2
of it you only want the list of posts and not the comments.
SELECT
COUNT(pc.post_id) AS num_comments
,p.* FROM posts p
LEFT OUTER JOIN posts_comments pc ON p.id = pc.post_id
GROUP BY p.id
ORDER BY 1 DESC
LIMIT $1
OFFSET $2
I can't figure out how to do limit within group although I've read all similar questions here. Reading PSQL doc didn't help either :( Consider the following:
CREATE TABLE article_relationship
(
article_from INT NOT NULL,
article_to INT NOT NULL,
score INT
);
I want to get a list of top 5 related articles per given article IDs sorted by score.
Here is what I tried:
select DISTINCT o.article_from
from article_relationship o
join lateral (
select i.article_from, i.article_to, i.score from article_relationship i
order by score desc
limit 5
) p on p.article_from = o.article_from
where o.article_from IN (18329382, 61913904, 66538293, 66540477, 66496909)
order by o.article_from;
And it returns nothing. I was under impression that outer query is like loop so I guess I only need source IDs there.
Also what if I want to join on articles table where there are columns id and title and get titles of related articles in resultset?
I added join in inner query:
select o.id, p.*
from articles o
join lateral (
select a.title, i.article_from, i.article_to, i.score
from article_relationship i
INNER JOIN articles a on a.id = i.article_to
where i.article_from = o.id
order by score desc
limit 5
) p on true
where o.id IN (18329382, 61913904, 66538293, 66540477, 66496909)
order by o.id;
But it made it very very slow.
The problem with no rows returning from your query is that your join condition is wrong: ON p.article_from = o.article_from; this should obviously be ON p.article_from = o.article_to.
That issue aside, your query will not return the top 5 scoring relations per article id; instead it will return the article IDs that reference one of the 5 top rated referenced articles throughout the table and (also) at least 1 of the 5 referenced articles for which you specify the id.
You can get the top 5 rated referenced articles per referencing article with a window function to rank the scores in a sub-select and then select only the top 5 in the main query. Specifying a list of referenced article IDs effectively means that you will rank how these referenced articles are scored for each referencing article:
SELECT article_from, article_to, score
FROM (
SELECT article_from, article_to, score,
rank() OVER (PARTITION BY article_from ORDER BY score DESC) AS rnk
FROM article_relationship
WHERE article_to IN (18329382, 61913904, 66538293, 66540477, 66496909) ) a
WHERE rnk < 6
ORDER BY article_from, score DESC;
This is different from your code in that it returns up to 5 records for each article_from but it is consistent with your initial description.
Adding columns from table articles is trivially done in the main query:
SELECT a.article_from, a.article_to, a.score, articles.*
FROM (
SELECT article_from, article_to, score,
rank() OVER (PARTITION BY article_from ORDER BY score DESC) AS rnk
FROM article_relationship
WHERE article_to IN (18329382, 61913904, 66538293, 66540477, 66496909) ) a
JOIN articles ON articles.id = a.article_to
WHERE a.rnk < 6
ORDER BY a.article_from, a.score DESC;
Version with join lateral
select o.id as from_id, p.article_to as to_id, a.title, a.journal_id, a.pub_date_p from articles o
join lateral (
select i.article_to from article_relationship i
where i.article_from = o.id
order by score desc
limit 5
) p on true
INNER JOIN articles a on a.id = p.article_to
where o.id IN (18329382, 61913904, 66538293, 66540477, 66496909)
order by o.id;