Postgresql selecting with limit equal values? - postgresql

I have one postgresql table where I store some stories from different sites.
At this table I got story_id and site_id fields.
Where story_id is the primary key and site_id is the id of the site where I got this story from.
I need to make SELECT from this table picking the latest 30 added stories.
But I dont want to get more than 2 stories comming from same site...
So if I have something like this:
story_id | site_id
1 | 1
2 | 1
3 | 2
4 | 1
5 | 3
My results must be : story_ids = 1,2,3,5!
4 must be skipped because I have already picked 2 ids with site_id 1.

select story_id,
site_id
from (
select story_id,
site_id,
row_number() over (partition by site_id order by story_id desc) as rn
from the_table
) t
where rn <= 2
order by story_id desc
limit 30
If you want more or less than 2 entries "per group" you have to adjust the value in the outer where clause.

Related

Finding duplicate records posted within a lapse of time, in PostgreSQL

I'm trying to find duplicate rows in a large database (300,000 records). Here's an example of how it looks:
| id | title | thedate |
|----|---------|------------|
| 1 | Title 1 | 2021-01-01 |
| 2 | Title 2 | 2020-12-24 |
| 3 | Title 3 | 2021-02-14 |
| 4 | Title 2 | 2021-05-01 |
| 5 | Title 1 | 2021-01-13 |
I found this excellent (i.e. fast) answer here: Find duplicate rows with PostgreSQL
-- adapted from #MatthewJ answering in https://stackoverflow.com/questions/14471179/find-duplicate-rows-with-postgresql/14471928#14471928
select * from (
SELECT id, title, TO_DATE(thedate,'YYYY-MM-DD'),
ROW_NUMBER() OVER(PARTITION BY title ORDER BY id asc) AS Row
FROM table1
) dups
where
dups.Row > 1
Which I'm trying to use as a base to solve my specific problem: I need to find duplicates according to column values like in the example, but only for records posted within 15 days of each other (the date of record insertion in the column "thedate" in my DB).
I reproduced it in this fiddle http://sqlfiddle.com/#!15/ae109/2, where id 5 (same title as id 1, and posted within 15 days of each other) should be the only acceptable answer.
How would I implement that condition in the query?
With the LAG function you can get the date from the previous row with the same title and then filter based on the time difference.
WITH with_prev AS (
SELECT
*,
LAG(thedate, 1) OVER (PARTITION BY title ORDER BY thedate) AS prev_date
FROM table1
)
SELECT id, title, thedate
FROM with_prev
WHERE thedate::timestamp - prev_date::timestamp < INTERVAL '15 days'
You don't necessarily need window funtions for this, you an use a plain old self-join, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate < n.thedate
where n.thedate::date - p.thedate::date < 15
http://sqlfiddle.com/#!15/a3a73a/7
This has the advantage that it might use some of your indexes on the table, and also, you can decide if you want to use the data (i.e. the ID) of the previous row or the next row from each pair.
If your date column however is not unique, you'll need to be a little more specific in your join condition, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate <= n.thedate and p.id <> n.id
where n.thedate::date - p.thedate::date < 15

delete duplicates in a table and update references

I have a table with id, we now added a new field where we calculated uniques from an external source, which made us realize we actually have duplicates in the database:
Main Table
id | unique_id | ...
---|------------
4 | A |
5 | A
6 | B
We can see: 5 is actually a duplicate of 4, as they both have the same unique_id.
Now this needs to be cleaned up.
I sadly can not simply delete those duplicates (5), as other tables depend on it:
Other Table (OtherTable.main_id REFERENCES MainTable.id)
id | main_id | ...
---|------------
1 | 4 | Blah
2 | 5
3 | 6
Now I have to clean up the duplicates, here
UPDATE OtherTable SET main_id = 5 WHERE main_id=4
How can I do that in an efficient update?
I tried to simply update every reference to the first one with that same unique_id, however that didn't complete in a day.
UPDATE "OtherTable" SET "main_id" = (SELECT "id" FROM "MainTable" WHERE "unique_id" = (SELECT "unique_id" FROM "MainTable" WHERE "id" == "OtherTable"."main_id") LIMIT 1)
If it helps, the MainTable contains about 750,000 entries, the OtherTable contains 12,000,000 rows.
Probably that's because those tripple select one is quite inefficient.
For the simple part of deletion the duplicates (after I would be done with changing the references to the first one of it's kind) I found this query to work swiftly enough:
DELETE FROM MainTable
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY unique_id
ORDER BY id ) AS row_num
FROM MainTable ) t
WHERE t.row_num > 1 );
However I need a way to update the references to the non-deleted ones of the duplicates.
Instead of UPDATE with a nested query, I'd suggest using UPDATE FROM for a join, and the same window function as in your DELETE statement:
UPDATE "OtherTable" AS other
SET main_id = main.min_id
FROM (SELECT
id,
first_value(id) OVER (PARTITION BY unique_id ORDER BY id) AS min_id
FROM "MainTable"
) AS main
WHERE main.id = other.main_id
AND main.id <> main.min_id

PostgreSQL Window Function "column must appear in the GROUP BY clause"

I'm trying to get a leaderboard of summed user scores from a list of user score entries. A single user can have more than one entry in this table.
I have the following table:
rewards
=======
user_id | amount
I want to add up all of the amount values for given users and then rank them on a global leaderboard. Here's the query I'm trying to run:
SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION BY user_id) FROM rewards;
I'm getting the following error:
ERROR: column "rewards.user_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION...
Isn't user_id already in an "aggregate function" because I'm trying to partition on it? The PostgreSQL manual shows the following entry which I feel is a direct parallel of mine, so I'm not sure why mine's not working:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
They're not grouping by depname, so how come theirs works?
For example, for the following data:
user_id | score
===============
1 | 2
1 | 3
2 | 5
3 | 1
I would expect the following output (I have made a "tie" between users 1 and 2):
user_id | SUM(score) | rank
===========================
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
So user 1 has a total score of 5 and is ranked #1, user 2 is tied with a score of 5 and thus is also rank #1, and user 3 is ranked #3 with a score of 1.
You need to GROUP BY user_id since it's not being aggregated. Then you can rank by SUM(score) descending as you want;
SQL Fiddle Demo
SELECT user_id, SUM(score), RANK() OVER (ORDER BY SUM(score) DESC)
FROM rewards
GROUP BY user_id;
user_id | sum | rank
---------+-----+------
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
There is a difference between window functions and aggregate functions. Some functions can be used both as a window function and an aggregate function, which can cause confusion. Window functions can be recognized by the OVER clause in the query.
The query in your case then becomes, split in doing first an aggregate on user_id followed by a window function on the total_amount.
SELECT user_id, total_amount, RANK() OVER (ORDER BY total_amount DESC)
FROM (
SELECT user_id, SUM(amount) total_amount
FROM table
GROUP BY user_id
) q
ORDER BY total_amount DESC
If you have
SELECT user_id, SUM(amount) ....
^^^
agreagted function (not window function)
....
FROM .....
You need
GROUP BY user_id

Query to get last conversations for user inbox

I need a specific SQL query to select last 10 conversations for user inbox.
Inbox shows only conversations(threads) with every user - it selects the last message from the conversation and shows it in inbox.
Edited.
Expecting result: to extract latest message from each of 10 latest conversations. Facebook shows latest conversations in the same way
And one more question. How to make a pagination to show next 10 latest messages from previous latest conversations in the next page?
Private messages in the database looks like:
| id | user_id | recipient_id | text
| 1 | 2 | 3 | Hi John!
| 2 | 3 | 2 | Hi Tom!
| 3 | 2 | 3 | How are you?
| 4 | 3 | 2 | Thanks, good! You?
As per my understanding, you need to get the latest message of the conversation on per-user basis (of the last 10 latest conversations)
Update: I have modified the query to get the latest_conversation_message_id for every user conversation
The below query gets the details for user_id = 2, you can modify, users.id = 2 to get it for any other user
SQLFiddle, hope this solves your purpose
SELECT
user_id,
users.name,
users2.name as sent_from_or_sent_to,
subquery.text as latest_message_of_conversation
FROM
users
JOIN
(
SELECT
text,
row_number() OVER ( PARTITION BY user_id + recipient_id ORDER BY id DESC) AS row_num,
user_id,
recipient_id,
id
FROM
private_messages
GROUP BY
id,
recipient_id,
user_id,
text
) AS subquery ON ( ( subquery.user_id = users.id OR subquery.recipient_id = users.id) AND row_num = 1 )
JOIN users as users2 ON ( users2.id = CASE WHEN users.id = subquery.user_id THEN subquery.recipient_id ELSE subquery.user_id END )
WHERE
users.id = 2
ORDER BY
subquery.id DESC
LIMIT 10
Info: The query gets the latest message of every conversation with any other user, If user_id 2, sends a message to user_id 3, that too is displayed, as it indicates the start of a conversation. The latest message of every conversation with any other user is displayed
To solve groupwise-max in pg you can use DISTINCT ON. Like this:
SELECT
DISTINCT ON(pm.user_id)
pm.user_id,
pm.text
FROM
private_messages AS pm
WHERE pm.recipient_id= <my user id>
ORDER BY pm.user_id, pm.id DESC;
http://sqlfiddle.com/#!12/4021d/19
To get the latest X however we will have to use it in a subselect:
SELECT
q.user_id,
q.id,
q.text
FROM
(
SELECT
DISTINCT ON(pm.user_id)
pm.user_id,
pm.id,
pm.text
FROM
private_messages AS pm
WHERE pm.recipient_id=2
ORDER BY pm.user_id, pm.id DESC
) AS q
ORDER BY q.id DESC
LIMIT 10;
http://sqlfiddle.com/#!12/4021d/28
To get both sent and recieved threads:
SELECT
q.user_id,
q.recipient_id,
q.id,
q.text
FROM
(
SELECT
DISTINCT ON(pm.user_id,pm.recipient_id)
pm.user_id,
pm.recipient_id,
pm.id,
pm.text
FROM
private_messages AS pm
WHERE pm.recipient_id=2 OR pm.user_id=2
ORDER BY pm.user_id,pm.recipient_id, pm.id DESC
) AS q
ORDER BY q.id DESC
LIMIT 10;
http://sqlfiddle.com/#!12/4021d/42
Paste it after your WHERE clause
ORDER BY "ColumnName" [ASC, DESC]
UNION Description at W3Schools it combines the result of this 2 statements.
SELECT "ColumnName" FROM "TableName"
UNION
SELECT "ColumnName" FROM "TableName"
For large data sets I think you might like to try running the two statements and then consolidating the results, as an index scan on (user_id and id) or (recipient_id and id) ought to be very efficient at getting the 10 most recent conversations of each type.
with sent_messages as (
SELECT *
FROM private_messages
WHERE user_id = my_user_id
ORDER BY id desc
LIMIT 10),
received_messages as ( SELECT *
FROM private_messages
WHERE recipient_id = my_user_id
ORDER BY id desc
LIMIT 10),
all_messages as (
select *
from sent_messages
union all
select *
from received_messages)
select *
from all_messages
order by id desc
limit 10
Edit: Actually another query worth trying might be:
select *
from private_messages
where id in (
select id
from (
SELECT id
FROM private_messages
WHERE user_id = my_user_id
ORDER BY id desc
LIMIT 10
union all
SELECT id
FROM private_messages
WHERE recipient_id = my_user_id
ORDER BY id desc
LIMIT 10) all_ids
order by id desc
limit 10) last_ten_ids
order by id desc
This might be better in 9.2+, where the indexes alone could be used to get the id's, or in cases where the most recent number to retrieve is very large. Still a bit unclear on that though. If in doubt I'd go for the former version.

How to rank in postgres query

I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:
I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:
select cluster_id,feed_id,pub_date,rank
from (select feed_id,pub_date,cluster_id,rank()
over (order by pub_date asc) from url_info)
as bar where cluster_id = 9876 and feed_id = 1234;
I'm modeling this after the following stackoverflow post: postgres rank
The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?
This is postgres 8.4 btw.
By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.
It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876 and feed_id = 1234;
If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:
select ranked.*
from (
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876
) as ranked
where feed_id = 1234;
Sharing another example of DENSE_RANK() of PostgreSQL.
Find top 3 students sample query.
Reference taken from this blog:
Create a table with sample data:
CREATE TABLE tbl_Students
(
StudID INT
,StudName CHARACTER VARYING
,TotalMark INT
);
INSERT INTO tbl_Students
VALUES
(1,'Anvesh',88),(2,'Neevan',78)
,(3,'Roy',90),(4,'Mahi',88)
,(5,'Maria',81),(6,'Jenny',90);
Using DENSE_RANK(), Calculate RANK of students:
;WITH cteStud AS
(
SELECT
StudName
,Totalmark
,DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank
FROM tbl_Students
)
SELECT
StudName
,Totalmark
,StudRank
FROM cteStud
WHERE StudRank <= 3;
The Result:
studname | totalmark | studrank
----------+-----------+----------
Roy | 90 | 1
Jenny | 90 | 1
Anvesh | 88 | 2
Mahi | 88 | 2
Maria | 81 | 3
(5 rows)