PostgreSQL Window Function "column must appear in the GROUP BY clause" - postgresql

I'm trying to get a leaderboard of summed user scores from a list of user score entries. A single user can have more than one entry in this table.
I have the following table:
rewards
=======
user_id | amount
I want to add up all of the amount values for given users and then rank them on a global leaderboard. Here's the query I'm trying to run:
SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION BY user_id) FROM rewards;
I'm getting the following error:
ERROR: column "rewards.user_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION...
Isn't user_id already in an "aggregate function" because I'm trying to partition on it? The PostgreSQL manual shows the following entry which I feel is a direct parallel of mine, so I'm not sure why mine's not working:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
They're not grouping by depname, so how come theirs works?
For example, for the following data:
user_id | score
===============
1 | 2
1 | 3
2 | 5
3 | 1
I would expect the following output (I have made a "tie" between users 1 and 2):
user_id | SUM(score) | rank
===========================
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
So user 1 has a total score of 5 and is ranked #1, user 2 is tied with a score of 5 and thus is also rank #1, and user 3 is ranked #3 with a score of 1.

You need to GROUP BY user_id since it's not being aggregated. Then you can rank by SUM(score) descending as you want;
SQL Fiddle Demo
SELECT user_id, SUM(score), RANK() OVER (ORDER BY SUM(score) DESC)
FROM rewards
GROUP BY user_id;
user_id | sum | rank
---------+-----+------
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3

There is a difference between window functions and aggregate functions. Some functions can be used both as a window function and an aggregate function, which can cause confusion. Window functions can be recognized by the OVER clause in the query.
The query in your case then becomes, split in doing first an aggregate on user_id followed by a window function on the total_amount.
SELECT user_id, total_amount, RANK() OVER (ORDER BY total_amount DESC)
FROM (
SELECT user_id, SUM(amount) total_amount
FROM table
GROUP BY user_id
) q
ORDER BY total_amount DESC

If you have
SELECT user_id, SUM(amount) ....
^^^
agreagted function (not window function)
....
FROM .....
You need
GROUP BY user_id

Related

Finding duplicate records posted within a lapse of time, in PostgreSQL

I'm trying to find duplicate rows in a large database (300,000 records). Here's an example of how it looks:
| id | title | thedate |
|----|---------|------------|
| 1 | Title 1 | 2021-01-01 |
| 2 | Title 2 | 2020-12-24 |
| 3 | Title 3 | 2021-02-14 |
| 4 | Title 2 | 2021-05-01 |
| 5 | Title 1 | 2021-01-13 |
I found this excellent (i.e. fast) answer here: Find duplicate rows with PostgreSQL
-- adapted from #MatthewJ answering in https://stackoverflow.com/questions/14471179/find-duplicate-rows-with-postgresql/14471928#14471928
select * from (
SELECT id, title, TO_DATE(thedate,'YYYY-MM-DD'),
ROW_NUMBER() OVER(PARTITION BY title ORDER BY id asc) AS Row
FROM table1
) dups
where
dups.Row > 1
Which I'm trying to use as a base to solve my specific problem: I need to find duplicates according to column values like in the example, but only for records posted within 15 days of each other (the date of record insertion in the column "thedate" in my DB).
I reproduced it in this fiddle http://sqlfiddle.com/#!15/ae109/2, where id 5 (same title as id 1, and posted within 15 days of each other) should be the only acceptable answer.
How would I implement that condition in the query?
With the LAG function you can get the date from the previous row with the same title and then filter based on the time difference.
WITH with_prev AS (
SELECT
*,
LAG(thedate, 1) OVER (PARTITION BY title ORDER BY thedate) AS prev_date
FROM table1
)
SELECT id, title, thedate
FROM with_prev
WHERE thedate::timestamp - prev_date::timestamp < INTERVAL '15 days'
You don't necessarily need window funtions for this, you an use a plain old self-join, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate < n.thedate
where n.thedate::date - p.thedate::date < 15
http://sqlfiddle.com/#!15/a3a73a/7
This has the advantage that it might use some of your indexes on the table, and also, you can decide if you want to use the data (i.e. the ID) of the previous row or the next row from each pair.
If your date column however is not unique, you'll need to be a little more specific in your join condition, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate <= n.thedate and p.id <> n.id
where n.thedate::date - p.thedate::date < 15

How can I rank a table in postgresql and then find the rank of a specific row?

I have a postgresql table
cubing=# SELECT * FROM times;
count | name | time
-------+---------+--------
4 | sean | 32.97
5 | Austin | 15.64
6 | Kirk | 117.02
I retrieve all from it with SELECT * FROM times ORDER BY time ASC. But now I want to give the user the option to search for a specific value (say, WHERE name = Austin) and have it tell them what rank they are in the table. Right now, I have SELECT name,time, RANK () OVER ( ORDER BY time ASC) rank_number FROM times. From how I understand it, that is giving me the rank of the entire table. I would like the rank, name, and time of who I am searching for. I am afraid if I added a where clause to my last SELECT statement with the name Austin, it would only find where the name equals Austin and rank those, rather than the rank of Austin in the rest of the table.
thanks for reading
I think the behavior you want here is to first rank your current data, then query it with some WHERE filter:
WITH cte AS (
SELECT *, RANK() OVER (ORDER BY time) rank_number
FROM times
)
SELECT count, name, time
FROM cte
WHERE name = 'Austin';
The point here is that at the time we do a query searching for Austin, the ranks for each row in your original table have already been generated.
Edit:
If you're running this query from an application, it would probably be best to avoid CTE syntax. Instead, just inline the CTE as a subquery:
SELECT count, name, time, rank_number
FROM
(
SELECT *, RANK() OVER (ORDER BY time) rank_number
FROM times
) t
WHERE name = 'Austin';

How to get id of the row which was selected by aggregate function? [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 4 years ago.
I have next data:
id | name | amount | datefrom
---------------------------
3 | a | 8 | 2018-01-01
4 | a | 3 | 2018-01-15 10:00
5 | b | 1 | 2018-02-20
I can group result with the next query:
select name, max(amount) from table group by name
But I need the id of selected row too. Thus I have tried:
select max(id), name, max(amount) from table group by name
And as it was expected it returns:
id | name | amount
-----------
4 | a | 8
5 | b | 1
But I need the id to have 3 for the amount of 8:
id | name | amount
-----------
3 | a | 8
5 | b | 1
Is this possible?
PS. This is required for billing task. At some day 2018-01-15 configuration of a was changed and user consumes some resource 10h with the amount of 8 and rests the day 14h -- 3. I need to count such a day by the maximum value. Thus row with id = 4 is just ignored for 2018-01-15 day. (for next day 2018-01-16 I will bill the amount of 3)
So I take for billing the row:
3 | a | 8 | 2018-01-01
And if something is wrong with it. I must report that row with id == 3 is wrong.
But when I used aggregation function the information about id is lost.
Would be awesome if this is possible:
select current(id), name, max(amount) from table group by name
select aggregated_row(id), name, max(amount) from table group by name
Here agg_row refer to the row which was selected by aggregation function max
UPD
I resolve the task as:
SELECT
(
SELECT id FROM t2
WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount )
) id,
name,
MAX(amount) ma,
SUM( ratio )
FROM t2 tf
GROUP BY name
UPD
It would be much better to use window functions
There are at least 3 ways, see below:
CREATE TEMP TABLE test (
id integer, name text, amount numeric, datefrom timestamptz
);
COPY test FROM STDIN (FORMAT csv);
3,a,8,2018-01-01
4,a,3,2018-01-15 10:00
5,b,1,2018-02-20
6,b,1,2019-01-01
\.
Method 1. using DISTINCT ON (PostgreSQL-specific)
SELECT DISTINCT ON (name)
id, name, amount
FROM test
ORDER BY name, amount DESC, datefrom ASC;
Method 2. using window functions
SELECT id, name, amount FROM (
SELECT *, row_number() OVER (
PARTITION BY name
ORDER BY amount DESC, datefrom ASC) AS __rn
FROM test) AS x
WHERE x.__rn = 1;
Method 3. using corelated subquery
SELECT id, name, amount FROM test
WHERE id = (
SELECT id FROM test AS t2
WHERE t2.name = test.name
ORDER BY amount DESC, datefrom ASC
LIMIT 1
);
demo: db<>fiddle
You need DISTINCT ON which filters the first row per group.
SELECT DISTINCT ON (name)
*
FROM table
ORDER BY name, amount DESC
You need a nested inner join. Try this -
SELECT id, T2.name, T2.amount
FROM TABLE T
INNER JOIN (SELECT name, MAX(amount) amount
FROM TABLE
GROUP BY name) T2
ON T.amount = T2.amount

Postgresql selecting with limit equal values?

I have one postgresql table where I store some stories from different sites.
At this table I got story_id and site_id fields.
Where story_id is the primary key and site_id is the id of the site where I got this story from.
I need to make SELECT from this table picking the latest 30 added stories.
But I dont want to get more than 2 stories comming from same site...
So if I have something like this:
story_id | site_id
1 | 1
2 | 1
3 | 2
4 | 1
5 | 3
My results must be : story_ids = 1,2,3,5!
4 must be skipped because I have already picked 2 ids with site_id 1.
select story_id,
site_id
from (
select story_id,
site_id,
row_number() over (partition by site_id order by story_id desc) as rn
from the_table
) t
where rn <= 2
order by story_id desc
limit 30
If you want more or less than 2 entries "per group" you have to adjust the value in the outer where clause.

How to rank in postgres query

I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:
I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:
select cluster_id,feed_id,pub_date,rank
from (select feed_id,pub_date,cluster_id,rank()
over (order by pub_date asc) from url_info)
as bar where cluster_id = 9876 and feed_id = 1234;
I'm modeling this after the following stackoverflow post: postgres rank
The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?
This is postgres 8.4 btw.
By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.
It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876 and feed_id = 1234;
If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:
select ranked.*
from (
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876
) as ranked
where feed_id = 1234;
Sharing another example of DENSE_RANK() of PostgreSQL.
Find top 3 students sample query.
Reference taken from this blog:
Create a table with sample data:
CREATE TABLE tbl_Students
(
StudID INT
,StudName CHARACTER VARYING
,TotalMark INT
);
INSERT INTO tbl_Students
VALUES
(1,'Anvesh',88),(2,'Neevan',78)
,(3,'Roy',90),(4,'Mahi',88)
,(5,'Maria',81),(6,'Jenny',90);
Using DENSE_RANK(), Calculate RANK of students:
;WITH cteStud AS
(
SELECT
StudName
,Totalmark
,DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank
FROM tbl_Students
)
SELECT
StudName
,Totalmark
,StudRank
FROM cteStud
WHERE StudRank <= 3;
The Result:
studname | totalmark | studrank
----------+-----------+----------
Roy | 90 | 1
Jenny | 90 | 1
Anvesh | 88 | 2
Mahi | 88 | 2
Maria | 81 | 3
(5 rows)