I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:
I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:
select cluster_id,feed_id,pub_date,rank
from (select feed_id,pub_date,cluster_id,rank()
over (order by pub_date asc) from url_info)
as bar where cluster_id = 9876 and feed_id = 1234;
I'm modeling this after the following stackoverflow post: postgres rank
The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?
This is postgres 8.4 btw.
By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.
It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876 and feed_id = 1234;
If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:
select ranked.*
from (
select
cluster_id
,feed_id
,pub_date
,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876
) as ranked
where feed_id = 1234;
Sharing another example of DENSE_RANK() of PostgreSQL.
Find top 3 students sample query.
Reference taken from this blog:
Create a table with sample data:
CREATE TABLE tbl_Students
(
StudID INT
,StudName CHARACTER VARYING
,TotalMark INT
);
INSERT INTO tbl_Students
VALUES
(1,'Anvesh',88),(2,'Neevan',78)
,(3,'Roy',90),(4,'Mahi',88)
,(5,'Maria',81),(6,'Jenny',90);
Using DENSE_RANK(), Calculate RANK of students:
;WITH cteStud AS
(
SELECT
StudName
,Totalmark
,DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank
FROM tbl_Students
)
SELECT
StudName
,Totalmark
,StudRank
FROM cteStud
WHERE StudRank <= 3;
The Result:
studname | totalmark | studrank
----------+-----------+----------
Roy | 90 | 1
Jenny | 90 | 1
Anvesh | 88 | 2
Mahi | 88 | 2
Maria | 81 | 3
(5 rows)
Related
I'm trying to find duplicate rows in a large database (300,000 records). Here's an example of how it looks:
| id | title | thedate |
|----|---------|------------|
| 1 | Title 1 | 2021-01-01 |
| 2 | Title 2 | 2020-12-24 |
| 3 | Title 3 | 2021-02-14 |
| 4 | Title 2 | 2021-05-01 |
| 5 | Title 1 | 2021-01-13 |
I found this excellent (i.e. fast) answer here: Find duplicate rows with PostgreSQL
-- adapted from #MatthewJ answering in https://stackoverflow.com/questions/14471179/find-duplicate-rows-with-postgresql/14471928#14471928
select * from (
SELECT id, title, TO_DATE(thedate,'YYYY-MM-DD'),
ROW_NUMBER() OVER(PARTITION BY title ORDER BY id asc) AS Row
FROM table1
) dups
where
dups.Row > 1
Which I'm trying to use as a base to solve my specific problem: I need to find duplicates according to column values like in the example, but only for records posted within 15 days of each other (the date of record insertion in the column "thedate" in my DB).
I reproduced it in this fiddle http://sqlfiddle.com/#!15/ae109/2, where id 5 (same title as id 1, and posted within 15 days of each other) should be the only acceptable answer.
How would I implement that condition in the query?
With the LAG function you can get the date from the previous row with the same title and then filter based on the time difference.
WITH with_prev AS (
SELECT
*,
LAG(thedate, 1) OVER (PARTITION BY title ORDER BY thedate) AS prev_date
FROM table1
)
SELECT id, title, thedate
FROM with_prev
WHERE thedate::timestamp - prev_date::timestamp < INTERVAL '15 days'
You don't necessarily need window funtions for this, you an use a plain old self-join, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate < n.thedate
where n.thedate::date - p.thedate::date < 15
http://sqlfiddle.com/#!15/a3a73a/7
This has the advantage that it might use some of your indexes on the table, and also, you can decide if you want to use the data (i.e. the ID) of the previous row or the next row from each pair.
If your date column however is not unique, you'll need to be a little more specific in your join condition, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate <= n.thedate and p.id <> n.id
where n.thedate::date - p.thedate::date < 15
I have a postgresql table
cubing=# SELECT * FROM times;
count | name | time
-------+---------+--------
4 | sean | 32.97
5 | Austin | 15.64
6 | Kirk | 117.02
I retrieve all from it with SELECT * FROM times ORDER BY time ASC. But now I want to give the user the option to search for a specific value (say, WHERE name = Austin) and have it tell them what rank they are in the table. Right now, I have SELECT name,time, RANK () OVER ( ORDER BY time ASC) rank_number FROM times. From how I understand it, that is giving me the rank of the entire table. I would like the rank, name, and time of who I am searching for. I am afraid if I added a where clause to my last SELECT statement with the name Austin, it would only find where the name equals Austin and rank those, rather than the rank of Austin in the rest of the table.
thanks for reading
I think the behavior you want here is to first rank your current data, then query it with some WHERE filter:
WITH cte AS (
SELECT *, RANK() OVER (ORDER BY time) rank_number
FROM times
)
SELECT count, name, time
FROM cte
WHERE name = 'Austin';
The point here is that at the time we do a query searching for Austin, the ranks for each row in your original table have already been generated.
Edit:
If you're running this query from an application, it would probably be best to avoid CTE syntax. Instead, just inline the CTE as a subquery:
SELECT count, name, time, rank_number
FROM
(
SELECT *, RANK() OVER (ORDER BY time) rank_number
FROM times
) t
WHERE name = 'Austin';
I've used WINDOW functions before but only when working with data that has a fixed cadence/interval. I am likely missing something simple in aggregation but I've never had a scenario where I'm not working with fixed intervals.
I have a table the records samples at arbitrary timestamps. A sample is only recorded when it is a delta from the previous sample and the sample rate is completely irregular due to a large number of conditions. The table is very simple:
id (int)
happened_at (timestamp)
sensor_id (int)
new_value (float)
I'm trying to construct a query that will include a count of all of the samples before the happened_at of a given result row. So given an ultra simple 2 row sample data set:
id|happened_at |sensor_id| new_value
1 |2019-06-07:21:41|134679 | 123.331
2 |2019-06-07:19:00|134679 | 100.009
I'd like the result set to look like this:
happened_at |sensor_id | new_value | sample_count
2019-06-07:21:41|134679 |123.331 |2
2019-06-07:19:00|134679 |123.331 |1
I've tried:
SELECT *,
(SELECT count(sample_history.id) OVER (PARTITION BY score_history.sensor_id
ORDER BY sample_history.happened_at DESC))
FROM sensor_history
ORDER by happened_at DESC
and the duh not going to work.
(SELECT count(*)
FROM sample_history
WHERE sample_history.happened_at <= sample_timestamp)
Insights greatly appreciated.
Get rid of the SELECT (sub-query) when using the window function.
SELECT *,
count(*) OVER (PARTITION BY sensor_id ORDER BY happened_at DESC)
FROM sensor_history
ORDER BY happened_at DESC
I'm trying to get a leaderboard of summed user scores from a list of user score entries. A single user can have more than one entry in this table.
I have the following table:
rewards
=======
user_id | amount
I want to add up all of the amount values for given users and then rank them on a global leaderboard. Here's the query I'm trying to run:
SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION BY user_id) FROM rewards;
I'm getting the following error:
ERROR: column "rewards.user_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION...
Isn't user_id already in an "aggregate function" because I'm trying to partition on it? The PostgreSQL manual shows the following entry which I feel is a direct parallel of mine, so I'm not sure why mine's not working:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
They're not grouping by depname, so how come theirs works?
For example, for the following data:
user_id | score
===============
1 | 2
1 | 3
2 | 5
3 | 1
I would expect the following output (I have made a "tie" between users 1 and 2):
user_id | SUM(score) | rank
===========================
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
So user 1 has a total score of 5 and is ranked #1, user 2 is tied with a score of 5 and thus is also rank #1, and user 3 is ranked #3 with a score of 1.
You need to GROUP BY user_id since it's not being aggregated. Then you can rank by SUM(score) descending as you want;
SQL Fiddle Demo
SELECT user_id, SUM(score), RANK() OVER (ORDER BY SUM(score) DESC)
FROM rewards
GROUP BY user_id;
user_id | sum | rank
---------+-----+------
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
There is a difference between window functions and aggregate functions. Some functions can be used both as a window function and an aggregate function, which can cause confusion. Window functions can be recognized by the OVER clause in the query.
The query in your case then becomes, split in doing first an aggregate on user_id followed by a window function on the total_amount.
SELECT user_id, total_amount, RANK() OVER (ORDER BY total_amount DESC)
FROM (
SELECT user_id, SUM(amount) total_amount
FROM table
GROUP BY user_id
) q
ORDER BY total_amount DESC
If you have
SELECT user_id, SUM(amount) ....
^^^
agreagted function (not window function)
....
FROM .....
You need
GROUP BY user_id
I have a table segnature describing an item with a varchar field deno and a numeric field ord. A foreign key fk_collection tells which collection the row is part of.
I want to update field ord so that it contains the ordinal of that row per each collection, sorted by field deno.
E.g. if I have something like
[deno] ord [fk_collection]
abc 10
aab 10
bcd 10
zxc 20
vbn 20
Then I want a result like
[deno] ord [fk_collection]
abc 1 10
aab 0 10
bcd 2 10
zxc 1 20
vbn 0 20
I tried with something like
update segnature s1 set ord = (select count(*)
from segnature s2
where s1.fk_collection=s2.fk_collection and s2.deno<s1.deno
)
but query is really slow: 150 collections per about 30000 items are updated in 10 minutes about.
Any suggestion to speed up the process?
Thank you!
You can use a window function to generate the "ordinal" number:
with numbered as (
select deno, fk_collection,
row_number() over (partition by fk_collection order by deno) as rn,
ctid as id
from segnature
)
update segnature
set ord = n.rn
from numbered n
where n.id = segnature.ctid;
This uses the internal column ctid to uniquely identify each rows. The ctid comparison is quite slow, so if you have a real primary (or unique) key in that table, use that column instead.
Alternatively without the common table expression:
update segnature
set ord = n.rn
from (
select deno, fk_collection,
row_number() over (partition by fk_collection order by deno) as rn,
ctid as id
from segnature
) as n
where n.id = segnature.ctid;
SQLFiddle example: http://sqlfiddle.com/#!15/e997f/1