Optimizing row_number to not scan all table

Optimizing row_number to not scan all table - postgresql

I have a table created as
CREATE TABLE T0
(
id text,
kind_of_datetime text,
... 10 more text fields
),
PRIMARY KEY(id, kind_of_datetime)
It is about 31M rows with about 800K of unique id values and 13K unique kind_of_datetime values.
I want to make query
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() over(PARTITION BY id ORDER BY kind_of_datetime DESC) as rn_col
FROM TO
WHERE kind_of_datetime <= 'some_value'
) as tmp
WHERE rn_col = 1
It makes WindowAgg with actual reading of all table + sorting and works really long (minutes).
I tried to make index
CREATE INDEX index_name ON T0 (id, kind_of_datetime DESC NULLS LAST)
and in works better but only if final select consists of two key columns id + kind_of_datetime. Otherwise it's always fullscan.
Maybe I should change a way of storing data? Or create some other index?
What I don't want to do is to add UNCLUDE 10 other columns because it will take much RAM.

Try a subquery:
SELECT *,
ROW_NUMBER() over(PARTITION BY id ORDER BY kind_of_datetime DESC) as rn_col
FROM (SELECT * FROM tab
WHERE kind_of_datetime <= 'some_value'
) AS t
That will definitely apply the filter first.

Related

Update a deleted_at column on partition in PostgreSQL

Quick question, I'm trying to update a column only when there are duplicates(partition column > 1) in the table and have selected it based on partition concept, But the current query updates the whole table! please check the query below, Any leads would be greatly appreciated :)
UPDATE public.database_tag
SET deleted_at= '2022-04-25 19:33:29.087133+00'
FROM (
SELECT *,
row_number() over (partition by title order by created_at) as RN
FROM public.database_tag
ORDER BY RN DESC) X
WHERE X.RN > 1
Thanks very much!

Assuming that every row have unique ID it can be done like below.
UPDATE database_tag
SET deleted_at= '2022-04-25 19:33:29.087133+00'
WHERE <some_unique_id> in (
select <some_unique_id> from (
SELECT <some_unique_id>,
row_number() over (partition by title order by created_at) as RN
FROM public.database_tag
) X
WHERE X.RN > 1
)
Or we can reverse query to update all but set of ID's
UPDATE database_tag
SET deleted_at= '2022-04-25 19:33:29.087133+00'
WHERE <some_unique_id> not in (
select distinct on (title)
<some_unique_id> from database_tag
order by title, created_at
)

If a query misses an index, why is the row_number over () in an effectively random order?

Here is a sqlfiddle that demonstrates the problem: http://sqlfiddle.com/#!17/626cc2/2
Tested in 9.6 and 10.2
But in case that isn't preferred, the schema and query are:
CREATE TABLE events("id" varchar(36), "ts" timestamp);
CREATE INDEX events_default_order ON events(ts DESC, id DESC);
INSERT INTO events("id", "ts") VALUES
('dccc3a43-8a48-4c29-84e5-906b7817d9a4', '2019-05-20 11:46:19'),
('f7355c58-1e09-4043-b4ee-fb3d3d997ac7', '2019-05-17 20:05:01'),
... -- 50 or so rows
;
-- In this query, the window function using an empty OVER returns out of order numbers.
SELECT id
, ts
, row_number() OVER (ORDER BY ts DESC, id ASC) AS row_num_good
, row_number() OVER () AS row_num_bad
FROM events
WHERE (ts > cast('2019-05-10 14:20:13-0400' as timestamptz))
ORDER BY ts DESC, id ASC
LIMIT 20;
-- But this one gives both lists in order.
SELECT id
, ts
, row_number() OVER (ORDER BY ts DESC, id DESC) AS row_num_good
, row_number() OVER () AS row_num_is_now_good
FROM events
WHERE (ts > cast('2019-05-10 14:20:13-0400' as timestamptz))
ORDER BY ts DESC, id DESC
LIMIT 20;

as sticky mention if you don't specify an order then any order is a good order. And you cant assume anything even if in previous result you get the order you want. Change in hardware, db version, index or data can change that result.
In this case because
CREATE INDEX events_default_order ON events(ts DESC, id DESC);
The data is retrieve based on the index order there for is already order and the row_number is perform in an order set. But again that can change if for a different query the db engine decide use a different index instead or the data is move to a raid disk.
If you DROP the index you will see the exact same result.
Also if you change the index to:
CREATE INDEX events_default_order ON events(ts DESC, id ASC);
You will get the inverted behaviour

Selecting the 1st and 10th Records Only

Have a table with 3 columns: ID, Signature, and Datetime, and it's grouped by Signature Having Count(*) > 9.
select * from (
select s.Signature
from #Sigs s
group by s.Signature
having count(*) > 9
) b
join #Sigs o
on o.Signature = b.Signature
order by o.Signature desc, o.DateTime
I now want to select the 1st and 10th records only, per Signature. What determines rank is the Datetime descending. Thus, I would expect every Signature to have 2 rows.
Thanks,

I would go with a couple of common table expressions.
The first will select all records from the table as well as a count of records per signature, and the second one will select from the first where the record count > 9 and add row_number partitioned by signature - and then just select from that where the row_number is either 1 or 10:
With cte1 AS
(
SELECT ID, Signature, Datetime, COUNT(*) OVER(PARTITION BY Signature) As NumberOfRows
FROM #Sigs
), cte2 AS
(
SELECT ID, Signature, Datetime, ROW_NUMBER() OVER(PARTITION BY Signature ORDER BY DateTime DESC) As Rn
FROM cte1
WHERE NumberOfRows > 9
)
SELECT ID, Signature, Datetime
FROM cte2
WHERE Rn IN (1, 10)
ORDER BY Signature desc

Because I don't know what your data looks like, this might need some adjustment.
The simplest way here, since you already know your sort order (DateTime DESC) and partitioning (Signature), is probably to assign row numbers and then select the rows you want.
SELECT *
FROM
(
select o.Signature
,o.DateTime
,ROW_NUMBER() OVER (PARTITION BY o.Signature ORDER BY o.DateTime DESC) [Row]
from (
select s.Signature
from #Sigs s
group by s.Signature
having count(*) > 9
) b
join #Sigs o
on o.Signature = b.Signature
order by o.Signature desc, o.DateTime
)
WHERE [Row] IN (1,10)

Update Postgresql table using rank()

I'm trying to update a column (pop_1_rank) in a postgresql table with the results from a rank() like so:
UPDATE database_final_form_merge
SET
pop_1_rank = r.rnk
FROM (
SELECT pop_1, RANK() OVER ( ORDER BY pop_1 DESC) FROM database_final_form_merge WHERE territory_name != 'north' AS rnk)r
The SELECT query by itself works fine, but I just can't get it to update correctly. What am I doing wrong here?

I rather use the CTE notation.
WITH cte as (
SELECT pop_1,
RANK() OVER ( ORDER BY pop_1 DESC) AS rnk
FROM database_final_form_merge
WHERE territory_name <> 'north'
)
UPDATE database_final_form_merge
SET pop_1_rank = cte.rnk
FROM cte
WHERE database_final_form_merge.pop_1 = cte.pop_1

As far as I know, Postgres updates tables not subqueries. So, you can join back to the table:
UPDATE database_final_form_merge
SET pop_1_rank = r.rnk
FROM (SELECT pop_1, RANK() OVER ( ORDER BY pop_1 DESC) as rnk
FROM database_final_form_merge
WHERE territory_name <> 'north'
) r
WHERE database_final_form_merge.pop_1 = r.pop_1;
In addition:
The column alias goes by the column name.
This assumes that pop_1 is the id connecting the two tables.

You're missing WHERE on UPDATE query, because when doing UPDATE ... FROM you're basically doing joins.
So you need to select primary key and then match on primary key to update just the columns are computing rank over.

Indexing to reduce cost of SORT

I have this table:
TopScores
Username char(255)
Score int
DateAdded datetime2
which will have a lot of rows.
I run the following query (code for a stored procedure) against it to get the top 5 high scorers, and the score for a particular Username preceded by the person directly above them in position and the person below:
WITH Rankings
AS (SELECT Row_Number() OVER (ORDER BY Score DESC, DateAdded DESC) AS Pos,
--if score same, latest date higher
Username,
Score
FROM TopScores)
SELECT TOP 5 Pos,
Username,
Score
FROM Rankings
UNION ALL
SELECT Pos,
Username,
Score
FROM Rankings
WHERE Pos BETWEEN (SELECT Pos
FROM Rankings
WHERE Username = #User) - 1 AND (SELECT Pos
FROM Rankings
WHERE Username = #User) + 1
I had to index the table so I added clustered: ci_TopScores(Username) first and nonclustered: nci_TopScores(Dateadded, Score).
Query plan showed that clustered was completely ignored (before I created the nonclustered I tested and it was used by the query), and logical reads were more (as compared to a table scan without any index).
Sort was the highest costing operator. So I adjusted indexes to clustered: ci_TopScores(Score desc, Dateadded desc) and nonclustered: nci_TopScores(Username).
Still sort costs the same. Nonclustered: nci_TopScores(Username) is completely ignored again.
How can I avoid the high cost of sort and index this table effectively?

The CTE does not use Username so not a surprise it does not use that index.
A CTE is just syntax. You are evaluating that CTE 4 times.
Try a #temp so it is only evaluated once.
But you need to think about the indexes.
I would skip the RowNumber and just put an iden pk on the #temp to serve as pos
I would skip any other indexes on #temp
For TopScores an index on Score desc, DateAdded desc, Username asc will help
But it won't help if it is fragmented
That is an index that will fragment when you insert
insert into #temp (Score, DateAdded, Username)
select Score, DateAdded, Username
from TopScores
order by Score desc, DateAdded desc, Username asc
select top 5 *
from #temp
order by pos
union
select three.*
from #temp
join #temp as three
on #temp.UserName = #user
and abs(three.pos - #temp.pos) <= 1
So what if there is table scan on #temp UserName.
One scan does not take as long as create one index.
That index would be severely fragmented anyway.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimizing row_number to not scan all table - postgresql

Try a subquery: SELECT , ROW_NUMBER() over(PARTITION BY id ORDER BY kind_of_datetime DESC) as rn_col FROM (SELECT FROM tab WHERE kind_of_datetime <= 'some_value' ) AS t That will definitely apply the filter first.

Related

Update a deleted_at column on partition in PostgreSQL

If a query misses an index, why is the row_number over () in an effectively random order?

Selecting the 1st and 10th Records Only

Update Postgresql table using rank()

Indexing to reduce cost of SORT

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimizing row_number to not scan all table - postgresql

Try a subquery: SELECT *, ROW_NUMBER() over(PARTITION BY id ORDER BY kind_of_datetime DESC) as rn_col FROM (SELECT * FROM tab WHERE kind_of_datetime <= 'some_value' ) AS t That will definitely apply the filter first.

Related

Update a deleted_at column on partition in PostgreSQL

If a query misses an index, why is the row_number over () in an effectively random order?

Selecting the 1st and 10th Records Only

Update Postgresql table using rank()

Indexing to reduce cost of SORT

Categories

Resources

Try a subquery: SELECT , ROW_NUMBER() over(PARTITION BY id ORDER BY kind_of_datetime DESC) as rn_col FROM (SELECT FROM tab WHERE kind_of_datetime <= 'some_value' ) AS t That will definitely apply the filter first.