Indexing to reduce cost of SORT - tsql

I have this table:
TopScores
Username char(255)
Score int
DateAdded datetime2
which will have a lot of rows.
I run the following query (code for a stored procedure) against it to get the top 5 high scorers, and the score for a particular Username preceded by the person directly above them in position and the person below:
WITH Rankings
AS (SELECT Row_Number() OVER (ORDER BY Score DESC, DateAdded DESC) AS Pos,
--if score same, latest date higher
Username,
Score
FROM TopScores)
SELECT TOP 5 Pos,
Username,
Score
FROM Rankings
UNION ALL
SELECT Pos,
Username,
Score
FROM Rankings
WHERE Pos BETWEEN (SELECT Pos
FROM Rankings
WHERE Username = #User) - 1 AND (SELECT Pos
FROM Rankings
WHERE Username = #User) + 1
I had to index the table so I added clustered: ci_TopScores(Username) first and nonclustered: nci_TopScores(Dateadded, Score).
Query plan showed that clustered was completely ignored (before I created the nonclustered I tested and it was used by the query), and logical reads were more (as compared to a table scan without any index).
Sort was the highest costing operator. So I adjusted indexes to clustered: ci_TopScores(Score desc, Dateadded desc) and nonclustered: nci_TopScores(Username).
Still sort costs the same. Nonclustered: nci_TopScores(Username) is completely ignored again.
How can I avoid the high cost of sort and index this table effectively?

The CTE does not use Username so not a surprise it does not use that index.
A CTE is just syntax. You are evaluating that CTE 4 times.
Try a #temp so it is only evaluated once.
But you need to think about the indexes.
I would skip the RowNumber and just put an iden pk on the #temp to serve as pos
I would skip any other indexes on #temp
For TopScores an index on Score desc, DateAdded desc, Username asc will help
But it won't help if it is fragmented
That is an index that will fragment when you insert
insert into #temp (Score, DateAdded, Username)
select Score, DateAdded, Username
from TopScores
order by Score desc, DateAdded desc, Username asc
select top 5 *
from #temp
order by pos
union
select three.*
from #temp
join #temp as three
on #temp.UserName = #user
and abs(three.pos - #temp.pos) <= 1
So what if there is table scan on #temp UserName.
One scan does not take as long as create one index.
That index would be severely fragmented anyway.

Related

Optimizing row_number to not scan all table

I have a table created as
CREATE TABLE T0
(
id text,
kind_of_datetime text,
... 10 more text fields
),
PRIMARY KEY(id, kind_of_datetime)
It is about 31M rows with about 800K of unique id values and 13K unique kind_of_datetime values.
I want to make query
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() over(PARTITION BY id ORDER BY kind_of_datetime DESC) as rn_col
FROM TO
WHERE kind_of_datetime <= 'some_value'
) as tmp
WHERE rn_col = 1
It makes WindowAgg with actual reading of all table + sorting and works really long (minutes).
I tried to make index
CREATE INDEX index_name ON T0 (id, kind_of_datetime DESC NULLS LAST)
and in works better but only if final select consists of two key columns id + kind_of_datetime. Otherwise it's always fullscan.
Maybe I should change a way of storing data? Or create some other index?
What I don't want to do is to add UNCLUDE 10 other columns because it will take much RAM.
Try a subquery:
SELECT *,
ROW_NUMBER() over(PARTITION BY id ORDER BY kind_of_datetime DESC) as rn_col
FROM (SELECT * FROM tab
WHERE kind_of_datetime <= 'some_value'
) AS t
That will definitely apply the filter first.

How to update duplicate rows in a table n postgresql

I have created synthetic data for a typical call center.
Below is the screenshot of the table I have created.
Table 1:
Problem statement: Since this is completely random data, I noticed that there are some customers who are being assigned to the same agents whenever they call again.
So using this query I was able to test such a case and count the number of times agents are being repeated for each customer.
select agentid, customerid, count(customerid) from aa_dev.calls group by agentid, customerid having count(customerid) > 1 ;
Table 2
I have a separate agents table to called aa_dev.agents in which the agent's ids are stored
Now I want to replace the agentid for such cases, such that if agentid is repeated 6 times for a single customer then 5 of the times the agent id should be updated with any other agentid from the table but call time shouldn't be overlapping That means the agent we are replacing with should not be busy on the time the call is going one.
I have assigned row numbers to each repeated ones.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY agentid, customerid ORDER BY random()) rn,
COUNT(*) OVER (PARTITION BY agentid, customerid) cnt
FROM aa_dev.calls
)
SELECT agentid, customerid, rn
FROM cte
WHERE cnt > 1;
This way I could visualize the repetition clearly.
So I don't want to update row 1 but the rest.
Is there any way I can acheive this? Can I use the row number and write a query according to the row number to update rownum 2 onwards row one by one with each row having a unique agent?
If you don't want duplicates in your artificial data, it's probably better to not generate them.
But if you already have a table with duplicates and want to work on the duplicates, either updating them or deleting, here is the easy way:
You need a unique ID for each updated row. If you don't have it,
add it temporarily. Then you can use this pattern to update all duplicates
except the first one:
To add artificial id column to preexisting table, use:
ALTER TABLE calls ADD id serial;
In my case I generated a test table with 100 random rows:
CREATE TEMP TABLE calls (id serial, agentid int, customerid int);
INSERT INTO calls (agentid, customerid)
SELECT (random()*10)::int, (random()*10)::int
FROM generate_series(1, 100) n;
Define what constitutes a duplicate and find duplicates in data:
SELECT agentid, customerid, count(*), array_agg(id) id
FROM calls
GROUP BY 1,2 HAVING count(*)>1
ORDER BY 1,2;
Update all the duplicate rows except first one with NULLs:
UPDATE calls SET agentid = whatever_needed
FROM (
SELECT array_agg(id) id, min(id) idmin FROM calls
GROUP BY agentid, customerid HAVING count(*)>1
) AS dup
WHERE calls.id = ANY(dup.id) AND calls.id <> dup.idmin;
Alternatively, remove all duplicates except first one:
DELETE FROM calls
USING (
SELECT array_agg(id) id, min(id) idmin FROM calls
GROUP BY agentid, customerid HAVING count(*)>1
) AS dup
WHERE calls.id = ANY(dup.id) AND calls.id <> dup.idmin;

If a query misses an index, why is the row_number over () in an effectively random order?

Here is a sqlfiddle that demonstrates the problem: http://sqlfiddle.com/#!17/626cc2/2
Tested in 9.6 and 10.2
But in case that isn't preferred, the schema and query are:
CREATE TABLE events("id" varchar(36), "ts" timestamp);
CREATE INDEX events_default_order ON events(ts DESC, id DESC);
INSERT INTO events("id", "ts") VALUES
('dccc3a43-8a48-4c29-84e5-906b7817d9a4', '2019-05-20 11:46:19'),
('f7355c58-1e09-4043-b4ee-fb3d3d997ac7', '2019-05-17 20:05:01'),
... -- 50 or so rows
;
-- In this query, the window function using an empty OVER returns out of order numbers.
SELECT id
, ts
, row_number() OVER (ORDER BY ts DESC, id ASC) AS row_num_good
, row_number() OVER () AS row_num_bad
FROM events
WHERE (ts > cast('2019-05-10 14:20:13-0400' as timestamptz))
ORDER BY ts DESC, id ASC
LIMIT 20;
-- But this one gives both lists in order.
SELECT id
, ts
, row_number() OVER (ORDER BY ts DESC, id DESC) AS row_num_good
, row_number() OVER () AS row_num_is_now_good
FROM events
WHERE (ts > cast('2019-05-10 14:20:13-0400' as timestamptz))
ORDER BY ts DESC, id DESC
LIMIT 20;
as sticky mention if you don't specify an order then any order is a good order. And you cant assume anything even if in previous result you get the order you want. Change in hardware, db version, index or data can change that result.
In this case because
CREATE INDEX events_default_order ON events(ts DESC, id DESC);
The data is retrieve based on the index order there for is already order and the row_number is perform in an order set. But again that can change if for a different query the db engine decide use a different index instead or the data is move to a raid disk.
If you DROP the index you will see the exact same result.
Also if you change the index to:
CREATE INDEX events_default_order ON events(ts DESC, id ASC);
You will get the inverted behaviour

postgres - get top category purchased by customer

I have a denormalized table with the columns:
buyer_id
order_id
item_id
item_price
item_category
I would like to return something that returns 1 row per buyer_id
buyer_id, sum(item_price), item_category
-- but ONLY for the category with the highest rank of sales along that specific buyer_id.
I can't get row_number() or partition to work because I need to order by the sum of item_price relative to item_category relative to buyer. Am I overlooking anything obvious?
You need a few layers of fudging here:
SELECT buyer_id, item_sum, item_category
FROM (
SELECT buyer_id,
rank() OVER (PARTITION BY buyer_id ORDER BY item_sum DESC) AS rnk,
item_sum, item_category
FROM (
SELECT buyer_id, sum(item_price) AS item_sum, item_category
FROM my_table
GROUP BY 1, 3) AS sub2) AS sub
WHERE rnk = 1;
In sub2 you calculate the sum of 'item_price' for each 'item_category' for each 'buyer_id'. In sub you rank these with a window function by 'buyer_id', ordering by 'item_sum' in descending order (so the highest 'item_sum' comes first). In the main query you select those rows where rnk = 1.

Simple SELECT, but adding JOIN returns too many rows

The query below returns 9,817 records. Now, I want to SELECT one more field from another table. See the 2 lines that are commented out, where I've simply selected this additional field and added a JOIN statement to bind this new columns. With these lines added, the query now returns 649,200 records and I can't figure out why! I guess something is wrong with my WHERE criteria in conjunction with the JOIN statement. Please help, thanks.
SELECT DISTINCT dbo.IMPORT_DOCUMENTS.ITEMID, BEGDOC, BATCHID
--, dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.CATEGORY_ID
FROM IMPORT_DOCUMENTS
--JOIN dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS ON
dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID = dbo.IMPORT_DOCUMENTS.ITEMID
WHERE (BATCHID LIKE 'IC0%' OR BATCHID LIKE 'LP0%')
AND dbo.IMPORT_DOCUMENTS.ITEMID IN
(SELECT dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID FROM
CATEGORY_COLLECTION_CATEGORY_RESULTS
WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN(
SELECT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16))
AND Sample_Id > 0)
AND dbo.IMPORT_DOCUMENTS.ITEMID NOT IN
(SELECT ASSIGNMENT_FOLDER_DOCUMENTS.Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)
One possible reason is because one of your tables contains data at lower level, lower than your join key. For example, there may be multiple records per item id. The same item id is repeated X number of times. I would fix the query like the below. Without data knowledge, Try running the below modified query.... If output is not what you're looking for, convert it into SELECT Within a Select...
Hope this helps....
Try this SQL: SELECT DISTINCT a.ITEMID, a.BEGDOC, a.BATCHID, b.CATEGORY_ID FROM IMPORT_DOCUMENTS a JOIN (SELECT DISTINCT ITEMID FROM CATEGORY_COLLECTION_CATEGORY_RESULTS WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN (SELECT DISTINCT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16)) AND Sample_Id > 0) B ON a.ITEMID =b.ITEMID WHERE a.(a.BATCHID LIKE 'IC0%' OR a.BATCHID LIKE 'LP0%') AND a.ITEMID NOT IN (SELECT DIDTINCT Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)