Subquery in update doesn't see already updated records - postgresql

I have a few thousand records with a duplicate sortorder (which causes duplicate entries in other queries), so I'm trying to set a correct sort order for all those records.
First I set them all to -1 so the sortorder would start from 0, and then I execute this query:
UPDATE op.customeraddress SET sortorder = (SELECT MAX(ca.sortorder) + 1
FROM op.customeraddress AS ca
WHERE ca.customerid = customeraddress.customerid)
WHERE id IN (<subquery for IDs>)
The problem is that the MAX() in the subquery always seems to return the same value - it doesn't know about an earlier update.
The query works fine if I manually apply it record by record.
Any ideas on how to do this without having to resort to looping?

This should do it:
with new_order as
(
select ctid as rid,
row_number() over (partition by customerid order by sortorder) as rn
from customeraddress
)
update customeraddress ca
set sortorder = new_order.rn
where ca.ctid = new_order.rid;
and ca.id IN (<subquery for IDs>);
No need to reset the sortorder before running this, it will renumber all customeraddresses for a one customerid according to the old order.
You need PostgreSQL 9.1 for the above solution (writeable CTEs)
For previous version this should do it:
update customeraddress ca
set ca.sortorder = t.sortorder
from
(
select ctid as rid,
row_number() over (partition by customerid order by sortorder) as rn
from customeraddress
) t
where ca.ctid = t.rid
and ca.id IN (<subquery for IDs>);

You could use a sequence:
CREATE TEMPORARY SEQUENCE sort_seq;
UPDATE op.customeraddress SET sort_order = (
SELECT nextval('sort_seq')
FROM op.customeraddress AS ca
WHERE ca.customerid = customeraddress.customerid
) WHERE id IN ...

Related

How to update duplicate rows in a table n postgresql

I have created synthetic data for a typical call center.
Below is the screenshot of the table I have created.
Table 1:
Problem statement: Since this is completely random data, I noticed that there are some customers who are being assigned to the same agents whenever they call again.
So using this query I was able to test such a case and count the number of times agents are being repeated for each customer.
select agentid, customerid, count(customerid) from aa_dev.calls group by agentid, customerid having count(customerid) > 1 ;
Table 2
I have a separate agents table to called aa_dev.agents in which the agent's ids are stored
Now I want to replace the agentid for such cases, such that if agentid is repeated 6 times for a single customer then 5 of the times the agent id should be updated with any other agentid from the table but call time shouldn't be overlapping That means the agent we are replacing with should not be busy on the time the call is going one.
I have assigned row numbers to each repeated ones.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY agentid, customerid ORDER BY random()) rn,
COUNT(*) OVER (PARTITION BY agentid, customerid) cnt
FROM aa_dev.calls
)
SELECT agentid, customerid, rn
FROM cte
WHERE cnt > 1;
This way I could visualize the repetition clearly.
So I don't want to update row 1 but the rest.
Is there any way I can acheive this? Can I use the row number and write a query according to the row number to update rownum 2 onwards row one by one with each row having a unique agent?
If you don't want duplicates in your artificial data, it's probably better to not generate them.
But if you already have a table with duplicates and want to work on the duplicates, either updating them or deleting, here is the easy way:
You need a unique ID for each updated row. If you don't have it,
add it temporarily. Then you can use this pattern to update all duplicates
except the first one:
To add artificial id column to preexisting table, use:
ALTER TABLE calls ADD id serial;
In my case I generated a test table with 100 random rows:
CREATE TEMP TABLE calls (id serial, agentid int, customerid int);
INSERT INTO calls (agentid, customerid)
SELECT (random()*10)::int, (random()*10)::int
FROM generate_series(1, 100) n;
Define what constitutes a duplicate and find duplicates in data:
SELECT agentid, customerid, count(*), array_agg(id) id
FROM calls
GROUP BY 1,2 HAVING count(*)>1
ORDER BY 1,2;
Update all the duplicate rows except first one with NULLs:
UPDATE calls SET agentid = whatever_needed
FROM (
SELECT array_agg(id) id, min(id) idmin FROM calls
GROUP BY agentid, customerid HAVING count(*)>1
) AS dup
WHERE calls.id = ANY(dup.id) AND calls.id <> dup.idmin;
Alternatively, remove all duplicates except first one:
DELETE FROM calls
USING (
SELECT array_agg(id) id, min(id) idmin FROM calls
GROUP BY agentid, customerid HAVING count(*)>1
) AS dup
WHERE calls.id = ANY(dup.id) AND calls.id <> dup.idmin;

Using top in a subquery

I have the following data structure and I want to write a query that returns for a given order number, all the orderlineid's with the most recent statusId for that orderline.
If I was just interested in a particular order line I could use
select top 1 StatusId from task where OrderLineId = #OrderLineId order by TaskId desc
but I can't figure out how to get all the results for a given OrderId in one SQL Statement.
If I'm understanding your question correctly, you could use row_number in a subquery:
select orderid, orderlineid, statusid
from (
select o.orderid,
ol.orderlineid,
t.statusid,
row_number() over (partition by o.orderid order by t.taskid desc) rn
from order o
join orderline ol on o.orderid = ol.orderid
join task t on ol.orderlineid = t.orderlineid
) t
where orderid = ? and rn = 1
Please note, order is a reserved word in sql server so if that's your real table name, you'll need to use brackets around it. But I'd recommend renaming it to make your life easier.

Update Postgresql table using rank()

I'm trying to update a column (pop_1_rank) in a postgresql table with the results from a rank() like so:
UPDATE database_final_form_merge
SET
pop_1_rank = r.rnk
FROM (
SELECT pop_1, RANK() OVER ( ORDER BY pop_1 DESC) FROM database_final_form_merge WHERE territory_name != 'north' AS rnk)r
The SELECT query by itself works fine, but I just can't get it to update correctly. What am I doing wrong here?
I rather use the CTE notation.
WITH cte as (
SELECT pop_1,
RANK() OVER ( ORDER BY pop_1 DESC) AS rnk
FROM database_final_form_merge
WHERE territory_name <> 'north'
)
UPDATE database_final_form_merge
SET pop_1_rank = cte.rnk
FROM cte
WHERE database_final_form_merge.pop_1 = cte.pop_1
As far as I know, Postgres updates tables not subqueries. So, you can join back to the table:
UPDATE database_final_form_merge
SET pop_1_rank = r.rnk
FROM (SELECT pop_1, RANK() OVER ( ORDER BY pop_1 DESC) as rnk
FROM database_final_form_merge
WHERE territory_name <> 'north'
) r
WHERE database_final_form_merge.pop_1 = r.pop_1;
In addition:
The column alias goes by the column name.
This assumes that pop_1 is the id connecting the two tables.
You're missing WHERE on UPDATE query, because when doing UPDATE ... FROM you're basically doing joins.
So you need to select primary key and then match on primary key to update just the columns are computing rank over.

most efficient query to get the first and last record id in a large dataset

I need to write a query against a large dataset to get the first and last record id, plus the first record created time. The sample of the data is as following:
In the above case, if the category "Blue" is passed into the query as parameter, I will expect to return "A12, 13:00, E66" as the result of the query.
I can using aggregate function to get the max and min time from the dataset, and join to get the first and last record. But just wondering whehter there is a more effecient way to achieve the same output?
My advice would be to try to reduce the number of scan/seek operations by comparing execution plans and to place indexes on the categoryID (for lookup) and time (for sorting) columns.
If you have SQL Server 2008 or later, you could use the following, which requires two scans/seeks:
Declare #CategoryID As Varchar(16)
Set #CategoryID = 'Blue'
Select
First_Record.RecordID,
First_Record.CreatedTime,
Last_Record.RecordID
From
(
Select Top 1
RecordID,
CreatedTime
From
<Table>
Where
CategoryID = #CategoryID
Order By
CreatedTime Asc
) First_Record
Cross Apply
(
Select Top 1
RecordID
From
<Table>
Where
CategoryID = #CategoryID
Order By
CreatedTime Desc
) Last_Record
If you have SQL Server 2012 or later, you could write the following, which requires only one scan/seek:
Select Top 1
First_Value(RecordID) Over (Partition By CategoryID Order By CreatedTime Asc),
First_Value(CreatedTime) Over (Partition By CategoryID Order By CreatedTime Asc),
First_Value(RecordID) Over (Partition By CategoryID Order By CreatedTime Desc)
From
<Table>
Where
CategoryID = #CategoryID

Updating a CTE table fail cause of derived or constant field

I'm using MS-SQL 2012
WITH C1
(
SELECT ID, 0 as Match, Field2, Count(*)
FROM TableX
GROUP BY ID, Fields2
)
UPDATE C1 SET Match = 1
WHERE ID = (SELECT MATCHING_ID FROM AnotherTable WHERE ID = C1.ID)
This TSQL statement gives me the following error:
Update or insert of view or function 'C1' failed because it contains a derived or constant field.
Ideally I would like to create a "fake field" named Match and set its default value to 0. Then with the update I would like to Update ONLY the records that have an existing entry on the "AnotherTable".
Any thoughts what am I doing wrong?
Thanks in advanced.
Try doing a Left Outer Join like
SELECT x.ID, ISNULL(a.Matching_ID, 0) as Match, x.Field2, Count(*)
FROM TableX x
LEFT OUTER JOIN AnotherTable a on x.ID = a.ID
GROUP BY x.ID, ISNULL(a.Matching_ID, 0), x.Fields2
without the need of a C1
If I am understanding correctly, the problem is that you are trying to update the CTE table. If you update the table directly you should be fine.
Does this modified version help?
SELECT t.ID
, CASE WHEN (EXISTS (SELECT MATCHING_ID FROM AnotherTable WHERE ID = t.ID)) THEN 1 ELSE 0 END
,t.Field2
,Count(*)
FROM TableX t
GROUP BY ID, Fields2