I have created synthetic data for a typical call center.
Below is the screenshot of the table I have created.
Table 1:
Problem statement: Since this is completely random data, I noticed that there are some customers who are being assigned to the same agents whenever they call again.
So using this query I was able to test such a case and count the number of times agents are being repeated for each customer.
select agentid, customerid, count(customerid) from aa_dev.calls group by agentid, customerid having count(customerid) > 1 ;
Table 2
I have a separate agents table to called aa_dev.agents in which the agent's ids are stored
Now I want to replace the agentid for such cases, such that if agentid is repeated 6 times for a single customer then 5 of the times the agent id should be updated with any other agentid from the table but call time shouldn't be overlapping That means the agent we are replacing with should not be busy on the time the call is going one.
I have assigned row numbers to each repeated ones.
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY agentid, customerid ORDER BY random()) rn,
COUNT(*) OVER (PARTITION BY agentid, customerid) cnt
FROM aa_dev.calls
)
SELECT agentid, customerid, rn
FROM cte
WHERE cnt > 1;
This way I could visualize the repetition clearly.
So I don't want to update row 1 but the rest.
Is there any way I can acheive this? Can I use the row number and write a query according to the row number to update rownum 2 onwards row one by one with each row having a unique agent?
If you don't want duplicates in your artificial data, it's probably better to not generate them.
But if you already have a table with duplicates and want to work on the duplicates, either updating them or deleting, here is the easy way:
You need a unique ID for each updated row. If you don't have it,
add it temporarily. Then you can use this pattern to update all duplicates
except the first one:
To add artificial id column to preexisting table, use:
ALTER TABLE calls ADD id serial;
In my case I generated a test table with 100 random rows:
CREATE TEMP TABLE calls (id serial, agentid int, customerid int);
INSERT INTO calls (agentid, customerid)
SELECT (random()*10)::int, (random()*10)::int
FROM generate_series(1, 100) n;
Define what constitutes a duplicate and find duplicates in data:
SELECT agentid, customerid, count(*), array_agg(id) id
FROM calls
GROUP BY 1,2 HAVING count(*)>1
ORDER BY 1,2;
Update all the duplicate rows except first one with NULLs:
UPDATE calls SET agentid = whatever_needed
FROM (
SELECT array_agg(id) id, min(id) idmin FROM calls
GROUP BY agentid, customerid HAVING count(*)>1
) AS dup
WHERE calls.id = ANY(dup.id) AND calls.id <> dup.idmin;
Alternatively, remove all duplicates except first one:
DELETE FROM calls
USING (
SELECT array_agg(id) id, min(id) idmin FROM calls
GROUP BY agentid, customerid HAVING count(*)>1
) AS dup
WHERE calls.id = ANY(dup.id) AND calls.id <> dup.idmin;
Related
This code gives me a table of the unique values (without duplicates):
SELECT id, firstname, lastname, startdate, position
FROM (
SELECT id, firstname, lastname, startdate, position,
ROW_NUMBER() OVER (PARTITION BY (firstname, lastname) ORDER BY startdate DESC) rn
FROM people
) tmp
WHERE rn = 1;
What syntax would replace the current table with just the results of this one?
Alternatively, I could use WHERE rn <> 1 to get all the data I want to delete, but again, I am struggling to get the syntax of the DELETE right using this method.
Assuming values in firstname, lastname and startdate are never NULL, this simple query with a NOT EXISTS anti-semi-join does the job:
DELETE FROM people AS p
WHERE EXISTS (
SELECT FROM people AS p1
WHERE p1.firstname = p.firstname
AND p1.lastname = p.lastname
AND p1.startdate > p.startdate
);
It deletes every row where a newer copy exists, effectively keeping the latest row per group of peers. (Of course, (firstname, lastname) is a poor way of establishing identity. There are many distinct people with identical names. The demo may be simplified ...)
Can there be identical values in startdate? Then you need a tiebreaker ...
Typically faster than using a subquery with row_number(). There are a hundred and one ways to make this faster, depending on your precise situation and requirements. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If compared columns can be NULL, consider:
How to delete duplicate rows without unique identifier
There is a whole dedicated tag for duplicate-removal. Combine it with postgres to narrow down:
https://stackoverflow.com/questions/tagged/duplicates+postgresql
I need to write a sql code that probably is very simple but I am very new to it.
I need to find all the records from one table that have matching id (but no more than one) from the other table. eg. one table contains records of the employees and the second one with employees' telephone numbers. i need to find all employees with only one telephone no
Sample data would be nice. In absence of:
SELECT
employees.employee_id
FROM
employees
LEFT JOIN
(SELECT distinct on(employee_id) employee_id FROM emp_phone) AS phone
ON
employees.employee_id = phone.employee_id
WHERE
phone.employee_id IS NOT NULL;
You need a join of the 2 tables, group by employee and the condition in the having clause:
SELECT e.employee_id, e.name
FROM employees e INNER JOIN numbers n
ON e.employee_id = n.employee_id
GROUP BY e.employee_id, e.name
HAVING COUNT(*) = 1;
If there can be more than a few numbers per employee in the table with the employees' telephone numbers (calling it tel), then it's cheaper to avoid GROUP BY and HAVING which has to process all rows. Find employees with "unique" numbers using a self-anti-join with NOT EXISTS.
While you don't need more than the employee_id and their unique phone number, you don't even have to involve the employee table at all:
SELECT *
FROM tel t
WHERE NOT EXISTS (
SELECT FROM tel
WHERE employee_id = t.employee_id
AND tel_number <> t.tel_number -- or use PK column
);
If you need additional columns from the employee table:
SELECT * -- or any columns you need
FROM (
SELECT employee_id AS id, tel_number -- or any columns you need
FROM tel t
WHERE NOT EXISTS (
SELECT FROM tel
WHERE employee_id = t.employee_id
AND tel_number <> t.tel_number -- or use PK column
)
) t
JOIN employee e USING (id);
The column alias in the subquery (employee_id AS id) is just for convenience. Then the outer join condition can be USING (id), and the ID column is only included once in the result, even with SELECT * ...
Simpler with a smart naming convention that uses employee_id for the employee ID everywhere. But it's a widespread anti-pattern to use employee.id instead.
Related:
JOIN table if condition is satisfied, else perform no join
I have a result set in a temp table that is the result of some complicated joins and need to know the best way to filter rows that have the same duplicate AccountId/HealthPlanId (shown below).
select * from #HealthPlans
And the contents are as follows:
AccountId MemberId HealthPlanId RankNo
101273 47570 5215 1
101273 47570 2187 2
101273 55551 5179 3
160026 48102 5620 1
160026 48446 5620 2
In this scenario RankNo, which is not a value computed by my original query, is a db column that ranks member/healthPlan where there is more than one member/healthPlan combination on a given account.
In the case of account 101273, I have the same member (47570) with 3 separate health plans (5215, 2187, 5179). That's fine. I want to rank the health plans.
However, for accountId 160026, I have healthPlanId: 5620 listed twice but with different memberId's. I need to keep either of these member id's and discard the other (it doesn't matter which I keep since I'm only interested in Ranking the HealthPlanId).
Basically, an account should only have a row for each unique health plan. However, duplicate memberId's is OK and should be ranked as long as the HealthPlanId differs.
In other words, select rows from #HealthPlans such that the following is the result set:
AccountId MemberId HealthPlanId RankNo
101273 47570 5215 1
101273 47570 2187 2
101273 55551 5179 3
160026 48102 5620 1
There's no need to show the original joins because this is basically a simplification of my original issue.
Thanks,
Sean
Another method using a window function:
DECLARE #tab TABLE (AccountId int, MemberId int, HealthPlanId int, RankNo int)
INSERT #tab VALUES
(101273,47570,5215,1),
(101273,47570,2187,2),
(101273,55551,5179,3),
(160026,48102,5620,1),
(160026,48446,5620,2)
SELECT *
FROM(
SELECT ROW_NUMBER() OVER(PARTITION BY t.AccountId, t.HealthPlanId ORDER BY t.RankNo) rn, t.*
FROM #tab t
) t2
WHERE t2.rn = 1
Your particular query might look like:
SELECT *
FROM(
SELECT ROW_NUMBER() OVER(PARTITION BY hp.AccountId, hp.HealthPlanId ORDER BY hp.RankNo) rn, hp.*
FROM #HealthPlans hp
) hp2
WHERE hp2.rn = 1
My table is somethingg like
CREATE TABLE table1
(
_id text,
name text,
data_type int,
data_value int,
data_date timestamp -- insertion time
);
Now due to a system bug, many duplicate entries are created and I need to remove those duplicated and keep only unique entries excluding data_date because it is a system generated date.
My query to do that is something like:
DELETE FROM table1 A
USING ( SELECT _id, name, data_type, data_value, MIN(data_date) min_date
FROM table1
GROUP BY _id, name, data_type, data_value
HAVING count(data_date) > 1) B
WHERE A._id = B._id
AND A.name = B.name
AND A.data_type = B.data_type
AND A.data_value = B.data_value
AND A.data_date != B.min_date;
However this query works, having millions of records in the table, I want a faster way for it. My idea is to create a new column with value as partition by [_id, name, data_type, data_value] or columns which are in group by. However, I could not find the way to create such column.
I would appretiate if any one may suggest a way to create such column.
Edit 1:
There is another thing to add, I don't want to use CTE or subquery for updating this new column because it will be same as my existing query.
The best way is simply creating a new table without duplicated records:
CREATE...
SELECT _id, name, data_type, data_value, MIN(data_date) min_date
FROM table1
GROUP BY _id, name, data_type, data_value;
Alternatively, you can create a rank and then filter, but a subquery is needed.
RANK() OVER (PARTITION BY your_variables ORDER BY data_date ASC) r
And then filter r=1.
I have a interim table without any primary key and identity. I need to check one of the columns (branch_ref) value for duplicate entries and should mark the flag as 'D' if the branch_ref is same for more than one record except the last occurrence in the table. How can we do this?
Actual data as stored in table.
select branch_name,branch_reference,address_1,zip_cd,null as flag_val FROM Branch_Master
As per above table, I need all flag to be updated as āDā except for 6th (brach_reference=9910) and 16th record (branch_reference=99100 and zip_cd=612).
When I use row_number function to identify the duplicates order gets changed.
SELECT branch_name,branch_reference,address_1,zip_cd,flag_val, ROW_NUMBER() OVER(PARTITION BY branch_reference ORDER BY branch_reference) RID
FROM Branch_Master
Am using below query to update flag_val and its updating wrong records.
;WITH CTE AS
(
SELECT branch_name,branch_reference,address_1,zip_cd,flag_val, ROW_NUMBER() OVER(PARTITION BY branch_reference ORDER BY branch_reference) RID
FROM Branch_Master
WHERE flag_val IS NULL
)
UPDATE C1 SET flag_val = 'D'
FROM CTE C1
LEFT OUTER JOIN (SELECT branch_reference, max(RID) MRID FROM CTE GROUP BY branch_reference) C2
ON C1.branch_reference=C2.branch_reference and C1.RID=C2.MRID
WHERE C2.branch_reference IS NULL