Related
I have a table with id, we now added a new field where we calculated uniques from an external source, which made us realize we actually have duplicates in the database:
Main Table
id | unique_id | ...
---|------------
4 | A |
5 | A
6 | B
We can see: 5 is actually a duplicate of 4, as they both have the same unique_id.
Now this needs to be cleaned up.
I sadly can not simply delete those duplicates (5), as other tables depend on it:
Other Table (OtherTable.main_id REFERENCES MainTable.id)
id | main_id | ...
---|------------
1 | 4 | Blah
2 | 5
3 | 6
Now I have to clean up the duplicates, here
UPDATE OtherTable SET main_id = 5 WHERE main_id=4
How can I do that in an efficient update?
I tried to simply update every reference to the first one with that same unique_id, however that didn't complete in a day.
UPDATE "OtherTable" SET "main_id" = (SELECT "id" FROM "MainTable" WHERE "unique_id" = (SELECT "unique_id" FROM "MainTable" WHERE "id" == "OtherTable"."main_id") LIMIT 1)
If it helps, the MainTable contains about 750,000 entries, the OtherTable contains 12,000,000 rows.
Probably that's because those tripple select one is quite inefficient.
For the simple part of deletion the duplicates (after I would be done with changing the references to the first one of it's kind) I found this query to work swiftly enough:
DELETE FROM MainTable
WHERE id IN
(SELECT id
FROM
(SELECT id,
ROW_NUMBER() OVER( PARTITION BY unique_id
ORDER BY id ) AS row_num
FROM MainTable ) t
WHERE t.row_num > 1 );
However I need a way to update the references to the non-deleted ones of the duplicates.
Instead of UPDATE with a nested query, I'd suggest using UPDATE FROM for a join, and the same window function as in your DELETE statement:
UPDATE "OtherTable" AS other
SET main_id = main.min_id
FROM (SELECT
id,
first_value(id) OVER (PARTITION BY unique_id ORDER BY id) AS min_id
FROM "MainTable"
) AS main
WHERE main.id = other.main_id
AND main.id <> main.min_id
I have a table like this:
id product amount
1 A 6
1 A 8
1 A
1 B 1
1 B
2 C 2
2 C
2 C 4
2 C
2 C
and I need to make it like this:
id product amount
1 A 6
1 A 8
1 A 8
1 B 1
1 B 1
2 C 2
2 C 2
2 C 4
2 C 4
2 C 4
Copy amount by previous non-missing value.
I tried to use lag() function. however, aggregation function lag() is not allowed in UPDATE.
update tableA set amount = lag(amount);
What can I do using PostgreSQL?
You can SELECT what you want to UPDATE, but there is no (easy) way to actually do the UPDATE, because the table fox does not have a primary key (yet).
CREATE TABLE fox (
id integer NOT NULL,
product text NOT NULL,
amount integer
);
To populate the fox with some data.
INSERT INTO fox VALUES
(1, 'A', 6),
(1, 'A', 8),
(1, 'A', NULL),
(1, 'B', 1),
(1, 'B', NULL),
(2, 'C', 2),
(2, 'C', NULL),
(2, 'C', 4),
(2, 'C', NULL),
(2, 'C', NULL),
(3, 'What does the fox say?', 5);
The query.
WITH ranks (rank, id, product, amount) AS (
SELECT ROW_NUMBER() OVER (), id, product, amount FROM foo
)
SELECT r.id, r.product,
(SELECT amount FROM ranks
WHERE id = r.id AND product = r.product
AND rank < r.rank AND amount IS NOT NULL
ORDER BY amount DESC LIMIT 1
)
FROM ranks r WHERE r.amount IS NULL ORDER BY 1, 2, 3;
Yields the rows which previously had a NULL and now have the appropriate amount.
id | product | amount
----+---------+--------
1 | A | 8
1 | B | 1
2 | C | 2
2 | C | 4
2 | C | 4
But you cannot use this data to update, because rows are still not uniquely identified by (id, product) - which means you cannot write a WHERE condition identifying your rows uniquely. How would the WHERE clause know whether to change the amount to 2 or 4 in the UPDATE? The multiple rows with (id, product) = (2, 'C') are indistinguishable in the WHERE of the UPDATE.
Let's give the fox a primary key.
ALTER TABLE fox ADD COLUMN IF NOT EXISTS pkey serial ;
ALTER TABLE fox ADD PRIMARY KEY (pkey) ;
Now we can identify the rows by the PRIMARY KEY pkey.
WITH nulls AS (
SELECT pkey, id, product
FROM fox
WHERE amount IS NULL
)
SELECT pkey,
id, product, -- you can leave these out in your UPDATE: pkey is UNIQUE
(SELECT amount FROM fox
WHERE id = n.id AND product = n.product
AND n.pkey > pkey AND amount IS NOT NULL
ORDER BY pkey DESC LIMIT 1)
FROM nulls n ORDER BY 1, 2, 3, 4;
to display the changes to be made
pkey | id | product | amount
------+----+---------+--------
3 | 1 | A | 8
5 | 1 | B | 1
7 | 2 | C | 2
9 | 2 | C | 4
10 | 2 | C | 4
And we can use pkey in the UPDATE.
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
WITH nulls AS (
SELECT pkey, id, product
FROM fox
WHERE amount IS NULL
), changes AS (
SELECT pkey,
(SELECT amount FROM fox
WHERE id = n.id AND product = n.product
AND n.pkey > pkey AND amount IS NOT NULL
ORDER BY pkey DESC LIMIT 1)
FROM nulls n
) UPDATE fox f SET amount = c.amount FROM changes c WHERE f.pkey = c.pkey ;
Check the result is okay:
SELECT * FROM fox ORDER BY 1, 2, 3, 4;
And accept using COMMIT or ROLLBACK accordingly.
Alternative to adding a PRIMARY KEY
Every table should always have a primary key.
If you insist not to have one, then you could also compute the rows with their then-not-NULL amount and instead of UPDATEing them, you could INSERT them into your table and then DELETE FROM fox WHERE amount IS NULL remove the rows which had no amount. This way you get around adding a primary key, which is unique. Of course the UPDATE and DELETE are packaged into a TRANSACTION such as not to interfere with other Transactions running concurrently. For example another Transaction adding rows with NULL amount AFTER you have calculated the data to be INSERTed using SELECT and before you DELETE all NULL amounts. You'd miss the concurrently added row with NULL amount in this case (data loss due to concurrency; think ACID).
But a missing primary key will probably bite you later on, anyway.
Without knowing what defines "previous rows" all is a guess. But you can use a anonymous block to do what your want, just make your changes:
CREATE TEMPORARY TABLE test_lag AS
SELECT column1 AS id, column2 AS product, column3 AS amount FROM (
VALUES (1, 'A', 6),
(1, 'A', 8),
(1, 'A', NULL),
(1, 'B', 1),
(1, 'B', NULL),
(2, 'C', 2),
(2, 'C', NULL),
(2, 'C', 4),
(2, 'C', NULL),
(2, 'C', NULL)) AS tmp;
DO $$
BEGIN
--Loop until update all null amounts
--Why we need this? It's because PostgreSQL don't supports IGNORE NULLS clause on lag()
LOOP
WITH tmp AS (
SELECT ctid, lag(amount) OVER() AS last_amount FROM test_lag ORDER BY id, product -- You MUST change this ORDER to right columns (What's previous row?)
)
UPDATE test_lag SET amount = tmp.last_amount FROM tmp WHERE test_lag.ctid = tmp.ctid AND amount IS NULL;
IF NOT FOUND THEN
EXIT;
END IF;
END LOOP;
END $$;
SELECT * FROM test_lag ORDER BY id, product, amount;
I have a many-to-many relation representing containers holding items.
I have a primary key row_id in the table.
I insert four rows: (container_id, item_id) values (1778712425160346751, 4). These rows will be identical except the aforementioned unique row_id.
I subsequently execute the following query:
delete from contains
where item_id = 4 and
container_id = '1778712425160346751' and
row_id =
(
select max(row_id) from contains
where container_id = '1778712425160346751' and
item_id = 4
)
returning
(
select count(*) from contains
where container_id = '1778712425160346751' and
item_id = 4
);
Now I expected to get 3 returned from this query, but I got a 4. Getting a 4 is the desired behavior, but it is not what was expected.
My question is: can I always expect that the returning clause executes before the delete, or is this an idiosyncrasy of certain versions or specific software?
The use of a query in returning section is allowed but not documented. For the documentation:
output_expression
An expression to be computed and returned by the DELETE command after each row is deleted. The expression can use any column names of the table named by table_name or table(s) listed in USING. Write * to return all columns.
It seems logical that the query sees the table in a state before deleting, as the statement is not completed yet.
create temp table test as
select id from generate_series(1, 4) id;
delete from test
returning id, (select count(*) from test);
id | count
----+-------
1 | 4
2 | 4
3 | 4
4 | 4
(4 rows)
The same concerns update:
create temp table test as
select id from generate_series(1, 4) id;
update test
set id = id+ 1
returning id, (select sum(id) from test);
id | sum
----+-----
2 | 10
3 | 10
4 | 10
5 | 10
(4 rows)
I apologize for the long problem description, but I was unable to break it down more than this.
Before reading on, keep in mind that my end goal is T-SQL (maybe some recursive CTE?). However, a shove in the right direction would be much appreciated (I've tried a million things and have been scratching my head for hours).
Consider the following problem: I have a table of categories which is self-referencing through ParentCategoryID->CategoryID:
-----------------------
| Category |
-----------------------
| *CategoryID |
| Name |
| ParentCategoryID |
-----------------------
This type of table allows me to build a tree structure, say:
-----------------
| parentCategory|
-----------------
/ | \
child(1) child child
/ \
child(2) child(3)
where "child" means "child category" (ignore the numbers, I'll explain them later). Obviously I can have as many children as I like at any level.
Every day, a program I've written stores values to a table "ValueRegistration", which is connected to "Category" like so:
------------------------ ---------------------- ----------------------
| ValueRegistration | | Item | | Category |
------------------------ ---------------------- ----------------------
| *RegID | | *ItemID | | *CategoryID |
| Date |>-------| CategoryID |>---------| Name |
| ItemID | | ItemTypeID | | ParentCategoryID |
| Value | ---------------------- ----------------------
------------------------ Y
|
|
---------------------
| ItemType |
---------------------
| *ItemTypeID |
| ItemType |
---------------------
As you can see, a ValueRegistration concerns a specific Item, which in turn belongs to a certain category. The category may or may not have a parent (and grandparents and great-grandparents and so on). For instance, it may be the child all the way to the bottom left (number 2) in the tree I illustrated above. Also, an Item is of a certain ItemType.
My goal:
I register values to the ValueRegistration table daily (in other words, Date and ItemID combined is also a primary key). I want to be able to retrieve a resultset on the following form:
[ValueRegistration.Date, ItemType.ItemTypeID, Category.CategoryID, Value]
which seems simple enough (it's obviously just a bunch of joins). However, I also want results for rows that actually don't exist in the ValueRegistration table, namely results in which the values of sibling nodes for a given date and itemID are summed, and a new row is produced where ValueRegistration.Date and ItemType.ItemTypeID are the same as in the child nodes but where CategoryID is that of the parent of the child nodes. Keep in mind that an Item will NOT exist for this type of row in the resultset.
Consider for instance a scenario where I have ValueRegistrations for child 2 and 3 on a bunch of dates and a bunch of ItemIDs. Obviously, each registration belongs to a certain ItemType and Category. It should be clear to the reader that
ValueRegistration.Date, ItemType.ItemTypeID, Category.CategoryID
is a sufficient key to identify a specific ValueRegistration (in other words, it's possible to solve my problem without having to create temporary Item rows), and so I can inner join all tables and, for instance, the following result:
ValueReg.Date, ItemType.ItemTypeID, Category.CategoryID, ValueReg.Value
08-mar-2013, 1, 5, 200
08-mar-2013, 1, 6, 250
Assume now that I have four category rows that look like this:
1, category1, NULL
2, category2, 1
5, category5, 2
6, category6, 2
I.e. category 1 is the parent of category 2, and category 2 is the parent of categories 5 and 6. Category 1 has no parent. I now wish to append the following rows to my resultset:
08-mar-2013, 1, 2, (200+250)
08-mar-2013, 1, 1, (200+250+sum(values in all other childnodes of node 1)
Remember:
the solution needs to be recursive, so that it is performed upwards in the tree (until NULL is reached)
an Item row will NOT exist for tree nodes which are calculated, so CategoryID and ItemTypeID must be used
yes, I know I could simply create "virtual" Item rows and add ValueRegistrations when I originally INSERT INTO my database, but that's solution is prone to errors, particularly if other programmers code up against my database but either forget or are unaware that results must be passed up to parent node. A solution that calculates this on request instead is much safer and, frankly, much more elegant.
I've tried to set something up along the lines of this, but I seem to get stuck with having to group by Date and ItemTypeID, and that's not allowed in a CTE. The programmer in me just wants to make a recursive function, but I'm really struggling to do that in SQL.
Anyone have an idea where to begin, what things I should try, or even (fingers crossed) a solution?
Thanks!
Alexander
EDIT:
SQL FIDDLE
CREATE TABLE ItemType(
ItemTypeID INT PRIMARY KEY,
ItemType VARCHAR(50)
);
CREATE TABLE Category(
CategoryID INT PRIMARY KEY,
Name VARCHAR(50),
ParentCategoryID INT,
FOREIGN KEY(ParentCategoryID) REFERENCES Category(CategoryID)
);
CREATE TABLE Item(
ItemID INT PRIMARY KEY,
CategoryID INT NOT NULL,
ItemTypeID INT NOT NULL,
FOREIGN KEY(CategoryID) REFERENCES Category(CategoryID),
FOREIGN KEY(ItemTypeID) REFERENCES ItemType(ItemTypeID)
);
CREATE TABLE ValueRegistration(
RegID INT PRIMARY KEY,
Date DATE NOT NULL,
Value INT NOT NULL,
ItemID INT NOT NULL,
FOREIGN KEY(ItemID) REFERENCES Item(ItemID)
);
INSERT INTO ItemType VALUES(1, 'ItemType1');
INSERT INTO ItemType VALUES(2, 'ItemType2');
INSERT INTO Category VALUES(1, 'Category1', NULL); -- Top parent (1)
INSERT INTO Category VALUES(2, 'Category2', 1); -- A child of 1
INSERT INTO Category VALUES(3, 'Category3', 1); -- A child of 1
INSERT INTO Category VALUES(4, 'Category4', 2); -- A child of 2
INSERT INTO Category VALUES(5, 'Category5', 2); -- A child of 2
INSERT INTO Category VALUES(6, 'Category6', NULL); -- Another top parent
INSERT INTO Item VALUES(1, 4, 1); -- Category 4, ItemType 1
INSERT INTO Item VALUES(2, 5, 1); -- Category 5, ItemType 1
INSERT INTO Item VALUES(3, 3, 1); -- Category 3, ItemType 1
INSERT INTO Item VALUES(4, 1, 2); -- Category 1, ItemType 2
INSERT INTO ValueRegistration VALUES(1, '2013-03-08', 100, 1);
INSERT INTO ValueRegistration VALUES(2, '2013-03-08', 200, 2);
INSERT INTO ValueRegistration VALUES(3, '2013-03-08', 300, 3);
INSERT INTO ValueRegistration VALUES(4, '2013-03-08', 400, 4);
INSERT INTO ValueRegistration VALUES(5, '2013-03-09', 120, 1);
INSERT INTO ValueRegistration VALUES(6, '2013-03-09', 220, 2);
INSERT INTO ValueRegistration VALUES(7, '2013-03-09', 320, 3);
INSERT INTO ValueRegistration VALUES(8, '2013-03-09', 420, 4);
-- -------------------- RESULTSET I WANT ----------------------
-- vr.Date | ItemType | CategoryTypeID | Value
-- ------------------------------------------------------------
-- 2013-03-08 | 'ItemType1' | 'Category4' | 100 Directly available
-- 2013-03-08 | 'ItemType1' | 'Category5' | 200 Directly available
-- 2013-03-08 | 'ItemType1' | 'Category3' | 300 Directly available
-- 2013-03-08 | 'ItemType1' | 'Category2' | 100+200 Calculated tree node
-- 2013-03-08 | 'ItemType1' | 'Category1' | 100+200+300 Calculated tree node
-- 2013-03-08 | 'ItemType2' | 'Category1' | 400 Directly available
-- 2013-03-09 | 'ItemType1' | 'Category4' | 120 Directly available
-- 2013-03-09 | 'ItemType1' | 'Category5' | 220 Directly available
-- 2013-03-09 | 'ItemType1' | 'Category3' | 320 Directly available
-- 2013-03-09 | 'ItemType1' | 'Category2' | 120+220 Calculated tree node
-- 2013-03-09 | 'ItemType1' | 'Category1' | 120+220+320 Calculated tree node
-- 2013-03-09 | 'ItemType2' | 'Category1' | 420 Directly available
If you replace all joins to the table Category with joins to this dynamic relation, you will get the hierarchy you are lookinfg for:
with Category as (
select * from ( values
(1,'Fred',null),
(2,'Joan',1),
(3,'Greg',2),
(4,'Jack',2),
(5,'Jill',4),
(6,'Bill',3),
(7,'Sam',6)
) Category(CategoryID,Name,ParentCategoryID)
)
, Hierarchy as (
select 0 as [Level],* from Category
-- where Parent is null
union all
select super.[Level]+1, sub.CategoryID, super.Name, super.ParentCategoryID
from Category as sub
join Hierarchy as super on super.CategoryID = sub.ParentCategoryID and sub.ParentCategoryID is not null
)
select * from Hierarchy
-- where CategoryID = 6
-- order by [Level], CategoryID
For example, uncommenting the two lines at the bottom will yield this result set:
Level CategoryID Name ParentCategoryID
----------- ----------- ---- ----------------
0 6 Bill 3
1 6 Greg 2
2 6 Joan 1
3 6 Fred NULL
I have two tables in a database. The first table tblTracker contains many columns, but the column of particular interest is called siteAdmin and each row in that column can contain multiple loginIDs of 5 digits like 21457, 21456 or just one like 21444. The next table users contains columns like LoginID, fname, and lname.
What I would like to be able to do is take the loginIDs contained in tblTracker.siteAdmin and return fname + lname from users. I can successfully do this when there is only one loginID in the row such as 21444 but I cannot figure out how to do this when there is more than one like 21457, 21456.
Here is the SQL statement I use for when there is one loginID in that column
SELECT b.FName + '' '' + b.LName AS siteAdminName,
FROM tblTracker a
LEFT OUTER JOIN users b ON a.siteAdmin= b.Login_Id
However this doesn't work when it tries to join a siteAdmin with more than one LoginID in it
Thanks!
I prefer the number table approach to split a string in TSQL
For this method to work, you need to do this one time table setup:
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Numbers
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
Once the Numbers table is set up, create this split function:
CREATE FUNCTION [dbo].[FN_ListToTable]
(
#SplitOn char(1) --REQUIRED, the character to split the #List string on
,#List varchar(8000)--REQUIRED, the list to split apart
)
RETURNS TABLE
AS
RETURN
(
----------------
--SINGLE QUERY-- --this will not return empty rows
----------------
SELECT
ListValue
FROM (SELECT
LTRIM(RTRIM(SUBSTRING(List2, number+1, CHARINDEX(#SplitOn, List2, number+1)-number - 1))) AS ListValue
FROM (
SELECT #SplitOn + #List + #SplitOn AS List2
) AS dt
INNER JOIN Numbers n ON n.Number < LEN(dt.List2)
WHERE SUBSTRING(List2, number, 1) = #SplitOn
) dt2
WHERE ListValue IS NOT NULL AND ListValue!=''
);
GO
You can now easily split a CSV string into a table and join on it:
select * from dbo.FN_ListToTable(',','1,2,3,,,4,5,6777,,,')
OUTPUT:
ListValue
-----------------------
1
2
3
4
5
6777
(6 row(s) affected)
Your can now use a CROSS APPLY to split every row in your table like:
DECLARE #users table (LoginID int, fname varchar(5), lname varchar(5))
INSERT INTO #users VALUES (1, 'Sam', 'Jones')
INSERT INTO #users VALUES (2, 'Don', 'Smith')
INSERT INTO #users VALUES (3, 'Joe', 'Doe')
INSERT INTO #users VALUES (4, 'Tim', 'White')
INSERT INTO #users VALUES (5, 'Matt', 'Davis')
INSERT INTO #users VALUES (15,'Sue', 'Me')
DECLARE #tblTracker table (RowID int, siteAdmin varchar(50))
INSERT INTO #tblTracker VALUES (1,'1,2,3')
INSERT INTO #tblTracker VALUES (2,'2,3,4')
INSERT INTO #tblTracker VALUES (3,'1,5')
INSERT INTO #tblTracker VALUES (4,'1')
INSERT INTO #tblTracker VALUES (5,'5')
INSERT INTO #tblTracker VALUES (6,'')
INSERT INTO #tblTracker VALUES (7,'8,9,10')
INSERT INTO #tblTracker VALUES (8,'1,15,3,4,5')
SELECT
t.RowID, u.LoginID, u.fname+' '+u.lname AS YourAdmin
FROM #tblTracker t
CROSS APPLY dbo.FN_ListToTable(',',t.siteAdmin) st
LEFT OUTER JOIN #users u ON st.ListValue=u.LoginID --to get all rows even if missing siteAdmin
--INNER JOIN #users u ON st.ListValue=u.LoginID --to remove rows without any siteAdmin
ORDER BY t.RowID,u.fname,u.lname
OUTPUT:
RowID LoginID YourAdmin
----------- ----------- -----------
1 2 Don Smith
1 3 Joe Doe
1 1 Sam Jones
2 2 Don Smith
2 3 Joe Doe
2 4 Tim White
3 5 Matt Davis
3 1 Sam Jones
4 1 Sam Jones
5 5 Matt Davis
7 NULL NULL
7 NULL NULL
7 NULL NULL
8 3 Joe Doe
8 5 Matt Davis
8 1 Sam Jones
8 15 Sue Me
8 4 Tim White
(18 row(s) affected)