Related
I've values in a table which are selected as duplicates if the name is same then the corresponding ids are included in a csv string column as below:
Original table:
create table #original(id int, unique_id varchar(500), name varchar(200))
insert into #original
values
( 1, '12345', 'A'), ( 2, '12345', 'A'), ( 3, null, 'B'), ( 4, '45678', 'B'),
( 5, '900', 'C'), ( 6, '901', 'C'), ( 7, null, 'D'), ( 8, null, 'D'),
( 9, null, 'E'), (10, '1000', 'E'), (11, null, 'E'), (12, '1100', 'F'),
(13, '1101', 'F'), (14, '1102', 'F')
, (15, '9999', 'G'), (16, '9998', 'G'), (17, '', 'G')
, (18, '1111', 'H')
, (19, '1010', 'I'), (20, '1010', 'I'), (21, '', 'I')
Person with name A, B are same but C isn't because unique id is different for C.
A record is duplicate if the name is same AND unique id is there or null. When name is same but the unique ids are different then they aren't the same people.
I'm selecting the data as below:
;with cte as
(select name
from #original
group by name
having count(*) > 1)
I need to get the data as below:
Id unique_id Name
1,2 12345 A
3,4 45678 B
7 null D
8,10,11 1000 E
19,20,21 1010 I
C and F should be avoided as their unique_ids are different though names are same. H should be avoided because it's not a duplicate. G should be avoided because unique ids are different. I should be selected because if unique id is present for duplicates by name, it should be the same for all duplicates to be selected.
Thanks
I think you are looking for something like:
SELECT
STRING_AGG(id, ',') WITHIN GROUP (ORDER BY id) id,
(SELECT top 1 unique_id FROM original o3 WHERE o3.name = o1.name AND o3.unique_id IS NOT NULL) unique_id,
name
FROM original o1
WHERE NOT EXISTS
(SELECT 1 FROM original o2 WHERE o2.name = o1.name AND o2.unique_id <> o1.unique_id)
GROUP BY name
ORDER BY name
The NOT EXISTS condition eliminates names C and F (you could use an IN clause if you prefer, but I don't think it's any prettier in this case).
The GROUP BY name combined with the aggregate STRING_AGG gets the comma separated list of ids for the name.
This uses a subquery with top 1 to get a non-null unique_id. You could use max(unique_id) instead which certainly looks better, but you will get a warning. If you're comfortable ignoring the warning and don't think it will be confusing, I would use the max version.
You can see both versions working in this Fiddle.
Edited to add: To address the new requirement in the comments, please see this Fiddle.
There will be multiple ways of doing this, but the condition...
(SELECT COUNT(DISTINCT unique_id) FROM original o2 WHERE o2.name = o1.name and COALESCE(unique_id, '') <> '') <= 1
... will count the number of non-null and non-empty strings to ensure it is always 0 (to allow cases like D and G) or 1 (to allow the other cases).
Note that this also adds an ORDER BY to the TOP 1 subquery version in order to prefer unique_ids with values over both nulls and empty strings.
This is my emails table
create table emails (
id bigint not null primary key generated by default as identity,
name text not null
);
And contacts table:
create table contacts (
id bigint not null primary key generated by default as identity,
email_id bigint not null,
user_id bigint not null,
full_name text not null,
ordering int not null
);
As you can see I have user_id field here. There can be multiple same user ID's on my result so i want to join them using comma ,
Insert some data to the tables:
insert into emails (name)
values
('dennis1'),
('dennis2');
insert into contacts (id, email_id, user_id, full_name, ordering)
values
(5, 1, 1, 'dennis1', 9),
(6, 2, 1, 'dennis1', 5),
(7, 2, 1, 'dennis1', 1),
(8, 1, 3, 'john', 2),
(9, 2, 4, 'dennis7', 1),
(10, 2, 4, 'dennis7', 1);
My query is:
select em.name,
c.user_ids
from emails em
join (
select email_id, string_agg(user_id::text, ',' order by ordering desc) as user_ids
from contacts
group by email_id
) c on c.email_id = em.id
order by em.name;
Actual Result
name user_ids
dennis1 1,3
dennis2 1,1,4,4
Expected Result
name user_ids
dennis1 1,3
dennis2 1,4
On my real-world data, I get same user id like 50 times. Instead it should appear 1 time only. In example above, you see user 1 and 4 appears 2 times for dennis2 user.
How can I unique them?
Demo: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=2e957b52eb46742f3ddea27ec36effb1
P.S: I tried to add user_id it to group by but this time I get duplicate rows...
demo:db<>fiddle
SELECT
name,
string_agg(user_id::text, ',' order by ordering desc)
FROM (
SELECT DISTINCT ON (em.id, c.user_id)
*
FROM emails em
JOIN contacts c ON c.email_id = em.id
) s
GROUP BY name
Join the tables
DISTINCT ON email and the user_id, so for every email record, there is no equal users
Aggregate
strong textI have the following problem:
Given the two tables contacts and organisations :
WITH contacts(oe_id, name, email, person_id) AS (VALUES
(1, 'Mark', 'm.smith#test.nl', 19650728),
(2, 'Tom', 't.b.smith#test.nl', 20010627),
(1, 'Frank', 'f.j.smith#test.nl', 20040709),
(3, 'Petra', 'p.ringenaldus#test.nl', 19700317),
(3, 'Paul', 'p.m.sprengers#test.nl', 19681006)),
organisations(oe_id, name) AS (VALUES
(1, 'Cardiology'),
(2, 'Neurology'),
(3, 'Dermatology'),
(4, 'Churgery'))
I want to get a table with 3 columns: the organisation name, the organisation id, and an array of contactpersons for that organisation.
Every array element is also an array with the data of the contact person.
First I created a table in which all contact columns are being aggregated into an array. One array per tuple row:
WITH contacts(oe_id, name, email, person_id) AS (VALUES
(1, 'Mark', 'm.smith#test.nl', 19650728),
(2, 'Tom', 't.b.smith#test.nl', 20010627),
(1, 'Frank', 'f.j.smith#test.nl', 20040709),
(3, 'Petra', 'p.ringenaldus#test.nl', 19700317),
(3, 'Paul', 'p.m.sprengers#test.nl', 19681006)),
organisations(oe_id, name) AS (VALUES
(1, 'Cardiology'),
(2, 'Neurology'),
(3, 'Dermatology'),
(4, 'Churgery')),
contacts_aggregated(oe_id, cdata) AS (
select oe_id, ARRAY[name, email, person_id::text] from contacts)
select * from contacts_aggregated;
This result into:
oe_id | cdata
-------+---------------------------------------
1 | {Mark,m.smith#test.nl,19650728}
2 | {Tom,t.b.smith#test.nl,20010627}
1 | {Frank,f.j.smith#test.nl,20040709}
3 | {Petra,p.ringenaldus#test.nl,19700317}
3 | {Paul,p.m.sprengers#test.nl,19681006}
(5 rows)
Next step is to aggregate cdata (contact data) for each organisation id:
WITH contacts(oe_id, name, email, person_id) AS (VALUES
(1, 'Mark', 'm.smith#test.nl', 19650728),
(2, 'Tom', 't.b.smith#test.nl', 20010627),
(1, 'Frank', 'f.j.smith#test.nl', 20040709),
(3, 'Petra', 'p.ringenaldus#test.nl', 19700317),
(3, 'Paul', 'p.m.sprengers#test.nl', 19681006)),
organisations(oe_id, name) AS (VALUES
(1, 'Cardiology'),
(2, 'Neurology'),
(3, 'Dermatology'),
(4, 'Churgery')),
contacts_aggregated(oe_id, cdata) AS (
select oe_id, ARRAY[name, email, person_id::text] from contacts),
contacts_for_organisations(oe_id, contacts) AS (
SELECT organisations.oe_id, array_agg(contacts_aggregated.cdata::text)
FROM organisations
JOIN contacts_aggregated USING(oe_id)
GROUP BY oe_id)
SELECT * FROM contacts_for_organisations;
This results into the following:
oe_id | contacts
-------+------------------------------------------------------------------------------------
1 | {"{Mark,m.smith#test.nl,19650728}","{Frank,f.j.smith#test.nl,20040709}"}
2 | {"{Tom,t.b.smith#test.nl,20010627}"}
3 | {"{Petra,p.ringenaldus#test.nl,19700317}","{Paul,p.m.sprengers#test.nl,19681006}"}
(3 rows)
As you can see the result is an array. But its elements should also be an array. Instead of an array I get an imploded array as a string.
Wat I want is something like this:
oe_id | contacts
-------+------------------------------------------------------------------------------------
1 | {{Mark,m.smith#test.nl,19650728},{Frank,f.j.smith#test.nl,20040709}}
2 | {{Tom,t.b.smith#test.nl,20010627}}
3 | {{Petra,p.ringenaldus#test.nl,19700317},{Paul,p.m.sprengers#test.nl,19681006}}
(3 rows)
If I remove the cast to text array_agg(contacts_aggregated.cdata::text I get:
could not find array type for data type text[]
What am I forgetting/doing wrong?
Postgres: psql (9.2.24) and psql (9.6.10, server 9.2.24)
If I run the code using a postgres client 9.6 on a postgres 9.6 server everyting works fine.
I just moved to a higher postgres version and everything works fine now.
I have following tables:
CREATE TABLE person (
id INTEGER NOT NULL,
name TEXT,
CONSTRAINT person_pkey PRIMARY KEY(id)
);
INSERT INTO person ("id", "name")
VALUES
(1, E'Person1'),
(2, E'Person2'),
(3, E'Person3'),
(4, E'Person4'),
(5, E'Person5'),
(6, E'Person6');
CREATE TABLE person_book (
id INTEGER NOT NULL,
person_id INTEGER,
book_id INTEGER,
receive_date DATE,
expire_date DATE,
CONSTRAINT person_book_pkey PRIMARY KEY(id)
);
/* Data for the 'person_book' table (Records 1 - 9) */
INSERT INTO person_book ("id", "person_id", "book_id", "receive_date", "expire_date")
VALUES
(1, 1, 1, E'2016-01-18', NULL),
(2, 1, 2, E'2016-02-18', E'2016-10-18'),
(3, 1, 4, E'2016-03-18', E'2016-12-18'),
(4, 2, 3, E'2017-02-18', NULL),
(5, 3, 5, E'2015-02-18', E'2016-02-23'),
(6, 4, 34, E'2016-12-18', E'2018-02-18'),
(7, 5, 56, E'2016-12-28', NULL),
(8, 5, 34, E'2018-01-19', E'2018-10-09'),
(9, 5, 57, E'2018-06-09', E'2018-10-09');
CREATE TABLE book (
id INTEGER NOT NULL,
type TEXT,
CONSTRAINT book_pkey PRIMARY KEY(id)
) ;
/* Data for the 'book' table (Records 1 - 8) */
INSERT INTO book ("id", "type")
VALUES
( 1, E'Btype1'),
( 2, E'Btype2'),
( 3, E'Btype3'),
( 4, E'Btype4'),
( 5, E'Btype5'),
(34, E'Btype34'),
(56, E'Btype56'),
(67, E'Btype67');
My query should list name of all persons and for persons with recently received book types of (book_id IN (2, 4, 34, 56, 67)), it should display the book type and expire date; if a person hasn’t received such book type it should display blank as book type and expire date.
My query looks like this:
SELECT p.name,
pb.expire_date,
b.type
FROM
(SELECT p.id AS person_id, MAX(pb.receive_date) recent_date
FROM
Person p
JOIN person_book pb ON pb.person_id = p.id
WHERE pb.book_id IN (2, 4, 34, 56, 67)
GROUP BY p.id
)tmp
JOIN person_book pb ON pb.person_id = tmp.person_id
AND tmp.recent_date = pb.receive_date AND pb.book_id IN
(2, 4, 34, 56, 67)
JOIN book b ON b.id = pb.book_id
RIGHT JOIN Person p ON p.id = pb.person_id
The (correct) result:
name | expire_date | type
---------+-------------+---------
Person1 | 2016-12-18 | Btype4
Person2 | |
Person3 | |
Person4 | 2018-02-18 | Btype34
Person5 | 2018-10-09 | Btype34
Person6 | |
The query works fine but since I'm right joining a small table with a huge one, it's slow. Is there any efficient way of rewriting this query?
My local PostgreSQL version is 9.3.18; but the query should work on version 8.4 as well since that's our productions version.
Problems with your setup
My local PostgreSQL version is 9.3.18; but the query should work on version 8.4 as well since that's our productions version.
That makes two major problems before even looking at the query:
Postgres 8.4 is just too old. Especially for "production". It has reached EOL in July 2014. No more security upgrades, hopelessly outdated. Urgently consider upgrading to a current version.
It's a loaded footgun to use very different versions for development and production. Confusion and errors that go undetected. We have seen more than one desperate request here on SO stemming from this folly.
Better query
This equivalent should be substantially simpler and faster (works in pg 8.4, too):
SELECT p.name, pb.expire_date, b.type
FROM (
SELECT DISTINCT ON (person_id)
person_id, book_id, expire_date
FROM person_book
WHERE book_id IN (2, 4, 34, 56, 67)
ORDER BY person_id, receive_date DESC NULLS LAST
) pb
JOIN book b ON b.id = pb.book_id
RIGHT JOIN person p ON p.id = pb.person_id;
To optimize read performance, this partial multicolumn index with matching sort order would be perfect:
CREATE INDEX ON person_book (person_id, receive_date DESC NULLS LAST)
WHERE book_id IN (2, 4, 34, 56, 67);
In modern Postgres versions (9.2 or later) you might append book_id, expire_date to the index columns to get index-only scans. See:
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
About DISTINCT ON:
Select first row in each GROUP BY group?
About DESC NULLS LAST:
PostgreSQL sort by datetime asc, null first?
I have a table like this:
id product amount
1 A 6
1 A 8
1 A
1 B 1
1 B
2 C 2
2 C
2 C 4
2 C
2 C
and I need to make it like this:
id product amount
1 A 6
1 A 8
1 A 8
1 B 1
1 B 1
2 C 2
2 C 2
2 C 4
2 C 4
2 C 4
Copy amount by previous non-missing value.
I tried to use lag() function. however, aggregation function lag() is not allowed in UPDATE.
update tableA set amount = lag(amount);
What can I do using PostgreSQL?
You can SELECT what you want to UPDATE, but there is no (easy) way to actually do the UPDATE, because the table fox does not have a primary key (yet).
CREATE TABLE fox (
id integer NOT NULL,
product text NOT NULL,
amount integer
);
To populate the fox with some data.
INSERT INTO fox VALUES
(1, 'A', 6),
(1, 'A', 8),
(1, 'A', NULL),
(1, 'B', 1),
(1, 'B', NULL),
(2, 'C', 2),
(2, 'C', NULL),
(2, 'C', 4),
(2, 'C', NULL),
(2, 'C', NULL),
(3, 'What does the fox say?', 5);
The query.
WITH ranks (rank, id, product, amount) AS (
SELECT ROW_NUMBER() OVER (), id, product, amount FROM foo
)
SELECT r.id, r.product,
(SELECT amount FROM ranks
WHERE id = r.id AND product = r.product
AND rank < r.rank AND amount IS NOT NULL
ORDER BY amount DESC LIMIT 1
)
FROM ranks r WHERE r.amount IS NULL ORDER BY 1, 2, 3;
Yields the rows which previously had a NULL and now have the appropriate amount.
id | product | amount
----+---------+--------
1 | A | 8
1 | B | 1
2 | C | 2
2 | C | 4
2 | C | 4
But you cannot use this data to update, because rows are still not uniquely identified by (id, product) - which means you cannot write a WHERE condition identifying your rows uniquely. How would the WHERE clause know whether to change the amount to 2 or 4 in the UPDATE? The multiple rows with (id, product) = (2, 'C') are indistinguishable in the WHERE of the UPDATE.
Let's give the fox a primary key.
ALTER TABLE fox ADD COLUMN IF NOT EXISTS pkey serial ;
ALTER TABLE fox ADD PRIMARY KEY (pkey) ;
Now we can identify the rows by the PRIMARY KEY pkey.
WITH nulls AS (
SELECT pkey, id, product
FROM fox
WHERE amount IS NULL
)
SELECT pkey,
id, product, -- you can leave these out in your UPDATE: pkey is UNIQUE
(SELECT amount FROM fox
WHERE id = n.id AND product = n.product
AND n.pkey > pkey AND amount IS NOT NULL
ORDER BY pkey DESC LIMIT 1)
FROM nulls n ORDER BY 1, 2, 3, 4;
to display the changes to be made
pkey | id | product | amount
------+----+---------+--------
3 | 1 | A | 8
5 | 1 | B | 1
7 | 2 | C | 2
9 | 2 | C | 4
10 | 2 | C | 4
And we can use pkey in the UPDATE.
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
WITH nulls AS (
SELECT pkey, id, product
FROM fox
WHERE amount IS NULL
), changes AS (
SELECT pkey,
(SELECT amount FROM fox
WHERE id = n.id AND product = n.product
AND n.pkey > pkey AND amount IS NOT NULL
ORDER BY pkey DESC LIMIT 1)
FROM nulls n
) UPDATE fox f SET amount = c.amount FROM changes c WHERE f.pkey = c.pkey ;
Check the result is okay:
SELECT * FROM fox ORDER BY 1, 2, 3, 4;
And accept using COMMIT or ROLLBACK accordingly.
Alternative to adding a PRIMARY KEY
Every table should always have a primary key.
If you insist not to have one, then you could also compute the rows with their then-not-NULL amount and instead of UPDATEing them, you could INSERT them into your table and then DELETE FROM fox WHERE amount IS NULL remove the rows which had no amount. This way you get around adding a primary key, which is unique. Of course the UPDATE and DELETE are packaged into a TRANSACTION such as not to interfere with other Transactions running concurrently. For example another Transaction adding rows with NULL amount AFTER you have calculated the data to be INSERTed using SELECT and before you DELETE all NULL amounts. You'd miss the concurrently added row with NULL amount in this case (data loss due to concurrency; think ACID).
But a missing primary key will probably bite you later on, anyway.
Without knowing what defines "previous rows" all is a guess. But you can use a anonymous block to do what your want, just make your changes:
CREATE TEMPORARY TABLE test_lag AS
SELECT column1 AS id, column2 AS product, column3 AS amount FROM (
VALUES (1, 'A', 6),
(1, 'A', 8),
(1, 'A', NULL),
(1, 'B', 1),
(1, 'B', NULL),
(2, 'C', 2),
(2, 'C', NULL),
(2, 'C', 4),
(2, 'C', NULL),
(2, 'C', NULL)) AS tmp;
DO $$
BEGIN
--Loop until update all null amounts
--Why we need this? It's because PostgreSQL don't supports IGNORE NULLS clause on lag()
LOOP
WITH tmp AS (
SELECT ctid, lag(amount) OVER() AS last_amount FROM test_lag ORDER BY id, product -- You MUST change this ORDER to right columns (What's previous row?)
)
UPDATE test_lag SET amount = tmp.last_amount FROM tmp WHERE test_lag.ctid = tmp.ctid AND amount IS NULL;
IF NOT FOUND THEN
EXIT;
END IF;
END LOOP;
END $$;
SELECT * FROM test_lag ORDER BY id, product, amount;