Postgres remove duplicates (multiple columns) in order to add unique constraint

Postgres remove duplicates (multiple columns) in order to add unique constraint - postgresql

I have a table:
CREATE TABLE public.assignment (
id integer NOT NULL,
dining_table_id integer NOT NULL,
guest_group_id integer NOT NULL,
start_timestamp timestamp without time zone DEFAULT '1999-01-01 00:00:00'::timestamp without time zone NOT NULL,
end_timestamp timestamp without time zone DEFAULT '1999-01-02 00:00:00'::timestamp without time zone NOT NULL,
assignment_related_id text
);
When I add an unique constraint:
ALTER TABLE assignment ADD CONSTRAINT unique_assignment UNIQUE (dining_table_id, guest_group_id, start_timestamp, end_timestamp);
I get:
ERROR: could not create unique index "unique_assignment"
DETAIL: Key (dining_table_id, guest_group_id, start_timestamp, end_timestamp)=(1433, 101476, 2019-07-16 18:30:00, 2019-07-16 20:30:00) is duplicated.
So how can I delete all duplicates, which have the same values in the concerning columns.

DELETE FROM assignment
WHERE id IN (SELECT id
FROM (SELECT id,
ROW_NUMBER() OVER (partition BY dining_table_id, guest_group_id, start_timestamp, end_timestamp ORDER BY id) AS rnum
FROM assignment) t
WHERE t.rnum > 1);

Related

column "birthdate" is of type timestamp without time zone but expression is of type text

This query throws an error: column "birthdate" is of type timestamp without time zone but expression is of type text
INSERT INTO profile.profile(name,gender,birthDate,userId)
SELECT
userId,
substr(md5(random()::text), 0, 5) AS name,
substr(md5(random()::text), 0, 2) AS gender,
to_timestamp('2021-08-09 13:57:40', 'YYYY-MM-DD hh24:mi:ss') AS birthDate
FROM
generate_series(1,10) AS y(userId)
My table:
CREATE TABLE profile.profile
(
id SERIAL NOT NULL,
name character varying NOT NULL,
gender character varying NOT NULL,
birthDate TIMESTAMP NOT NULL,
image character varying NOT NULL DEFAULT
'https://e7.pngegg.com/pngimages/274/947/png-clipart-computer-icons-user-business-believer-business-service-people.png',
userId integer NOT NULL,
CONSTRAINT UQ_profile_user UNIQUE (userId),
CONSTRAINT PK_profile PRIMARY KEY (id)
)
What am I doing wrong? Thanks in advance.

just change the order of the columns in the INSERT so that it corresponds to the order of the values to be inserted :
INSERT INTO profile.profile(userId,name,gender,birthDate)

Using 'on conflict' with a unique constraint on a table partitioned by date

Given the following table:
CREATE TABLE event_partitioned (
customer_id varchar(50) NOT NULL,
user_id varchar(50) NOT NULL,
event_id varchar(50) NOT NULL,
comment varchar(50) NOT NULL,
event_timestamp timestamp with time zone DEFAULT NOW()
)
PARTITION BY RANGE (event_timestamp);
And partitioning by calendar week [one example]:
CREATE TABLE event_partitioned_2020_51 PARTITION OF event_partitioned
FOR VALUES FROM ('2020-12-14') TO ('2020-12-20');
And the unique constraint [event_timestamp necessary since the partition key]:
ALTER TABLE event_partitioned
ADD UNIQUE (customer_id, user_id, event_id, event_timestamp);
I would like to update if customer_id, user_id, event_id exist, otherwise insert:
INSERT INTO event_partitioned (customer_id, user_id, event_id)
VALUES ('9', '99', '999')
ON CONFLICT (customer_id, user_id, event_id, event_timestamp) DO UPDATE
SET comment = 'I got updated';
But I cannot add a unique constraint only for customer_id, user_id, event_id, hence event_timestamp as well.
So this will insert duplicates of customer_id, user_id, event_id. Even so with adding now() as a fourth value, unless now() precisely matches what's already in event_timestamp.
Is there a way that ON CONFLICT could be less 'granular' here and update if now() falls in the week of the partition, rather than precisely on '2020-12-14 09:13:04.543256' for example?
Basically I am trying to avoid duplication of customer_id, user_id, event_id, at least within a week, but still benefit from partitioning by week (so that data retrieval can be narrowed to a date range and not scan the entire partitioned table).

I don't think you can do this with on conflict in a partitioned table. You can, however, express the logic with CTEs:
with
data as ( -- data
select '9' as customer_id, '99' as user_id, '999' as event_id
),
ins as ( -- insert if not exists
insert into event_partitioned (customer_id, user_id, event_id)
select * from data d
where not exists (
select 1
from event_partitioned ep
where
ep.customer_id = d.customer_id
and ep.user_id = d.user_id
and ep.event_id = d.event_id
)
returning *
)
update event_partitioned ep -- update if insert did not happen
set comment = 'I got updated'
from data d
where
ep.customer_id = d.customer_id
and ep.user_id = d.user_id
and ep.event_id = d.event_id
and not exists (select 1 from ins)

#GMB's answer is great and works well. Since enforcing a unique constrain on a partitioned table (parent table) partitioned by time range is usually not that useful, why now just have a unique constraint/index placed on the partition itself?
In your case, event_partitioned_2020_51 can have a unique constraint:
ALTER TABLE event_partitioned_2020_51
ADD UNIQUE (customer_id, user_id, event_id, event_timestamp);
And subsequent query can just use
INSERT ... INTO event_partitioned_2020_51 ON CONFLICT (customer_id, user_id, event_id, event_timestamp)
as long as this its the partition intended, which is usually the case.

Postgresql not choosing rows grouping

I have query. There is a construction like this example: (online demo)
You will see the in result created_at field. I have to use query the created_at field. So I have to use it in select created_at. I don't want to use it created_at field in select. Because, there are millions of records in the deposits table. How can i escape this problem?
(Note: I have many table to query, like "deposits" table. this is just a short example.)
create table payment_methods
(
payment_method_id bigserial not null
constraint payment_methods_pkey
primary key
);
create table currencies_of_payment_methods
(
copm_id bigserial not null
constraint currencies_of_payment_methods_pkey
primary key,
payment_method_id integer not null
);
create table deposits
(
deposit_id bigserial not null
constraint deposits_pkey
primary key,
amount numeric(18,2) not null,
copm_id integer not null,
created_at timestamp(0)
);
INSERT INTO payment_methods (payment_method_id) VALUES (1);
INSERT INTO payment_methods (payment_method_id) VALUES (2);
INSERT INTO currencies_of_payment_methods (copm_id, payment_method_id) VALUES (1, 1);
INSERT INTO deposits (amount, copm_id, created_at) VALUES (100, 1, '2020-09-10 08:49:37');
INSERT INTO deposits (amount, copm_id, created_at) VALUES (200, 1, '2020-09-10 08:49:37');
INSERT INTO deposits (amount, copm_id, created_at) VALUES (40, 1, '2020-09-10 08:49:37');
Query:
SELECT payment_methods.payment_method_id,
deposit_copm_id.deposit_copm_id,
manuel_deposit_amount.manuel_deposit_amount,
manuel_deposit_amount.created_at
FROM payment_methods
CROSS JOIN lateral
(
SELECT currencies_of_payment_methods.copm_id AS deposit_copm_id
FROM currencies_of_payment_methods
WHERE currencies_of_payment_methods.payment_method_id = payment_methods.payment_method_id) deposit_copm_id
CROSS JOIN lateral
(
SELECT sum(deposits.amount) AS manuel_deposit_amount,
array_agg(deposits.created_at) AS created_at
FROM deposits
WHERE deposits.copm_id = deposit_copm_id.deposit_copm_id) manuel_deposit_amount
WHERE payment_methods.payment_method_id = 1

Query too slow for just 4 tables with 50000 rows each

I've been struggling for hours and I can't find why this query takes too long (> 60 minutes). All 4 tables have less than 50.000 records.
Also if I remove any table (gel6, gf6 or ger6) the query takes less than 500 ms to execute. What am I doing wrong?
Explain plan:
https://explain.depesz.com/s/ldm2
SELECT COUNT(*)
FROM agroapp.ganado g
INNER JOIN (SELECT gel5.ganado_id, gel5.estado_leche
FROM agroapp.ganado_estado_leche gel5
INNER JOIN (SELECT MAX(gel3.ganado_estado_leche_id) ganado_estado_leche_id
FROM agroapp.ganado_estado_leche gel3
INNER JOIN (SELECT gel.ganado_id, MAX(gel.created) created
FROM agroapp.ganado_estado_leche gel
GROUP BY gel.ganado_id) gel2 ON (gel2.ganado_id = gel3.ganado_id AND gel2.created = gel3.created)
GROUP BY gel3.ganado_id) gel4 ON gel4.ganado_estado_leche_id = gel5.ganado_estado_leche_id
) gel6 ON gel6.ganado_id = g.ganado_id
INNER JOIN (SELECT gf5.ganado_id, gf5.fundo_id
FROM agroapp.ganado_fundo gf5
INNER JOIN (SELECT MAX(gf3.ganado_fundo_id) ganado_fundo_id
FROM agroapp.ganado_fundo gf3
INNER JOIN (SELECT gf.ganado_id, MAX(gf.created) created
FROM agroapp.ganado_fundo gf
GROUP BY gf.ganado_id) gf2 ON (gf2.ganado_id = gf3.ganado_id AND gf2.created = gf3.created)
GROUP BY gf3.ganado_id) gf4 ON gf4.ganado_fundo_id = gf5.ganado_fundo_id
) gf6 ON gf6.ganado_id = g.ganado_id
INNER JOIN (SELECT ger5.ganado_id, ger5.estado_reproductivo
FROM agroapp.ganado_estado_reproductivo ger5
INNER JOIN (SELECT MAX(ger3.ganado_estado_reproductivo_id) ganado_estado_reproductivo_id
FROM agroapp.ganado_estado_reproductivo ger3
INNER JOIN (SELECT ger.ganado_id, MAX(ger.created) created
FROM agroapp.ganado_estado_reproductivo ger
GROUP BY ger.ganado_id) ger2 ON (ger2.ganado_id = ger3.ganado_id AND ger2.created = ger3.created)
GROUP BY ger3.ganado_id) ger4 ON ger4.ganado_estado_reproductivo_id = ger5.ganado_estado_reproductivo_id
) ger6 ON ger6.ganado_id = g.ganado_id
WHERE g.organizacion_id = 21
Tables
CREATE TABLE agroapp.ganado_estado_leche
(
ganado_estado_leche_id serial NOT NULL,
organizacion_id integer NOT NULL,
isactive character(1) NOT NULL DEFAULT 'Y'::bpchar,
created timestamp without time zone NOT NULL DEFAULT now(),
createdby numeric(10,0) NOT NULL,
updated timestamp without time zone NOT NULL DEFAULT now(),
updatedby numeric(10,0) NOT NULL,
estado_leche character varying(80) NOT NULL,
ganado_id integer NOT NULL,
fecha_manejo timestamp without time zone NOT NULL,
CONSTRAINT ganado_estado_leche_pk PRIMARY KEY (ganado_estado_leche_id),
CONSTRAINT ganado_fk FOREIGN KEY (ganado_id)
REFERENCES agroapp.ganado (ganado_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE TABLE agroapp.ganado_fundo
(
ganado_fundo_id serial NOT NULL,
organizacion_id integer NOT NULL,
isactive character(1) NOT NULL DEFAULT 'Y'::bpchar,
created timestamp without time zone NOT NULL DEFAULT now(),
createdby numeric(10,0) NOT NULL,
updated timestamp without time zone NOT NULL DEFAULT now(),
updatedby numeric(10,0) NOT NULL,
fundo_id integer NOT NULL,
ganado_id integer NOT NULL,
CONSTRAINT ganado_fundo_pk PRIMARY KEY (ganado_fundo_id),
CONSTRAINT ganado_fk FOREIGN KEY (ganado_id)
REFERENCES agroapp.ganado (ganado_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE TABLE agroapp.ganado_estado_reproductivo
(
ganado_estado_reproductivo_id serial NOT NULL,
organizacion_id integer NOT NULL,
isactive character(1) NOT NULL DEFAULT 'Y'::bpchar,
created timestamp without time zone NOT NULL DEFAULT now(),
createdby numeric(10,0) NOT NULL,
updated timestamp without time zone NOT NULL DEFAULT now(),
updatedby numeric(10,0) NOT NULL,
estado_reproductivo character varying(80) NOT NULL,
ganado_id integer NOT NULL,
fecha_manejo timestamp without time zone NOT NULL,
CONSTRAINT ganado_estado_reproductivo_pk PRIMARY KEY (ganado_estado_reproductivo_id),
CONSTRAINT ganado_fk FOREIGN KEY (ganado_id)
REFERENCES agroapp.ganado (ganado_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE TABLE agroapp.ganado
(
ganado_id serial NOT NULL,
organizacion_id integer NOT NULL,
isactive character(1) NOT NULL DEFAULT 'Y'::bpchar,
created timestamp without time zone NOT NULL DEFAULT now(),
createdby numeric(10,0) NOT NULL,
updated timestamp without time zone NOT NULL DEFAULT now(),
updatedby numeric(10,0) NOT NULL,
fecha_nacimiento timestamp without time zone NOT NULL,
tipo_ganado character varying(80) NOT NULL,
diio_id integer NOT NULL,
fundo_id integer NOT NULL,
raza_id integer NOT NULL,
estado_reproductivo character varying(80) NOT NULL,
estado_leche character varying(80),
CONSTRAINT ganado_pk PRIMARY KEY (ganado_id),
CONSTRAINT diio_fk FOREIGN KEY (diio_id)
REFERENCES agroapp.diio (diio_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fundo_fk FOREIGN KEY (fundo_id)
REFERENCES agroapp.fundo (fundo_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT raza_fk FOREIGN KEY (raza_id)
REFERENCES agroapp.raza (raza_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)

Table design
This looks very much like a boolean column (yes / no):
isactive character(1) NOT NULL DEFAULT 'Y'::bpchar
If so, replace with:
isactive bool NOT NULL DEFAULT TRUE
If you might involve multiple times zones in any way, use timestamptz instead of timestamp here:
created timestamp without time zone NOT NULL DEFAULT now(),
The default now() produces timestamptz and after the assignment cast results in the current time according to the time zone of the session. I.e., the value changes with the timezone of the session, which is a sneaky point of failure. See:
- Ignoring time zones altogether in Rails and PostgreSQL
And:
createdby numeric(10,0) NOT NULL
et al. look like they should really be just integer. (Or maybe bigint if you really think you might burn through more than 2147483648 numbers ...)
Query
Looking at the first subquery:
SELECT gel5.ganado_id, gel5.estado_leche
FROM agroapp.ganado_estado_leche gel5
INNER JOIN (
SELECT MAX(gel3.ganado_estado_leche_id) ganado_estado_leche_id
FROM agroapp.ganado_estado_leche gel3
INNER JOIN (
SELECT gel.ganado_id, MAX(gel.created) created
FROM agroapp.ganado_estado_leche gel
GROUP BY gel.ganado_id
) gel2 ON (gel2.ganado_id = gel3.ganado_id AND gel2.created = gel3.created)
GROUP BY gel3.ganado_id
) gel4 ON gel4.ganado_estado_leche_id = gel5.ganado_estado_leche_id
The innermost subquery gets the max. created per ganado_id, the next one the max ganado_estado_leche_id of those rows. And finally you join back and retrieve all ganado_id that appear in combination with the identified max ganado_estado_leche_id per partition. I have a hard time making sense of this, but it can be simplified to:
SELECT gel2.ganado_id
FROM agroapp.ganado_estado_leche gel2
JOIN (
SELECT DISTINCT ON (ganado_id) ganado_estado_leche_id
FROM agroapp.ganado_estado_leche
ORDER BY ganado_id, created DESC NULLS LAST, ganado_estado_leche_id DESC NULLS LAST
) gel1 USING (ganado_estado_leche_id)
See:
Select first row in each GROUP BY group?
Looks like an incorrect query to me. Same with the rest of the query: the joins multiply rows in an odd fashion. Not sure what you are trying to count, but I doubt the query counts just that. You did not provide enough information to make sense of it.

Query execution time increased dramatically without type cast

The query in this state takes more than 5 minutes to execute. If I remove any of the ::DATE conversions (see comment in code) the execution time goes < 500 ms.
For example, if I change gf.created::DATE to gf.created the performance is dramatically increased. Same happens if I change gtg.created::DATE to gtg.created.
Why is there a huge difference when using both ::DATE conversions if each shows great performance on its own?
SELECT gtg6.tipo_ganado, COUNT(gtg6.tipo_ganado) animales
FROM agroapp.ganado g
INNER JOIN (SELECT gf5.ganado_id, gf5.fundo_id
FROM agroapp.ganado_fundo gf5
INNER JOIN (SELECT MAX(gf3.ganado_fundo_id) ganado_fundo_id
FROM agroapp.ganado_fundo gf3
INNER JOIN (SELECT gf.ganado_id, MAX(gf.created) created
FROM agroapp.ganado_fundo gf
WHERE gf.isactive = 'Y'
-- HERE CHANGING gf.created::DATE TO gf.created
AND gf.created::DATE <= '20181030'::DATE
GROUP BY gf.ganado_id) gf2 ON (gf2.ganado_id = gf3.ganado_id AND gf2.created = gf3.created)
WHERE gf3.isactive = 'Y'
GROUP BY gf3.ganado_id) gf4 ON gf4.ganado_fundo_id = gf5.ganado_fundo_id
) gf6 ON gf6.ganado_id = g.ganado_id
INNER JOIN (SELECT gtg5.ganado_id, gtg5.tipo_ganado
FROM agroapp.ganado_tipo_ganado gtg5
INNER JOIN (SELECT MAX(gtg3.ganado_tipo_ganado_id) ganado_tipo_ganado_id
FROM agroapp.ganado_tipo_ganado gtg3
INNER JOIN (SELECT gtg.ganado_id, MAX(gtg.created) created
FROM agroapp.ganado_tipo_ganado gtg
WHERE gtg.isactive = 'Y'
-- OR HERE CHANGING gtg.created::DATE TO gtg.created
AND gtg.created::DATE <= '20181030'::DATE
GROUP BY gtg.ganado_id) gtg2 ON (gtg2.ganado_id = gtg3.ganado_id AND gtg2.created = gtg3.created)
WHERE gtg3.isactive = 'Y'
GROUP BY gtg3.ganado_id) gtg4 ON gtg4.ganado_tipo_ganado_id = gtg5.ganado_tipo_ganado_id
) gtg6 ON gtg6.ganado_id = g.ganado_id
WHERE g.organizacion_id = 21
GROUP BY gtg6.tipo_ganado
ORDER BY gtg6.tipo_ganado;
Table definitions
All 3 tables have around 50000 rows:
CREATE TABLE agroapp.ganado_fundo
(
ganado_fundo_id serial NOT NULL,
organizacion_id integer NOT NULL,
isactive character(1) NOT NULL DEFAULT 'Y'::bpchar,
created timestamp without time zone NOT NULL DEFAULT now(),
createdby numeric(10,0) NOT NULL,
updated timestamp without time zone NOT NULL DEFAULT now(),
updatedby numeric(10,0) NOT NULL,
fundo_id integer NOT NULL,
ganado_id integer NOT NULL,
CONSTRAINT ganado_fundo_pk PRIMARY KEY (ganado_fundo_id),
CONSTRAINT ganado_fk FOREIGN KEY (ganado_id)
REFERENCES agroapp.ganado (ganado_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE TABLE agroapp.ganado_tipo_ganado
(
ganado_tipo_ganado_id serial NOT NULL,
organizacion_id integer NOT NULL,
isactive character(1) NOT NULL DEFAULT 'Y'::bpchar,
created timestamp without time zone NOT NULL DEFAULT now(),
createdby numeric(10,0) NOT NULL,
updated timestamp without time zone NOT NULL DEFAULT now(),
updatedby numeric(10,0) NOT NULL,
tipo_ganado character varying(80) NOT NULL,
ganado_id integer NOT NULL,
CONSTRAINT ganado_tipo_ganado_pk PRIMARY KEY (ganado_tipo_ganado_id),
CONSTRAINT ganado_fk FOREIGN KEY (ganado_id)
REFERENCES agroapp.ganado (ganado_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
CREATE TABLE agroapp.ganado
(
ganado_id serial NOT NULL,
organizacion_id integer NOT NULL,
isactive character(1) NOT NULL DEFAULT 'Y'::bpchar,
created timestamp without time zone NOT NULL DEFAULT now(),
createdby numeric(10,0) NOT NULL,
updated timestamp without time zone NOT NULL DEFAULT now(),
updatedby numeric(10,0) NOT NULL,
fecha_nacimiento timestamp without time zone NOT NULL,
tipo_ganado character varying(80) NOT NULL,
diio_id integer NOT NULL,
fundo_id integer NOT NULL,
raza_id integer NOT NULL,
estado_reproductivo character varying(80) NOT NULL,
estado_leche character varying(80),
CONSTRAINT ganado_pk PRIMARY KEY (ganado_id),
CONSTRAINT diio_fk FOREIGN KEY (diio_id)
REFERENCES agroapp.diio (diio_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT fundo_fk FOREIGN KEY (fundo_id)
REFERENCES agroapp.fundo (fundo_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT raza_fk FOREIGN KEY (raza_id)
REFERENCES agroapp.raza (raza_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)

Most probably because the forced cast voids the option to use an index on the column agroapp.ganado_fundo.created
Guessing (for lack of information) that gf.created is of type timestamp with time zone (or timestamp), replace
AND gf.created::DATE <= '20181030'::DATE
with:
AND gf.created < '2018-10-31'::timestamp -- match the data type of the column!
to achieve the same result, but with index support.
If you operate with timestamtptz, be aware of implications on the date: it depends on the current time zone. Details:
Ignoring time zones altogether in Rails and PostgreSQL

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgres remove duplicates (multiple columns) in order to add unique constraint - postgresql

DELETE FROM assignment WHERE id IN (SELECT id FROM (SELECT id, ROW_NUMBER() OVER (partition BY dining_table_id, guest_group_id, start_timestamp, end_timestamp ORDER BY id) AS rnum FROM assignment) t WHERE t.rnum > 1);

Related

column "birthdate" is of type timestamp without time zone but expression is of type text

Using 'on conflict' with a unique constraint on a table partitioned by date

Postgresql not choosing rows grouping

Query too slow for just 4 tables with 50000 rows each

Query execution time increased dramatically without type cast

Categories

Resources