I have 61 million of non unique emails with statuses.
This emails need to deduplicate with logic by status.
I write stored procedure, but this procedure runs to long.
How I can optimize execution time of this procedure?
CREATE OR REPLACE FUNCTION public.load_oxy_emails() RETURNS boolean AS $$
DECLARE
row record;
rec record;
new_id int;
BEGIN
FOR row IN SELECT * FROM oxy_email ORDER BY id LOOP
SELECT * INTO rec FROM oxy_emails_clean WHERE email = row.email;
IF rec IS NOT NULL THEN
IF row.status = 3 THEN
UPDATE oxy_emails_clean SET status = 3 WHERE id = rec.id;
END IF;
ELSE
INSERT INTO oxy_emails_clean(id, email, status) VALUES(nextval('oxy_emails_clean_id_seq'), row.email, row.status);
SELECT currval('oxy_emails_clean_id_seq') INTO new_id;
INSERT INTO oxy_emails_clean_websites_relation(oxy_emails_clean_id, website_id) VALUES(new_id, row.website_id);
END IF;
END LOOP;
RETURN true;
END;
$$
LANGUAGE 'plpgsql';
How I can optimize execution time of this procedure?
Don't do it with a loop.
Doing a row-by-row processing (also known as "slow-by-slow") is almost always a lot slower then doing bulk changes where a single statement processes a lot of rows "in one go".
The change of the status can easily be done using a single statement:
update oxy_emails_clean oec
SET status = 3
from oxy_email oe
where oe.id = oec.id
and oe.status = 3;
The copying of the rows can be done using a chain of CTEs:
with to_copy as (
select *
from oxy_email
where status <> 3 --<< all those that have a different status
), clean_inserted as (
INSERT INTO oxy_emails_clean (id, email, status)
select nextval('oxy_emails_clean_id_seq'), email, status
from to_copy
returning id;
)
insert oxy_emails_clean_websites_relation (oxy_emails_clean_id, website_id)
select ci.id, tc.website_id
from clean_inserted ci
join to_copy tc on tc.id = ci.id;
I'm trying to do this in postgreSQL:
You can't store more than 5 records (rows).
CREATE FUNCTION contar_dest()
RETURNS TRIGGER AS
$body$
BEGIN
IF (SELECT COUNT(ID) FROM "LUGARES" WHERE demanda is not null) > 5
THEN
DELETE FROM "LUGARES"
WHERE ID = (SELECT max(ID) FROM "LUGARES");
RETURN NULL;
END IF;
END;
$body$
LANGUAGE plpgsql;
CREATE TRIGGER contar
AFTER INSERT
ON "LUGARES"
FOR EACH ROW
EXECUTE PROCEDURE contar_dest();
When I try to insert a row show me this:
ERROR: execution reached the end of trigger procedure without finding RETURN
CONTEXT : PL / pgSQL function contar_dest ()
Now I do what I want but the error is this now:
CREATE FUNCTION contar_dest()
RETURNS TRIGGER AS
$body$
BEGIN
IF ((SELECT COUNT(ID) FROM "LUGARES" WHERE demanda is not null) > 5) THEN
DELETE FROM "LUGARES"
WHERE ID = (SELECT NEW.ID FROM "LUGARES");
ELSIF ((SELECT COUNT(ID) FROM "LUGARES" WHERE demanda is not null) < 5) THEN
RETURN NULL;
END IF;
RETURN NULL;
END;
$body$
LANGUAGE plpgsql;
CREATE TRIGGER contar
AFTER INSERT
ON "LUGARES"
FOR EACH ROW
EXECUTE PROCEDURE contar_dest();
INSERT INTO "LUGARES"(
nombre, demanda)
VALUES ('Valencia',2000);
ERROR: a subquery used as an expression returned more than one record
CONTEXT : SQL statement : " DELETE FROM "LUGARES"
WHERE ID = ( SELECT FROM NEW.ID "LUGARES") »
PL / pgSQL contar_dest ( ) function in line 4 SQL statement
********** Error **********
The error is caused by you using syntax that requires maximum of 1 row, but when executed returns many.
Change:
DELETE FROM "LUGARES"
WHERE ID = (SELECT NEW.ID FROM "LUGARES");
To:
DELETE FROM "LUGARES"
WHERE ID IN (SELECT NEW.ID FROM "LUGARES");
= only works with 1 value, but IN allows many values.
Assuming a schema like the following:
CREATE TABLE node (
id SERIAL PRIMARY KEY,
name VARCHAR,
parentid INT REFERENCES node(id)
);
Further, let's assume the following data is present:
INSERT INTO node (name,parentid) VALUES
('A',NULL),
('B',1),
('C',1);
Is there a way to prevent cycles from being created? Example:
UPDATE node SET parentid = 2 WHERE id = 1;
This would create a cycle of 1->2->1->...
Your trigger simplified and optimized, should be considerably faster:
CREATE OR REPLACE FUNCTION detect_cycle()
RETURNS TRIGGER
LANGUAGE plpgsql AS
$func$
BEGIN
IF EXISTS (
WITH RECURSIVE search_graph(parentid, path, cycle) AS ( -- relevant columns
-- check ahead, makes 1 step less
SELECT g.parentid, ARRAY[g.id, g.parentid], (g.id = g.parentid)
FROM node g
WHERE g.id = NEW.id -- only test starting from new row
UNION ALL
SELECT g.parentid, sg.path || g.parentid, g.parentid = ANY(sg.path)
FROM search_graph sg
JOIN node g ON g.id = sg.parentid
WHERE NOT sg.cycle
)
SELECT FROM search_graph
WHERE cycle
LIMIT 1 -- stop evaluation at first find
)
THEN
RAISE EXCEPTION 'Loop detected!';
ELSE
RETURN NEW;
END IF;
END
$func$;
You don't need dynamic SQL, you don't need to count, you don't need all the columns and you don't need to test the whole table for every single row.
CREATE TRIGGER detect_cycle_after_update
AFTER INSERT OR UPDATE ON node
FOR EACH ROW EXECUTE PROCEDURE detect_cycle();
An INSERT like this has to be prohibited, too:
INSERT INTO node (id, name,parentid) VALUES (8,'D',9), (9,'E',8);
To answer my own question, I came up with a trigger that prevents this:
CREATE OR REPLACE FUNCTION detect_cycle() RETURNS TRIGGER AS
$func$
DECLARE
loops INTEGER;
BEGIN
EXECUTE 'WITH RECURSIVE search_graph(id, parentid, name, depth, path, cycle) AS (
SELECT g.id, g.parentid, g.name, 1,
ARRAY[g.id],
false
FROM node g
UNION ALL
SELECT g.id, g.parentid, g.name, sg.depth + 1,
path || g.id,
g.id = ANY(path)
FROM node g, search_graph sg
WHERE g.id = sg.parentid AND NOT cycle
)
SELECT count(*) FROM search_graph where cycle = TRUE' INTO loops;
IF loops > 0 THEN
RAISE EXCEPTION 'Loop detected!';
ELSE
RETURN NEW;
END IF;
END
$func$ LANGUAGE plpgsql;
CREATE TRIGGER detect_cycle_after_update
AFTER UPDATE ON node
FOR EACH ROW EXECUTE PROCEDURE detect_cycle();
So, if you try to create a loop, like in the question:
UPDATE node SET parentid = 2 WHERE id = 1;
You get an EXCEPTION:
ERROR: Loop detected!
CREATE OR REPLACE FUNCTION detect_cycle()
RETURNS TRIGGER AS
$func$
DECLARE
cycle int[];
BEGIN
EXECUTE format('WITH RECURSIVE search_graph(%4$I, path, cycle) AS (
SELECT g.%4$I, ARRAY[g.%3$I, g.%4$I], (g.%3$I = g.%4$I)
FROM %1$I.%2$I g
WHERE g.%3$I = $1.%3$I
UNION ALL
SELECT g.%4$I, sg.path || g.%4$I, g.%4$I = ANY(sg.path)
FROM search_graph sg
JOIN %1$I.%2$I g ON g.%3$I = sg.%4$I
WHERE NOT sg.cycle)
SELECT path
FROM search_graph
WHERE cycle
LIMIT 1', TG_TABLE_SCHEMA, TG_TABLE_NAME, quote_ident(TG_ARGV[0]), quote_ident(TG_ARGV[1]))
INTO cycle
USING NEW;
IF cycle IS NULL
THEN
RETURN NEW;
ELSE
RAISE EXCEPTION 'Loop in %.% detected: %', TG_TABLE_SCHEMA, TG_TABLE_NAME, array_to_string(cycle, ' -> ');
END IF;
END
$func$ LANGUAGE plpgsql;
CREATE TRIGGER detect_cycle_after_update
AFTER INSERT OR UPDATE ON node
FOR EACH ROW EXECUTE PROCEDURE detect_cycle('id', 'parent_id');
While the current accepted answer by #Erwin Brandstetter is ok when you process one update/insert at a time, it still can fail when considering concurrent execution.
Assume the table content defined by
INSERT INTO node VALUES
(1, 'A', NULL),
(2, 'B', 1),
(3, 'C', NULL),
(4, 'D', 3);
and then in one transaction, execute
-- transaction A
UPDATE node SET parentid = 2 where id = 3;
and in another
-- transaction B
UPDATE node SET parentid = 4 where id = 1;
Both UPDATE commands will succeed, and you can afterwards commit both transactions.
-- transaction A
COMMIT;
-- transaction B
COMMIT;
You will then have a cycle 1->4->3->2->1 in the table.
To make it work, you will either have to use isolation level SERIALIZABLE or use explicit locking in the trigger.
slightly different from Erwin's
CREATE OR REPLACE FUNCTION detect_cycle ()
RETURNS TRIGGER
LANGUAGE plpgsql
AS $func$
BEGIN
IF EXISTS ( WITH RECURSIVE search_graph (
id,
name,
parentid,
is_cycle,
path
) AS (
SELECT *, FALSE,ARRAY[ROW (n.id,n.parentid)]
FROM
node n
WHERE
n.id = NEW.id
UNION ALL
SELECT
n.*,
ROW (n.id,n.parentid) = ANY (path),
path || ROW (n.id,n.parentid)
FROM
node n,
search_graph sg
WHERE
n.id = sg.parentid
AND NOT is_cycle
)
SELECT *
FROM
search_graph
WHERE
is_cycle
LIMIT 1) THEN
RAISE EXCEPTION 'Loop detected!';
ELSE
RETURN new;
END IF;
END
$func$;
I have a problem with my Postgres and it looks like a simple one. I have done my research but I have not seen anything similar online and would like some clarification:
This is done inside a function, here is the whole code:
BEGIN
IF($5 IS NOT NULL) THEN
BEGIN
INSERT INTO "PushDevice"("DeviceId","PushNotificationId", "pushId","deviceType",sound)
SELECT DISTINCT d.id, $4,d.pushid,d.type,d.sound FROM "Device" d inner join "DeviceLocation" dl ON d.id = dl."DeviceId"
WHERE dl."FIPScode" in (select "FIPScode" from "CountyFIPS" where "stateCode"=$5) AND dl."AppId"=$2 AND d.pushId is not null and d.pushId <>'' and d.pushId<>'1234-5678-9101-2345-3456' and d."isTest"=$3 and d."enableNotification"=TRUE and dl."isDeleted"=0
AND NOT EXISTS (SELECT 1 FROM "PushDevice" t where t."DeviceId"=d.id AND t."PushNotificationId"=$4);
END;
ELSE
DECLARE "epiCentre" VARCHAR := NULL;
magnitude FLOAT = NULL;
BEGIN
SELECT polygon INTO "epiCentre" from alert where id=$1 and "disablePush"=FALSE;
END;
IF("epiCentre" IS NOT NULL) THEN
BEGIN
INSERT INTO "PushDevice"("DeviceId","PushNotificationId", "pushId","deviceType","sound")
SELECT DISTINCT d.id, $4,d."pushId",d.type,d.sound FROM "Device" d inner join "DeviceLocation" dl ON d.id = dl."DeviceId"
WHERE dl."AppId"=$2 AND d."pushId" is not null and d."pushId" <>'' and d."pushId" <>'1234-5678-9101-2345-3456' and d."isTest" =$3 and ST_Distance_Sphere(ST_GeometryFromText("epiCentre"), ST_GeometryFromText(geoPoint))<=d.radius * 1609.344 and magnitude>= d.magnitude and d."enableNotification"=1 and dl."isDeleted"=0
AND NOT EXISTS (SELECT 1 FROM "PushDevice" t where t."DeviceId"=d.id AND t."PushNotificationId"=$4);
END;
END IF;
RETURN QUERY SELECT pd.* FROM "PushDevice" pd
WHERE pd."PushNotificationId" =$4 and pd."sentAt" is null;
END IF;
END;
The problem is here specifically:
DECLARE "epiCentre" VARCHAR := NULL;
magnitude FLOAT = NULL;
BEGIN
SELECT polygon INTO "epiCentre" from alert where id=$1 and "disablePush"=FALSE;
END;
IF("epiCentre" IS NOT NULL) THEN
With error:
Procedure execution failed
ERROR: column "epiCentre" does not exist
LINE 1: SELECT ("epiCentre" IS NOT NULL)
^
QUERY: SELECT ("epiCentre" IS NOT NULL)
CONTEXT: PL/pgSQL function "GetDevicesForPush... line 18 at IF.
So somehow the IF statement perceives epiCentre as column instead of value. And it does not even know it exists although I specifically declared it above.
Any thoughts?
I think you have to many BEGIN-END statements. The declaration of epiCentre is only valid to the first END. And the IF is after that. Therefore I would use on Block for the whole ELSE part.
http://www.postgresql.org/docs/8.3/static/plpgsql-structure.html
As you have found yourself already that DECLARE must be placed before BEGIN of a each block.
More importantly, you do not need multiple blocks here at all. And you don't need a variable either. Use this simpler, safer and faster form:
CREATE function foo(...)
RETURNS ... AS
$func$
BEGIN
IF($5 IS NOT NULL) THEN
-- no redundant BEGIN!
INSERT INTO ... ;
-- and no END!
ELSIF EXISTS (SELECT 1 FROM alert
WHERE id = $1
AND "disablePush" = FALSE
AND polygon IS NOT NULL -- only if polygon can be NULL
) THEN
INSERT INTO ... ;
...
END IF;
END
$func$ LANGUAGE plpgsql;
More Details:
PL/pgSQL checking if a row exists - SELECT INTO boolean
i try a query that runs on mssql however does not run postgreSQL...
SQL Query is..
IF EXISTS (SELECT * FROM Kategoriler WHERE KategoriId = 119)
BEGIN
SELECT * FROM Kategoriler
END
ELSE
SELECT * FROM Adminler
i searched it and i found in stackoverflow
DO
$BODY$
BEGIN
IF EXISTS (SELECT 1 FROM orders) THEN
DELETE from orders;
ELSE
INSERT INTO orders VALUES (1,2,3);
END IF;
END;
$BODY$
but i do not want to use DO or, $body etc... I do not want to write any function or other etc...
i want to write only if else statement in postgreSQL... Please help me...
T-SQL supports some procedural statement like IF. PostgreSQL doesn't support it, so you cannot rewrite your query to postgres simply. Sometime you can use Igor's solution, sometime you can use plpgsql (functions) and sometime you have to modify your application and move procedural code from server to client.
Try something like
SELECT *
FROM Kategoriler
UNION ALL
SELECT *
FROM Adminler
WHERE NOT EXIST (SELECT * FROM Kategoriler WHERE KategoriId = 119)
Will only work if Kategoriler and Adminler have same structure. Otherwise you need to specify list of fields instead of *
In my case I needed to know if a record existed.
I had to write a function
CREATE OR REPLACE FUNCTION public.pro_device_exists(vdn character varying)
RETURNS boolean
LANGUAGE plpgsql
AS $function$
BEGIN
IF EXISTS (SELECT 1 FROM tags WHERE device_name = upper(vdn)) THEN
return true;
ELSE
return false;
END IF;
END; $function$
Then I was able to call this function in my code ... just a portion of my code
if pro_device_exists(vdn) then
update tags
set device_id = 11 where device_id = pro_device_id(vdn) and tag_type=10;
update tags
set device_id = pro_device_id(vdn) where tag_id = vtag_id;
vmsg = (select 'Device Now set to ' || first_name || ' ' || last_name from tags where tag_id=vtag_id);
vaction = 'Refresh Device Data';
else
vmsg = 'Device is not registered on this system';
vaction = 'No Nothing';
end if;