I am having problems with the following function. The purpose of this function is to return a set of records that will not be returned again if called within 60 seconds (Almost like a queue).
It seems to work fine when I run this one a time, however when I am using it in my threaded application, I see duplicates that show up. Am I locking the rows correctly? What is the correct way to use FOR UPDATE when inserting into a temp table?
CREATE OR REPLACE FUNCTION needs_quantities(computer TEXT)
RETURNS TABLE(id BIGINT, listing_id CHARACTER VARYING, asin CHARACTER VARYING, retry_count INT)
LANGUAGE plpgsql
AS $$
BEGIN
CREATE TEMP TABLE temp_needs_quantity ON COMMIT DROP
AS
SELECT
listing.id,
listing.listing_id,
listing.asin,
listing.retry_count
FROM listing
WHERE listing.id IN (
SELECT min(listing.id) AS id
FROM listing
WHERE (listing.quantity_assigned_to IS NULL
--quantity is null
-- and quantity assigned date is at least 60 seconds ago
-- and quantity date is within 2 hours
OR (
quantity IS NULL AND listing.quantity_assigned_date < now_utc() - INTERVAL '60 second'
AND (listing.quantity_date IS NULL OR listing.quantity_date > now_utc() - INTERVAL '2 hour')
)
)
AND listing.retry_count < 10
GROUP BY listing.asin
ORDER BY min(listing.retry_count), min(listing_date)
LIMIT 10
)
FOR UPDATE;
UPDATE listing
SET quantity_assigned_date = now_utc(), quantity_assigned_to = computer
WHERE listing.id IN (SELECT temp_needs_quantity.id
FROM temp_needs_quantity);
RETURN QUERY
SELECT *
FROM temp_needs_quantity
ORDER BY id;
END
$$
Your function should lock the rows in listing like you intended in the first thread that comes along.
The problem in the second thread is that this subselect:
...
WHERE listing.id IN (
SELECT min(listing.id) AS id
FROM listing
...
LIMIT 10
)
is not blocked by the lock on the rows, even though the enclosing SELECT ... FOR UPDATE is.
So the subselect will happily see the old row versions from before the UPDATE of the first thread, then blocks in the enclosing SELECT ... FOR UPDATE until the first thread is done. Then it proceeds to update the same rows again.
I am not sure if that is to be considered a bug or not – you may want to ask at the pgsql-general mailing list.
There has been a similar problem with CTEs recently, see this commit message fixing this bug. One could argue that this is a similar case.
Unfortunately with a complicated query like yours I can't think of a better solution than to
LOCK TABLE listing IN EXCLUSIVE MODE;
before you begin processing, which is not very satisfactory.
Related
I've got a Postgres 12.3 question: Can I rely on CLOCK_TIMESTAMP() in a trigger to stamp an updated_dts timestamp in exactly the same order as changes are committed to the permanent data?
On the face of it, this might sound like kind of an silly question, but I just spent two tracking down a super rare race condition in a non-Postgres system that hinged on exactly this behavior. (Lagging commits made their 'last value seen' tracking data unreliable.) Now I'm trying to figure out if it's possible for CLOCK_TIMESTAMP() to not match the order of changes recorded in the WAL perfectly.
It's simple to see how this could occur with NOW/TRANSACTION_TIMESTAMP/CURRENT_TIMESTAMP as they're returning the transaction start time, not the completion time. It's pretty easy, in that case, to record a timestamp sequence where the stamps and log order don't agree. But I can't figure out if there's any chance for commits to be saved in a different order to the BEFORE trigger CLOCK_TIMESTAMP() values.
For background, we need a 100% reliable timeline for an external search to use. As I understand it, I can create one using logical replication, and a replication-target side trigger to stamp changes as they're replayed from the log. What I'm unclear on, is if it's possible to get the same fidelity from CLOCK_TIMESTAMP() on a single server.
I haven't got the chops to get deep into the Postgres internals, and see how requests are interleaved, nor how granular execution is, and am hoping that someone here knows definitively. If this is more of a question for one of the PG mailing lists, please let me know.
-- Thanks
Below is a bit of sample code for how I'm looking at building the timestamps. It works fine, but doesn't prove anything about behavior with lots of concurrent processes.
---------------------------------------------
-- Create the trigger function
---------------------------------------------
DROP FUNCTION IF EXISTS api.set_updated CASCADE;
CREATE OR REPLACE FUNCTION api.set_updated()
RETURNS TRIGGER
AS $BODY$
BEGIN
NEW.updated_dts = CLOCK_TIMESTAMP();
RETURN NEW;
END;
$BODY$
language plpgsql;
COMMENT ON FUNCTION api.set_updated() IS 'Sets updated_dts field to CLOCK_TIMESTAMP(), if the record has changed..';
---------------------------------------------
-- Create the table
---------------------------------------------
DROP TABLE IF EXISTS api.numbers;
CREATE TABLE api.numbers (
id uuid NOT NULL DEFAULT extensions.gen_random_uuid (),
number integer NOT NULL DEFAULT NULL,
updated_dts timestamptz NOT NULL DEFAULT 'epoch'::timestamptz
);
---------------------------------------------
-- Define the triggers (binding)
---------------------------------------------
-- NOTE: I'm guessing that in production that I can use DEFAULT CLOCK_TIMESTAMP() instead of a BEFORE INSERT trigger,
-- I'm using a distinct DEFAULT value, as I want it to pop out if I'm not getting the trigger to fire.
CREATE TRIGGER trigger_api_number_before_insert
BEFORE INSERT ON api.numbers
FOR EACH ROW
EXECUTE PROCEDURE set_updated();
CREATE TRIGGER trigger_api_number_before_update
BEFORE UPDATE ON api.numbers
FOR EACH ROW
WHEN (OLD.* IS DISTINCT FROM NEW.*)
EXECUTE PROCEDURE set_updated();
---------------------------------------------
-- INSERT some data
---------------------------------------------
INSERT INTO numbers (number) values (1),(2),(3);
---------------------------------------------
-- Take a look
---------------------------------------------
SELECT * from numbers ORDER BY updated_dts ASC; -- The values should be listed as 1, 2, 3 as oldest to newest.
---------------------------------------------
-- UPDATE a row
---------------------------------------------
UPDATE numbers SET number = 11 where number = 1;
---------------------------------------------
-- Take a look
---------------------------------------------
SELECT * from numbers ORDER BY updated_dts ASC; -- The values should be listed as 2, 3, 11 as oldest to newest.
No, you cannot depend on clock_timestamp() order during trigger execution (or while evaluating a DEFAULT clause) being the same as commit order.
Commit will always happen later than the function call, and you cannot control how long it takes between them.
But I am surprised that that is a problem for you. Typically, the commit time is not visible or relevant. Why don't you simply accept the clock_timestamp() as the measure of things?
I have the following table in Postgres
Which would typically be populated like below
id day visits passes
1 Monday {11,13,19} {13,17}
2 Tuesday {7,9} {11,13,19}
3 Wednesday {2,5,21} {21,27}
4 Thursday {3,11,39} {21,19}`
In order to get the visit or passes ids over a range of days I have written the following function
CREATE OR REPLACE FUNCTION day_entries(p_column TEXT,VARIADIC ids int[]) RETURNS bigint[] AS
$$
DECLARE result bigint[];
DECLARE hold bigint[];
BEGIN
FOR i IN 1 .. array_upper(ids,1) LOOP
execute format('SELECT %I FROM days WHERE id = $1',p_column) USING ids[i] INTO hold;
result := unnest(result) UNION unnest(hold);
END LOOP;
RETURN result;
END;
$$
LANGUAGE 'plpgsql';
which works with a subsequent call to day_entries('visits',1,2,3) returning
{11,9,19,21,5,13,2,7}
While it does the job I am concerned that based on my one day old knowledge of writing Postgres functions I have worked in one or more inefficiences into the process. Can the function be made easier in some way?
The other issue is more a curiosity than a problem - the order of elements in the result appears to have no bearing to the order of visits entries in the three rows that are touched. Although this is not an issue as far as I am concerned I am curious to know why it happens.
You can do the unnesting and aggregating in a single statement, no need for a loop. And you can use the ANY operator with the array to select all matching rows.
CREATE OR REPLACE FUNCTION day_entries(p_column TEXT, variadic p_ids int[])
RETURNS bigint[] AS
$$
DECLARE
result bigint[];
BEGIN
execute
format('SELECT array(select unnest(%I) from days WHERE id = any($1))', p_column)
USING p_ids -- pass the whole array as a parameter
INTO result;
RETURN result;
END;
$$
LANGUAGE plpgsql;
Not related to your questions, but I think you are going down the wrong road with that design. While arrays might look intriguing to beginners at the beginning, they should only be used rarely.
And if you find yourself unnesting and aggregating things back and forth, this is a strong indication that something could be improved.
I would split your table up in two tables, one that stores the "day" information and one that stores visits and passes in the same table with a column distinguishing the two. Then finding visits is as simple as adding a where ... = 'visit' rather than having to cope with (slow and error prone) dynamic SQL.
Without knowing more details, I would probably create the tables like this:
create table days
(
id integer not null primary key,
day character varying(9) not null
);
create table event
(
day_id integer not null references days,
event_id integer not null,
event_type varchar(10) not null check (event_type in ('visit', 'pass'))
);
event_id might even be a foreign to key to another table you haven't shown us - again something you can't really do with de-normalized tables.
Getting all visits for specific days, is then as simple as:_
select event_id
from event
where day_id in (1,2)
and event_type = 'visit';
Or if you do need that as an array:
select array_agg(event_id)
from event
where day_id in (1,2)
and event_type = 'visit';
Online example
I have an application where I receive a stream of ticks (buys or sells of a commodity) and am trying to generate a table of minutely OHLC (open, high, low, close) columns with this data. The reason I am creating these in a table rather than deriving them from the tick table is due to the high volume of ticks I get (10000000 per day). Using this strategy I can delete all the ticks from the database on a schedule to keep my database size manageable.
My schema is roughly equivalent to this (unnecessary columns remove for brevity).
CREATE TABLE tick (
executed TIMESTAMP WITH TIME ZONE NOT NULL,
price NUMERIC
);
CREATE TABLE ohlc_minute (
created TIMESTAMP WITH TIME ZONE NOT NULL PRIMARY KEY,
open NUMERIC,
high NUMERIC,
low NUMERIC,
close NUMERIC,
);
My idea was to create an after insert trigger on tick which computes the last minute of OHLC and upserts this into the ohlc_minute table but with this trigger enabled the cpu usage on the database jumps to 100% almost instantly.
CREATE OR REPLACE FUNCTION update_ohlc()
RETURNS trigger AS
$BODY$
BEGIN
INSERT INTO ohlc_minute (created, open, high, low, close)
SELECT
date_trunc('minute', NEW.executed) executed,
(array_agg(price ORDER BY executed ASC))[1] as open,
MAX(price) as high,
MIN(price) as low,
(array_agg(price ORDER BY executed DESC))[1] as close
FROM tick
WHERE executed BETWEEN date_trunc('minute', NEW.executed) AND date_trunc('minute', NEW.executed) + interval '1 Min'
ON CONFLICT (created)
DO UPDATE
SET open = EXCLUDED.open, high=EXCLUDED.high, low=EXCLUDED.low, close=EXCLUDED.close;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;
CREATE TRIGGER tick_insert
AFTER INSERT
ON tick
FOR EACH ROW
EXECUTE PROCEDURE update_ohlc();
One possibly alternative I have is just to run an equivalent function manually on a schedule to update all ohlc bars but I like the idea of always having up to date partial (eg current bar less than one minute) ohlc information available. Is there any easy optimisations I can make to lower the CPU usage of my trigger function?
Are ticks guaranteed to arrive in order? If the insert succeeds, than your aggregation had been done over only one row, so the answer to all aggregations is just the price. If the insert conflicts, then you should be able to compute each value based on the just the existing and the excluded one.
CREATE OR REPLACE FUNCTION update_ohlc()
RETURNS trigger AS
$BODY$
BEGIN
INSERT INTO ohlc_minute (created, open, high, low, close)
values (
date_trunc('minute', NEW.executed),
NEW.price,
NEW.price,
NEW.price,
NEW.price
)
ON CONFLICT (created)
DO UPDATE
SET high=greatest(ohlc_minute.high,EXCLUDED.high),
low=least(ohlc_minute.low,EXCLUDED.low),
close=EXCLUDED.close;
RETURN NEW;
END;
$BODY$
LANGUAGE plpgsql;
If they are not guaranteed to arrive in order, then I think your current solution would be about optimal, if you insist on having partial results available within the accruing minute.
I solved my own problem, the answer was the obvious not having an index, I created an index
CREATE INDEX IF NOT EXISTS execute_index ON tick (executed);
and CPU usage has fallen to an acceptable level, I would however still be interested to see optimized solutions.
If there's a queue of work todo in a table that is going to be periodically polled by a number of different worker clients...what's the best way to prevent each worker from getting the same item to work on?
Say a table like: ItemId, LastAttemptDateTime, AttemptCount, and various item details.
Given an index on LastAttemptDateTime and sorted in ascending order and various clients are querying the table to grab an item to be worked on.
I use a stored procedure in MS SQL to do this...something like:
CREATE PROCEDURE GetNextQueueItem AS
SET NOCOUNT ON
DECLARE #ItemId INT
UPDATE myqueue SET #ItemId=ItemId, AttemptCount=AttemptCount+1, LastAttemptDateTime=GetDate()
WHERE ItemId=(SELECT TOP 1 ItemId
FROM myqueue
ORDER BY LastAttemptDateTime ASC)
SELECT ItemId, AttemptCount, and various item detail fields
FROM myqueue
WHERE ItemId = #ItemId
I'm fairly new to PostgreSQL and was wondering if there's alternate approaches available. (The TOP 1 will change to LIMIT 1.)
PostgreSQL equivalent could look like this:
CREATE OR REPLACE FUNCTION get_next_queue_item()
RETURNS SETOF myqueue AS
$BODY$
BEGIN
RETURN QUERY
UPDATE myqueue
SET attempt_count = attempt_count + 1
,last_attempt_ts = now()
WHERE item_id = (
SELECT item_id
FROM myqueue
ORDER BY last_attempt_ts
LIMIT 1
)
RETURNING myqueue.*;
END;
$BODY$
LANGUAGE plpgsql VOLATILE;
Major points
You only need 1 statement to do it all. UPDATE can return the updated row in the same command with the RETURNING clause.
State of the row is post-update. There is ways to get the pre-update state if needed.
No need for any variables.
I changed all identifiers to lower case, which is the cleanest style in PostgreSQL.
I renamed your column LastAttemptDateTime to last_attempt_ts
ts .. for "timestamp", because that's the name of the timestamp / datetime type in Postgres.
As you mentioned yourself, LIMIT 1 instead of TOP 1.
I use RETURNS SETOF myqueue as return type.
myqueue is the associated row-type of the table myqueue - for every table or view a row-type of the same name is automatically created in PostgreSQL.
This declaration allows for multiple rows to be returned, but LIMIT 1 guarantees that it will only ever be one.
This return type allows for RETURN QUERY to return the resulting row directly without any intermediate step. Fast, clean.
Actually, you don't need a plpgsql function at all. You can do it with a simple SQL statement:
UPDATE myqueue
SET attempt_count = attempt_count + 1
,last_attempt_ts = now()
WHERE item_id = (
SELECT item_id
FROM myqueue
ORDER BY last_attempt_ts
LIMIT 1
)
RETURNING myqueue.*;
Since PostgreSQL has sequences separate to identity columns incremented with them that can be used for other things, one nice way to do have a sequence used to set an id on the table, and another for getting the item:
Look at the currval of the sequence, if it's higher than or equal to the max id of the table, there's no items waiting.
Obtain nextval. If there is no item with a matching id then loop back to 1 (this can happen if an insert to the table failed).
Obtain the row with the matching id.
This isn't the only way to skin this cat (and not the way I've used with other databases), but has the advantage of being light on writes to the database (altering only the sequence, not the table.
I want to define a trigger in PostgreSQL to check that the inserted row, on a generic table, has the the property: "no other row exists with the same key in the same valid time" (the keys are sequenced keys). In fact, I has already implemented it. But since the trigger has to scan the entire table, now i'm wondering: is there a need for a table-level lock? Or this is managed someway by the PostgreSQL itself?
Here is an example.
In the upcoming PostgreSQL 9.0 I would have defined the table in this way:
CREATE TABLE medicinal_products
(
aic_code CHAR(9), -- sequenced key
full_name VARCHAR(255),
market_time PERIOD,
EXCLUDE USING gist
(aic_code CHECK WITH =,
market_time CHECK WITH &&)
);
but in fact I have been defined it like this:
CREATE TABLE medicinal_products
(
PRIMARY KEY (aic_code, vs),
aic_code CHAR(9), -- sequenced key
full_name VARCHAR(255),
vs DATE NOT NULL,
ve DATE,
CONSTRAINT valid_time_range
CHECK (ve > vs OR ve IS NULL)
);
Then, I have written a trigger that check the costraint: "two distinct medicinal products can have the same code in two different periods, but not in same time".
So the code:
INSERT INTO medicinal_products VALUES ('1','A','2010-01-01','2010-04-01');
INSERT INTO medicinal_products VALUES ('1','A','2010-03-01','2010-06-01');
return an error.
One solution is to have a second table to use for detecting clashes, and populate that with a trigger. Using the schema you added into the question:
CREATE TABLE medicinal_product_date_map(
aic_code char(9) NOT NULL,
applicable_date date NOT NULL,
UNIQUE(aic_code, applicable_date));
(note: this is the second attempt due to misreading your requirement the first time round. hope it's right this time).
Some functions to maintain this table:
CREATE FUNCTION add_medicinal_product_date_range(aic_code_in char(9), start_date date, end_date date)
RETURNS void STRICT VOLATILE LANGUAGE sql AS $$
INSERT INTO medicinal_product_date_map
SELECT $1, $2 + offset
FROM generate_series(0, $3 - $2)
$$;
CREATE FUNCTION clr_medicinal_product_date_range(aic_code_in char(9), start_date date, end_date date)
RETURNS void STRICT VOLATILE LANGUAGE sql AS $$
DELETE FROM medicinal_product_date_map
WHERE aic_code = $1 AND applicable_date BETWEEN $2 AND $3
$$;
And populate the table first time with:
SELECT count(add_medicinal_product_date_range(aic_code, vs, ve))
FROM medicinal_products;
Now create triggers to populate the date map after changes to medicinal_products: after insert calls add_, after update calls clr_ (old values) and add_ (new values), after delete calls clr_.
CREATE FUNCTION sync_medicinal_product_date_map()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
IF TG_OP = 'UPDATE' OR TG_OP = 'DELETE' THEN
PERFORM clr_medicinal_product_date_range(OLD.aic_code, OLD.vs, OLD.ve);
END IF;
IF TG_OP = 'UPDATE' OR TG_OP = 'INSERT' THEN
PERFORM add_medicinal_product_date_range(NEW.aic_code, NEW.vs, NEW.ve);
END IF;
RETURN NULL;
END;
$$;
CREATE TRIGGER sync_date_map
AFTER INSERT OR UPDATE OR DELETE ON medicinal_products
FOR EACH ROW EXECUTE PROCEDURE sync_medicinal_product_date_map();
The uniqueness constraint on medicinal_product_date_map will trap any products being added with the same code on the same day:
steve#steve#[local] =# INSERT INTO medicinal_products VALUES ('1','A','2010-01-01','2010-04-01');
INSERT 0 1
steve#steve#[local] =# INSERT INTO medicinal_products VALUES ('1','A','2010-03-01','2010-06-01');
ERROR: duplicate key value violates unique constraint "medicinal_product_date_map_aic_code_applicable_date_key"
DETAIL: Key (aic_code, applicable_date)=(1 , 2010-03-01) already exists.
CONTEXT: SQL function "add_medicinal_product_date_range" statement 1
SQL statement "SELECT add_medicinal_product_date_range(NEW.aic_code, NEW.vs, NEW.ve)"
PL/pgSQL function "sync_medicinal_product_date_map" line 6 at PERFORM
This depends on the values being checked for having a discrete space- which is why I asked about dates vs timestamps. Although timestamps are, technically, discrete since Postgresql only stores microsecond-resolution, adding an entry to the map table for every microsecond the product is applicable for is not practical.
Having said that, you could probably also get away with something better than a full-table scan to check for overlapping timestamp intervals, with some trickery on looking for only the first interval not after or not before... however, for easy discrete spaces I prefer this approach which IME can also be handy for other things too (e.g. reports that need to quickly find which products are applicable on a certain day).
I also like this approach because it feels right to leverage the database's uniqueness-constraint mechanism this way. Also, I feel it will be more reliable in the context of concurrent updates to the master table: without locking the table against concurrent updates, it would be possible for a validation trigger to see no conflict and allow inserts in two concurrent sessions, that are then seen to conflict when both transaction's effects are visible.
Just a thought, in case the valid time blocks could be coded with a number or something, creating a UNIQUE index on Id+TimeBlock would be blazingly fast and resolve all table lock problems.
It is managed by PostgreSQL itself. On a select it acquires an ACCESS_SHARE lock which means that you can query the table but do not perform updates.
A radical solution which might help you is to use a cache like ehcache or memcached to store the id/timeblock info and not use the postgresql at all. Many can be persisted so they would survive a server restart and they do not exhibit this locking behavior.
Why can't you use a UNIQUE constraint? Will be much faster (it's an index) and easier.