Try parsing data types in PipelineDB and streaming errors to failing text table? - postgresql

We were using Pipeline DB to receive data into a streaming table, and in two streaming views, in one view, filter out records that would fail typecasting validataion errors, and in the other view, filter in the records that failed typecasting errors. Ideally, we're trying to separate good from bad records and have them materialize into two final tables.
For example, a system was configured to receive data from a 3rd party in the format YYYY/MM/DD HH24:MI:SS, but for some reason values showed up where the day and month are flipped. In PipelineDB, since using the PostGres SQL "to_timestamp(mycolumn,'YYYY/MM/DD HH24:MI:SS')" will throw a hard error if the text in "mycolumn" was something like '2019/15/05 13:10:24'. And any records fed into the stream within that transaction are rolled back. (So, if PG Copy was used, one record to fail the materialing streaming view causes zero records to be inserted all together. This is not an ideal scenario in data automation, where the 3rd party automated system could care less about our problem to process its data.)
From what I can see:
- PostGres has no "native SQL" way to doing a "try-parse"
- PipelineDB does not support user defined functions (if we wrote a function with two outputs, one to parse the value, the other returning the boolean "is_valid" column). (My assumption is that the function resides on the server, and pipelinedb executes as a foreign server, which is a different thing all together.)
Ideally, a function returned the typecast value and a boolean flag if it was valid, and it can be used in the WHERE clause of the streaming views to fork good records from bad records. But I can't seem to be able to solve this? Any thoughts?

After lots of time, I found a solution to this problem. I don't like it, but it will work.
It dawned on me after realizing the entire problem is predicated on the following:
http://docs.pipelinedb.com/continuous-transforms.html
"You can think of continuous transforms as being triggers on top of incoming streaming data where the trigger function is executed for each new row output by the continuous transform. Internally the function is executed as an AFTER INSERT FOR EACH ROW trigger so there is no OLD row and the NEW row contains the row output by the continuous transform."
I spent hours trying to figure out: "why aren't my custom functions working that I wrote to "try-parse" data types for incoming data streams? Nothing would show up in the materializing view or output table? and no hard errors were being thrown by PipelineDB? And then after a few hours, I realize that the problem was associated with the fact that PipelineDB couldn't handle user defined functions, but rather that in the continuous transformation, that transformation expressed as SQL is happening "AFTER THE ROW IS INSERTED". So, fundamentally, altering the typecasting of the datafield within the materializing stream was failing before it started.
The solution (which is not very elegant), is to:
1 - move the typecasting logic or any SQL logic that may result in an error into the trigger function
2 - create an "EXCEPTION WHEN others THEN" section inside the trigger function
3 - make sure that RETURN NEW; happens in both cases of a successful and failed transformation.
4 - make the continuous transformation as as merely a passthrough with applying no logic, it's merely to call the trigger. (in which case it really negates the entire point of using PipelineDB to some extent for this initial data staging problem. But, I digress.)
With that, I created a table to capture the errors, and by ensuring that by all 3 steps listed above are implemented, then we ensure that the transaction will be successul.
That is important because if that is not done and "you get and exception in the exception", or you don't handle an exception gracefully, then no records will be loaded.
This supports the strategic: we are just trying to make a data processing "fork in the river" to route records that successfully transform into one table (or streaming table) one way, and records that fail their transformation into an errors table.
Below I show a POC where, we process the records as a stream and materialize them into a physical table. (it could just as well have been another stream). The keys to this are realizing:
The errors table used text columns
The trigger function captures errors in the attempted transformation and writes them to the errors table with a basic description of the error back from the system.
I mention that I don't "like" the solution, but this was the best I could find in a few hours to get around the limitation of PipelineDB doing things as a trigger post-insert, so the failing on insert couldn't be caught, and pipelinedb didn't have an intrinsic capability built in to handle:
- continuing the process the stream within a transaction on failure
- fail gracefully at the row-level and provide an easier mechanism to route failed transformations to an errors table
DROP SCHEMA IF EXISTS pdb CASCADE;
CREATE SCHEMA IF NOT EXISTS pdb;
DROP TABLE IF EXISTS pdb.lis_final;
CREATE TABLE pdb.lis_final(
edm___row_id bigint,
edm___created_dtz timestamp with time zone DEFAULT current_timestamp,
edm___updatedat_dtz timestamp with time zone DEFAULT current_timestamp,
patient_id text,
encounter_id text,
order_id text,
sample_id text,
container_id text,
result_id text,
orderrequestcode text,
orderrequestname text,
testresultcode text,
testresultname text,
testresultcost text,
testordered_dt timestamp,
samplereceived_dt timestamp,
testperformed_dt timestamp,
testresultsreleased_dt timestamp,
extractedfromsourceat_dt timestamp,
birthdate_d date
);
DROP TABLE IF EXISTS pdb.lis_errors;
CREATE TABLE pdb.lis_errors(
edm___row_id bigint,
edm___errorat_dtz timestamp with time zone default current_timestamp,
edm___errormsg text,
patient_id text,
encounter_id text,
order_id text,
sample_id text,
container_id text,
result_id text,
orderrequestcode text,
orderrequestname text,
testresultcode text,
testresultname text,
testresultcost text,
testordered_dt text,
samplereceived_dt text,
testperformed_dt text,
testresultsreleased_dt text,
extractedfromsourceat_dt text,
birthdate_d text
);
DROP FOREIGN TABLE IF EXISTS pdb.lis_streaming_table CASCADE;
CREATE FOREIGN TABLE pdb.lis_streaming_table (
edm___row_id serial,
patient_id text,
encounter_id text,
order_id text,
sample_id text,
container_id text,
result_id text,
orderrequestcode text,
orderrequestname text,
testresultcode text,
testresultname text,
testresultcost text,
testordered_dt text,
samplereceived_dt text,
testperformed_dt text,
testresultsreleased_dt text,
extractedfromsourceat_dt text,
birthdate_d text
)
SERVER pipelinedb;
CREATE OR REPLACE FUNCTION insert_into_t()
RETURNS trigger AS
$$
BEGIN
INSERT INTO pdb.lis_final
SELECT
NEW.edm___row_id,
current_timestamp as edm___created_dtz,
current_timestamp as edm___updatedat_dtz,
NEW.patient_id,
NEW.encounter_id,
NEW.order_id,
NEW.sample_id,
NEW.container_id,
NEW.result_id,
NEW.orderrequestcode,
NEW.orderrequestname,
NEW.testresultcode,
NEW.testresultname,
NEW.testresultcost,
to_timestamp(NEW.testordered_dt,'YYYY/MM/DD HH24:MI:SS') as testordered_dt,
to_timestamp(NEW.samplereceived_dt,'YYYY/MM/DD HH24:MI:SS') as samplereceived_dt,
to_timestamp(NEW.testperformed_dt,'YYYY/MM/DD HH24:MI:SS') as testperformed_dt,
to_timestamp(NEW.testresultsreleased_dt,'YYYY/MM/DD HH24:MI:SS') as testresultsreleased_dt,
to_timestamp(NEW.extractedfromsourceat_dt,'YYYY/MM/DD HH24:MI:SS') as extractedfromsourceat_dt,
to_date(NEW.birthdate_d,'YYYY/MM/DD') as birthdate_d;
-- Return new as nothing happens
RETURN NEW;
EXCEPTION WHEN others THEN
INSERT INTO pdb.lis_errors
SELECT
NEW.edm___row_id,
current_timestamp as edm___errorat_dtz,
SQLERRM as edm___errormsg,
NEW.patient_id,
NEW.encounter_id,
NEW.order_id,
NEW.sample_id,
NEW.container_id,
NEW.result_id,
NEW.orderrequestcode,
NEW.orderrequestname,
NEW.testresultcode,
NEW.testresultname,
NEW.testresultcost,
NEW.testordered_dt,
NEW.samplereceived_dt,
NEW.testperformed_dt,
NEW.testresultsreleased_dt,
NEW.extractedfromsourceat_dt,
NEW.birthdate_d;
-- Return new back to the streaming view as we don't want that process to error. We already routed the record above to the errors table as text.
RETURN NEW;
END;
$$
LANGUAGE plpgsql;
DROP VIEW IF EXISTS pdb.lis_tryparse CASCADE;
CREATE VIEW pdb.lis_tryparse WITH (action=transform, outputfunc=insert_into_t) AS
SELECT
edm___row_id,
patient_id,
encounter_id,
order_id,
sample_id,
container_id,
result_id,
orderrequestcode,
orderrequestname,
testresultcode,
testresultname,
testresultcost,
testordered_dt,
samplereceived_dt,
testperformed_dt,
testresultsreleased_dt,
extractedfromsourceat_dt,
birthdate_d
FROM pdb.lis_streaming_table as st;

Related

Does CLOCK_TIMESTAMP from a BEFORE trigger match log/commit order *exactly* in PG 12.3?

I've got a Postgres 12.3 question: Can I rely on CLOCK_TIMESTAMP() in a trigger to stamp an updated_dts timestamp in exactly the same order as changes are committed to the permanent data?
On the face of it, this might sound like kind of an silly question, but I just spent two tracking down a super rare race condition in a non-Postgres system that hinged on exactly this behavior. (Lagging commits made their 'last value seen' tracking data unreliable.) Now I'm trying to figure out if it's possible for CLOCK_TIMESTAMP() to not match the order of changes recorded in the WAL perfectly.
It's simple to see how this could occur with NOW/TRANSACTION_TIMESTAMP/CURRENT_TIMESTAMP as they're returning the transaction start time, not the completion time. It's pretty easy, in that case, to record a timestamp sequence where the stamps and log order don't agree. But I can't figure out if there's any chance for commits to be saved in a different order to the BEFORE trigger CLOCK_TIMESTAMP() values.
For background, we need a 100% reliable timeline for an external search to use. As I understand it, I can create one using logical replication, and a replication-target side trigger to stamp changes as they're replayed from the log. What I'm unclear on, is if it's possible to get the same fidelity from CLOCK_TIMESTAMP() on a single server.
I haven't got the chops to get deep into the Postgres internals, and see how requests are interleaved, nor how granular execution is, and am hoping that someone here knows definitively. If this is more of a question for one of the PG mailing lists, please let me know.
-- Thanks
Below is a bit of sample code for how I'm looking at building the timestamps. It works fine, but doesn't prove anything about behavior with lots of concurrent processes.
---------------------------------------------
-- Create the trigger function
---------------------------------------------
DROP FUNCTION IF EXISTS api.set_updated CASCADE;
CREATE OR REPLACE FUNCTION api.set_updated()
RETURNS TRIGGER
AS $BODY$
BEGIN
NEW.updated_dts = CLOCK_TIMESTAMP();
RETURN NEW;
END;
$BODY$
language plpgsql;
COMMENT ON FUNCTION api.set_updated() IS 'Sets updated_dts field to CLOCK_TIMESTAMP(), if the record has changed..';
---------------------------------------------
-- Create the table
---------------------------------------------
DROP TABLE IF EXISTS api.numbers;
CREATE TABLE api.numbers (
id uuid NOT NULL DEFAULT extensions.gen_random_uuid (),
number integer NOT NULL DEFAULT NULL,
updated_dts timestamptz NOT NULL DEFAULT 'epoch'::timestamptz
);
---------------------------------------------
-- Define the triggers (binding)
---------------------------------------------
-- NOTE: I'm guessing that in production that I can use DEFAULT CLOCK_TIMESTAMP() instead of a BEFORE INSERT trigger,
-- I'm using a distinct DEFAULT value, as I want it to pop out if I'm not getting the trigger to fire.
CREATE TRIGGER trigger_api_number_before_insert
BEFORE INSERT ON api.numbers
FOR EACH ROW
EXECUTE PROCEDURE set_updated();
CREATE TRIGGER trigger_api_number_before_update
BEFORE UPDATE ON api.numbers
FOR EACH ROW
WHEN (OLD.* IS DISTINCT FROM NEW.*)
EXECUTE PROCEDURE set_updated();
---------------------------------------------
-- INSERT some data
---------------------------------------------
INSERT INTO numbers (number) values (1),(2),(3);
---------------------------------------------
-- Take a look
---------------------------------------------
SELECT * from numbers ORDER BY updated_dts ASC; -- The values should be listed as 1, 2, 3 as oldest to newest.
---------------------------------------------
-- UPDATE a row
---------------------------------------------
UPDATE numbers SET number = 11 where number = 1;
---------------------------------------------
-- Take a look
---------------------------------------------
SELECT * from numbers ORDER BY updated_dts ASC; -- The values should be listed as 2, 3, 11 as oldest to newest.
No, you cannot depend on clock_timestamp() order during trigger execution (or while evaluating a DEFAULT clause) being the same as commit order.
Commit will always happen later than the function call, and you cannot control how long it takes between them.
But I am surprised that that is a problem for you. Typically, the commit time is not visible or relevant. Why don't you simply accept the clock_timestamp() as the measure of things?

Default date for insert doesn't change in continuos transformation

I created the table below.
create table foo
(
ibutton text NULL,
severidade int4 NULL,
dt_insercao timestamptz NULL DEFAULT now()
)
My insert:
insert into foo (ibutton, severidade)values ('aa', 4);
For any cases of the value of 'dt_insersao', wich should be default "now", is always going as '2017-06-08 10:35:35'...
I don't have idea where does it's come from this value..
This insert are executed into my continuous transformation.
These inserts are executed into my continuous transformation of the pipelinedb. When I execute in my client PGAdmin, the date are correct.
Not sure how PipelineDB comes into play here, but in Postgres, now() returns the same value for all inserts in a single transaction:
Quote from the manual
Since these functions return the start time of the current transaction, their values do not change during the transaction. This is considered a feature: the intent is to allow a single transaction to have a consistent notion of the "current" time, so that multiple modifications within the same transaction bear the same time stamp.
If you need a different value for each row that is inserted in one transaction use clock_timestamp() instead in your table definition.

record "new" has no field "cure" postgreSQL

so here's the thing,I have two tables: apointments(with a single p) and medical_folder and i get this
ERROR: record "new" has no field "cure"
CONTEXT: SQL statement "insert into medical_folder(id,"patient_AMKA",cure,drug_id)
values(new.id,new."patient_AMKA",new.cure,new.drug_id)"
PL/pgSQL function new_medical() line 3 at SQL statement
create trigger example_trigger after insert on apointments
for each row execute procedure new_medical();
create or replace function new_medical()
returns trigger as $$
begin
if apointments.diagnosis is not null then
insert into medical_folder(id,"patient_AMKA",cure,drug_id)
values(new.id,new."patient_AMKA",new.cure,new.drug_id);
return new;
end if;
end;
$$ language plpgsql;
insert into apointments(id,time,"patient_AMKA","doctor_AMKA",diagnosis)
values('30000','2017-05-24 0
07:42:15','4017954515276','6304745877947815701','M3504');
I have checked multiple times and all of my tables and columns are existing
Please help!Thank you!
Table structures are:
create table medical_folder (
id bigInt,
patient bigInt,
cure text,
drug_id bigInt);
create table apointments (
id bigint,
time timestamp without time zone,
"patient_AMKA" bigInt,
"doctor_AMKA" bigInt);
I was facing the same issue.
Change:
values(new.id,new."patient_AMKA",new.cure,new.drug_id);
to:
values(new.id,new."patient_AMKA",new."cure",new."drug_id");
This error means the table apointments (with 1 p) doesn't have a field named cure. The trigger occurs when inserting an apointment, so "new" is an apointment row. Maybe it is part of the diagnosis object?
The values for the second table are not available in the "new" row. You need a way to get and insert them, and using a trigger is not the easiest/clean way to go.
You can have your application do two inserts, one by table, and wrap them in a transaction to ensure they are both committed/rolled back. Another option, which lets you better enforce the data integrity, is to create a stored procedure that takes the values to be inserted in both tables and do the two inserts. You can go as far as forbidding user to write to the tables, effectively leaving the stored procedure the only way to insert the data.

How to check a sequence efficiently for used and unused values in PostgreSQL

In PostgreSQL (9.3) I have a table defined as:
CREATE TABLE charts
( recid serial NOT NULL,
groupid text NOT NULL,
chart_number integer NOT NULL,
"timestamp" timestamp without time zone NOT NULL DEFAULT now(),
modified timestamp without time zone NOT NULL DEFAULT now(),
donotsee boolean,
CONSTRAINT pk_charts PRIMARY KEY (recid),
CONSTRAINT chart_groupid UNIQUE (groupid),
CONSTRAINT charts_ichart_key UNIQUE (chart_number)
);
CREATE TRIGGER update_modified
BEFORE UPDATE ON charts
FOR EACH ROW EXECUTE PROCEDURE update_modified();
I would like to replace the chart_number with a sequence like:
CREATE SEQUENCE charts_chartnumber_seq START 16047;
So that by trigger or function, adding a new chart record automatically generates a new chart number in ascending order. However, no existing chart record can have its chart number changed and over the years there have been skips in the assigned chart numbers. Hence, before assigning a new chart number to a new chart record, I need to be sure that the "new" chart number has not yet been used and any chart record with a chart number is not assigned a different number.
How can this be done?
Consider not doing it. Read these related answers first:
Gap-less sequence where multiple transactions with multiple tables are involved
Compacting a sequence in PostgreSQL
If you still insist on filling in gaps, here is a rather efficient solution:
1. To avoid searching large parts of the table for the next missing chart_number, create a helper table with all current gaps once:
CREATE TABLE chart_gap AS
SELECT chart_number
FROM generate_series(1, (SELECT max(chart_number) - 1 -- max is no gap
FROM charts)) chart_number
LEFT JOIN charts c USING (chart_number)
WHERE c.chart_number IS NULL;
2. Set charts_chartnumber_seq to the current maximum and convert chart_number to an actual serial column:
SELECT setval('charts_chartnumber_seq', max(chart_number)) FROM charts;
ALTER TABLE charts
ALTER COLUMN chart_number SET NOT NULL
, ALTER COLUMN chart_number SET DEFAULT nextval('charts_chartnumber_seq');
ALTER SEQUENCE charts_chartnumber_seq OWNED BY charts.chart_number;
Details:
How to reset postgres' primary key sequence when it falls out of sync?
Safely and cleanly rename tables that use serial primary key columns in Postgres?
3. While chart_gap is not empty fetch the next chart_number from there.
To resolve possible race conditions with concurrent transactions, without making transactions wait, use advisory locks:
WITH sel AS (
SELECT chart_number, ... -- other input values
FROM chart_gap
WHERE pg_try_advisory_xact_lock(chart_number)
LIMIT 1
FOR UPDATE
)
, ins AS (
INSERT INTO charts (chart_number, ...) -- other target columns
TABLE sel
RETURNING chart_number
)
DELETE FROM chart_gap c
USING ins i
WHERE i.chart_number = c.chart_number;
Alternatively, Postgres 9.5 or later has the handy FOR UPDATE SKIP LOCKED to make this simpler and faster:
...
SELECT chart_number, ... -- other input values
FROM chart_gap
LIMIT 1
FOR UPDATE SKIP LOCKED
...
Detailed explanation:
Postgres UPDATE ... LIMIT 1
Check the result. Once all rows are filled in, this returns 0 rows affected. (you could check in plpgsql with IF NOT FOUND THEN ...). Then switch to a simple INSERT:
INSERT INTO charts (...) -- don't list chart_number
VALUES (...); -- don't provide chart_number
In PostgreSQL, a SEQUENCE ensures the two requirements you mention, that is:
No repeats
No changes once assigned
But because of how a SEQUENCE works (see manual), it can not ensure no-skips. Among others, the first two reasons that come to mind are:
How a SEQUENCE handles concurrent blocks with INSERTS (you could also add that the concept of Cache also makes this impossible)
Also, user triggered DELETEs are an uncontrollable aspect that a SEQUENCE can not handle by itself.
In both cases, if you still do not want skips, (and if you really know what you're doing) you should have a separate structure that assign IDs (instead of using SEQUENCE). Basically a system that has a list of 'assignable' IDs stored in a TABLE that has a function to pop out IDs in a FIFO way. That should allow you to control DELETEs etc.
But again, this should be attempted, only if you really know what you're doing! There's a reason why people don't do SEQUENCEs themselves. There are hard corner-cases (for e.g. concurrent INSERTs) and most probably you're over-engineering your problem case, that probably can be solved in a much better / cleaner way.
Sequence numbers usually have no meaning, so why worry? But if you really want this, then follow the below, cumbersome procedure. Note that it is not efficient; the only efficient option is to forget about the holes and use the sequence.
In order to avoid having to scan the charts table on every insert, you should scan the table once and store the unused chart_number values in a separate table:
CREATE TABLE charts_unused_chart_number AS
SELECT seq.unused
FROM (SELECT max(chart_number) FROM charts) mx,
generate_series(1, mx(max)) seq(unused)
LEFT JOIN charts ON charts.chart_number = seq.unused
WHERE charts.recid IS NULL;
The above query generates a contiguous series of numbers from 1 to the current maximum chart_number value, then LEFT JOINs the charts table to it and find the records where there is no corresponding charts data, meaning that value of the series is unused as a chart_number.
Next you create a trigger that fires on an INSERT on the charts table. In the trigger function, pick a value from the table created in the step above:
CREATE FUNCTION pick_unused_chart_number() RETURNS trigger AS $$
BEGIN
-- Get an unused chart number
SELECT unused INTO NEW.chart_number FROM charts_unused_chart_number LIMIT 1;
-- If the table is empty, get one from the sequence
IF NOT FOUND THEN
NEW.chart_number := next_val(charts_chartnumber_seq);
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER tr_charts_cn
BEFORE INSERT ON charts
FOR EACH ROW EXECUTE PROCEDURE pick_unused_chart_number();
Easy. But the INSERT may fail because of some other trigger aborting the procedure or any other reason. So you need a check to ascertain that the chart_number was indeed inserted:
CREATE FUNCTION verify_chart_number() RETURNS trigger AS $$
BEGIN
-- If you get here, the INSERT was successful, so delete the chart_number
-- from the temporary table.
DELETE FROM charts_unused_chart_number WHERE unused = NEW.chart_number;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER tr_charts_verify
AFTER INSERT ON charts
FOR EACH ROW EXECUTE PROCEDURE verify_chart_number();
At a certain point the table with unused chart numbers will be empty whereupon you can (1) ALTER TABLE charts to use the sequence instead of an integer for chart_number; (2) delete the two triggers; and (3) the table with unused chart numbers; all in a single transaction.
While what you want is possible, it can't be done using only a SEQUENCE and it requires an exclusive lock on the table, or a retry loop, to work.
You'll need to:
LOCK thetable IN EXCLUSIVE MODE
Find the first free ID by querying for the max id then doing a left join over generate_series to find the first free entry. If there is one.
If there is a free entry, insert it.
If there is no free entry, call nextval and return the result.
Performance will be absolutely horrible, and transactions will be serialized. There'll be no concurrency. Also, unless the LOCK is the first thing you run that affects that table, you'll face deadlocks that cause transaction aborts.
You can make this less bad by using an AFTER DELETE .. FOR EACH ROW trigger that keeps track of entries you delete by INSERTing them into a one-column table that keeps track of spare IDs. You can then SELECT the lowest ID from the table in your ID assignment function on the default for the column, avoiding the need for the explicit table lock, the left join on generate_series and the max call. Transactions will still be serialized on a lock on the free IDs table. In PostgreSQL you can even solve that using SELECT ... FOR UPDATE SKIP LOCKED. So if you're on 9.5 you can actually make this non-awful, though it'll still be slow.
I strongly advise you to just use a SEQUENCE directly, and not bother with re-using values.

PostgreSQL, triggers, and concurrency to enforce a temporal key

I want to define a trigger in PostgreSQL to check that the inserted row, on a generic table, has the the property: "no other row exists with the same key in the same valid time" (the keys are sequenced keys). In fact, I has already implemented it. But since the trigger has to scan the entire table, now i'm wondering: is there a need for a table-level lock? Or this is managed someway by the PostgreSQL itself?
Here is an example.
In the upcoming PostgreSQL 9.0 I would have defined the table in this way:
CREATE TABLE medicinal_products
(
aic_code CHAR(9), -- sequenced key
full_name VARCHAR(255),
market_time PERIOD,
EXCLUDE USING gist
(aic_code CHECK WITH =,
market_time CHECK WITH &&)
);
but in fact I have been defined it like this:
CREATE TABLE medicinal_products
(
PRIMARY KEY (aic_code, vs),
aic_code CHAR(9), -- sequenced key
full_name VARCHAR(255),
vs DATE NOT NULL,
ve DATE,
CONSTRAINT valid_time_range
CHECK (ve > vs OR ve IS NULL)
);
Then, I have written a trigger that check the costraint: "two distinct medicinal products can have the same code in two different periods, but not in same time".
So the code:
INSERT INTO medicinal_products VALUES ('1','A','2010-01-01','2010-04-01');
INSERT INTO medicinal_products VALUES ('1','A','2010-03-01','2010-06-01');
return an error.
One solution is to have a second table to use for detecting clashes, and populate that with a trigger. Using the schema you added into the question:
CREATE TABLE medicinal_product_date_map(
aic_code char(9) NOT NULL,
applicable_date date NOT NULL,
UNIQUE(aic_code, applicable_date));
(note: this is the second attempt due to misreading your requirement the first time round. hope it's right this time).
Some functions to maintain this table:
CREATE FUNCTION add_medicinal_product_date_range(aic_code_in char(9), start_date date, end_date date)
RETURNS void STRICT VOLATILE LANGUAGE sql AS $$
INSERT INTO medicinal_product_date_map
SELECT $1, $2 + offset
FROM generate_series(0, $3 - $2)
$$;
CREATE FUNCTION clr_medicinal_product_date_range(aic_code_in char(9), start_date date, end_date date)
RETURNS void STRICT VOLATILE LANGUAGE sql AS $$
DELETE FROM medicinal_product_date_map
WHERE aic_code = $1 AND applicable_date BETWEEN $2 AND $3
$$;
And populate the table first time with:
SELECT count(add_medicinal_product_date_range(aic_code, vs, ve))
FROM medicinal_products;
Now create triggers to populate the date map after changes to medicinal_products: after insert calls add_, after update calls clr_ (old values) and add_ (new values), after delete calls clr_.
CREATE FUNCTION sync_medicinal_product_date_map()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
IF TG_OP = 'UPDATE' OR TG_OP = 'DELETE' THEN
PERFORM clr_medicinal_product_date_range(OLD.aic_code, OLD.vs, OLD.ve);
END IF;
IF TG_OP = 'UPDATE' OR TG_OP = 'INSERT' THEN
PERFORM add_medicinal_product_date_range(NEW.aic_code, NEW.vs, NEW.ve);
END IF;
RETURN NULL;
END;
$$;
CREATE TRIGGER sync_date_map
AFTER INSERT OR UPDATE OR DELETE ON medicinal_products
FOR EACH ROW EXECUTE PROCEDURE sync_medicinal_product_date_map();
The uniqueness constraint on medicinal_product_date_map will trap any products being added with the same code on the same day:
steve#steve#[local] =# INSERT INTO medicinal_products VALUES ('1','A','2010-01-01','2010-04-01');
INSERT 0 1
steve#steve#[local] =# INSERT INTO medicinal_products VALUES ('1','A','2010-03-01','2010-06-01');
ERROR: duplicate key value violates unique constraint "medicinal_product_date_map_aic_code_applicable_date_key"
DETAIL: Key (aic_code, applicable_date)=(1 , 2010-03-01) already exists.
CONTEXT: SQL function "add_medicinal_product_date_range" statement 1
SQL statement "SELECT add_medicinal_product_date_range(NEW.aic_code, NEW.vs, NEW.ve)"
PL/pgSQL function "sync_medicinal_product_date_map" line 6 at PERFORM
This depends on the values being checked for having a discrete space- which is why I asked about dates vs timestamps. Although timestamps are, technically, discrete since Postgresql only stores microsecond-resolution, adding an entry to the map table for every microsecond the product is applicable for is not practical.
Having said that, you could probably also get away with something better than a full-table scan to check for overlapping timestamp intervals, with some trickery on looking for only the first interval not after or not before... however, for easy discrete spaces I prefer this approach which IME can also be handy for other things too (e.g. reports that need to quickly find which products are applicable on a certain day).
I also like this approach because it feels right to leverage the database's uniqueness-constraint mechanism this way. Also, I feel it will be more reliable in the context of concurrent updates to the master table: without locking the table against concurrent updates, it would be possible for a validation trigger to see no conflict and allow inserts in two concurrent sessions, that are then seen to conflict when both transaction's effects are visible.
Just a thought, in case the valid time blocks could be coded with a number or something, creating a UNIQUE index on Id+TimeBlock would be blazingly fast and resolve all table lock problems.
It is managed by PostgreSQL itself. On a select it acquires an ACCESS_SHARE lock which means that you can query the table but do not perform updates.
A radical solution which might help you is to use a cache like ehcache or memcached to store the id/timeblock info and not use the postgresql at all. Many can be persisted so they would survive a server restart and they do not exhibit this locking behavior.
Why can't you use a UNIQUE constraint? Will be much faster (it's an index) and easier.