I'd like to create a web service that allows a client to fetch all rows in a table, and then later allows the client to only fetch new or updated rows.
The simplest implementation seems to be to send the current timestamp to the client, and then have the client ask for rows that are newer than the timestamp in the following request.
It seems that this is doable by keeping an "updated_at" column with a timestamp set to NOW() in update and insert triggers, and then querying newer rows, and also passing down the value of NOW().
The problem is that if there are uncommitted transactions, these transactions will set updated_at to the start time of the transaction, not the commit time.
As a result, this simple implementation doesn't work, because rows can be lost since they can appear with a timestamp in the past.
I have been unable to find any simple solution to this problem, despite the fact that it seems to be a very common need: any ideas?
Possible solutions:
Keep a monotonic timestamp in a table, update it at the start of every transaction to MAX(NOW(), last_timestamp + 1) and use it as a row timestamp. Problem: this effectively means that all write transactions are fully serialized and lock the whole database since they conflict on the update time table.
At the end of the transaction, add a mapping from NOW() to the time in an update table like the above solution. This seems to require to take an explicit lock and use a sequence to generate non-temporal "timestamps" because just using an UPDATE on a single row would cause rollbacks in SERIALIZABLE mode.
Somehow have PostgreSQL, at commit time, iterate over all updated rows and set updated_at to a monotonic timestamp
Somehow have PostgreSQL itself maintain a table of transaction commit times, which it doesn't seem to do at the moment
Using the built-in xmin column also seems impossible, because VACUUM can trash it.
It would be nice to be able to do this in the database without modifications to all updates in the application.
What is the usual way this is done?
The problem with the naive solution
In case it's not obvious, this is the problem with using NOW() or CLOCK_TIMESTAMP():
At time 1, we run NOW() or CLOCK_TIMESTAMP() in a transaction, which gives 1 and we update a row setting time 1 as the update time
At time 2, a client fetches all rows, and we tell him that we gave it all rows until time 2
At time 3, the transaction commits with "time 1" in the updated_at field
The client asks for updated rows since time 2 (the time he got from the previous full fetch request), we query for updated_at >= 2 and we return nothing, instead of returning the row that is just added
That row is lost and will never seen by the client
Your whole proposition goes against some of the underlying fundamentals of an ACID-compliant RDBMS like PostgreSQL. Time of transaction start (e.g. current_timestamp()) and other time-based metrics are meaningless as a measure of what a particular client has received or not. Abandon the whole idea.
Assuming that your clients connect through a persistent session to the database you can follow this procedure:
When the session starts, CREATE TEMP UNLOGGED TABLE for the session user. This table contains nothing but the PK and the last update time of the table you want to fetch the data from.
The client polls for new data and receives only those records that have a PK not yet in the temp table or an existing PK but a newer last update time. Currently uncommitted transactions are invisible but will be retrieved at the next poll for new or updated records. The update time is required because there is no way to delete records from the temp tables of all concurrent clients.
The PK and last update time of retrieved record are stored in the temp table.
When the user closes the session, the temp table is deleted.
If you want to persist the retrieved records over multiple sessions for each client or the client disconnects after every query, then you need a regular table but then I would suggest to also add the oid of the user such that all users can use a single table for keeping track of the retrieved records. In that latter case you can create an AFTER UPDATE trigger on the table with your data which deletes the PK from the table with fetched records, for all users in one sweep. On their next poll the clients will then get the updated record.
Add a column, which will be used to track what record has been sent to a client:
alter table table_under_view
add column access_order int null;
create sequence table_under_view_access_order_seq
owned by table_under_view.access_order;
create function table_under_view_reset_access_order()
returns trigger
language plpgsql
as $func$
new.access_order := null;
$func$;
create trigger table_under_view_reset_access_order_before_update
before update on table_under_view
for each row execute procedure table_under_view_reset_access_order();
create index table_under_view_access_order_idx
on table_under_view (access_order);
create index table_under_view_access_order_where_null_idx
on table_under_view (access_order)
where (access_order is null);
(You could use a before insert on table_under_view trigger too, to ensure only NULL values are inserted into access_order).
You need to update this column after transactions with INSERTs & UPDATEs on this table is finished, but before any client query your data. You cannot do anything just after a transaction finished, so let's do it before a query happens. You can do this with a function, f.ex:
create function table_under_access(from_access int)
returns setof table_under_view
language sql
as $func$
update table_under_view
set access_order = nextval('table_under_view_access_order_seq'::regclass)
where access_order is null;
select *
from table_under_view
where access_order > from_access;
$func$;
Now, your first "chunk" of data (which will fetch all rows in a table), looks like:
select *
from table_under_access(0);
The key element after this is that your client needs to process every "chunk" of data to determine which is the greatest access_order it last got (unless you include it in your result with f.ex. window functions, but if you're going to process the results - which seems highly likely - you don't need that). Always use that for the subsequent calls.
You can add an updated_at column too for ordering your results, if you want to.
You can also use a view + rule(s) for the last part (instead of the function), to make it more transparent.
Related
Is it possible to perform a query that will SELECT for some values and if those values do not exist, perform an INSERT and return the very same values - in a single query?
Background:
I am writing an application with a large deal of concurrency. At one point a function will check a database to see if a certain key value exists using SELECT. If the key exists, the function can safely exit. If the value does not exist, the function will perform a REST API call to capture the necessary data, then INSERT the values into the database.
This works fine until it is run concurrently. Two threads (I am using Go, so goroutines) will each independently run the SELECT. Since both queries report that the key does not exist, both will independently perform the REST API call and both will attempt to INSERT the values.
Currently, I avoid double-insertion by using a duplicate constraint. However, I would like to avoid even the double API call by having the first query SELECT for the key value and if it does not exist, INSERT a placeholder - then return those values. This way, subsequent SELECT queries report that the key value already exists and will not perform the API calls or INSERT.
In Pseudo-code, something like this:
SELECT values FROM my_table WHERE key=KEY_HERE;
if found;
RETURN SELECTED VALUES;
if not found:
INSERT values, key VALUES(random_placeholder, KEY_HERE) INTO table;
SELECT values from my_table WHERE key=KEY_HERE;
The application code will insert a random value so that a routine/thread can determine if it was the one that generated the new INSERT and will subsequently go ahead and perform the Rest API call.
This is for a Go application using the pgx library.
Thanks!
You could write a stored procedure and it would be a single query for the client to execute. PostgreSQL, of course, would still execute multiple statements. PostgreSQL insert statement can return values with the returning keyword, so you may not need the 2nd select.
Lock the table in an appropriate lock mode.
For example in the strictest possible mode ACCESS EXCLUSIVE:
BEGIN TRANSACTION;
LOCK elbat
IN ACCESS EXCLUSIVE MODE;
SELECT *
FROM elbat
WHERE id = 1;
-- if there wasn't any row returned make the API call and
INSERT INTO elbat
(id,
<other columns>)
VALUES (1,
<API call return values>);
COMMIT;
-- return values the to the application
Once one transaction has acquired the ACCESS EXCLUSIVE lock, no other transaction isn't even reading from the table until the acquiring transaction ends. And ACCESS EXCLUSIVE won't be granted unless there are no other (even weaker) locks. That way the instance of your component that gets the lock first will do the check and the INSERT if necessary. The other one will be blocked in the meantime and the time it finally gets access, the INSERT has already been done in the first transaction, it need not make the API call anymore (unless the first one fails for some reason and rolled back).
If this is too strict for your use case, you may need to find out which lock level might be appropriate for you. Maybe, if you can make any component accessing the database (or at least the table) cooperative (and it sounds like you can do this), even advisory locks are enough.
I want a create a table that has a column updated_date that is updated to SYSDATE every time any field in that row is updated. How should I do this in Redshift?
You should be creating table definition like below, that will make sure whenever you insert the record, it populates sysdate.
create table test(
id integer not null,
update_at timestamp DEFAULT SYSDATE);
Every time field update?
Remember, Redshift is DW solution, not a simple database, hence updates should be avoided or minimized.
UPDATE= DELETE + INSERT
Ideally instead of updating any record, you should be deleting and inserting it, so takes care of update_at population while updating which is eventually, DELETE+INSERT.
Also, most of use ETLs, you may using stg_sales table for populating you date, then also, above solution works, where you could do something like below.
DELETE from SALES where id in (select Id from stg_sales);
INSERT INTO SALES select id from stg_sales;
Hope this answers your question.
Redshift doesn't support UPSERTs, so you should load your data to a temporary/staging table first and check for IDs in the main tables, which also exist in the staging table (i.e. which need to be updated).
Delete those records, and INSERT the data from the staging table, which will have the new updated_date.
Also, don't forget to run VACUUM on your tables every once in a while, because your use case involves a lot of DELETEs and UPDATEs.
Refer this for additional info.
In postgresql: multiple sessions want to get one record from the the table, but we need to make sure they don't interfere with each other. I could do it using message queue: put the data in a queue, and them let each session get data from the queue. But is it doable in postgresql? since it will be easier for SQL guys to cal stored procedure. Any way to configure a stored procedure so that no concurrent calling will happen, or use some special lock?
I would recommend making sure the stored procedure uses SELECT FOR UPDATE, which should prevent the same row in the table from being accessed by multiple transactions.
Per the Postgres doc:
FOR UPDATE causes the rows retrieved by the SELECT statement to be
locked as though for update. This prevents them from being modified or
deleted by other transactions until the current transaction ends. That
is, other transactions that attempt UPDATE, DELETE, SELECT FOR UPDATE,
SELECT FOR NO KEY UPDATE, SELECT FOR SHARE or SELECT FOR KEY SHARE of
these rows will be blocked until the current transaction ends. The FOR
UPDATE lock mode is also acquired by any DELETE on a row, and also by
an UPDATE that modifies the values on certain columns. Currently, the
set of columns considered for the UPDATE case are those that have an
unique index on them that can be used in a foreign key (so partial
indexes and expressional indexes are not considered), but this may
change in the future.
More SELECT info.
So you don't end up locking all of the rows in the table at once (i.e. by SELECTing all of the records), I would recommend you use ORDER BY to sort the table in a consistent manner, and then do a LIMIT 1, so that it only gets the next one in the queue. Also add a WHERE clause that checks for a certain column value (i.e. processed), and then once processed set the column to a value that will prevent the WHERE clause from picking it up.
I am trying to link a remote database to a local database using dblink. What want to achieve here is:
I want to be fetching only data from the latest row(s) every 10 secs from a table in remote database.
I wish to insert the data into a local database into a pre-existing table. In this case, I wish to insert the data I collected from the remote database, plus some others like primary key, and some sequenses which I dont get from the remote database into the table.
Any suggestions would be appreciated.
You'd need an indexed column in remote table that will increase every time a row will be inserted or updated there (not a timestamp, as many rows can have the same timestamp, a computer clock can sometimes go backwards etc.). If you don't ever update then a serial primary key would suffice (enforce that updates are not allowed with a trigger if you decide to rely on this). Deletes would also not be synchronized, so I'd recommend to enforce that they're not allowed using trigger too.
You'd need a cron (or scheduler on Windows) job that will connect to your database and perform synchronization, as PostgreSQL has no mechanism for periodic tasks.
This job would just do:
start transaction;
lock table local_tablename in exclusive mode;
dblink_connect(…);
insert into local_tablename (id, data, row_counter)
select * from dblink(
'select id, data, row_counter from remote_tablename
where row_counter>'||(select coalesce(max(row_counter),-1) from local_tablename)
) as t(id int, data text, row_counter int);
commit;
It needs to be in transaction and protected with a lock, because it can break if it would run concurrently with another synchronization job (for example if previous job took more than 10 seconds).
`coalesce` is needed in case `local_tablename` has no rows yet - it would then not insert anything without it. It assumes that always `row_counter>=0`.
I have a question. Transaction isolation level is set to serializable. When the one user opens a transaction and INSERTs or UPDATEs data in "table1" and then another user opens a transaction and tries to INSERT data to the same table, does the second user need to wait 'til the first user commits the transaction?
Generally, no. The second transaction is inserting only, so unless there is a unique index check or other trigger that needs to take place, the data can be inserted unconditionally. In the case of a unique index (including primary key), it will block if both transactions are updating rows with the same value, e.g.:
-- Session 1 -- Session 2
CREATE TABLE t (x INT PRIMARY KEY);
BEGIN;
INSERT INTO t VALUES (1);
BEGIN;
INSERT INTO t VALUES (1); -- blocks here
COMMIT;
-- finally completes with duplicate key error
Things are less obvious in the case of updates that may affect insertions by the other transaction. I understand PostgreSQL does not yet support "true" serialisability in this case. I do not know how commonly supported it is by other SQL systems.
See http://www.postgresql.org/docs/current/interactive/mvcc.html
The second user will be blocked until the first user commits or rolls back his/her changes.