postgresql - Schedule Insert data from dblink to a local table In a - postgresql

I am trying to link a remote database to a local database using dblink. What want to achieve here is:
I want to be fetching only data from the latest row(s) every 10 secs from a table in remote database.
I wish to insert the data into a local database into a pre-existing table. In this case, I wish to insert the data I collected from the remote database, plus some others like primary key, and some sequenses which I dont get from the remote database into the table.
Any suggestions would be appreciated.

You'd need an indexed column in remote table that will increase every time a row will be inserted or updated there (not a timestamp, as many rows can have the same timestamp, a computer clock can sometimes go backwards etc.). If you don't ever update then a serial primary key would suffice (enforce that updates are not allowed with a trigger if you decide to rely on this). Deletes would also not be synchronized, so I'd recommend to enforce that they're not allowed using trigger too.
You'd need a cron (or scheduler on Windows) job that will connect to your database and perform synchronization, as PostgreSQL has no mechanism for periodic tasks.
This job would just do:
start transaction;
lock table local_tablename in exclusive mode;
dblink_connect(…);
insert into local_tablename (id, data, row_counter)
select * from dblink(
'select id, data, row_counter from remote_tablename
where row_counter>'||(select coalesce(max(row_counter),-1) from local_tablename)
) as t(id int, data text, row_counter int);
commit;
It needs to be in transaction and protected with a lock, because it can break if it would run concurrently with another synchronization job (for example if previous job took more than 10 seconds).
`coalesce` is needed in case `local_tablename` has no rows yet - it would then not insert anything without it. It assumes that always `row_counter>=0`.

Related

PostgreSQL: Optimizing bulk INSERT INTO/SELECT FROM

In my Postgresql schema I have a jobs table and an accounts table. Once a day I need to schedule a job for each account by inserting a row per account into the jobs table. This can done using a simple INSERT INTO.. SELECT FROM statement, but is there any empirical way I can know if I am straining my DB by this bulk insert and whether I should chunk the inserts instead?
Postgres often does miraculous work so I have no idea if bulk inserting 500k records at a time is better than 100 x 5k, for example. The bulk insert works today but can take minutes to complete.
One additional data point: the jobs table has a uniqueness constraint on account ID, so this statement includes an ON CONFLICT clause too.
In PostgreSQL, it doesn't matter how many rows you modify in a single transaction, so it is preferable to do everything in a single statement so that you don't end up with half the work done in case of a failure. The only consideration is that transactions should not take too long, but if that happens once a day, it is no problem.

How to lock a SELECT in PostgreSQL?

I am used to do this in MySQL:
INSERT INTO ... SELECT ...
which would lock the table I SELECT from.
Now, I am trying to do something similar in PostgreSQL, where I select a set of rows in a table, and then I insert some stuff in other tables based on those rows values. I want to prevent having outdated data, so I am wondering how can I lock a SELECT in PostgresSQL.
There is no need to explicitly lock anything. A SELECT statement will always see a consistent snapshot of the table, no matter how long it runs.
The result will be no different if you lock the table against concurrent modifications before starting the SELECT, but you will harm concurrency unnecessarily.
If you need several queries to see a consistent state of the database, start a transaction with the REPEATABLE READ isolation level. Then all statements in the transaction will see the same state of the database.

Move truncated records to another table in Postgresql 9.5

Problem is following: remove all records from one table, and insert them to another.
I have a table that is partitioned by date criteria. To avoid partitioning each record one by one, I'm collecting the data in one table, and periodically move them to another table. Copied records have to be removed from first table. I'm using DELETE query with RETURNING, but the side effect is that autovacuum is having a lot of work to do to clean up the mess from original table.
I'm trying to achieve the same effect (copy and remove records), but without creating additional work for vacuum mechanism.
As I'm removing all rows (by delete without where conditions), I was thinking about TRUNCATE, but it does not support RETURNING clause. Another idea was to somehow configure the table, to automatically remove tuple from page on delete operation, without waiting for vacuum, but I did not found if it is possible.
Can you suggest something, that I could use to solve my problem?
You need to use something like:
--Open your transaction
BEGIN;
--Prevent concurrent writes, but allow concurrent data access
LOCK TABLE table_a IN SHARE MODE;
--Copy the data from table_a to table_b, you can also use CREATE TABLE AS to do this
INSERT INTO table_b AS SELECT * FROM table_a;
--Zeroying table_a
TRUNCATE TABLE table_a;
--Commits and release the lock
COMMIT;

Sending only updated rows to a client

I'd like to create a web service that allows a client to fetch all rows in a table, and then later allows the client to only fetch new or updated rows.
The simplest implementation seems to be to send the current timestamp to the client, and then have the client ask for rows that are newer than the timestamp in the following request.
It seems that this is doable by keeping an "updated_at" column with a timestamp set to NOW() in update and insert triggers, and then querying newer rows, and also passing down the value of NOW().
The problem is that if there are uncommitted transactions, these transactions will set updated_at to the start time of the transaction, not the commit time.
As a result, this simple implementation doesn't work, because rows can be lost since they can appear with a timestamp in the past.
I have been unable to find any simple solution to this problem, despite the fact that it seems to be a very common need: any ideas?
Possible solutions:
Keep a monotonic timestamp in a table, update it at the start of every transaction to MAX(NOW(), last_timestamp + 1) and use it as a row timestamp. Problem: this effectively means that all write transactions are fully serialized and lock the whole database since they conflict on the update time table.
At the end of the transaction, add a mapping from NOW() to the time in an update table like the above solution. This seems to require to take an explicit lock and use a sequence to generate non-temporal "timestamps" because just using an UPDATE on a single row would cause rollbacks in SERIALIZABLE mode.
Somehow have PostgreSQL, at commit time, iterate over all updated rows and set updated_at to a monotonic timestamp
Somehow have PostgreSQL itself maintain a table of transaction commit times, which it doesn't seem to do at the moment
Using the built-in xmin column also seems impossible, because VACUUM can trash it.
It would be nice to be able to do this in the database without modifications to all updates in the application.
What is the usual way this is done?
The problem with the naive solution
In case it's not obvious, this is the problem with using NOW() or CLOCK_TIMESTAMP():
At time 1, we run NOW() or CLOCK_TIMESTAMP() in a transaction, which gives 1 and we update a row setting time 1 as the update time
At time 2, a client fetches all rows, and we tell him that we gave it all rows until time 2
At time 3, the transaction commits with "time 1" in the updated_at field
The client asks for updated rows since time 2 (the time he got from the previous full fetch request), we query for updated_at >= 2 and we return nothing, instead of returning the row that is just added
That row is lost and will never seen by the client
Your whole proposition goes against some of the underlying fundamentals of an ACID-compliant RDBMS like PostgreSQL. Time of transaction start (e.g. current_timestamp()) and other time-based metrics are meaningless as a measure of what a particular client has received or not. Abandon the whole idea.
Assuming that your clients connect through a persistent session to the database you can follow this procedure:
When the session starts, CREATE TEMP UNLOGGED TABLE for the session user. This table contains nothing but the PK and the last update time of the table you want to fetch the data from.
The client polls for new data and receives only those records that have a PK not yet in the temp table or an existing PK but a newer last update time. Currently uncommitted transactions are invisible but will be retrieved at the next poll for new or updated records. The update time is required because there is no way to delete records from the temp tables of all concurrent clients.
The PK and last update time of retrieved record are stored in the temp table.
When the user closes the session, the temp table is deleted.
If you want to persist the retrieved records over multiple sessions for each client or the client disconnects after every query, then you need a regular table but then I would suggest to also add the oid of the user such that all users can use a single table for keeping track of the retrieved records. In that latter case you can create an AFTER UPDATE trigger on the table with your data which deletes the PK from the table with fetched records, for all users in one sweep. On their next poll the clients will then get the updated record.
Add a column, which will be used to track what record has been sent to a client:
alter table table_under_view
add column access_order int null;
create sequence table_under_view_access_order_seq
owned by table_under_view.access_order;
create function table_under_view_reset_access_order()
returns trigger
language plpgsql
as $func$
new.access_order := null;
$func$;
create trigger table_under_view_reset_access_order_before_update
before update on table_under_view
for each row execute procedure table_under_view_reset_access_order();
create index table_under_view_access_order_idx
on table_under_view (access_order);
create index table_under_view_access_order_where_null_idx
on table_under_view (access_order)
where (access_order is null);
(You could use a before insert on table_under_view trigger too, to ensure only NULL values are inserted into access_order).
You need to update this column after transactions with INSERTs & UPDATEs on this table is finished, but before any client query your data. You cannot do anything just after a transaction finished, so let's do it before a query happens. You can do this with a function, f.ex:
create function table_under_access(from_access int)
returns setof table_under_view
language sql
as $func$
update table_under_view
set access_order = nextval('table_under_view_access_order_seq'::regclass)
where access_order is null;
select *
from table_under_view
where access_order > from_access;
$func$;
Now, your first "chunk" of data (which will fetch all rows in a table), looks like:
select *
from table_under_access(0);
The key element after this is that your client needs to process every "chunk" of data to determine which is the greatest access_order it last got (unless you include it in your result with f.ex. window functions, but if you're going to process the results - which seems highly likely - you don't need that). Always use that for the subsequent calls.
You can add an updated_at column too for ordering your results, if you want to.
You can also use a view + rule(s) for the last part (instead of the function), to make it more transparent.

INSERT and transaction serialization in PostreSQL

I have a question. Transaction isolation level is set to serializable. When the one user opens a transaction and INSERTs or UPDATEs data in "table1" and then another user opens a transaction and tries to INSERT data to the same table, does the second user need to wait 'til the first user commits the transaction?
Generally, no. The second transaction is inserting only, so unless there is a unique index check or other trigger that needs to take place, the data can be inserted unconditionally. In the case of a unique index (including primary key), it will block if both transactions are updating rows with the same value, e.g.:
-- Session 1 -- Session 2
CREATE TABLE t (x INT PRIMARY KEY);
BEGIN;
INSERT INTO t VALUES (1);
BEGIN;
INSERT INTO t VALUES (1); -- blocks here
COMMIT;
-- finally completes with duplicate key error
Things are less obvious in the case of updates that may affect insertions by the other transaction. I understand PostgreSQL does not yet support "true" serialisability in this case. I do not know how commonly supported it is by other SQL systems.
See http://www.postgresql.org/docs/current/interactive/mvcc.html
The second user will be blocked until the first user commits or rolls back his/her changes.