Postgres - Bulk transferring of data from one table to another - postgresql

I need to transfer a large amount of data (several million rows) from one table to another. So far I’ve tried doing this….
INSERT INTO TABLE_A (field1, field2)
SELECT field1, field2 FROM TABLE_A_20180807_BCK;
This worked (eventually) for a table with about 10 million rows in it (took 24 hours). The problem is that I have several other tables that need the same process applied and they’re all a lot larger (the biggest is 20 million rows). I have attempted a similar load with a table holding 12 million rows and it failed to complete in 48 hours so I had to cancel it.
Other issues that probably affect performance are 1) TABLE_A has a field based on an auto-generated sequence, 2) TABLE_A has an AFTER INSERT trigger on it that parses each new record and adds a second record to TABLE_B
A number of other threads have suggested doing a pg_dump of TABLE_A_20180807_BCK and then load the data back into TABLE_A. I’m not sure a pg_dump would actually work for me because I’m only interested in couple of fields from TABLE_A, not the whole lot.
Instead I was wondering about the following….
Export to a CSV file…..
COPY TABLE_A_20180807_BCK (field1,field2) to 'd:\tmp\dump\table_a.dump' DELIMITER ',' CSV;
Import back into the desired table….
COPY TABLE_A(field1,field2) FROM 'd:\tmp\dump\table_a.dump' DELIMITER ',' CSV
Is the export/import method likely to be any quicker – I’d like some guidance on this before I start on another job that may take days to run, and may not even work any better! The obvious answer of "just try it and see" isn't really an option, I can't afford more downtime!
(this is follow-on question from this, if any background details are required)
Update....
I don't think there is any significant problems with the trigger. Under normal circumstances records are INPUTed into TABLE_A at a rate of about 1000/sec (including trigger time). I think the issue is likely to be size of the transaction, under normal circumstances records are inserted into in blocks of 100 records per INSERT, the statement shown above attempts to add 10 million records in a single transaction, my guess is that this is the problem, but I've no way of knowing if it really is, or if there's a suitable work around (or if the export/import method I've proposed would be any quicker)
Maybe I should have emphasized this earlier, every insert into TABLE_A fires a trigger that adds record to TABLE_B. It's the data that's in TABLE_B that's the final objective, so disabling the trigger isn't an option! This whole problem came about because I accidentally disabled the trigger for a few days, and the preferred solution to the question 'how to run a trigger on existing rows' seemed to be 'remove the rows and add them back again' - see the original post (link above) for details.
My current attempt involves using the COPY command with a WHERE clause to split the contents of TABLE_A_20180807_BCK into a dozen small files and then re-load them one at a time. This may not give me any overall time saving, but although I can't afford 24 hours of continuous downtime, I can afford 6 hours of downtime for 4 nights.

Preparation (if you have access and can restart your server) set checkpoint_segments to 32 or perhaps more. This will reduce the frequency and number of checkpoints during this operation. You can undo it when you're finished. This step is not totally necessary but should speed up writes considerably.
edit postgresql.conf and set checkpoint_segments to 32 or maybe more
Step 1: drop/delete all indexes and triggers on table A.
EDIT: Step 1a
alter table_a set unlogged;
(repeat step 1 for each table you're inserting into)
Step 2. (unnecessary if you do one table at a time)
begin transaction;
Step 3.
INSERT INTO TABLE_A (field1, field2)
SELECT field1, field2 FROM TABLE_A_20180807_BCK;
(repeat step 3 for all tables being inserted into)
Step 4. (unnecessary if you do one table at a time)
commit;
Step 5 re-enable indexes and triggers on all tables.
Step 5a.
Alter table_a set logged;

Related

PostgreSQL - 100 million records transfer from archive to a new table

I have a requirement to transfer data from 2 tables (Table A and Table B) into a new table.
I am using a query to join both A and B tables using an ID column.
Table A and B are archive tables without any indexes. (Millions of records)
Table X and Y are a replica of A and B with good indexes. (Some thousands of records)
Below is the code for my project.
with data as
(
SELECT a.*, b.* FROM A_archive a
join B_archive b where a.transaction_id = b.transaction_id
UNION
SELECT x.*, y.* FROM X x
join Y y where x.transaction_id = y.transaction_id
)
INSERT INTO
Another_Table
(
columns
)
select * from data
On Conflict(transaction_id)
do udpate ...
The above whole thing is running in production environment and has nearly 140 million records.
Due to this production database is taking almost 10 hours to process the data and failing.
I am also having a distributed job scheduler in AWS to schedule this query inside a function and retrieve the latest records every 5 hours. The archive tables store closed invoice data. Pega UI will be using this table for retrieving data about closed invoices and showing to the customer.
Please suggest something that is a bit more performant.
UNION removes duplicate rows. On big unindexed tables that is an expensive operation. Try UNION ALL if you don't need deduplication. It will save the s**tton of data shuffling and comparisons required for deduplication.
Without indexes on your archival tables your JOIN operation will be grossly inefficient. Index, at a minimum, the transaction_id columns you use in your ON clause.
You don't say what you want to do with the resulting table. In many cases you'll be able to use a VIEW rather than a table for your purposes. A VIEW removes the work of creating the derived table. Actually it defers the work to the time of SELECT operations using the derived structure. If your SELECT operations have highly selective WHERE clauses the savings can be astonishing. For this to work well you may need to put appropriate indexes on your archival tables.
You use SELECT * when you could enumerate the columns you need. That certainly puts one redundant column into your result: it generates two copies of transaction_id. It also may generate other redundant or unused data. Always avoid SELECT * in production software unless you know you need it.
Keep this in mind: SQL is declarative, not procedural. You declare (describe) the result you require, and you let the server work out the best way to get it. VIEWs let the server do this work for you in cases like your table combination. It will use the indexes you provide as best it can.
That UNION must be costly, it pretty much builds a temp-table in the background containing all the A-B + X-Y records, sorts it (over all fields) and then removes any doubles. If you say 100 million records are involved then that's a LOT of sorting going on that most likely will involve swapping out to disk.
Keep in mind that you only need to do this if there are expected duplicates
in the result from the JOIN between A and B
in the result from the JOIN between X and Y
in the combined result from the two above
IF neither of those are expected, just use UNION ALL
In fact, in that case, why not have 1 INSERT operation for A-B and another one for X-Y? Going by the description I'd say that whatever is in X-Y should overrule whatever is in A-B anyway, right?
Also, as mentioned by O.Jones, archive tables or not, they should come at least with a (preferably clustered) index on the transaction_id fields you're JOINing on. (same for the Another_Table btw)
All that said, processing 100M records in 1 transaction IS going to take some time, it's just a lot of data that's being moved around. But 10h does sound excessive indeed.

PostgreSQL: UPDATE large table

I have a large PostgreSQL table of 29 million rows. The size (according to the stats tab in pgAdmin is almost 9GB.) The table is post-gis enabled with an empty geometry column.
I want to UPDATE the geometry column using ST_GeomFromText, reading from X and Y coordinate columns (SRID: 27700) stored in the same table. However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.
To overcome this, should I UPDATE the 29 million rows in batches/stages? How can I do 1 million rows (the FIRST 1 million), then do the next 1 million rows until I reach 29 million?
Or are there other more efficient ways of updating large tables like this?
I should add, the table is hosted in AWS.
My UPDATE query is:
UPDATE schema.table
SET geom = ST_GeomFromText('POINT(' || eastingcolumn || ' ' || northingcolumn || ')',27700);
You did not give any server specs, writing 9GB can be pretty fast on recent hardware.
You should be OK with one, long, update - unless you have concurrent writes to this table.
A common trick to overcome this problem (a very long transaction, locking writes to the table) is to split the UPDATE into ranges based on the primary key, ran in separate transactions.
/* Use PK or any attribute with a known distribution pattern */
UPDATE schema.table SET ... WHERE id BETWEEN 0 AND 1000000;
UPDATE schema.table SET ... WHERE id BETWEEN 1000001 AND 2000000;
For high level of concurrent writes people use more subtle tricks (like: SELECT FOR UPDATE / NOWAIT, lightweight locks, retry logic, etc).
From my original question:
However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.
Turns out our Amazon AWS instance database was running out of space, stopping my original ST_GeomFromText query from completing. Freeing up space fixed it.
On an important note, as suggested by #mlinth, ST_Point ran my query far quicker than ST_GeomFromText (24mins vs 2hours).
My final query being:
UPDATE schema.tablename
SET geom = ST_SetSRID(ST_Point(eastingcolumn,northingcolumn),27700);

How to list all locked rows of a table?

My application uses pessimistic locking. When a user opens the form for update a record, the application executes this query (table names are exemplary):
begin;
select *
from master m
natural join detail d
where m.master_id = 123456
for update nowait;
The query locks one master row and several (to several dozen) detail rows. Transaction is open until a user confirms or cancels updates.
I need to know what rows (at least master rows) are locked. I have excavated the documentation and postgres wiki without success.
Is it possible to list all locked rows?
PostgreSQL 9.5 added a new option to FOR UPDATE that provides a straightforward way to do this.
SELECT master_id
FROM master
WHERE master_id NOT IN (
SELECT master_id
FROM master
FOR UPDATE SKIP LOCKED);
This acquires locks on all the not-currently-locked rows, so think through whether that's a problem for you, especially if your table is large. If nothing else, you'll want to avoid doing this in an open transaction. If your table is huge you can apply additional WHERE conditions and step through it in chunks to avoid locking everything at once.
Is it possible? Probably yes, but it is the Greatest Mystery of Postgres. I think you would need to write your own extension for it (*).
However, there is an easy way to work around the problem. You can use very nice Postgres feature, advisory locks. Two arguments of the function pg_try_advisory_lock(key1 int, key2 int) you can interpret as: table oid (key1) and row id (key2). Then
select pg_try_advisory_lock(('master'::regclass)::integer, 123456)
locks row 123456 of table master, if it was not locked earlier. The function returns boolean.
After update the lock has to be freed:
select pg_advisory_unlock(('master'::regclass)::integer, 123456)
And the nicest thing, list of locked rows:
select classid::regclass, objid
from pg_locks
where locktype = 'advisory'
Advisory locks may be complementary to regular locks or you can use them independently. The second option is very temptive, as it can significantly simplify the code. But it should be applied with caution because you have to make sure that all updates (deletes) on the table in all applications are performed with this locking.
(*) Mr. Tatsuo Ishii did it (I did not know about it, have just found).

How to execute TSQL select without blocking?

I have a table which occasionally gets massive (300k+) amounts of rows inserted into it in a batch.
However, if the table is being read from during this insert period, the inserts and selects time out.
preventing all selects allows the inserts to run just fine.
Is there a way I can allow the selects to happen in a way that doesn't block the inserts?
I'm selecting with READ UNCOMMITTED but that doesn't seem to be enough.
I don't care if the read isn't 100% accurate (with regards the inserted data), it can miss rows if needs be, I just need the select to be fast and not upset the insert. Is this possible?
NOLOCK - http://technet.microsoft.com/en-us/library/aa213026(v=sql.80).aspx
Does this help?
SELECT * FROM 'TABLE NAME' WITH (NOLOCK)

In-order sequence generation

Is there a way to generate some kind of in-order identifier for a table records?
Suppose that we have two threads doing queries:
Thread 1:
begin;
insert into table1(id, value) values (nextval('table1_seq'), 'hello');
commit;
Thread 2:
begin;
insert into table1(id, value) values (nextval('table1_seq'), 'world');
commit;
It's entirely possible (depending on timing) that an external observer would see the (2, 'world') record appear before the (1, 'hello').
That's fine, but I want a way to get all the records in the 'table1' that appeared since the last time the external observer checked it.
So, is there any way to get the records in the order they were inserted? Maybe OIDs can help?
No. Since there is no natural order of rows in a database table, all you have to work with is the values in your table.
Well, there are the Postgres specific system columns cmin and ctid you could abuse to some degree.
The tuple ID (ctid) contains the file block number and position in the block for the row. So this represents the current physical ordering on disk. Later additions will have a bigger ctid, normally. Your SELECT statement could look like this
SELECT *, ctid -- save ctid from last row in last_ctid
FROM tbl
WHERE ctid > last_ctid
ORDER BY ctid
ctid has the data type tid. Example: '(0,9)'::tid
However it is not stable as long-term identifier, since VACUUM or any concurrent UPDATE or some other operations can change the physical location of a tuple at any time. For the duration of a transaction it is stable, though. And if you are just inserting and nothing else, it should work locally for your purpose.
I would add a timestamp column with default now() in addition to the serial column ...
I would also let a column default populate your id column (a serial or IDENTITY column). That retrieves the number from the sequence at a later stage than explicitly fetching and then inserting it, thereby minimizing (but not eliminating) the window for a race condition - the chance that a lower id would be inserted at a later time. Detailed instructions:
Auto increment table column
What you want is to force transactions to commit (making their inserts visible) in the same order that they did the inserts. As far as other clients are concerned the inserts haven't happened until they're committed, since they might roll back and vanish.
This is true even if you don't wrap the inserts in an explicit begin / commit. Transaction commit, even if done implicitly, still doesn't necessarily run in the same order that the row its self was inserted. It's subject to operating system CPU scheduler ordering decisions, etc.
Even if PostgreSQL supported dirty reads this would still be true. Just because you start three inserts in a given order doesn't mean they'll finish in that order.
There is no easy or reliable way to do what you seem to want that will preserve concurrency. You'll need to do your inserts in order on a single worker - or use table locking as Tometzky suggests, which has basically the same effect since only one of your insert threads can be doing anything at any given time.
You can use advisory locking, but the effect is the same.
Using a timestamp won't help, since you don't know if for any two timestamps there's a row with a timestamp between the two that hasn't yet been committed.
You can't rely on an identity column where you read rows only up to the first "gap" because gaps are normal in system-generated columns due to rollbacks.
I think you should step back and look at why you have this requirement and, given this requirement, why you're using individual concurrent inserts.
Maybe you'll be better off doing small-block batched inserts from a single session?
If you mean that every query if it sees world row it has to also see hello row then you'd need to do:
begin;
lock table table1 in share update exclusive mode;
insert into table1(id, value) values (nextval('table1_seq'), 'hello');
commit;
This share update exclusive mode is the weakest lock mode which is self-exclusive — only one session can hold it at a time.
Be aware that this will not make this sequence gap-less — this is a different issue.
We found another solution with recent PostgreSQL servers, similar to #erwin's answer but with txid.
When inserting rows, instead of using a sequence, insert txid_current() as row id. This ID is monotonically increasing on each new transaction.
Then, when selecting rows from the table, add to the WHERE clause id < txid_snapshot_xmin(txid_current_snapshot()).
txid_snapshot_xmin(txid_current_snapshot()) corresponds to the transaction index of the oldest still-open transaction. Thus, if row 20 is committed before row 19, it will be filtered out because transaction 19 will still be open. When the transaction 19 is committed, both rows 19 and 20 will become visible.
When no transaction is opened, the snapshot xmin will be the transaction id of the currently running SELECT statement.
The returned transaction IDs are 64-bits, the higher 32 bits are an epoch and the lower 32 bits are the actual ID.
Here is the documentation of these functions: https://www.postgresql.org/docs/9.6/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT
Credits to tux3 for the idea.