Insert query having DataFileRead wait event - postgresql

There is an insert query inserting data into a partitioned table using values clause.
insert into t (c1, c2, c3) values (v1,v2,v3);
Database is AWS Aurora v11. Around 20 sessions run in parallel, executing ~2million individual insert statements in total. Seeing DataFileRead as the wait event, wondering why would this wait event show up for an insert statement? Would it be because each insert statement has to check if the PK/UK keys already exists in the table before committing the insert statement? Or other reasons?

Each inserted row has to read the relevant leaf pages of each of the table's indexes in order to do index maintenance (insert the index entries for the new row into their proper locations--it has to dirty the page, but it first needs to read the page before it can dirty it), and also to verify PK/UK constraints. And maybe it also needs to read index leaf pages of other table's indexes in order to verify FKs.
If you insert the new tuples is the right order, you an hit the same leaf pages over and over in quick sequence, maximizing the cacheability. But if you have multiple indexes, there might be no ordering that can satisfy all of them.

Related

PostgreSQL: Return auto-generated ids from COPY FROM insertion

I have a non-empty PostgreSQL table with a GENERATED ALWAYS AS IDENTITY column id. I do a bulk insert with the C++ binding pqxx::stream_to, which I'm assuming uses COPY FROM. My problem is that I want to know the ids of the newly created rows, but COPY FROM has no RETURNING clause. I see several possible solutions, but I'm not sure if any of them is good, or which one is the least bad:
Provide the ids manually through COPY FROM, taking care to give the values which the identity sequence would have provided, then afterwards synchronize the sequence with setval(...).
First stream the data to a temp-table with a custom index column for ordering. Then do something likeINSERT INTO foo (col1, col2)
SELECT ttFoo.col1, ttFoo.col2 FROM ttFoo
ORDER BY ttFoo.idx RETURNING foo.id
and depend on the fact that the identity sequence produces ascending numbers to correlate them with ttFoo.idx (I cannot do RETURNING ttFoo.idx too because only the inserted row is available for that which doesn't contain idx)
Query the current value of the identity sequence prior to insertion, then check afterwards which rows are new.
I would assume that this is a common situation, yet I don't see an obviously correct solution. What do you recommend?
You can find out which rows have been affected by your current transaction using the system columns. The xmin column contains the ID of the inserting transaction, so to return the id values you just copied, you could:
BEGIN;
COPY foo(col1,col2) FROM STDIN;
SELECT id FROM foo
WHERE xmin::text = (txid_current() % (2^32)::bigint)::text
ORDER BY id;
COMMIT;
The WHERE clause comes from this answer, which explains the reasoning behind it.
I don't think there's any way to optimise this with an index, so it might be too slow on a large table. If so, I think your second option would be the way to go, i.e. stream into a temp table and INSERT ... RETURNING.
I think you can create id with type is uuid.
The first step, you should random your ids after that bulk insert them, by this way your will not need to return ids from database.

Postgres Insert as select from table ignore any errors

I am trying to bulk load records from a temp table to table using insert as select stmt and on conflict strategy do update.
I want load as many records as possible, currently if there any any foreign key violations no records get inserted, everything gets rolled back. Is there a way to insert valid records and skip the faulty records.
In https://dba.stackexchange.com/a/46477 I saw a strategy of going with the foreign table in the query to ignore the faulty rows. I don't want to do that too as I may have many foreign keys on that table and it will make my query more complex and table specific. I would like it to be generic.
Sample use case, if have 100 rows in the temp table and suppose row number 5 and 7 are causing insertion failure, I want to insert the rest 98 records and identify which two rows failed.
I want to avoid inserting record by record and catch the error, as it is not efficient. I am doing this whole exercise to avoid loading a table row by row.
Oracle provides support to catch bulk errors at a shot.
Sample https://stackoverflow.com/a/36430893/8575780
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1422998100346727312
I have already explored options loading using copy, it catches not null constraints and other data type errors, but when foreign key violation happens nothing gets committed.
I am looking something closer to what pgloader is doing when it faces error.
https://pgloader.readthedocs.io/en/latest/pgloader.html#batches-and-retry-behaviour

Move truncated records to another table in Postgresql 9.5

Problem is following: remove all records from one table, and insert them to another.
I have a table that is partitioned by date criteria. To avoid partitioning each record one by one, I'm collecting the data in one table, and periodically move them to another table. Copied records have to be removed from first table. I'm using DELETE query with RETURNING, but the side effect is that autovacuum is having a lot of work to do to clean up the mess from original table.
I'm trying to achieve the same effect (copy and remove records), but without creating additional work for vacuum mechanism.
As I'm removing all rows (by delete without where conditions), I was thinking about TRUNCATE, but it does not support RETURNING clause. Another idea was to somehow configure the table, to automatically remove tuple from page on delete operation, without waiting for vacuum, but I did not found if it is possible.
Can you suggest something, that I could use to solve my problem?
You need to use something like:
--Open your transaction
BEGIN;
--Prevent concurrent writes, but allow concurrent data access
LOCK TABLE table_a IN SHARE MODE;
--Copy the data from table_a to table_b, you can also use CREATE TABLE AS to do this
INSERT INTO table_b AS SELECT * FROM table_a;
--Zeroying table_a
TRUNCATE TABLE table_a;
--Commits and release the lock
COMMIT;

Is it possible to catch a foreign key violation in postgres

I'm trying to insert data into a table which has a foreign key constraint. If there is a constraint violation in a row that I'm inserting, I want to chuck that data away.
The issue is that postgres returns an error every time I violate the constraint. Is it possible for me to have some statement in my insert statement like 'ON FOREIGN KEY CONSTRAINT DO NOTHING'?
EDIT:
This is the query that I'm trying to do, where info is a dict:
cursor.execute("INSERT INTO event (case_number_id, date, \
session, location, event_type, worker, result) VALUES \
(%(id_number)s, %(date)s, %(session)s, \
%(location)s, %(event_type)s, %(worker)s, %(result)s) ON CONFLICT DO NOTHING", info)
It errors out when there is a foreign key violation
If you're only inserting a single row at a time, you can create a savepoint before the insert and rollback to it when the insert fails (or release it when the insert succeeds).
For Postgres 9.5 or later, you can use INSERT ... ON CONFLICT DO NOTHING which does what it says. You can also write ON CONFLICT DO UPDATE SET column = value..., which will automagically convert your insert into an update of the row you are conflicting with (this functionality is sometimes called "upsert").
This does not work because OP is dealing with a foreign key constraint rather than a unique constraint. In that case, you can most easily use the savepoint method I described earlier, but for multiple rows it may prove tedious. If you need to insert multiple rows at once, it should be reasonably performant to split them into multiple insert statements, provided you are not working in autocommit mode, all inserts occur in one transaction, and you are not inserting a very large number of rows.
Sometimes, you really do need multiple inserts in a single statement, because the round-trip overhead of talking to your database plus the cost of having savepoints on every insert is simply too high. In this case, there are a number of imperfect approaches. Probably the least bad is to build a nested query which selects your data and joins it against the other table, something like this:
INSERT INTO table_A (column_A, column_B, column_C)
SELECT A_rows.*
FROM VALUES (...) AS A_rows(column_A, column_B, column_C)
JOIN table_B ON A_rows.column_B = table_B.column_B;

Postgresql table with one ID column, sorted index, with duplicate primary key

I want to use a PostgreSQL table as a kind of work queue for documents. Each document has an ID and is stored in another, normal table with lots of additional columns. But this question is about creating the table for the work queue.
I want to create a table for this queue without OIDs with just one column: The ID of the document as integer. If an ID of a document exists in this work queue table, it means that the document with that ID is dirty and some processing has to be done.
The extra table shall avoid the VACUUM and dead tuple problems and deadlocks with transactions that would emerge if there was just a dirty bit on each document entry in the main document table.
Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. These inserts would be for many IDs in one transaction. I don't want to use any kind of nested transactions and there doesn't seem to be any kind of INSERT IF NOT EXISTS command. I'd rather have duplicate IDs in the table. Therefore duplicates must be possible for the only column in that table.
The process which processes the work queue will delete all processes IDs and therefore take care of duplicates. (BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)
But also I want the documents to be processed in order: Always shall documents with smaller IDs be processed first.
Therefore I want to have an index which aids LIMIT and ORDER BY on the ID column, the only column in the workqueue table.
Ideally given that I have only one column, this should be the primary key. But the primary key must not have duplicates, so it seems I can't do that.
Without the index, ORDER BY and LIMIT would be slow.
I could add a normal, secondary index on that column. But I fear PostgreSQL would add a second file on disc (PostgreSQL does that for every additional index) and use the double amount of disc operations for that table.
What is the best thing to do?
Add a dummy column with something random (like the OID) in order to make the primary key not complain about duplicates? Must I waste that space in my queue table?
Or is adding the second index harmless, would it become kind of the primary index which is directly in the primary tuple btree?
Shall I delete everything above this and just leave the following? The original question is distracting and contains too much unrelated information.
I want to have a table in PostgreSQL with these properties:
One column with an integer
Allow duplicates
Efficient ORDER BY+LIMIT on the column
INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between to other rows, ordered by ID.
INSERTs will happen in bulk and must not fail, expect for disc full, etc.
There shall not be additional btree files for this table, so no secondary indexes
The rows should occupy not much space, e.g. have no OIDs
I cannot think of a solution that solves all of this.
My only solution would compromise on the last bullet point: Add a PRIMARY KEY covering the integer and also a dummy column, like OIDs, a timestamp or a SERIAL.
Another solution would either use a hypothetical INSERT IF NOT EXISTS, or nested transaction or a special INSERT with a WHERE. All these solutions would add a query of the btree when inserting.
Also they might cause deadlocks.
(Also posted here: https://dba.stackexchange.com/q/45126/7788)
You said
Many parts of my system would mark documents as dirty and therefore
insert IDs to process into that table. Therefore duplicates must be
possible.
and
5 rows with the same ID mean the same thing as 1 or 10 rows with that
same ID: They mean that the document with that ID is dirty.
You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.
A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.
Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.
An implementation that allows duplicates . . .
create table dirty_documents (
document_id integer not null
);
create index on dirty_documents (document_id);
Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.
Pick the first dirty document ID number for cleaning up.
select min(document_id)
from dirty_documents;
document_id
--
1
Took only 0.136 ms. Now lets delete every row that has document ID 1.
delete from dirty_documents
where document_id = 1;
Took 0.272 ms.
Let's start over.
drop table dirty_documents;
create table dirty_documents (
document_id integer primary key
);
insert into dirty_documents
select generate_series(1,100000);
Took 500 ms. Let's find the first one again.
select min(document_id)
from dirty_documents;
Took .054 ms. That's about half the time it took using a table that allowed duplicates.
delete from dirty_documents
where document_id = 1;
Also took .054 ms. That's roughly 50 times faster than the other table.
Let's start over again, and try an unindexed table.
drop table dirty_documents;
create table dirty_documents (
document_id integer not null
);
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Get the first document.
select min(document_id)
from dirty_documents;
Took 32.5 ms. Delete those documents . . .
delete from dirty_documents
where document_id = 1;
Took 12 ms.
All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.
Reading between the lines, I think you're trying to implement a work-queueing system.
Stop. Now.
Work queueing is hard. Work queuing in a relational DBMS is very hard. Most of the "clever" solutions people come up with end up serializing work on a lock without them realising it, or they have nasty bugs in concurrent operation.
Use an existing message/task queueing system. ZeroMQ, RabbitMQ, PGQ, etc etc etc etc. There are lots to choose from and they have the significant advantages of (a) working and (b) being efficient. You'll most likely need to run an external helper process or server, but the limitations of the relational database model tend to make that necessary.
The scheme you seem to be envisioning, as best as I can guess, sounds like it'll suffer from hopeless concurrency problems when it comes to failure handling, insert/delete races, etc. Really, do not try to design this yourself, especially when you don't have a really good grasp of the underlying concurrency and performance issues.