PostgreSQL: Return auto-generated ids from COPY FROM insertion

PostgreSQL: Return auto-generated ids from COPY FROM insertion - postgresql

I have a non-empty PostgreSQL table with a GENERATED ALWAYS AS IDENTITY column id. I do a bulk insert with the C++ binding pqxx::stream_to, which I'm assuming uses COPY FROM. My problem is that I want to know the ids of the newly created rows, but COPY FROM has no RETURNING clause. I see several possible solutions, but I'm not sure if any of them is good, or which one is the least bad:
Provide the ids manually through COPY FROM, taking care to give the values which the identity sequence would have provided, then afterwards synchronize the sequence with setval(...).
First stream the data to a temp-table with a custom index column for ordering. Then do something likeINSERT INTO foo (col1, col2)
SELECT ttFoo.col1, ttFoo.col2 FROM ttFoo
ORDER BY ttFoo.idx RETURNING foo.id
and depend on the fact that the identity sequence produces ascending numbers to correlate them with ttFoo.idx (I cannot do RETURNING ttFoo.idx too because only the inserted row is available for that which doesn't contain idx)
Query the current value of the identity sequence prior to insertion, then check afterwards which rows are new.
I would assume that this is a common situation, yet I don't see an obviously correct solution. What do you recommend?

You can find out which rows have been affected by your current transaction using the system columns. The xmin column contains the ID of the inserting transaction, so to return the id values you just copied, you could:
BEGIN;
COPY foo(col1,col2) FROM STDIN;
SELECT id FROM foo
WHERE xmin::text = (txid_current() % (2^32)::bigint)::text
ORDER BY id;
COMMIT;
The WHERE clause comes from this answer, which explains the reasoning behind it.
I don't think there's any way to optimise this with an index, so it might be too slow on a large table. If so, I think your second option would be the way to go, i.e. stream into a temp table and INSERT ... RETURNING.

I think you can create id with type is uuid.
The first step, you should random your ids after that bulk insert them, by this way your will not need to return ids from database.

Related

Is postgresql sequence next val consist with insert order?

Given an table with id bigint default next_val('foo_sequence')
Can I assume the order of id consisting with the insert order ?
I mean the later inserted id is always greater then earlier inserted ids.
I am trying to calculate and save an increment continuous number of row,
Here is how I did
SELECT count(*) as seq_no from foo where id < some_id;
// get the seq no
UPDATE foo SET seq_no = seq_no_above + 1 WHERE id = some_id;
But it sometimes give duplicate seq_no value,
if the id consists with insert order, it should not have duplicate value.

In the simplest and purest sense, yes. It depends what you mean by "earlier" and "later", though, as you have to consider opening the transaction and closing the transaction. If a transaction has not been committed, then theoretically a record could show up later with an earlier ID.
The IDs are allocated when the insert happens, but the records will not show up until the records are committed. So if commit order is different, you may see some strange behavior depending on how strict your use case is.
I.e.
Open Transaction A
Insert records 1,2
Open Transaction B
Insert records 3,4
Close transaction B
Select * (get 3,4)
Close transaction A
Select * (get 1,2,3,4)
You also have to worry about caching on whether you consider them to be sequential. From the (very good) Postgres docs:
Furthermore, although multiple sessions are guaranteed to allocate
distinct sequence values, the values might be generated out of
sequence when all the sessions are considered. For example, with a
cache setting of 10, session A might reserve values 1..10 and return
nextval=1, then session B might reserve values 11..20 and return
nextval=11 before session A has generated nextval=2. Thus, with a
cache setting of one it is safe to assume that nextval values are
generated sequentially; with a cache setting greater than one you
should only assume that the nextval values are all distinct, not that
they are generated purely sequentially. Also, last_value will reflect
the latest value reserved by any session, whether or not it has yet
been returned by nextval.
One last caveat is someone with appropriate privileges can always reset the sequence to a different value, which obviously would throw a wrench into things.
EDIT:
To address your use case above, you definitely want to use sequences (and likely add NOT NULL / PRIMARY KEY constraints as well, to ensure uniqueness). In pgAdmin, at least, you can do all of this by setting data type serial. Though I have mentioned caveats, for 99% of practical purposes, you get uniqueness and sequential ordering (hence sequences) the way that you want.
In any case, we would need to see example data to confirm why you are seeing duplication (how to create a reproducible example). I presume the duplication you are seeing is in seq_no and not id, which illustrates that the problem is your query. If duplication is in id, then you have other problems, and that would explain duplication in seq_no.
Sequences are much better for transactional definition in the data (they take care of uniqueness for you, perform well in concurrency, and do not cause duplication... plus you get sequential ordering for the most part). For unique keys, they are best used with NOT NULL and PRIMARY KEY or UNIQUE constraints.
But if you need a perfect increment, it is better to do something like the below:
select *, row_number() over (order by value) as id
from foo
;
Postgres window functions are very powerful, but are definitely not the standard to use for inserting data with sequential keys. They are more useful for reporting, analysis, and complex queries after the fact.

would postgres really update page file when fields are all equals before and after update?

I am working with a little website crawler program. I use PostgresQL to store data and use such statement to update that，
INSERT INTO topic (......) VALUES (......)
ON CONFLICT DO UPDATE /* updagte all fields here */
The question is if all fields before upate and after update are really equals, would PostgresQL really update that?

Postgres (like nearly all other DBMS) will not check if the target values are different then the original ones. So the answer is: yes, it will update the row even if the values are different.
However, you can easily prevent the "empty" update in this case by including a where clause:
INSERT INTO topic (......)
VALUES (......)
ON CONFLICT (...)
DO UPDATE
set ... -- update all column
WHERE topic IS DISTINCT FROM excluded;
The where clause will prevent updating a row that is identical to the one that is being inserted. To make that work correctly your insert has to list all columns of the target tables. Otherwise the topic is distinct from excluded condition will always be true because the excluded row has fewer columns then the topic row and thus it id "distinct" from it.
Adding a check for modified values has been discussed multiple times on the mailing list and has always be discarded. The main reason being, that it doesn't make sense to have the overhead of checking for changes for every statement just to cope with a few badly written ones.

Way to migrate a create table with sequence from postgres to DB2

I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?

You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.

Postgresql table with one ID column, sorted index, with duplicate primary key

I want to use a PostgreSQL table as a kind of work queue for documents. Each document has an ID and is stored in another, normal table with lots of additional columns. But this question is about creating the table for the work queue.
I want to create a table for this queue without OIDs with just one column: The ID of the document as integer. If an ID of a document exists in this work queue table, it means that the document with that ID is dirty and some processing has to be done.
The extra table shall avoid the VACUUM and dead tuple problems and deadlocks with transactions that would emerge if there was just a dirty bit on each document entry in the main document table.
Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. These inserts would be for many IDs in one transaction. I don't want to use any kind of nested transactions and there doesn't seem to be any kind of INSERT IF NOT EXISTS command. I'd rather have duplicate IDs in the table. Therefore duplicates must be possible for the only column in that table.
The process which processes the work queue will delete all processes IDs and therefore take care of duplicates. (BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)
But also I want the documents to be processed in order: Always shall documents with smaller IDs be processed first.
Therefore I want to have an index which aids LIMIT and ORDER BY on the ID column, the only column in the workqueue table.
Ideally given that I have only one column, this should be the primary key. But the primary key must not have duplicates, so it seems I can't do that.
Without the index, ORDER BY and LIMIT would be slow.
I could add a normal, secondary index on that column. But I fear PostgreSQL would add a second file on disc (PostgreSQL does that for every additional index) and use the double amount of disc operations for that table.
What is the best thing to do?
Add a dummy column with something random (like the OID) in order to make the primary key not complain about duplicates? Must I waste that space in my queue table?
Or is adding the second index harmless, would it become kind of the primary index which is directly in the primary tuple btree?
Shall I delete everything above this and just leave the following? The original question is distracting and contains too much unrelated information.
I want to have a table in PostgreSQL with these properties:
One column with an integer
Allow duplicates
Efficient ORDER BY+LIMIT on the column
INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between to other rows, ordered by ID.
INSERTs will happen in bulk and must not fail, expect for disc full, etc.
There shall not be additional btree files for this table, so no secondary indexes
The rows should occupy not much space, e.g. have no OIDs
I cannot think of a solution that solves all of this.
My only solution would compromise on the last bullet point: Add a PRIMARY KEY covering the integer and also a dummy column, like OIDs, a timestamp or a SERIAL.
Another solution would either use a hypothetical INSERT IF NOT EXISTS, or nested transaction or a special INSERT with a WHERE. All these solutions would add a query of the btree when inserting.
Also they might cause deadlocks.
(Also posted here: https://dba.stackexchange.com/q/45126/7788)

You said
Many parts of my system would mark documents as dirty and therefore
insert IDs to process into that table. Therefore duplicates must be
possible.
and
5 rows with the same ID mean the same thing as 1 or 10 rows with that
same ID: They mean that the document with that ID is dirty.
You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.
A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.
Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.
An implementation that allows duplicates . . .
create table dirty_documents (
document_id integer not null
);
create index on dirty_documents (document_id);
Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.
Pick the first dirty document ID number for cleaning up.
select min(document_id)
from dirty_documents;
document_id
--
1
Took only 0.136 ms. Now lets delete every row that has document ID 1.
delete from dirty_documents
where document_id = 1;
Took 0.272 ms.
Let's start over.
drop table dirty_documents;
create table dirty_documents (
document_id integer primary key
);
insert into dirty_documents
select generate_series(1,100000);
Took 500 ms. Let's find the first one again.
select min(document_id)
from dirty_documents;
Took .054 ms. That's about half the time it took using a table that allowed duplicates.
delete from dirty_documents
where document_id = 1;
Also took .054 ms. That's roughly 50 times faster than the other table.
Let's start over again, and try an unindexed table.
drop table dirty_documents;
create table dirty_documents (
document_id integer not null
);
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Get the first document.
select min(document_id)
from dirty_documents;
Took 32.5 ms. Delete those documents . . .
delete from dirty_documents
where document_id = 1;
Took 12 ms.
All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.

Reading between the lines, I think you're trying to implement a work-queueing system.
Stop. Now.
Work queueing is hard. Work queuing in a relational DBMS is very hard. Most of the "clever" solutions people come up with end up serializing work on a lock without them realising it, or they have nasty bugs in concurrent operation.
Use an existing message/task queueing system. ZeroMQ, RabbitMQ, PGQ, etc etc etc etc. There are lots to choose from and they have the significant advantages of (a) working and (b) being efficient. You'll most likely need to run an external helper process or server, but the limitations of the relational database model tend to make that necessary.
The scheme you seem to be envisioning, as best as I can guess, sounds like it'll suffer from hopeless concurrency problems when it comes to failure handling, insert/delete races, etc. Really, do not try to design this yourself, especially when you don't have a really good grasp of the underlying concurrency and performance issues.

How to use BULK INSERT when rows depend on foreign keys values?

My question is related to this one I asked on ServerFault.
Based on this, I've considered the use of BULK INSERT. I now understand that I have to prepare myself a file for each entities I want to save into the database. No matter what, I still wonder whether this BULK INSERT will avoid the memory issue on my system as described in the referenced question on ServerFault.
As for the Streets table, it's quite simple! I have only two cities and five sectors to care about as the foreign keys. But then, how about the Addresses? The Addresses table is structured like this:
AddressId int not null identity(1,1) primary key
StreetNumber int null
NumberSuffix_Value int not null DEFAULT 0
StreetId int null references Streets (StreetId)
CityId int not null references Cities (CityId)
SectorId int null references Sectors (SectorId)
As I said on ServerFault, I have about 35,000 addresses to insert. Shall I memorize all the IDs? =P
And then, I now have the citizen people to insert who have an association with the addresses.
PersonId int not null indentity(1,1) primary key
Surname nvarchar not null
FirstName nvarchar not null
IsActive bit
AddressId int null references Addresses (AddressId)
The only thing I can think of is to force the IDs to static values, but then, I lose any flexibility that I had with my former approach with the INSERT..SELECT stategy.
What are then my options?
I force the IDs to be always the same, then I have to SET IDENTITY_INSERT ON so that I can force the values into the table, this way I always have the same IDs for each of my rows just as suggested here.
How to BULK INSERT with foreign keys? I can't get any docs on this anywhere. =(
Thanks for your kind assistance!
EDIT
I edited in order to include the BULK INSERT SQL instruction that finally made it for me!
I had my Excel workbook ready with the information I needed to insert. So, I simply created a few supplemental worksheet and began to write formulas in order to "import" the information data to these new sheets. I had one for each of my entities.
Streets;
Addresses;
Citizens.
As for the two other entities, it wasn't worthy to bulk insert them, as I had only two cities and five sectors (cities subdivisions) to insert. Once the both the cities and sectors inserted, I noted their respective IDs and began to ready my record sets for bulk insert. Using the power of Excel to compute the values and to "import" the foreign keys was a charm of itself, by the way. Afterwards, I have saved each of the worksheets to a separated CSV file. My records were then ready to bulked.
USE [DatabaseName]
GO
delete from Citizens
delete from Addresses
delete from Streets
BULK INSERT Streets
FROM N'C:\SomeFolder\SomeSubfolder\Streets.csv'
WITH (
FIRSTROW = 2
, KEEPIDENTITY
, FIELDTERMINATOR = N','
, ROWTERMINATOR = N'\n'
, CODEPAGE = N'ACP'
)
GO
FIRSTROW
Indicates the row number at which to begin the insert. In my situation, my CSVs contained the column headers, so the second row was the one to begin with. Aside, one could possibly want to start anywhere in his file, let's say the 15th row.
KEEPIDENTITY
Allows one to bulk-insert specified in-file entity IDs even though the table has an identity column. This parameter is the same as SET INDENTITY_INSERT my_table ON before a row insert when you wish to insert with a precise id.
As for the other parameters, they speak by themselves.
Now that this is explained, the same code was repeated for each of the two remaining entities to insert Addresses and Citizens. And because the KEEPIDENTITY was specified, all of my foreign keys remained still, though my primary keys were set as identities in SQL Server.
Only a few tweaks though, just the exact same thing as marc_s said in his answer, just import your data as fast as you can into a staging table with no restriction at all. This way, you're gonna make your life much easier, while following good practices nevertheless. =)

The basic idea is to bulk insert your data into a staging table that doesn't have any restrictions, any constraints etc. - just bulk load the data as fast as you can.
Once you have the data in the staging table, then you need to start to worry about constraints etc. when you insert the data from the staging table into the real tables.
Here, you could e.g.
insert only those rows into your real work tables that match all the criteria (and mark them as "successfully inserted" in your staging table)
handle all rows that are left in the staging table that aren't successfully inserted by some error / recovery process - whatever that could be: printing a report with all the "problem" rows, tossing them into an "error bin" or whatever - totally up to you.
Key point is: the actual BULK INSERT should be into a totally unconstrained table - just load the data as fast as you can - and only then in a second step start to worry about constraints and lookup data and references and stuff like that