Postgres - How to delete a row with its sequence? [duplicate] - postgresql

This question already has an answer here:
Why in PostgreSQL when you delete a row in a table the id number of the future inserted row is not sequential?
(1 answer)
Closed 10 months ago.
I created a table called Product with a serial ID in Postgres:
CREATE TABLE PRODUCTS (
PRODUCT_ID serial PRIMARY KEY,
NAME VARCHAR ( 100 ) NOT NULL );
If I deleted a row with
DELETE FROM PRODUCTS WHERE PRODUCT_ID = 2
The next product to insert will have the id: 3 even if the Product_Id: 2 was already deleted, I'd like to delete the data with it's sequence so the next insert has the id: 2.

It's a generally bad idea. IDs are usually not the place where you need to fill the gaps.
However, if you are only looking to do this manually, then you can insert the new row with including the ID value by hand, then sequence will not be used.
You can reset the sequence with Postgres functions (eg. setval) , but you really shouldn't do it except in some database recovery cases maybe.
Just get used to the fact that IDs will not be sequential. That's the norm. If you have time going around chasing sequence numbering, you might be doing something productive instead.
If you do want sequential IDs for some non changing things like categories, just export and recreate table after you've inserted all the data, omitting ID column when importing. But remember that IDs are usually used by other tables, so you might have to update all your database every time you want to nicely renumber your IDs.

Related

Adding a Serial Column to existing table with 100,000,000 rows

I have a table with roughly 100,000,000 rows. We need to delete around 80,000 of them for a remediation.
In order to prevent downtime, I have a job setup to grab the records that needs to be deleted and then processes the delete in chunks of 100. However, even processing the first 100 is taking forever.
There is no primary ID on this table and the only way I can reliably reference each row is with a unique column called tx which is a varchar(250)` (though the field is never longer than 18-20 characters). I created an index on this row, but still takes roughly 4-6s to select a row.
Seemed likely the varchar was causing the problem, so I wanted to add a new id bigint serial column, but was trying to figure out whether or not doing this would lock the table until it's able to populate all of the ID's.
I know alter table add column is non blocking as long as there is no default value. But does Serial count as a default value?
I couldn't find an answer to this in the documentation. We're on Postgres 12.
Adding a new column with a sequence-generated value will rewrite the table, which will cause down time. With some care, it could be done without down time, but that is complicated and not worth the effort if you already have a varchar column with a unique index on it that does not contain NULL values.
Searching for rows with the existing index should be a matter of milliseconds. If it isn't, that's the problem you have to solve. Can you add EXPLAIN (ANALYZE, BUFFERS) output for the query to the question?

How to achieve reliable db reads calls with concurrency

Consider you have the very simple table definition of:
CREATE TABLE first_name
(
id Integer NOT NULL,
name varchar(10),
PRIMARY KEY id
);
Now consider you have two rows like:
id , name
1 Dan
2 Jack
Imagine you have X processes that read from time to time the max(id) value,
then decide what is the sequential id to be written as the new record.
The problem is when having multiple processes like that, while reading we can already have another id entered.
What is the best option to guarantee in postgres an atomic action of read latest id and then write the next one, when having multiple processes doing the same all the time?
I know we have the Serial type (like mysql autoincrement) which allows automatic management of field updates in a sequential manner, how will it perform when multiple processes won't have any lock mechanism applied and just the serial definition, is it sufficient? are we protected here for concurrency problem?
Example for the second declaration from point 2:
CREATE TABLE first_name
(
id Serial,
name varchar(10),
PRIMARY KEY id
);
To get the max just query the serial value:
select currval(pg_get_serial_sequence('first_name', 'id'));
for example:
clima=# CREATE TABLE first_name
(
id serial,
name varchar(10)
);
CREATE TABLE
clima=# insert into first_name(name) select 'Diego';
INSERT 0 1
clima=# select currval(pg_get_serial_sequence('first_name', 'id'));
currval
---------
1
(1 row)
Yes and No, Depends, have you transactions... with witch scope?
The whole point of SERIAL is that the database solves the concurrency issue for you.
With respect to postgresql:
Here's a page from the postgresql documentation (Data Types/Numeric Types/Serial Types) which tells you that SERIAL columns are built on sequences.
Note: Because smallserial, serial and bigserial are implemented using sequences...
Here we see sequence generators, CREATE SEQUENCE a postgresql (only?) construct that lets you make your own integer sequences without tying them to an (identity) column. It discusses the semantics, which includes the property that not every sequential number might appear in your sequence (because the sequence ids are generated even if the row isn't actually added to the table, e.g., if the inserting transaction is rolled back).
Because nextval and setval calls are never rolled back, sequence objects cannot be used if "gapless" assignment of sequence numbers is needed. It is possible to build gapless assignment by using exclusive locking of a table containing a counter; but this solution is much more expensive than sequence objects, especially if many transactions need sequence numbers concurrently.
(Also you can "cache" sequence generation but then you have issues with non-sequential sequence ids).
Finally, here we see you can also use GENERATED AS IDENTITY to the same effect, standard SQL.

Indexing for efficient querying and pagination of financial data in PostgreSQL

I'm working in an API that needs to return a list of financial transactions. These records are held in 6 different tables, but all have 3 common fields:
transaction_id int NOT NULL,
account_id bigint NOT NULL,
created timestamptz NOT NULL
note: might have actually
been a good use of table in inheritance in postgresql but it wasn't done like that.
The business requirement is to return all transactions for a given account_id in 1 list sorted by created in descending order (similar to an online banking page where your last transaction is at the top). Originally, they want to paginate in groups of 50 records, but I've got them to do it on date ranges (believing that I can do it more efficiently in the database than using offset and limits).
My intent is to create an index on each of these tables like this:
CREATE INDEX idx_table_1_account_created ON table_1(account_id, created desc);
ALTER TABLE table_1 CLUSTER ON idx_table_1_account_created;
Then finally to create a view to union all of the records from the 6 tables into one list and then obviously the records from the 6 tables will need to be *resorted" to come up with a unified list (in the correct order). This call will look like:
SELECT * FROM vw_all_transactions
WHERE account_id = 12345678901234
AND created >= '2014-01-01' AND created < '2014-02-01'
ORDER BY created desc;
My question is related to creating the indexing and clustering scheme. Since the records are going to have to be resorted by the view anyway is there any reason to do specify the individual indexes as created desc? And does sorting this way have any penalties when periodicially calling CLUSTER;
I've done some googling and reading but can't really seem to find any information that answers how this clustering is going to work.
Using PostgreSQL 9.2 on Heroku.

Duplicate Key error when using INSERT DEFAULT

I am getting a duplicate key error, DB2 SQL Error: SQLCODE=-803, SQLSTATE=23505, when I try to INSERT records. The primary key is one column, INTEGER 4, Generated, and it is the first column.
the insert looks like this: INSERT INTO SCHEMA.TABLE1 values (DEFAULT, ?, ?, ...)
It's my understanding that using the value DEFAULT will just let DB2 auto-generate the key at the time of insert, which is what I want. This works most of the time, but sometimes/randomly I get the duplicate key error. Thoughts?
More specifically, I'm running against DB2 9.7.0.3, using Scriptella to copy a bunch of records from one database to another. Sometimes I can process a bunch with no problems, other times I'll get the error right away, other times after 2 records, or 20 records, or 30 records, etc. Does not seem to be a pattern, nor is it the same record every time. If I change the data to copy 1 record instead of a bunch, sometimes I'll get the error one time, then it's fine the next time.
I thought maybe some other process was inserting records during my batch program, and creating keys at the same time. However, the tables I'm copying TO should not have any other users/processes trying to INSERT records during this same time frame, although there could be READS happening.
Edit: adding create info:
Create table SCHEMA.TABLE1 (
SYSTEM_USER_KEY INTEGER NOT NULL
generated by default as identity (start with 1 increment by 1 cache 20),
COL2...,
)
alter table SCHEMA.TABLE1
add constraint SYSTEM_USER_SYSTEM_USER_KEY_IDX
Primary Key (SYSTEM_USER_KEY);
You most likely have records in your table with IDs that are bigger then the next value in your identity sequence. To find out what the current value your sequence is about at, run the following query.
select s.nextcachefirstvalue-s.cache, s.nextcachefirstvalue-s.increment
from syscat.COLIDENTATTRIBUTES as a inner join syscat.sequences as s on a.seqid=s.seqid
where a.tabschema='SCHEMA'
and a.TABNAME='TABLE1'
and a.COLNAME='SYSTEM_USER_KEY'
So basically what happened is that somehow you got records in your table with ids that are bigger then the current last value of your identity sequence. So sooner or later these ids will collide with identity generated ids.
There are different reasons on how this could have happened. One possibility is that data was loaded which already contained values for the id column or that records were inserted with an actual value for the ID. Another option is that the identity sequence was reset to start at a lower value than the max id in the table.
Whatever the cause, you may also want the fix:
SELECT MAX(<primary_key_column>) FROM onsite.forms;
ALTER TABLE <table> ALTER COLUMN <primary_key_column> RESTART WITH <number from previous query + 1>;

Postgresql table with one ID column, sorted index, with duplicate primary key

I want to use a PostgreSQL table as a kind of work queue for documents. Each document has an ID and is stored in another, normal table with lots of additional columns. But this question is about creating the table for the work queue.
I want to create a table for this queue without OIDs with just one column: The ID of the document as integer. If an ID of a document exists in this work queue table, it means that the document with that ID is dirty and some processing has to be done.
The extra table shall avoid the VACUUM and dead tuple problems and deadlocks with transactions that would emerge if there was just a dirty bit on each document entry in the main document table.
Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. These inserts would be for many IDs in one transaction. I don't want to use any kind of nested transactions and there doesn't seem to be any kind of INSERT IF NOT EXISTS command. I'd rather have duplicate IDs in the table. Therefore duplicates must be possible for the only column in that table.
The process which processes the work queue will delete all processes IDs and therefore take care of duplicates. (BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)
But also I want the documents to be processed in order: Always shall documents with smaller IDs be processed first.
Therefore I want to have an index which aids LIMIT and ORDER BY on the ID column, the only column in the workqueue table.
Ideally given that I have only one column, this should be the primary key. But the primary key must not have duplicates, so it seems I can't do that.
Without the index, ORDER BY and LIMIT would be slow.
I could add a normal, secondary index on that column. But I fear PostgreSQL would add a second file on disc (PostgreSQL does that for every additional index) and use the double amount of disc operations for that table.
What is the best thing to do?
Add a dummy column with something random (like the OID) in order to make the primary key not complain about duplicates? Must I waste that space in my queue table?
Or is adding the second index harmless, would it become kind of the primary index which is directly in the primary tuple btree?
Shall I delete everything above this and just leave the following? The original question is distracting and contains too much unrelated information.
I want to have a table in PostgreSQL with these properties:
One column with an integer
Allow duplicates
Efficient ORDER BY+LIMIT on the column
INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between to other rows, ordered by ID.
INSERTs will happen in bulk and must not fail, expect for disc full, etc.
There shall not be additional btree files for this table, so no secondary indexes
The rows should occupy not much space, e.g. have no OIDs
I cannot think of a solution that solves all of this.
My only solution would compromise on the last bullet point: Add a PRIMARY KEY covering the integer and also a dummy column, like OIDs, a timestamp or a SERIAL.
Another solution would either use a hypothetical INSERT IF NOT EXISTS, or nested transaction or a special INSERT with a WHERE. All these solutions would add a query of the btree when inserting.
Also they might cause deadlocks.
(Also posted here: https://dba.stackexchange.com/q/45126/7788)
You said
Many parts of my system would mark documents as dirty and therefore
insert IDs to process into that table. Therefore duplicates must be
possible.
and
5 rows with the same ID mean the same thing as 1 or 10 rows with that
same ID: They mean that the document with that ID is dirty.
You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.
A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.
Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.
An implementation that allows duplicates . . .
create table dirty_documents (
document_id integer not null
);
create index on dirty_documents (document_id);
Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.
Pick the first dirty document ID number for cleaning up.
select min(document_id)
from dirty_documents;
document_id
--
1
Took only 0.136 ms. Now lets delete every row that has document ID 1.
delete from dirty_documents
where document_id = 1;
Took 0.272 ms.
Let's start over.
drop table dirty_documents;
create table dirty_documents (
document_id integer primary key
);
insert into dirty_documents
select generate_series(1,100000);
Took 500 ms. Let's find the first one again.
select min(document_id)
from dirty_documents;
Took .054 ms. That's about half the time it took using a table that allowed duplicates.
delete from dirty_documents
where document_id = 1;
Also took .054 ms. That's roughly 50 times faster than the other table.
Let's start over again, and try an unindexed table.
drop table dirty_documents;
create table dirty_documents (
document_id integer not null
);
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Get the first document.
select min(document_id)
from dirty_documents;
Took 32.5 ms. Delete those documents . . .
delete from dirty_documents
where document_id = 1;
Took 12 ms.
All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.
Reading between the lines, I think you're trying to implement a work-queueing system.
Stop. Now.
Work queueing is hard. Work queuing in a relational DBMS is very hard. Most of the "clever" solutions people come up with end up serializing work on a lock without them realising it, or they have nasty bugs in concurrent operation.
Use an existing message/task queueing system. ZeroMQ, RabbitMQ, PGQ, etc etc etc etc. There are lots to choose from and they have the significant advantages of (a) working and (b) being efficient. You'll most likely need to run an external helper process or server, but the limitations of the relational database model tend to make that necessary.
The scheme you seem to be envisioning, as best as I can guess, sounds like it'll suffer from hopeless concurrency problems when it comes to failure handling, insert/delete races, etc. Really, do not try to design this yourself, especially when you don't have a really good grasp of the underlying concurrency and performance issues.