PostgreSQL - Creating index on multiple partitioned tables - postgresql

I am trying to create indexes on multiple (1000) partitioned tables. As I'm using Postgres 10.2, I would have to do this for each of the partition separately, having to execute 1000 queries for the same.
I have figured how to do it, and it does work on environments where the table size(s) and number of transactions are very less. Below is the query to be executed for one of the table (which is to be repeated for all the tables ( user_2, user_3, etc.)
CREATE INDEX IF NOT EXISTS user_1_idx_location_id
ON users.user_1 ( user_id, ( user_data->>'locationId') );
where user_data is a jsonb column
This query does not work for large tables, with high number of transactions - when I run it for all the tables at once. Error thrown:
ERROR: SQL State : 40P01
Error Code : 0
Message : ERROR: deadlock detected
Detail: Process 77999 waits for ShareLock on relation 1999264 of database 16311; blocked by process 77902.
Process 77902 waits for RowExclusiveLock on relation 1999077 of database 16311; blocked by process 77999
I am able to run it in small batches (of 25 each) - still encountering the issue at times, but running successfully when I retry it once or twice. Smaller the batch, lesser the chances of a deadlock.
I would think this happens because all the user tables ( user_1, user_2, etc) are linked to the parent table: user. I don't want to lock the entire table for the index creation (since in theory only one table is being modified at a time). Why does this happen and is there any way around this, to ensure that the index is created without the deadlocks ?

This worked:
CREATE INDEX CONCURRENTLY IF NOT EXISTS user_1_idx_location_id
ON users.user_1 ( user_id, ( user_data->>'locationId') );

Related

DELETE then INSERT is randomly raising a duplicate key violates unique constraint error

I have a query (for a website) that replaces old data with new data.
I run the query in one call to the database via the PHP pg_query function and also use pgbouncer with transaction pool mode. I would be very surprised if two of the same queries are running at the same time, but is that the only explanation for this? I don't have any triggers or SERIAL columns on the table.
CREATE TABLE mydata (
id INT NOT NULL,
val TEXT NOT NULL
);
ALTER TABLE mydata ADD CONSTRAINT mydata_unique (id);
The statement that raises the conflict is
DELETE FROM mydata WHERE id IN (1,2,3);
INSERT INTO mydata (id,val) VALUES (1,'one');
INSERT INTO mydata (id,val) VALUES (2,'two');
INSERT INTO mydata (id,val) VALUES (3,'three');
Version PostgreSQL 12.2
I assume that you are not running these statements in parallel, but one after the other.
Still, this could easily cause conflicts if several database sessions are doing the same thing at the same time: a second session may insert rows after the first session deleted the old rows, but before it inserted the new rows.
To protect yourself from that with row locks, run all statements in a single transaction. This may occasionally lead to a deadlock, which is no big deal - just repeat the transaction that failed.

Is row level locking atomic in PostgreSQL?

I have a table in PostgreSQL which I'd like to treat as a queue. I have some selection criteria which I'm using to lock and then delete rows from the table like this:
DELETE FROM queue
WHERE itemid = (
SELECT itemid
FROM queue
ORDER BY itemid
WHERE some_column='some value'
FOR UPDATE SKIP LOCKED
)
RETURNING *;
How does row locking work in PostgreSQL? When the SELECT query is executed will it lock all matching rows atomically? I'm asking this because grouping is important for me and I want to process all rows where some_column='some value' in the same worker.
Clarification: What I really want to know is whether it can happen that two workers are executing the same query (the one above) for the same parameters (some value) and one of them locks a few rows for update and the other worker picks up the rest. This is what I'd like to avoid. What I expect to happen is that one of the workers will get all the rows (if row locking is atomic) and the other one gets nothing. Is this the case?
If two of your queries are running concurrently, each of them can return and delete some of the rows in the table. In that sense, your query is not atomic.
You should serialize your processes, either outside the database or using PostgreSQL advisory locks.
Since you're working on a queuing table, be sure to check out SKIP LOCK:
https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/

Deleting rows in Postgres table using ctid

We have a table with nearly 2 billion events recorded. As per our data model, each event is uniquely identified with 4 columns combined primary key. Excluding the primary key, there are 5 B-tree indexes each on single different columns. So totally 6 B-tree indexes.
The events recorded span for years and now we need to remove the data older than 1 year.
We have a time column with long values recorded for each event. And we use the following query,
delete from events where ctid = any ( array (select ctid from events where time < 1517423400000 limit 10000) )
Does the indices gets updated?
During testing, it didn't.
After insertion,
total_table_size - 27893760
table_size - 7659520
index_size - 20209664
After deletion,
total_table_size - 20226048
table_size - 0
index_size - 20209664
Reindex can be done
Command: REINDEX
Description: rebuild indexes
Syntax:
REINDEX { INDEX | TABLE | DATABASE | SYSTEM } name [ FORCE ]
Considering #a_horse_with_no_name method is the good solution.
What we had:
Postgres version 9.4.
1 table with 2 billion rows with 21 columns (all bigint) and 5 columns combined primary key and 5 individual column indices with date spanning 2 years.
It looks similar to time-series data with a time column containing UNIX timestamp except that its analytics project, so time is not at an ordered increase. The table was insert and select only (most select queries use aggregate functions).
What we need: Our data span is 6 months and need to remove the old data.
What we did (with less knowledge on Postgres internals):
Delete rows at 10000 batch rate.
At inital, the delete was so fast taking ms, as the bloat increased each batch delete increased to nearly 10s. Then autovacuum got triggered and it ran for almost 3 months. The insert rate was high and each batch delete has increased the WAL size too. Poor stats in the table made the current queries so slow that they ran for minutes and hours.
So we decided to go for Partitioning. Using Table Inheritance in 9.4, we implemented it.
Note: Postgres has Declarative Partitioning from version 10, which handles most manual work needed in partitioning using Table Inheritance.
Please go through the official docs as they have clear explanation.
Simplified and how we implemented it:
Create parent table
Create child table inheriting it with check constraints. (We had monthly partitions and created using schedular)
Indexes are need to be created separately for each child table
To drop old data, just drop the table, so vacuum is not needed and will be instant.
Make sure to have the postgres property constraint_exclusion to partition.
VACUUM ANALYZE the old partition after started inserting in the new partition. (In our case, it helped the query planner to use Index-Only scan instead of Seq. scan)
Using Triggers as mentioned in the docs may make the inserts slower, so we deviated from it, as we partitioned based on time column, we calculated the table name at application level based on time value before every insert and it didn't affect the insert rate for us.
Also read other caveats mentioned there.

How to obtain an indefinite lock on database in Rails (postgres database) for QA purposes?

I'm trying to obtain an indefinite lock on my Postgresql database (specifically on a table called orders) for QA purposes. In short, I want to know if certain locks on a table prevent or indefinitely block database migrations for adding columns (I think ALTER TABLE grabs an ACCESS EXCLUSIVE LOCK).
My plan is to:
grab a table lock or a row lock on the orders table
run the migration to add a column (an ALTER TABLE statement that grabs an ACCESS EXCLUSIVE LOCK)
issue a read statement to see if (2) is blocked (the ACCESS EXCLUSIVE LOCK blocks reads, and so this would be a problem that I'm trying to QA).
How would one do this? How do I grab a row lock on a table called orders via the Rails Console? How else could I do this?
Does my plan make sense?
UPDATE
It turns out open row-level transactions actually do block ALTER TABLE statements that grab an ACCESS EXCLUSIVE LOCK like table migrations that add columns. For example, when I run this code in one process:
Order.first.with_lock do
binding.pry
end
It blocks my migration in another process to add a column to the orders table. That migration's ACCESS EXCLUSIVE LOCK blocks all reads and select statements to the orders table, causing problems for end users.
Why is this?
Let's say you're in a transaction, selecting rows from a table with various where clauses. Halfway through, some other transaction adds a column to that table. Now you are getting back more fields than you did previously. How is your application supposed to handle this?

Postgresql table with one ID column, sorted index, with duplicate primary key

I want to use a PostgreSQL table as a kind of work queue for documents. Each document has an ID and is stored in another, normal table with lots of additional columns. But this question is about creating the table for the work queue.
I want to create a table for this queue without OIDs with just one column: The ID of the document as integer. If an ID of a document exists in this work queue table, it means that the document with that ID is dirty and some processing has to be done.
The extra table shall avoid the VACUUM and dead tuple problems and deadlocks with transactions that would emerge if there was just a dirty bit on each document entry in the main document table.
Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. These inserts would be for many IDs in one transaction. I don't want to use any kind of nested transactions and there doesn't seem to be any kind of INSERT IF NOT EXISTS command. I'd rather have duplicate IDs in the table. Therefore duplicates must be possible for the only column in that table.
The process which processes the work queue will delete all processes IDs and therefore take care of duplicates. (BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)
But also I want the documents to be processed in order: Always shall documents with smaller IDs be processed first.
Therefore I want to have an index which aids LIMIT and ORDER BY on the ID column, the only column in the workqueue table.
Ideally given that I have only one column, this should be the primary key. But the primary key must not have duplicates, so it seems I can't do that.
Without the index, ORDER BY and LIMIT would be slow.
I could add a normal, secondary index on that column. But I fear PostgreSQL would add a second file on disc (PostgreSQL does that for every additional index) and use the double amount of disc operations for that table.
What is the best thing to do?
Add a dummy column with something random (like the OID) in order to make the primary key not complain about duplicates? Must I waste that space in my queue table?
Or is adding the second index harmless, would it become kind of the primary index which is directly in the primary tuple btree?
Shall I delete everything above this and just leave the following? The original question is distracting and contains too much unrelated information.
I want to have a table in PostgreSQL with these properties:
One column with an integer
Allow duplicates
Efficient ORDER BY+LIMIT on the column
INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between to other rows, ordered by ID.
INSERTs will happen in bulk and must not fail, expect for disc full, etc.
There shall not be additional btree files for this table, so no secondary indexes
The rows should occupy not much space, e.g. have no OIDs
I cannot think of a solution that solves all of this.
My only solution would compromise on the last bullet point: Add a PRIMARY KEY covering the integer and also a dummy column, like OIDs, a timestamp or a SERIAL.
Another solution would either use a hypothetical INSERT IF NOT EXISTS, or nested transaction or a special INSERT with a WHERE. All these solutions would add a query of the btree when inserting.
Also they might cause deadlocks.
(Also posted here: https://dba.stackexchange.com/q/45126/7788)
You said
Many parts of my system would mark documents as dirty and therefore
insert IDs to process into that table. Therefore duplicates must be
possible.
and
5 rows with the same ID mean the same thing as 1 or 10 rows with that
same ID: They mean that the document with that ID is dirty.
You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.
A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.
Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.
An implementation that allows duplicates . . .
create table dirty_documents (
document_id integer not null
);
create index on dirty_documents (document_id);
Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.
Pick the first dirty document ID number for cleaning up.
select min(document_id)
from dirty_documents;
document_id
--
1
Took only 0.136 ms. Now lets delete every row that has document ID 1.
delete from dirty_documents
where document_id = 1;
Took 0.272 ms.
Let's start over.
drop table dirty_documents;
create table dirty_documents (
document_id integer primary key
);
insert into dirty_documents
select generate_series(1,100000);
Took 500 ms. Let's find the first one again.
select min(document_id)
from dirty_documents;
Took .054 ms. That's about half the time it took using a table that allowed duplicates.
delete from dirty_documents
where document_id = 1;
Also took .054 ms. That's roughly 50 times faster than the other table.
Let's start over again, and try an unindexed table.
drop table dirty_documents;
create table dirty_documents (
document_id integer not null
);
insert into dirty_documents
select generate_series(1,100000);
insert into dirty_documents
select generate_series(1, 100);
insert into dirty_documents
select generate_series(1, 50);
insert into dirty_documents
select generate_series(88000, 93245);
insert into dirty_documents
select generate_series(83000, 87245);
Get the first document.
select min(document_id)
from dirty_documents;
Took 32.5 ms. Delete those documents . . .
delete from dirty_documents
where document_id = 1;
Took 12 ms.
All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.
Reading between the lines, I think you're trying to implement a work-queueing system.
Stop. Now.
Work queueing is hard. Work queuing in a relational DBMS is very hard. Most of the "clever" solutions people come up with end up serializing work on a lock without them realising it, or they have nasty bugs in concurrent operation.
Use an existing message/task queueing system. ZeroMQ, RabbitMQ, PGQ, etc etc etc etc. There are lots to choose from and they have the significant advantages of (a) working and (b) being efficient. You'll most likely need to run an external helper process or server, but the limitations of the relational database model tend to make that necessary.
The scheme you seem to be envisioning, as best as I can guess, sounds like it'll suffer from hopeless concurrency problems when it comes to failure handling, insert/delete races, etc. Really, do not try to design this yourself, especially when you don't have a really good grasp of the underlying concurrency and performance issues.