postgresql CLUSTER command not clearing dead tuples - postgresql

We have a background process (linux daemon in an infinite loop) that automatically takes all lines from csv files that are placed in a certain directory and imports them into a table. The daemon processes any files that appear in the directory one by one, is written in python, and uses psycopg2 to connect to our postgresql database.
That process imports those records using INSERT statements, but first DELETES any table records that have the same unique key as any of the records in the csv file. Generally the process is DELETING a record for every record it INSERTS. So as this daemon is running in the background it is DELETING and then INSERTING rows. Every time it processes one file it specifically commits the transaction, closes the cursor, and then closes the connection.
Periodically (twice a day) we want to run CLUSTER to remove the dead tuples and keep the table to a manageable on disk size.
However, something in this process is stopping the CLUSTER command from removing the dead tuples for all the records that are being deleted as the process is running. We know this happening because if we run CLUSTER while the process is running, the on disk size of the table containing this imported data will not decrease and pg_stat_user_tables will show many dead tuples.
If we stop the process and then run CLUSTER, the on disk size of the table will decrease dramatically and and pg_stat_user_tables will report that all of the dead tuples are gone.
What's strange is we are committing the transaction and closing the connections every time we process each file, so I have no idea what is not allowing the dead tuples to be removed while the process is running.
Also strange, is that if we stop the process, then start the process again, then do a CLUSTER, it will remove all of the dead tuples created by the previous run of the daemon process; but any subsequent calls of CLUSTER will not clear any dead tuples created by the current run of the daemon process (while it is still running of course).
So something is maintaining some kind of link to the dead tuples until the process is stopped, even though we have committed the transaction and closed all connections to postgres that created those dead tuples. pg_locks does not report any open locks and no running transactions are reported, so it doesn't seem like its a lock or open transaction issue.
At the end of the day, this is stopping us from periodically running CLUSTER on the table so that it doesn't keep growing and growing.
I'm sure there is a simple answer to this, but I can't find it anywhere. Some skeleton code for the process is below. It really is a simple process so I have no idea what is going on here. Any guidance would be greatly appreciated.
while True:
l = [(get_modified_time(fname), fname) for fname in os.listdir('/tmp/data')]
l.sort()
for (t, fname) in l:
conn = psycopg2.connect("dbname='dbname' user='user' password='password'")
cursor = conn.cursor()
# Calls a postgresql function that reads a file and imports it into
# a table via INSERT statements and DELETEs any records that have the
# same unique key as any of the records in the file.
cursor.execute("SELECT import('%s', '%s');" % (fname, t))
conn.commit()
cursor.close()
conn.close()
os.remove(get_full_pathname(fname))
time.sleep(0.100)

What's wrong with autovacuum? When autovacuum does it's job, you don't have to use CLUSTER to cleanup dead tuples. CLUSTER isn't made for this, it's VACUUM.
If you change the proces to UPDATE duplicates, things might get even better when you use a lower FILLFACTOR: HOT updates. These are faster, reclaim space, keep the same order in storage and no need for VACUUM nor CLUSTER.

Related

Sudden Increase in row exclusive locks and connection exhaustion in PostgreSQL

I have a scenario that repeats itself every few hours. In every few hours, there is a sudden increase in row exclusive locks in PostgreSQL DB. In Meantime there seems that some queries are not responded in time and causes connection exhaustion to happen that PostgreSQL does not accept new clients anymore. After 2-3 minutes locks and connection numbers drops and the system comes back to normal state again.
I wonder if auto vacuum can be the root cause of this? I see analyze and vacuum (NOT FULL VACCUM) take about 20 seconds to complete on one of the tables. I have INSERT,SELECT,UPDATE and DELETE operations going on from my application and I don't have DDL commands (ALTER TABLE, DROP TABLE, CREATE INDEX, ...) going on. Can auto vacuum procedure conflict with queries from my application and cause them to wait until vacuum has completed? Or it's all the applications and my bad design fault? I should say one of my tables has a field of type jsonb that keeps relatively large data for each row (10 MB roughly).
I have attached an image from monitoring application that shows the sudden increase in row exclusive locks.
ROW EXCLUSIVE locks are perfectly harmless; they are taken on tables against which DML statements run. Your graph reveals nothing. You should set log_lock_waits = on and log_min_duration_statement to a reasonable value. Perhaps you can spot something in the logs. Also, watch out for long running transactions.

Postgres insert slow after snapshot restore but not after restart

My setup
Postgres 11 running on an AWS EC2 t4g.xlarge instance (4 vCPU, 16GB) running Amazon Linux.
Set up to take a nightly disk snapshot (my workload doesn't require high reliability).
Database has table xtc_table_1 with ~6.3 million rows, about 3.2GB.
Scenario
To test some new data processing code, I created a new test AWS instance from the nightly snapshot of my production instance.
I create a new UNLOGGED table, and populate it with INSERT INTO holding_table_1 SELECT * FROM xtc_table_1;
It takes around 2 min 24 sec for the CREATE statement to execute.
I truncate holding_table_1 and run the CREATE statement again, and it completes in 30 sec. The ~30 second timing is consistent for successive truncates and creates of the table.
I think this may be because of some caching of data. I tried restarting Postgres service, then rebooting the AWS instance (after stopping postgres with sudo service postgresql stop), then stopping and starting the AWS instance. However, it's still ~30 sec to create the table.
If I rebuild a new instance from the snapshot, the first time I run the CREATE statement it's back to the ~2m+ time.
Similar behavior for other tables xtc_table_2, xtc_table_3.
Hypothesis
After researching and finding this answer, I wonder if what's happening is that the disk snapshot contains some WAL data that is being replayed the first time I do anything with xtc_table_n. And that subsequently, because Postgres was shut down "nicely" there is no WAL to playback.
Does this sound plausible?
I don't know enough about Postgres internals to be sure. I would have imagined that any WAL playback would happen on starting up postgres, but maybe it happens at the individual table level the first time a table is touched?
Knowing the reason is more than just theoretical; I'm using the test instance to do some tuning on some processing code, and need to be confident in having a consistent baseline to measure from.
Let me know if more information is needed about my setup or what I'm doing.
#jellycsc's suggestion was correct; adding more info here in case it's helpful to anyone else.
The problem I was encountering was not a postgres issue at all, but because of the way AWS handles volumes and snapshots.
From this page:
For volumes that were created from snapshots, the storage blocks must
be pulled down from Amazon S3 and written to the volume before you can
access them. This preliminary action takes time and can cause a
significant increase in the latency of I/O operations the first time
each block is accessed. Volume performance is achieved after all
blocks have been downloaded and written to the volume.
I used the fio utility as described in the linked AWS page to initialize the restored volume, and first-time performance was consistent with subsequent query times.

Postgres: Repercussions to killing a large transaction?

I have a very large multi-million row transaction that I ended up needing to kill.
This transaction scanned a very large number of rows and created new rows in a new table if certain conditions were met.
This was in a commit block and did not complete before I killed the process— are there any repercussions to killing the process and restarting the server? I do not even see the tables in the db (presumably because the commit never happened). Can I just immediately try to do my migration again?
The answer depends on how you “killed” the transaction.
If you hit Ctrl+C or canceled the query with pg_cancel_backend or pg_terminate_backend, the transaction will have rolled back normally.
Any table you created in the session will be gone.
If you modified rows in pre-existing tables, the new rows will be dead and aotovacuum will remove them.
At worst, you will have some bloat in some tables that will be reused by the next attempt at your transaction.
Similarly, if you used a regular kill to kill the backend process of the session, everything will be fine.
If you used kill -9 to kill the session's backend process, PostgreSQL will have gone into crash recovery.
Your database will be consistent after crash recovery, but it is possible that some files (belonging to newly created tables) will be left behind. Such orphans take up space and are never removed, and the only safe way to get rid of that wasted space is to dump the database and restore it to a new database cluster.
Theoretically, yes. You should be able to just go ahead and try again. It might mean that some of the cleanup hasn't been performed yet, so there are some partial tables floating around, taking up memory, but nothing that should impact your data quality.

Deleting selective data from MongoDB Secondary Only

Is it possible to delete data from a single Mongo secondary instance, by running delete command directly on a secondary, without affecting the primary and other secondary instances?
Explanation: I want to purge a large collection ~500 GB, having ~500 million records. I want to keep last few months data, so I will have to remove ~400 million records. It is a replica setup, with one primary and 2 secondaries. Storage Engine is WiredTiger. I do not want any downtime or slowness as it is a production DB of a live transactional system. I am thinking of below options:
Create a new collection, and copy last few months records in it, and drop the old one. But copying such huge data will slow down the DB server.
Take backup of entire collection, then run bulk delete, with a batch size of 1000. This will take weeks to delete so many records, also will create huge op logs, as every delete will produce an op log that will be synced to secondary. These op logs will take up huge disk space.
Another option is that I run bulk delete on one secondary only. Once the data is deleted, I promote it as primary. Then run same delete on other 2 secondary instances. This will not affect the prod environment. Hence the question: Can we run delete on a secondary only? Once this secondary comes back in cluster after deletion, what will be the behaviour of the sync process between primary and secondary?
I run a small test on a local MongoDB cluster. In principle it seems to work when you follow this procedure:
Shut down the Secondary
Restart the secondary as a standalone (you cannot perform any changes on SECONDARY)
Connect to the standalone and delete old data
Shutdown the standalone
Restart the standalone normally as ReplicaSet member
Repeat step (1) to (5) with the other Secondary. You may run above steps in parallel on all Secondaries, however then you have no redundancy in case of problems.
Set a Secondary from above to Primary
Repeat step (1) to (5) with the last node
As I said, I did a "quick and dirty" test with a few documents and it seems to work.
However, I don't think it will work in your setup because:
Step (5) "delete old data" will take some time, maybe some hours or even days. When you have finished deletion, most likely you will trap into this situation:
Resync a Member of a Replica Set:
A replica set member becomes "stale" when its replication process falls so far behind that the primary overwrites oplog entries the member has not yet replicated. The member cannot catch up and becomes "stale." When this occurs, you must completely resynchronize the member by removing its data and performing an initial sync.
I.e. you will add all deleted data again.
Perhaps there are hacks to overwrite "stale" to "SECONDARY". Then you would have to drop old PRIMARY and add it again as SECONDARY. But by this you would loose all data which have been newly inserted in production while step (5) was running. I assume the application is constantly inserting new data (otherwise you would not get such an amount of documents), such data would be lost.

Mongo continues to insert documents, slowly, long after script is quit

Do I have a zombie somewhere?
My script finished inserting a massive amount of new data.
However, the server continues with high lock rates and slowly inserting new records. It's been about an hour since the script that did the inserts finished, and the documents are still trickling in.
Where are these coming from and how to I purge the queue? (I refactored the code to use an index and want to redo the process to avoid the 100-200% lock rate)
This could be because of following scenarios,
1..Throughput bound Disk IO
One can look into following metrics using "mongostat" and "MongoDB Management Service":
Average Flush time (how long MongoDB's periodic sync to disk is taking)
IOStats in the hardware tab (look specifically IOWait)
Since the Disk IO is slower than CPU processing time, all the inserts get queued up, and this can continue for longer duration, one can check the server status using db.serverStatus() and look into "globalLock"(as Write acquires global lock) field for "currrentQueue" associated with the lock, to see number of writers in queue.
2..Another possible cause could be Managed Sharded Cluster Balancer has been put in On Status(which is by Default On)
If you have been working on clustered environment, whenever write operation starts Balancer automatically gets ON, in-order to keep the cluster in balanced state, which can continue moving chunks from one shard to another even after completion of your scripts. In such a case I would rather suggest to keep the balancer off while having bulk load, in such a case all your documents goes to single shard, but balancer can be kicked on at any downtime.
3..Write Concern
This may also contribute to problem slightly, if they are set to Replica Acknowledged or Acknowledged mode, it depends on you, based on your type of data, to decide on these concerns.