If I understand correctly, when I delete a record and call commit, postgres will update write-ahead log (wal) and wait for the checkpoint then flush changes to the file.
My question is:
Is there any way I can recover deleted record after committed and before postgres checkpointing?
Btw, why is this method reducing disk write? Isn't wal an append log file?
I couldn't find anywhere how to do this without paying for postgres engineers.
Changes to the datafiles must be written and flushed by the end of the next checkpoint, but they can also be written earlier if the pages need to be (or are expected to be) evicted to make room for other data to be read in. As soon as the corresponding WAL is flushed, the change to the datafile is eligible to be written.
There is no user-available way to recover the deleted record. You could force a crash, then interfere with the recovery process. But you would have to be quite an expert to pull this off. It would be easier to just retrieve the record from a backup and then re-insert it. You don't even need to be an engineer to make this happen, just have valid backups (possibly including WAL archives) and know how to use them. Or better yet, don't commit things you don't want committed.
The system is designed this way for crash safety, not for reduced disk writing.
Related
I'm looking for a easy method to find the culprit process holding the transaction log which is causing the pg_wal full isues.
The transaction log contains all transactions, and it does not contain a reference to the process that caused an entry to be written. So you cannot infer from WAL what process causes the data modification activity that fills your disk.
You can turn on logging (log_min_duration_statement = 0) and find the answer in the log file.
But I think that you are looking at the problem in the wrong way: the problem is not that WAL is generated, but that full WAL segments are not removed soon enough.
That can happen for a variety of reasons:
WAL archiving has problems or is too slow
a stale replication slot is blocking WAL removal
wal_keep_segments is too high
I want to be able to disable PostgreSQL to auto checkpoint.
I just want to fsync wal files on disk without saving changes made in shared_buffer.
I set checkpoint_segments and checkpoint_timeout to big values, but still it makes additional checkpoints.
I don't want to checkpoint not even he needs to swap pages or is out of memory.
There are other causes for checkpoints:
Recovery has finished (could happen after a server crash).
Start of an online backup.
Database shutdown.
Before and after CREATE DATABASE.
After DROP DATABASE.
Before and after ALTER DATABASE SET TABLESPACE (you probably don't do that every day).
During DROP TABLESPACE (ditto).
And, of course, an explicit CHECKPOINT command.
I hope I haven't forgotten anything – could one of these cause the checkpoints you observe?
Set log_checkpoints to on, then the log file will show the cause of the checkpoint in the checkpoint starting message.
Are you sure that it is a good idea to avoid checkpoints? They are needed so that you can recover your database if there is a problem.
With slow query logging turned on, we see a lot of COMMITs taking upwards of multiple seconds to complete on our production database. On investigation, these are generally simple transactions: fetch a row, UPDATE the row, COMMIT. The SELECTs and UPDATEs in these particular transactions aren't being logged as slow. Is there anything we can do, or tools that we can use, to figure out the reason for these slow commits? We're running on an SSD, and are streaming to a slave, if that makes a difference.
Postgres commits are synchronous. This means they will wait for the WAL writes to complete before moving to the next one. You can adjust the WAL settings in the config file to adjust for this.
You can set the commit level to asynchronous at a session/user level or database wide with the synchronous_commit in the config file.
On the database side.
Vacuum your tables an update the statistics. This will get rid of dead tuples since your performing updates, there will be many.
VACUUM ANALYZE
I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.
So this is what I understand of the process so far, please correct me where I'm wrong
A client connect to mongod and performs a write. The write is stored in the socket buffer
When Mongo is available (not sure what available means at this point), data is written to the journal?
The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.
I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?
Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?
Thanks to anyone who can shed some light on this or point me to some reading material.
Simon
Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.
A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.
If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.
Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.
As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.
Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)
Hope this helps.
I have a two-member replica set, I accidentally removed all documents in an collection, I am not sure how I did this, but it's gone.
Is it possible to get all the data back?
Unless you have a backup (always recommended for just this type of thing), or one of the replicas is using slavedelay, then I am afraid the removal of the records is final. You might have been able to force a shutdown in time to save the data on-disk if you killed the process before the next fsync to disk (similarly if you broke replication before the removal was replicated), but even then it would be tricky.