PostgreSQL 9.6 understanding wal files - postgresql

I am trying to understand the behaviour of wal files. The wal related settings of the database are as follows:
"min_wal_size" "2GB"
"max_wal_size" "20GB"
"wal_segment_size" "16MB"
"wal_keep_segments" "0"
"checkpoint_completion_target" "0.8"
"checkpoint_timeout" "15min"
The number of wal files is always 1281 or higher:
SELECT COUNT(*) FROM pg_ls_dir('pg_xlog') WHERE pg_ls_dir ~ '^[0-9A-F]{24}';
-- count 1281
As I understand it this means wal files currently never fall below max_wal_size (1281 * 16 MB = 20496 MB = max_wal_size) ??
I would expect the number of wal files to decrease below maximum right after a checkpoint is reached and data is synced to disk. But this is clearly not the case. What am I missing?

As per the documentation (emphasis added):
The number of WAL segment files in pg_xlog directory depends on min_wal_size, max_wal_size and the amount of WAL generated in previous checkpoint cycles. When old log segment files are no longer needed, they are removed or recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of log output rate, max_wal_size is exceeded, the unneeded segment files will be removed until the system gets back under this limit. Below that limit, the system recycles enough WAL files to cover the estimated need until the next checkpoint, and removes the rest
So, as per your observation, you are probably observing the "recycle" effect -- the old WAL files are getting renamed instead of getting removed. This saves the disk some I/O, especially on busy systems.
Bear in mind that once a particular file has been recycled, it will not be reconsidered for removal/recycle again until it has been used (i.e., the relevant LSN is reached and checkpointed). That may take a long time if your system suddenly becomes less active.

If your server is very busy and then abruptly becomes mostly idle, you can get into a situation where the log fails remain at max_wal_size for a very long time. At the time it was deciding whether to remove or recycle the files, it was using them up quickly and so decided to recycle up to max_wal_size for predicted future use, rather than remove them. Once recycled, they will never get removed until they have been used (you could argue that that is a bug), and if the server is now mostly idle it will take a very long time for them to be used and thus removed.


What are the difference between background writer and checkpoint in postgresql?

As per my understanding
checkpoint write all dirty buffer(data) periodically into disk and
background writer writes some specific dirty buffer(data) into disk
It looks both do almost same work.
But what are the specific dirty buffer(data) writes into disk?
How frequently checkpoint and bgwriter it is calling?
I want to know what are the difference between them.
Thanks in advance
It looks both do almost same work.
Looking at the source code link given by Adrian, you can see these words in the comments for the background writer:
As of Postgres 9.2 the bgwriter no longer handles checkpoints.
...which means in the past, the background writer and checkpointer tasks were handled by one component, which explains the similarity that probably led you to ask this question. The two components were split on 1/Nov/2011 in this commit and you can learn more about the checkpointer here.
From my own understanding, they are doing the same task from different perspectives. The task is making sure we use a limited amount of resources:
For the background writer, that resource is RAM and it writes dirty buffers to disk so the buffers can be reused to store other data hence limiting the amount of RAM required.
For the checkpointer, that resource is DISK and it writes all dirty buffers to disk so it can add a checkpoint record to the WAL, which allows all segments of the WAL prior to that record to be removed/recycled hence limiting the amount of DISK required to store the WAL files. You can confirm this in the docs which say ...after a checkpoint, log segments preceding the one containing the redo record are no longer needed and can be recycled or removed.
It may be helpful to read more about the WAL (Write-Ahead Log) in general.

How does write ahead logging improve IO performance in Postgres?

I've been reading through the WAL chapter of the Postgres manual and was confused by a portion of the chapter:
Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.
How is it that continuous writing to WAL more performant than simply writing to the table/index data itself?
As I see it (forgetting for now the resiliency benefits of WAL) postgres need to complete two disk operations; first pg needs to commit to WAL on disk and then you'll still need to change the table data to be consistent with WAL. I'm sure there's a fundamental aspect of this I've misunderstood but it seems like adding an additional step between a client transaction and the and the final state of the table data couldn't actually increase overall performance. Thanks in advance!
You are fundamentally right: the extra writes to the transaction log will per se not reduce the I/O load.
But a transaction will normally touch several files (tables, indexes etc.). If you force all these files out to storage (“sync”), you will incur more I/O load than if you sync just a single file.
Of course all these files will have to be written and sync'ed eventually (during a checkpoint), but often the same data are modified several times between two checkpoints, and then the corresponding files will have to be sync'ed only once.

Size of SQL disk storage space growing 32GB a day

Since April 1st, the size of my DB storage space grows by 32GB a day. It's very unusual, and based on the 500GB disk, this will not last for much longer.
Why is the DB growing by 32GB a day?
For context, I've allocated a 500GB disk; binary logs are enabled; automated backups are enabled.
I tested further. The reason for the DB growing so dramatically every night is due to the binary logs. Every night Magento indexes run, and produce 32GB of binary logging data. Not all Magento stores will be the same, but large Magento stores beware.
The solution, temporarily at least, is to disable binary logging. Have a look at the image to see the reclaimed disk space after disabling the option.
This will make it a challenge when setting up read/failover replicas. It would be nice if the MySQL instance is configured to purge/prune binary logs after a set amount of time has passed, or at least once operations have been copied to slave instances. Maybe it does, but I haven't investigated. Given current time constraints, I was not going to wait until the purge/prune happened, if it even would.
Could it be your DB log that is growing at a rapid pace?
I have had this issue in the past and ended up creating a job for the SQL agent that runs once a week and purges the log.

Any performance gotchas when doing mass delete from MongoDb

We are looking at writing log information to a MongoDB logging database but have essentially zero practical experience running Mongo in a production environment.
Every day we'll be writing a million+ log entries. Logs older than (say) a month need to be purged (say) daily. My concern is how Mongo will handle these deletes.
What are the potential issues with this plan with Mongo?
Do we need to chunk the deletes?
Given we'll be deleting by chronological age (ie: insert order), can I assume fragmentation will not be an issue?
Will the database need to be compacted regularly?
Potential issues: None, if you can live with eventual consistency.
No. A far better approach is to have an (ISO)Date field in your documents and set up a TTL index on it. Assuming the mentioned field holds the time at which the log entry was made, you would setup said index like:
// Seconds in Minutes * Minutes in hour * hours a day * days in month (commercial)
{"expireAfterSeconds": 2592000}
This way, a mongod subprocess would take care of deleting the expired data, turning the collection in sort of a round robin database. Less moving parts, less to care about. Please note that the documents will not be deleted the very same second they expire. Under the worst circumstances, it can take up to 2 minutes from their time of expiration (iirc) before they are actually deleted. At median, an expired document should be deleted some 30 seconds after its expiration.
Compacting does not reclaim disk space on mmapv1, only on WiredTiger.Keep in mind that documents are never fragmented. With the fun fact that the database being compacted will be locked, I have yet to find a proper use case for the compact command. If disk space is your concern: Freed space in the datafiles will be reused. So yes, in a worst case scenario you can have a few additional datafiles allocated. Since I don't know the project's requirements and details, it is you who must decide wether reclaiming a few GB of disk space is worth locking the database for extended periods of time.
You can configure MongoDB for log files rotation :
You'd certainly be interested by "Manage journaling" section too :
My last suggestion is about "smallfiles" option :
Set to false to prevent the overhead of journaling in situations where durability is not required. To reduce the impact of the journaling on disk usage, you can leave journal enabled, and set smallfiles to true to reduce the size of the data and journal files.

MongoDB Write and lock processes

I've been read a lot about MongoDB recently, but one topic I can't find any clear material on, is how data is written to the journal and oplog.
So this is what I understand of the process so far, please correct me where I'm wrong
A client connect to mongod and performs a write. The write is stored in the socket buffer
When Mongo is available (not sure what available means at this point), data is written to the journal?
The mongoDB docs then say that writes every 60 seconds are flushed from the journal onto disk. By this I can only assume this mean written to the primary and the oplog. If this is the case, how to writes appear earlier than the 60 seconds sync interval?
Some time later, secondaries suck data from the primary or their sync source and update their oplog and databases. It seems very vague about when exactly this happens and what delays it.
I'm also wondering if journaling was disabled (I understand that's a really bad idea), at what point does the oplog and database get updated?
Lastly I'm a bit stumpted at which points in this process, the write locks get created. Is this just when the database and oplog are updated or at other times too?
Thanks to anyone who can shed some light on this or point me to some reading material.
Here is what happens as far as I understand it. I simplified a bit, but it should make clear how it works.
A client connects to mongo. No writes done so far, and no connection torn down, because it really depends on the write concern what happens now.Let's assume that we go with the (by the time of this writing) default "acknowledged".
The client sends it's write operation. Here is where I am really not sure. Either after this step or the next one the acknowledgement is sent to the driver.
The write operation is run through the query optimizer. It is here where the acknowledgment is sent because with in an acknowledged write concern, you may be returned a duplicate key error. It is possible that this was checked in the last step. If I should bet, I'd say it is after this one.
The output of the query optimizer is then applied to the data in memory Actually to the data of the memory mapped datafiles, to the memory mapped oplog and to the journal's memory mapped files. Queries are answered from this memory mapped parts or the according data is mapped to memory for answering the query. The oplog is read from memory if present, too.
Every 100ms in general the journal is synced to disk. The precise value is determined by a number of factors, one of them being the journalCommitInterval configuration parameter. If you have a write concern of journaled, the driver will be notified now.
Every syncDelay seconds, the current state of the memory mapped files is synced to disk I think the journal is truncated to the entries which weren't applied to the data yet, but I am not too sure of that since that it should basically never happen that data in the journal isn't yet applied to the current data.
If you have read carefully, you noticed that the data is ready for the oplog as early as it has been run through the query optimizer and was applied to the files mapped into memory. When the oplog entry is pulled by one of the secondaries, it is immediately applied to it's data of the memory mapped files and synced in the disk the same way as on the primary.
Some things to note: As soon as the relatively small data is written to the journal, it is quite safe. If a node goes down between two syncs to the datafiles, both the datafiles and the oplog can be restored from their last state in the datafiles and the journal. In general, the maximum data loss you can have is the operations recorded into the log after the last commit, 50ms in median.
As for the locks. If you have written carefully, there aren't locks imposed on a database level when the data is synced to disk. Write locks may be created in order to assure that only one thread at any given point in time modifies a given document. There are other write locks possible , but in general, they should be rather rare.
Write locks on the filesystem layer are created once, though only implicitly, iirc. During application startup, a lock file is created in the root directory of the dbpath. Any other mongod instance will refuse to do any operation on those datafiles while a valid lock exists. And you shouldn't either ;)
Hope this helps.