Postgres - wal files not getting deleted - postgresql

We have one master and two slaves and wal files are not getting deleted. I am new to the team and people are assuming they were getting deleted earlier and not any more.
But we found that we have wal file kept as long as Sep,2015.
Master:
archive_mode = off
archive_command = ''
slave1:
archive_cleanup_command -- not present on recovery_conf file
slave2:
archive_cleanup_command -- not present on recovery_conf file
wal logs are 600gb now and I have only couple of days before I delete those. Right now I am checking if basebackups are created so that I can delete wal files and resetxlog.
I am not understanding how wal files are generated when archive_mode is off. Can someone explain, what else to look for?
I will provide more information as needed.

Related

How to recover the current wal file that was being written in master and not yet archived?

I am new to postgres and was trying to simulate a postgresql cluster so:
I have two nodes installed for postgres latest version and acting as active / hot standby and with master configuration :
archive_mode = on
archive_command = 'test ! -f /data/%f && cp %p /data/%f'
and slave configuration
primary_slot_name = 'standby_db2_slot'
hot_standby = on
and others default and related configuration
my question is if the standby was off for some time and the master crashes how to recover the data from my archived wal files also how to get the last wal file that the master was writing to before crashing?
You could copy the files from the archive (if it is still available) into the replica's pg_wal folder. Or more typically, you would set restore_command to copy each of them from the archive upon request.
how to get the last wal file that the master was writing to before crashing?
If it was a hard crash where the master's storage was irreparably destroyed, you likey can't get it. That is why streaming is great, it copies the data stream in near-real time to minimize loss. And if was a soft crash, why are you trying to promote the replica anyway, rather than just turning the master back on? If the master's storage was only partially destroyed, then just copy this last file to the archive manually.

PostgreSQL - Does a single archive file contain information for only a specific database on a cluster or is it the entire cluster?

Note: this question is with regards to PostgreSQL version 13.
On my Ubuntu server, I have a cluster called main which has 2 databases inside it (the first one being for a fruits company and the second one for a car company).
Here are my postgresql.conf file settings:
wal_level = replica
archive_mode = on
archive_command = 'pxz --compress --keep --force -6 --to-stdout --quiet %p > /datadrive/postgresql/13/wal_archives/%f.xz'
This creates .xz files in /datadrive/postgresql/13/wal_archives/ as expected.
For example: a file name may look like this:
0000000100000460000000A4.xz
Now my question's regarding this archiving process are as follows:
Is this particular .xz file an archive of all the databases in the postgresql cluster? i.e. does this particular xz file contain an archive for both the fruits and the car databases or does it only contain an archive for only one of them?
What is an archive file? Is it just a single WAL file or is it an archive point + a WAL file?
I have read the official documentation found here and here and also looked at a large number of stackoverflow and database stack exchange questions and have not managed to gain a good understanding of the archive concept.
Such a file is called a "WAL segment". WAL is short for "write ahead log" and is the transaction log, which contains the information required to replay data modifications for the whole database cluster. So it contains data for all databases in the cluster.
WAL is an endless append-only stream, which is split into segments of a fixed size. A WAL archive is nothing more than a faithful copy of a WAL segment.
WAL archives are used together with a base backup to perform point-in-time-recovery. Other uses for WAL files are crash recovery and replication, but these don't require archived WAL segments.

Postgres Continuous Archiving and Point-in-Time Recovery (PITR)

I am trying to setup Continuous Archiving and Point-in-Time Recovery (PITR) in Postgres. When I go through the documentation it says:
The archive command should generally be designed to refuse to overwrite any pre-existing archive file. This is an important safety feature to preserve the integrity of your archive in case of administrator error (such as sending the output of two different servers to the same archive directory).
But I see that the same WAL file is changing multiple times when I open a connection and do some changes time to time. So for example, when I first connect the database and do some changes (like deleting or inserting some rows), it creates a WAL file named 000000010000000000000090 and my archive_command is immediately run. My archive_command is
test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f
This is based on the documentation, which checks if the file already exists in the archive directory, if exists, it doesn't copy and copies only if the file doesn't exist. So the first time the condition passes and the file is copied, but when I am doing some more changes with the same connection (I am even having the same issue when I reconnect from the same PC) the original WAL file is being changed. But the next time the copy doesn't work because the file already exists.
If this is allowed to happen, we may lose some changes in the backup. Anyone knows about any solution, so it creates a new file for every change instead of modifying the old file?
I am using Postgres version 10.2 on my local computer (Mac).
Does that really happen to you? Because it shouldn't.
PostgreSQL writes transaction logs in “WAL files” (WAL for Write Ahead Log) of 16MB size.
Whenever a WAL file is full, the log is switched to a new WAL file, and the old WAL file is archived with archive_command.
If archive_command completes with an exit status of 0 (success), the WAL file is recycled, otherwise archiving is retried until it succeeds. Failures will be logged.
Anyway, as long as there are no errors, each WAL file will only be archived once.
The behavior you describe shouldn't happen.
Check the PostgreSQL log to see if there were errors reported form archive_command. If you fix the error condition. normal operation will be resumed.

How to avoid a big log file with WAL archiving?

I enabled WAL archiving in EDB Postgresql 9.6 for PITR, but now every time a 16MB log file is created and filling Disk volume. How do I avoid that?
These are the changes made to Postgresql.conf to enable the wal archive:
wal_level = replica
archive_mode = on
archive_command = 'cp %p /postgres/cluster/wals/%f'
(cp from pg_xlogs to wals folder)
Now the wal folder is filling every time.
You avoid filling up the destination directory by
providing enough disk space there
deleting WAL archives you don't need any more.
PostgreSQL does not automatically delete WAL archives for you — it does not even know where they are.

Which Postgresql WAL files can I safely remove from the WAL archive folder

Current situation
So I have WAL archiving set up to an independent internal harddrive on a data logging computer running Postgres. The harddrive containing the WAL archives is filling up and I'd like to remove and archive all the WAL archive files, including the initial base backup, to external backup drives.
The directory structure is like:
D:/WALBACKUP/ which is the parent folder for all the WAL files (00000110000.CA00000004 etc)
D:/WALBACKUP/BASEBACKUP/ which holds the .tar of the initial base backup
The question I have then is:
Can I safely move literally every single WAL file except the current WAL archive file, (000000000001.CA0000.. and so on), including the base backup, and move them to another hdd. (Note that the database is live and receiving data)
cheers!
WAL archives
You can use the pg_archivecleanup command to remove WAL from an archive (not pg_xlog) that's not required by a given base backup.
In general I suggest using PgBarman or a similar tool to automate your base backups and WAL retention though. It's easier and less error prone.
pg_xlog
Never remove WAL from pg_xlog manually. If you have too much WAL then:
your wal_keep_segments setting is keeping WAL around;
you have archive_mode on and archive_command set but it isn't working correctly (check the logs);
your checkpoint_segments is ridiculously high so you're just generating too much WAL; or
you have a replication slot (see the pg_replication_slots view) that's preventing the removal of WAL.
You should fix the problem that's causing WAL to be retained. If nothing seems to have happened after changing a setting run a manual CHECKPOINT command.
If you have an offline server and need to remove WAL to start it you can use pg_archivecleanup if you must. It knows how to remove only WAL that isn't needed by the server its self ... but it might break your archive-based backups, streaming replicas, etc. So don't use it unless you must.
WAL files are incremental, so the simple answer is: You cannot throw any files out. The solution is to make a new base backup and then all previous WALs can be deleted.
The WAL files contain individual statements that modify tables so if you throw some older WALs out, then the recovery process will fail (it will not silently skip missing WAL files) because the state of the database cannot be restored reliably. You can move the WAL files to some other location without upsetting the WAL process but then you'd have to make all WAL files available again from a single location if you ever need to recover your database from some point in the past; if you are running out of disk space then that may mean recovering from some location where you have enough space to store the base backup and all WAL files. The main issue here is if you can do that fast enough to restore a full database after an incident.
Another issue is that if you cannot identify where/when a problem occurred that needs to be corrected your only option is to start with the base backup and then replay all the WAL files. This procedure is not difficult, but if you have an old base backup and many WAL files to process, this simply takes a lot of time.
The best approach for your case, in general, is to make a new base backup every x months and collect WALs with that base backup. After every new base backup you can delete the old base backup and its subsequent WALs or move them to cheap offline storage (DVD, tape, etc). In the case of a major incident you can quickly restore the database to a known correct state from the recent base backup and the relatively few WAL files collected since then.
A solution that we went for, is executing pg_basebackup every night. This would create a base backup and later on we can use pg_archivecleanup to clean up all the "old" WAL files before that base using something like
"%POSTGRES_INSTALLDIR%\bin\pg_archivecleanup" -d %WAL_backup_dir% %newestBaseFile%
Fortunately, we never had to recover yet, but it should work in theory.
In case someone found this by searching how to safely cleanup the WAL directory under a replication architecture, consider the scenario where there might be left overs from offline replicas, in this case, unused replica slots waiting for the replica to come back online and thus keeping a lot of WAL archives on the Master DB.
In our case we had an issue with a replica going down due to hardware failure, we had to recreate it along with its replica_slot on the Master DB but forgot to get rid of the previous used one. Once we cleared that out PSQL got rid of unused WALs and all was good.
You can add the script to automatically clean or remove pg_wal files. This will work in pg-11 version. If you want to use other psql version the you can simply replace the command "/usr/pgsql-11/bin/pg_archivecleanup" to /usr/pgsql-12/bin/pg_archivecleanup or 13 as per your wish.
#!/bin/bash
/usr/pgsql-11/bin/pg_controldata -D /var/lib/pgsql/11/data/ > pgwalfile.txt
/usr/pgsql-11/bin/pg_archivecleanup -d /var/lib/pgsql/11/data/pg_wal $(cat pgwalfile.txt | grep "Latest checkpoint's REDO WAL file" | awk '{print $6}')