PostgreSQL failed with vacuum and autovacuum - postgresql

Postgres v11.9
There are many errors on Postgres log like this:
2020-09-05 17:35:37 GMT [22464]: #: [6-1] ERROR: uncommitted xmin 636700836 from before xid cutoff 809126794 needs to be frozen
2020-09-05 17:35:37 GMT [22464]: #: [7-1] CONTEXT: automatic vacuum of table "table_nane"
Manual vacuum fails with this error too.
What can I do to fix this error?

Export the database with pg_dump.
Create a new database cluster and restore the dump into it.
Remove the original database cluster.

Related

Recover Postgresql pgBarman

I've setup a postgresql DB and I want to backup it.
I've 1 server with my main DB et 1 with Barman.
All the setup is working, I can backup my DB with barman.
I just don't understand how I can recover my DB on a exact time point between the backups that I do everyday.
barman#ubuntu:~$ barman check main-db-server
WARNING: No backup strategy set for server 'main-db-server' (using default 'exclusive_backup').
WARNING: The default backup strategy will change to 'concurrent_backup' in the future. Explicitly set 'backup_options' to silence this warning.
Server main-db-server:
PostgreSQL: OK
is_superuser: OK
wal_level: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (interval provided: 1 day, latest backup age: 9 minutes, 59 seconds)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 6 backups, expected at least 0)
ssh: OK (PostgreSQL server)
not in recovery: OK
systemid coherence: OK (no system Id available)
archive_mode: OK
archive_command: OK
continuous archiving: OK
archiver errors: OK
And when I backup my DB
barman#ubuntu:~$ barman backup main-db-server
WARNING: No backup strategy set for server 'main-db-server' (using default 'exclusive_backup').
WARNING: The default backup strategy will change to 'concurrent_backup' in the future. Explicitly set 'backup_options' to silence this warning.
Starting backup using rsync-exclusive method for server main-db-server in /var/lib/barman/main-db-server/base/20210427T150505
Backup start at LSN: 0/1C000028 (00000005000000000000001C, 00000028)
Starting backup copy via rsync/SSH for 20210427T150505
Copy done (time: 2 seconds)
Asking PostgreSQL server to finalize the backup.
Backup size: 74.0 MiB. Actual size on disk: 34.9 KiB (-99.95% deduplication ratio).
Backup end at LSN: 0/1C0000C0 (00000005000000000000001C, 000000C0)
Backup completed (start time: 2021-04-27 15:05:05.289717, elapsed time: 11 seconds)
Processing xlog segments from file archival for main-db-server
00000005000000000000001B
00000005000000000000001C
00000005000000000000001C.00000028.backup
I don't know how to restore my DB on a time between 2 backups :/
Thanks

Docker Postgres database corrupted

I had a docker container running timescaleDB. The database data was stored outside the container.
docker run -d --name timescale -v /<DATA>:/var/lib/postgresql/data timescale/timescaledb-postgis:latest-pg10
Something strange happened lately. I log in and see all the databases have suddenly vanished
I see the below in the log file
2021-03-13 11:32:00.215 UTC [21] LOG: database system was interrupted; last known up at 2021-03-11 16:16:19 UTC
2021-03-13 11:32:00.242 UTC [21] LOG: database system was not properly shut down; automatic recovery in progress
2021-03-13 11:32:00.243 UTC [21] LOG: redo starts at 0/15C1270
2021-03-13 11:32:00.243 UTC [21] LOG: invalid record length at 0/15C12A8: wanted 24, got 0
2021-03-13 11:32:00.243 UTC [21] LOG: redo done at 0/15C1270
2021-03-13 11:32:00.247 UTC [8] LOG: database system is ready to accept connections
2021-03-13 20:33:10.424 UTC [31] LOG: could not receive data from client: Operation timed out
2021-03-13 20:33:10.424 UTC [29] LOG: could not receive data from client: Operation timed out
Does that means that database has corrupted? If so is there a way to recover it somehow? The container has been running for 3 years without a problem and suddenly this unexpected loss of database.
Thanks
Yes, the database was corrupted, but it was recovering by the automated recovery process. It looked like the db system started working since it sent this message: database system is ready to accept connections. This means that the logfile recovery was done properly (which doesn't mean that the database files are fully consistent).
When the database is abruptly shutdown, there is small chance for filelvel corruption as well, but the good news is that I don't see anything in the log, after the recovery that can suggest that this is the case, however, you need to have backup of the files.
The next log message could not receive data from client: Operation timed out is not related to recovery, it's due to the client application which had terminated without properly closing the connection.
Check more information on corruptions and reasons in Postgresql wiki.
If you depend on the data in the database, always keep backup. Easiest way is to use pg_dumpall. This will dump the data in plain text format as a series of SQL statements and you will be able to import the data on later versions of PostgreSQL.
So my recommendation, before you do anything else with it, STOP THE CONTAINER AND TAKE BACKUP OF THE FILES. The recovery is trial and error process, and you will need to have the fresh copy of the files to try different thing. After you do this, export the data with pg_dumpall. If this passes, you can resume normal operations of the database.

PostgreSQL backup in custom format ( -F c) fails during pg_restore ( copy command in log )

We have a PostgreSQL custom format ( -F c ) database backup ~1Gb in size that could not be restored on two of our users machines.
The error that occurs is
:pg_restore: [archiver (db)] error returned by PQputCopyData and in logs there is error in Copy command.
All reports we found with errors in Copy command during pg_restore were related to textual (sql ) backup which is not the case.
Any ideas?
Below is the information that describe the issue in more details:
1. File integrity is ok checked with "Microsoft File Checksum Integrity Verifier"
2. Backup and restore and restore are performed with PostgreSQL 9.6.5 64 bit.
3. Backup format of pg_dump is called
pg_dump -U username -F c -Z 9 mydatabase > myarchive
4. Database on client is created with:
CREATE DATABASE mydatabase WITH TEMPLATE = template0 ENCODING = 'UTF8' OWNER=user;
5. Pg_resote call:
pg_restore.exe -U user --dbname=mydatabase --verbose --no-owner --role=user
6. Example of logs, there are repeating rows with random table errors:
2020-12-07 13:40:56 GMT LOG: checkpoints are occurring too frequently (21 seconds apart)
2020-12-07 13:40:56 GMT HINT: Consider increasing the configuration parameter "max_wal_size".
2020-12-07 13:40:57 GMT ERROR: extra data after last expected column
2020-12-07 13:40:57 GMT CONTEXT: COPY substance, line 21511: "21743 \N 2 1d8c29d2d4dc17ccec4a29710c2f190a e98906e08d4cf1ac23bc4a5a26f83e73 1d8c29d2d4dc17ccec4a297..."
2020-12-07 13:40:57 GMT STATEMENT: COPY substance (id, text_id, storehouse_id, i_tb_id, i_twod_tb_id, tb_id, twod_tb_id, o_smiles, i_smiles_id, i_twod_smiles_id, smiles_id, twod_smiles_id, substance_type)
2020-12-07 13:40:57 GMT FATAL: invalid frontend message type 48
2020-12-07 13:40:57 GMT LOG: PID 105976 in cancel request did not match any process
or
2020-12-07 14:35:42 GMT LOG: checkpoints are occurring too frequently (16 seconds apart)
2020-12-07 14:35:42 GMT HINT: Consider increasing the configuration parameter "max_wal_size".
2020-12-07 14:35:59 GMT LOG: checkpoints are occurring too frequently (17 seconds apart)
2020-12-07 14:35:59 GMT HINT: Consider increasing the configuration parameter "max_wal_size".
2020-12-07 14:36:09 GMT ERROR: invalid byte sequence for encoding "UTF8": 0x00
2020-12-07 14:36:09 GMT CONTEXT: COPY scalar_calculation, line 3859209
2020-12-07 14:36:09 GMT STATEMENT: COPY scalar_calculation (calculator_id, smiles_id, mean_value, remark) FROM stdin;
2020-12-07 14:36:09 GMT FATAL: invalid frontend message type 49
2020-12-07 14:36:10 GMT LOG: PID 109816 in cancel request did not match any process
I am seeing similar behavior on windows 10 pro machines with PG 11.x.
I used pg_dump as suggested above and restored to said machines with psql and had no error.
I also noted that the error shifted around using pg_restore with different "-j" settings. For instance without the setting or "-j 1" pg_restore always fails on the same table and record. Changing to "-j 4" results in that table succeeding to apply the record without error but it occurs on another table.
Changing a particular column to null in the record satisfies the entire restore.
Using pgAdmin 4 to run the restore never produces the error.
Copying the exact command displayed in pgAdmin reproduces the same error:
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 32780; 0 5435293 TABLE DATA REDACTED_TABLE_NAME postgres
pg_restore: [archiver (db)] COPY failed for table "REDACTED_TABLE_NAME": ERROR: extra data after last expected column
CONTEXT: COPY mi_gmrfutil, line 117: "REDACTED PLAIN TEXT \N REDACTED PLAIN TEXT \N \N \N \N \N \N REDACTED PLAIN TEXT \N \N REDACTED PLAIN TEXT \N ..."
pg_restore: FATAL: invalid frontend message type 49
I tried using pg_restore version 14 with the same outcome.

Postgres is not accepting commands and Vacuum failed due to missing chunk number error

Version: 9.4.4
Exception while inserting a record in health_status.
org.postgresql.util.PSQLException: ERROR: database is not accepting commands to avoid wraparound data loss in database "db"
Hint: Stop the postmaster and vacuum that database in single-user mode.
As indicated in the above error, I tried logging to single-user mode and tried to run full vacuum but instead received below error:
PostgreSQL stand-alone backend 9.4.4
backend> vacuum full;
< 2019-11-06 14:26:25.179 UTC > WARNING: database "db" must be vacuumed within 999999 transactions
< 2019-11-06 14:26:25.179 UTC > HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions.
< 2019-11-06 14:26:25.215 UTC > ERROR: missing chunk number 0 for toast value xxxx in pg_toast_1234
< 2019-11-06 14:26:25.215 UTC > STATEMENT: vacuum full;
I tried to run vacuum but the same is leading to another error that indicates missing attributes for relid xxxxx
backend> vacuum;
< 2019-11-06 14:27:47.556 UTC > ERROR: catalog is missing 3 attribute(s) for relid xxxxx
< 2019-11-06 14:27:47.556 UTC > STATEMENT: vacuum;
I tried to do a vacuum freeze for the entire db but it is leading to the catalog error again after waiting for sometime.
Furthermore, I tried to run vacuum freeze for a single table which was working fine but when I do the vacuuming for all tables, it probably includes the corrupted one as well and ends up with the same error:
backend> vacuum full freeze
< 2019-11-07 08:54:25.958 UTC > WARNING: database "db" must be vacuumed within 999987 transactions
< 2019-11-07 08:54:25.958 UTC > HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions.
< 2019-11-07 08:54:26.618 UTC > ERROR: missing chunk number 0 for toast value xxxxx in pg_toast_xxxx
< 2019-11-07 08:54:26.618 UTC > STATEMENT: vacuum full freeze
Is there a way to figure out the corrupted table and a way to restore the integrity of the database so the application can access the rest of the database?
P.S. I do not have a backup to restore the data so deleting the corrupted data or somehow fixing it would be the only solution here.

Dockerized Postgresql cannot access postgresql.conf on custom image

I am in the process of experimenting/tinkering/learning/breaking with Docker. I am currently writing Docker code to create a snapshotted testing environment for my application.
By snapshotted I mean that my database is reset on purpose on every restart, so that I can work with old data at a certain time. What is peculiar in my case is that I want to populate a Postgresql database at build time, not at start time. Postgresql image is ready for populating the db with sql scripts at container start, but it takes hours.
My application is made by a Tomcat 8.5 server running my WAR and a Postgresql database, which is the focus of my question now. I am creating a Gist while I write for full code.
The code I have done
Full code on Gist
I have followed a tutorial on how to build a Docker image of Postgres with a full database, rather than have Postgres populate itself on boot. This because I have a million record database and only a .sql.gz dump that sysop gave me.
So the relevant parts of the Dockerfile are
WORKDIR /opt/setup/
COPY db-setup.sh /opt/setup/
COPY db-pack.sh /opt/setup/
COPY db-run.sh /opt/setup/
RUN ./db-setup.sh
RUN ./db-pack.sh
#VOLUME $PGDATA (Note it is commented out, now)
EXPOSE 5432
The db-setup.sh is run on image build, and picks files from data-scripts.d. Of course I am not allowed to share the contents of the dump, but it's a plain .sql.gz with plenties of OIDs that take a huge amount of time to restore. The db-setup.sh shown in Gist is derived from both the tutorial and the original Postgres image so that it handles correctly the compression (the tutorial only uses plain SQL)
Build succeeds, startup fails
When I build the image, it takes considerable amount of time to load the data, which is what I want
2019-08-07 07:57:04.149 UTC [49] LOG: database system was shut down at 2019-08-07 07:57:03 UTC
2019-08-07 07:57:04.231 UTC [48] LOG: database system is ready to accept connections
done
server started
./db-setup.sh: running methodinv_pcp3.sql.gz
2019-08-07 08:49:52.052 UTC [117] ERROR: canceling autovacuum task
2019-08-07 08:49:52.052 UTC [117] CONTEXT: automatic analyze of table "postgres.public.ftt_interactive_data_492"
2019-08-07 08:49:59.086 UTC [118] ERROR: canceling autovacuum task
2019-08-07 08:49:59.086 UTC [118] CONTEXT: automatic analyze of table "postgres.public.ftt_oper_492"
2019-08-07 08:50:34.086 UTC [118] ERROR: canceling autovacuum task
2019-08-07 08:50:34.086 UTC [118] CONTEXT: automatic analyze of table "postgres.public.ftt_validation_492"
2019-08-07 08:51:11.889 UTC [119] ERROR: canceling autovacuum task
2019-08-07 08:51:11.889 UTC [119] CONTEXT: automatic analyze of table "postgres.public.ftt_oper_492"
2019-08-07 08:54:21.131 UTC [123] ERROR: canceling autovacuum task
2019-08-07 08:54:21.131 UTC [123] CONTEXT: automatic analyze of table "postgres.public.ftt_oper_492"
waiting for server to shut down...2019-08-07 08:54:28.652 UTC [48] LOG: received fast shutdown request
.2019-08-07 08:54:28.797 UTC [48] LOG: aborting any active transactions
2019-08-07 08:54:28.799 UTC [48] LOG: worker process: logical replication launcher (PID 55) exited with exit code 1
2019-08-07 08:54:28.800 UTC [50] LOG: shutting down
..2019-08-07 08:54:31.407 UTC [48] LOG: database system is shut down
done
When I run the image with docker run, startup fails because it can't find Postgres configuration
D:\IdeaProjects\pcp\ftt-containers\ftt-db-method>docker run -p 5432:5432 -l ftt-db-method ftt-db-method:latest
Restoring /var/lib/postgresql/data ...
Done.
Launching command: postgres ...
postgres: could not access the server configuration file "/var/lib/postgresql/data/postgresql.conf": No such file or directory
Originally, my Dockerfile exposed a VOLUME which is now commented out. The above output occurs both when I declare a volume (which is not exactly what I want, I am new to Docker and copied&pasted on first chance) and when I comment the volume out.
Question
What is wrong with the Docker image of Postgres fully loaded with s**tloads of data I am experimenting?
How can I effectively start Postgres with an already full database that will not (necessarily) survive container restarts?
Edit 1
By bash-ing into the container I have found that the data dump created during build time is 10K, so basically empty.
This doesn't solve my problem yet, but answers why Postgres is unable to find its beloved data dir
Edit 2
I was able to bash into a temporary container, in particular between the moment the database is restored and the data lib is packed.
Basically the Dockerfile does
RUN ./db-setup.sh
Which executes the restore of the sql
echo "$0: running $f"; gunzip -c "$f" | "${psql[#]}" > /dev/null 2>&1 ; echo ;;
The output is saved to a temporary container.
Now Dockerfile does
RUN ./db-pack.sh
Which tars /var/lib/postgresql/data into /zdata. I have
2019-08-07 16:43:51.532 UTC [42] LOG: received fast shutdown request
waiting for server to shut down....2019-08-07 16:43:51.676 UTC [42] LOG: aborting any active transactions
2019-08-07 16:43:51.679 UTC [42] LOG: worker process: logical replication launcher (PID 49) exited with exit code 1
2019-08-07 16:43:51.681 UTC [44] LOG: shutting down
...2019-08-07 16:43:54.952 UTC [42] LOG: database system is shut down
done
server stopped
Removing intermediate container 8dbe2a4e776a
---> 263896b905ce
Step 15/19 : RUN ./db-pack.sh
---> Running in 56132ecb90cc
Packing data folder: /var/lib/postgresql/data
Pack & clean finished successfully.
Removing intermediate container 56132ecb90cc
---> 1a7f8d68e8df
Step 16/19 : VOLUME $PGDATA
---> Running in 10d222beed81
Removing intermediate container 10d222beed81
---> e1a9355882d1
So I tagged 263896b905ce (YHMV if you replicate on your pc) into a new image, then executed bash on it. The data dir was empty, the script would have packed nothing
docker tag 263896b905ce examine
docker run -it --entrypoint /bin/bash examine
root#ab963ace16a1:/opt/setup# ls
data-scripts.d db-pack.sh db-run.sh db-setup.sh
root#ab963ace16a1:/opt/setup# cd /zdata/
root#ab963ace16a1:/zdata# ls
root#ab963ace16a1:/zdata# cd /var/lib/postgresql/
root#ab963ace16a1:/var/lib/postgresql# ls
data
root#ab963ace16a1:/var/lib/postgresql# cd data/
root#ab963ace16a1:/var/lib/postgresql/data# ls
root#ab963ace16a1:/var/lib/postgresql/data# ls -lah
total 8.0K
drwxrwxrwx 2 postgres postgres 4.0K Jul 17 23:55 .
drwxr-xr-x 1 postgres postgres 4.0K Jul 17 23:55 ..
root#ab963ace16a1:/var/lib/postgresql/data#
root#ab963ace16a1:/var/lib/postgresql/data# ls^C
root#ab963ace16a1:/var/lib/postgresql/data# exit
exit
Fixed
According to https://stackoverflow.com/a/52762779/471213
"why doesn't VOLUME work?" When you define a VOLUME in the Dockerfile, you can only define the target, not the source of the volume. During the build, you will only get an anonymous volume from this. That anonymous volume will be mounted at every RUN command, prepopulated with the contents of the image, and then discarded at the end of the RUN command. Only changes to the container are saved, not changes to the volume.
So I had basically to run both RUNs at the same time
RUN ./db-setup.sh && ./db-pack.sh
#RUN ./db-pack.sh