Docker Postgres database corrupted - postgresql

I had a docker container running timescaleDB. The database data was stored outside the container.
docker run -d --name timescale -v /<DATA>:/var/lib/postgresql/data timescale/timescaledb-postgis:latest-pg10
Something strange happened lately. I log in and see all the databases have suddenly vanished
I see the below in the log file
2021-03-13 11:32:00.215 UTC [21] LOG: database system was interrupted; last known up at 2021-03-11 16:16:19 UTC
2021-03-13 11:32:00.242 UTC [21] LOG: database system was not properly shut down; automatic recovery in progress
2021-03-13 11:32:00.243 UTC [21] LOG: redo starts at 0/15C1270
2021-03-13 11:32:00.243 UTC [21] LOG: invalid record length at 0/15C12A8: wanted 24, got 0
2021-03-13 11:32:00.243 UTC [21] LOG: redo done at 0/15C1270
2021-03-13 11:32:00.247 UTC [8] LOG: database system is ready to accept connections
2021-03-13 20:33:10.424 UTC [31] LOG: could not receive data from client: Operation timed out
2021-03-13 20:33:10.424 UTC [29] LOG: could not receive data from client: Operation timed out
Does that means that database has corrupted? If so is there a way to recover it somehow? The container has been running for 3 years without a problem and suddenly this unexpected loss of database.
Thanks

Yes, the database was corrupted, but it was recovering by the automated recovery process. It looked like the db system started working since it sent this message: database system is ready to accept connections. This means that the logfile recovery was done properly (which doesn't mean that the database files are fully consistent).
When the database is abruptly shutdown, there is small chance for filelvel corruption as well, but the good news is that I don't see anything in the log, after the recovery that can suggest that this is the case, however, you need to have backup of the files.
The next log message could not receive data from client: Operation timed out is not related to recovery, it's due to the client application which had terminated without properly closing the connection.
Check more information on corruptions and reasons in Postgresql wiki.
If you depend on the data in the database, always keep backup. Easiest way is to use pg_dumpall. This will dump the data in plain text format as a series of SQL statements and you will be able to import the data on later versions of PostgreSQL.
So my recommendation, before you do anything else with it, STOP THE CONTAINER AND TAKE BACKUP OF THE FILES. The recovery is trial and error process, and you will need to have the fresh copy of the files to try different thing. After you do this, export the data with pg_dumpall. If this passes, you can resume normal operations of the database.

Related

Postgres.exe crashes and tears down all apps, recovers and is running again

I'm running an application with about 20 processes connected to a postgres DB (10.0) on windows server 2016.
Since about a month I have unexpected crashes of postgres.exe.
To isolate the problem I extended the logging by setting log_min_duration_statement = 0
This creates more detailed logfile. What I can see is:
LOG: server process (PID xxxxx) was terminated by exception
0xFFFFFFFF DETAIL: Failed process was running: COMMIT HINT: See C
include file "ntstatus.h" for a description of the hexadecimal value.
Then it tears down all 20 processes like this:
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.Then DB recovers:
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2021-06-11 18:17:18 CEST
DB enters recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: database system was not properly shut down; automatic recovery in progress
...
LOG: redo starts at 1B2/33319E58
FATAL: the database system is in recovery mode
LOG: invalid record length at 1B2/33D29930: wanted 24, got 0
LOG: redo done at 1B2/33D29908
LOG: last completed transaction was at log time 2021-06-11 18:21:39.830526+02
FATAL: the database system is in recovery mode
...
FATAL: the database system is in recovery mode
LOG: database system is ready to accept connections
Now it's running again like normal
The crashed PID xxxxx I can identify to a postgres.exe running for one of the 20 application processes. It's not always the same one. This happens about every 5-10 days.
Can anybody give me some advice how to track down the reason of this crash?
Extensions used:
oracle_fdw 2.0.0, PostgreSQL 10.0, Oracle client 11.2.0.3.0, Oracle server 11.2.0.2.0
Crashdump:
Followed the link :
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows
Although the postgres user has "full control" of the crashdump folder in the security tab it does not write something. Folder stays empty.
Follow-Up on the comment #Laurenz Albe:
The COMMIT is not the reason of the crash. It is the last successfull executed command of the session. Explained on the following example:
Process gets a job and starts to do it's job
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.061 ms statement: DISCARD ALL
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.012 ms statement: BEGIN
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.015 ms statement: SET TRANSACTION ISOLATION LEVEL READ COMMITTED
now a lot of action going on within session 25604
and among others the oracle foreign datawrapper
2021-06-15 16:28:13.792 CEST [25604] LOG: duration: 0.016 ms execute <unnamed>: FETCH ALL FROM "<unnamed portal 689>"
finishes action successfully (data of the transaction in the database)
2021-06-15 16:28:13.823 CEST [25604] LOG: duration: 0.059 ms statement: COMMIT
a lot of action is going in different sessions
among others the oracle foreign datawrapper
more the 7 minutes afterwards the next job is requested and now postgres.exe crash
2021-06-15 16:36:01.524 CEST [17904] LOG: server process (PID 25604) was terminated by exception 0xFFFFFFFF
The process does not do DISCARD ALL, BEGIN and SET TRANSACTION ISOLATION LEVEL READ COMMITTED
It crashes immediately
My Conclusion:
"the possibly corrupted shared memory" was initiated by one of the processes before. Meaning between the last successful COMMIT and the new request.
That's a 7 minutes time span where the problem occurs.
Some feedback on this conclusion?

Change PostgreSQL data directory to directory created by PostgresSQL on another machine

I have a database stored on an external hard drive. The database was created using PostgreSQL 11, on an Ubuntu 18.04 machine. The folder it's stored in was the data directory of my PostgreSQL instance on my Ubuntu machine, everything worked fine. I don't have access to this Ubuntu machine anymore, and this will last for a few months, but I have the external drive. I'm working under macOS 14.6 in the meantime. I setup PostgreSQL on my Mac using the Postgres.app. I created a new server, making sure to use version 11. The defaut data directory was of course not the one I want, so I changed its path in postgresql.conf to point to my existing data dir:
data_directory = 'path_to_external_HDD_data_directory'
Note that this is all I changed in the .conf file (should I change anything else?). When I try to connect to the server via Postgres.app, I get the following error:
pg_ctl: server did not start in time
And the log is:
2019-10-21 22:06:47.628 CEST [72547] LOG: listening on IPv6 address "::1", port 5432
2019-10-21 22:06:47.629 CEST [72547] LOG: listening on IPv4 address "127.0.0.1", port 5432
2019-10-21 22:06:47.654 CEST [72547] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2019-10-21 22:06:47.742 CEST [72548] LOG: database system was interrupted; last known up at 2019-10-21 22:00:07 CEST
2019-10-21 22:06:58.263 CEST [72548] LOG: database system was not properly shut down; automatic recovery in progress
2019-10-21 22:06:58.266 CEST [72548] LOG: redo starts at 4A/B2804E40
2019-10-21 22:06:58.266 CEST [72548] LOG: invalid record length at 4A/B2804E78: wanted 24, got 0
2019-10-21 22:06:58.266 CEST [72548] LOG: redo done at 4A/B2804E40
2019-10-21 22:06:58.314 CEST [72547] LOG: database system is ready to accept connections
Postgres.app then tells me that the port is in use. Running lsof -n -i4TCP:5432, I see that postgres is listening. I killed it and retried but got the same pg_ctl error. Any idea of what I can do?
Your server started successfully. You don't use pg_ctl to connect to PostgreSQL, but the command line client psql.
Anyway, you should stop what you are doing right now before any damage is done.
It is not supported to use a PostgreSQL data directory created with one architecture (Linux) on a different architecture (MacOS).
If the server starts, it is by coincidence. Connecting might work, but it might just as well corrupt your database.

Dockerized Postgresql cannot access postgresql.conf on custom image

I am in the process of experimenting/tinkering/learning/breaking with Docker. I am currently writing Docker code to create a snapshotted testing environment for my application.
By snapshotted I mean that my database is reset on purpose on every restart, so that I can work with old data at a certain time. What is peculiar in my case is that I want to populate a Postgresql database at build time, not at start time. Postgresql image is ready for populating the db with sql scripts at container start, but it takes hours.
My application is made by a Tomcat 8.5 server running my WAR and a Postgresql database, which is the focus of my question now. I am creating a Gist while I write for full code.
The code I have done
Full code on Gist
I have followed a tutorial on how to build a Docker image of Postgres with a full database, rather than have Postgres populate itself on boot. This because I have a million record database and only a .sql.gz dump that sysop gave me.
So the relevant parts of the Dockerfile are
WORKDIR /opt/setup/
COPY db-setup.sh /opt/setup/
COPY db-pack.sh /opt/setup/
COPY db-run.sh /opt/setup/
RUN ./db-setup.sh
RUN ./db-pack.sh
#VOLUME $PGDATA (Note it is commented out, now)
EXPOSE 5432
The db-setup.sh is run on image build, and picks files from data-scripts.d. Of course I am not allowed to share the contents of the dump, but it's a plain .sql.gz with plenties of OIDs that take a huge amount of time to restore. The db-setup.sh shown in Gist is derived from both the tutorial and the original Postgres image so that it handles correctly the compression (the tutorial only uses plain SQL)
Build succeeds, startup fails
When I build the image, it takes considerable amount of time to load the data, which is what I want
2019-08-07 07:57:04.149 UTC [49] LOG: database system was shut down at 2019-08-07 07:57:03 UTC
2019-08-07 07:57:04.231 UTC [48] LOG: database system is ready to accept connections
done
server started
./db-setup.sh: running methodinv_pcp3.sql.gz
2019-08-07 08:49:52.052 UTC [117] ERROR: canceling autovacuum task
2019-08-07 08:49:52.052 UTC [117] CONTEXT: automatic analyze of table "postgres.public.ftt_interactive_data_492"
2019-08-07 08:49:59.086 UTC [118] ERROR: canceling autovacuum task
2019-08-07 08:49:59.086 UTC [118] CONTEXT: automatic analyze of table "postgres.public.ftt_oper_492"
2019-08-07 08:50:34.086 UTC [118] ERROR: canceling autovacuum task
2019-08-07 08:50:34.086 UTC [118] CONTEXT: automatic analyze of table "postgres.public.ftt_validation_492"
2019-08-07 08:51:11.889 UTC [119] ERROR: canceling autovacuum task
2019-08-07 08:51:11.889 UTC [119] CONTEXT: automatic analyze of table "postgres.public.ftt_oper_492"
2019-08-07 08:54:21.131 UTC [123] ERROR: canceling autovacuum task
2019-08-07 08:54:21.131 UTC [123] CONTEXT: automatic analyze of table "postgres.public.ftt_oper_492"
waiting for server to shut down...2019-08-07 08:54:28.652 UTC [48] LOG: received fast shutdown request
.2019-08-07 08:54:28.797 UTC [48] LOG: aborting any active transactions
2019-08-07 08:54:28.799 UTC [48] LOG: worker process: logical replication launcher (PID 55) exited with exit code 1
2019-08-07 08:54:28.800 UTC [50] LOG: shutting down
..2019-08-07 08:54:31.407 UTC [48] LOG: database system is shut down
done
When I run the image with docker run, startup fails because it can't find Postgres configuration
D:\IdeaProjects\pcp\ftt-containers\ftt-db-method>docker run -p 5432:5432 -l ftt-db-method ftt-db-method:latest
Restoring /var/lib/postgresql/data ...
Done.
Launching command: postgres ...
postgres: could not access the server configuration file "/var/lib/postgresql/data/postgresql.conf": No such file or directory
Originally, my Dockerfile exposed a VOLUME which is now commented out. The above output occurs both when I declare a volume (which is not exactly what I want, I am new to Docker and copied&pasted on first chance) and when I comment the volume out.
Question
What is wrong with the Docker image of Postgres fully loaded with s**tloads of data I am experimenting?
How can I effectively start Postgres with an already full database that will not (necessarily) survive container restarts?
Edit 1
By bash-ing into the container I have found that the data dump created during build time is 10K, so basically empty.
This doesn't solve my problem yet, but answers why Postgres is unable to find its beloved data dir
Edit 2
I was able to bash into a temporary container, in particular between the moment the database is restored and the data lib is packed.
Basically the Dockerfile does
RUN ./db-setup.sh
Which executes the restore of the sql
echo "$0: running $f"; gunzip -c "$f" | "${psql[#]}" > /dev/null 2>&1 ; echo ;;
The output is saved to a temporary container.
Now Dockerfile does
RUN ./db-pack.sh
Which tars /var/lib/postgresql/data into /zdata. I have
2019-08-07 16:43:51.532 UTC [42] LOG: received fast shutdown request
waiting for server to shut down....2019-08-07 16:43:51.676 UTC [42] LOG: aborting any active transactions
2019-08-07 16:43:51.679 UTC [42] LOG: worker process: logical replication launcher (PID 49) exited with exit code 1
2019-08-07 16:43:51.681 UTC [44] LOG: shutting down
...2019-08-07 16:43:54.952 UTC [42] LOG: database system is shut down
done
server stopped
Removing intermediate container 8dbe2a4e776a
---> 263896b905ce
Step 15/19 : RUN ./db-pack.sh
---> Running in 56132ecb90cc
Packing data folder: /var/lib/postgresql/data
Pack & clean finished successfully.
Removing intermediate container 56132ecb90cc
---> 1a7f8d68e8df
Step 16/19 : VOLUME $PGDATA
---> Running in 10d222beed81
Removing intermediate container 10d222beed81
---> e1a9355882d1
So I tagged 263896b905ce (YHMV if you replicate on your pc) into a new image, then executed bash on it. The data dir was empty, the script would have packed nothing
docker tag 263896b905ce examine
docker run -it --entrypoint /bin/bash examine
root#ab963ace16a1:/opt/setup# ls
data-scripts.d db-pack.sh db-run.sh db-setup.sh
root#ab963ace16a1:/opt/setup# cd /zdata/
root#ab963ace16a1:/zdata# ls
root#ab963ace16a1:/zdata# cd /var/lib/postgresql/
root#ab963ace16a1:/var/lib/postgresql# ls
data
root#ab963ace16a1:/var/lib/postgresql# cd data/
root#ab963ace16a1:/var/lib/postgresql/data# ls
root#ab963ace16a1:/var/lib/postgresql/data# ls -lah
total 8.0K
drwxrwxrwx 2 postgres postgres 4.0K Jul 17 23:55 .
drwxr-xr-x 1 postgres postgres 4.0K Jul 17 23:55 ..
root#ab963ace16a1:/var/lib/postgresql/data#
root#ab963ace16a1:/var/lib/postgresql/data# ls^C
root#ab963ace16a1:/var/lib/postgresql/data# exit
exit
Fixed
According to https://stackoverflow.com/a/52762779/471213
"why doesn't VOLUME work?" When you define a VOLUME in the Dockerfile, you can only define the target, not the source of the volume. During the build, you will only get an anonymous volume from this. That anonymous volume will be mounted at every RUN command, prepopulated with the contents of the image, and then discarded at the end of the RUN command. Only changes to the container are saved, not changes to the volume.
So I had basically to run both RUNs at the same time
RUN ./db-setup.sh && ./db-pack.sh
#RUN ./db-pack.sh

Restore postgres db from folder

i have an old copy of my postgresql db folder (/var/lib/postgresql/9.5/main/) from my server. Now I want to get the data out of the files. So i copied the main folder to my local machine and changed the postgresql config (/etc/postgresql/9.5/main/postgresql.conf) to point to that directory. Also i changed the permission of the main directory to the user postgres. After restarting the postgresql service (sudo service postgresql restart) it doesn't really work.
What I'm doing wrong? (Yea I know, pg_dump is the preferred way, but in this way...)
So my question, does this even work?
Or is there a other way to get the data out of this?
everything is done on ubuntu 16.04.
Edit:
the log file after changing the postgresql.conf file to point to the new directory.
2017-10-13 06:15:43 CEST [968-1] LOG: database system was shut down at 2017-10-13 00:21:04 CEST
2017-10-13 06:15:43 CEST [968-2] LOG: MultiXact member wraparound protections are now enabled
2017-10-13 06:15:43 CEST [959-1] LOG: database system is ready to accept connections
2017-10-13 06:15:43 CEST [975-1] LOG: autovacuum launcher started
2017-10-13 06:15:43 CEST [983-1] [unknown]#[unknown] LOG: incomplete startup packet
2017-10-13 06:47:55 CEST [975-2] LOG: autovacuum launcher shutting down
2017-10-13 06:47:55 CEST [959-2] LOG: received smart shutdown request
2017-10-13 06:47:55 CEST [972-1] LOG: shutting down
2017-10-13 06:47:55 CEST [972-2] LOG: database system is shut down
2017-10-13 06:47:55 CEST [4667-1] FATAL: database files are incompatible with server
2017-10-13 06:47:55 CEST [4667-2] DETAIL: The database cluster was initialized without USE_FLOAT8_BYVAL but the server was compiled with USE_FLOAT8_BYVAL.
2017-10-13 06:47:55 CEST [4667-3] HINT: It looks like you need to recompile or initdb.
Ok that pointed me to this. The server is a armv7l, whereas the local machine is x86_64 (uname -m). So there is no chance to get the data out of it?
thx, Luc
If it's really true that your data directory is from an ARM7l system, and your local system is x86_64, you're going to have some difficulties.
The immediate error about USE_FLOAT8_BYVAL is because ARM7L is 32-bit, and cannot pass 64-bit floating point values (8 byte) by-value. Your 64-bit host can. But if you recompiled a custom postgres with USE_FLOAT8_BYVAL disabled you'd likely just run into other issues.
I suggest installing PostgreSQL on a matching ARM system to recover the data. Data directories for PostgreSQL are not portable across architectures (for performance reasons).
If you do not have access to the ARM system anymore, an emulator like qemu should be able to help you.
Otherwise, maybe you can compile a modified PostgreSQL (probably starting with 32-bit x86) that can read the data-dir, with appropriate configure options etc. I've never needed to try this.

What is the purpose of `pg_logical` directory inside PostgreSQL data?

I've just stumbled upon this error while testing failover of a PostgreSQL 9.4 cluster I've set up. Here I'm trying to promote a slave to be the new master:
$ repmgr -f /etc/repmgr/repmgr.conf --verbose standby promote
2014-09-22 10:46:37 UTC LOG: database system shutdown was interrupted; last known up at 2014-09-22 10:44:02 UTC
2014-09-22 10:46:37 UTC LOG: database system was not properly shut down; automatic recovery in progress
2014-09-22 10:46:37 UTC LOG: redo starts at 0/18000028
2014-09-22 10:46:37 UTC LOG: consistent recovery state reached at 0/19000600
2014-09-22 10:46:37 UTC LOG: record with zero length at 0/1A000090
2014-09-22 10:46:37 UTC LOG: redo done at 0/1A000028
2014-09-22 10:46:37 UTC LOG: last completed transaction was at log time 2014-09-22 10:36:22.679806+00
2014-09-22 10:46:37 UTC FATAL: could not open directory "pg_logical/snapshots": No such file or directory
2014-09-22 10:46:37 UTC LOG: startup process (PID 2595) exited with exit code 1
2014-09-22 10:46:37 UTC LOG: aborting startup due to startup process failure
pg_logical/snapshots dir in fact exists on master node and it is empty.
UPD: I've just manually created empty directories pg_logical/snapshots and pg_logical/mappings and server has started without complaining. repmgr standby clone seems to omit this dirs while syncing. But the question still remains because I'm just curious what this directory is for, maybe I'm missing something in my setup. Simply Googling it did not yield any meaningful results.
It's for the new logical changeset extraction / logical replication feature in 9.4.
This shouldn't happen, though... it suggests a significant bug somewhere, probably repmgr. I'll wait for details (repmgr version, etc).
Update: Confirmed, it's a repmgr bug. It's fixed in git master already (and was before this report) and will be in the next release. Which had better be soon, given the significance of this issue.