Restoring Mongodb database fails using VMC tunneling - mongodb

I have successfully published an app to CloudFoundry. When I try and seed the database using VMC tunneling and mongorestore only part of the data is transferred. The restore process hangs part way into the collection. If I use mongorestore to restore the dump to my local mongo instance it works well.
$vmc tunnel energy mongorestore
Opening tunnel on port 10000... OK
Waiting for local tunnel to become available... OK
Directory or filename to restore from> ./dump/energy
connected to: localhost:10000
Wed Jan 16 09:22:25 ./dump/energy/twohourlyhistoryDatas.bson
Wed Jan 16 09:22:25 going into namespace [db.twohourlyhistoryDatas]
Wed Jan 16 09:22:27 warning: Restoring to db.twohourlyhistoryDatas without dropping.
Restored data will be inserted without raising errors; check your server log
795 objects found
Wed Jan 16 09:22:27 Creating index: { key: { _id: 1 }, ns: "db.twohourlyhistoryDatas", name: "_id_" }
I've left this for several hours and it hasn't finished. Using a network monitor I can see the data being transferred for 10-15 seconds, and then stopping suddenly. Turning on verbose mode for vmc hasn't given any failures. Running mongorestore directly with the same command and very verbose output also hasn't shed any light on the problem.
Apart from this, using CloudFoundry has been outstandingly easy. Any suggestions on where to look now to resolve the issue are welcome!

There are size limits on the database (for Mongo it's 240Mb) and also time limits on operations over the tunnel too, how big is the database?

Related

Using Docker, what triggered PANIC: could not locate a valid checkpoint record

I am trying to understand Docker a little better, and in doing so, it appears I corrupted my PostgreSQL DB for my application.
I am using Docker Swarm to start my application and I'm getting the following error in a loop in the PostgreSQL Container:
2021-02-10 15:38:51.304 UTC 120 LOG: database system was shut down at 2021-02-10 14:49:14 UTC
2021-02-10 15:38:51.304 UTC 120 LOG: invalid primary checkpoint record
2021-02-10 15:38:51.304 UTC 120 LOG: invalid secondary checkpoint record
2021-02-10 15:38:51.304 UTC 120 PANIC: could not locate a valid checkpoint record
2021-02-10 15:38:51.447 UTC 1 LOG: startup process (PID 120) was terminated by signal 6
2021-02-10 15:38:51.447 UTC 1 LOG: aborting startup due to startup process failure
2021-02-10 15:38:51.455 UTC 1 LOG: database system is shut down
Initially, I was trying to modify the pg_hba.conf file in the container by going to the mount drive in the FS, which is in
/var/lib/docker/volumes/postgres96-data-volume/_data
However, every time I restarted the container my changes to pg_hba.conf were reverted. So this morning I added a dummy file called test in the mount folder and restarted the container expecting the file to be deleted to get a visual validation that restarting the container automatically replaces everything in that mount to it's original format. After restarting it again, that's when I started getting those error messages preventing my application from starting.
I deleted the test file and restarted the container again, but the error message continues.
I read many solutions on how to fix it, but my question is more to understand why adding a file would cause that? Is my volume corrupted simply because I added a file in there?
Thanks
WARNING
For the people who jump onto using the solution in the accepted answer, here's your WARNING:
The solution in the accepted answer asks to remove the docker volume which means that all the data in the PostgreSQL instance will be lost!!!
Refer to my answer here if you wish to preserve the data of the database instance.
Context in which I faced the same error
I am also using docker swarm to deploy containers and recently encountered this issue when I tried to scale the postgres db to create 2 replicas, both pointing to the same physical volume (mounted using docker, shared using NFS).
This was needed so that the data is in sync across both replicas.
But this led me to the same error as you have
PANIC: could not locate a valid checkpoint record
My findings
Firstly, the database volume is not corrupted, just the transaction WAL has corrupted or it has lost consensus. I did a lot of digging on it. I found two scenarios in which this error may occur:
The database was executing a live transaction but suddenly it shut down due to some error. In this case, the WAL tells the database what it was supposed to be doing when it unexpectedly shut down. However, if the DB shut down during a WAL update, the WAL may reflect some transactions which were actually executed but have improper execution info. This leads to an inconsistency in DB data vs WAL or a corrupt transaction log which leads to a checkpoint error.
You create multiple replicas of the db which point to the same volume. Consider the case of 2 replicas that I faced. When both replicas simultaneously try to execute a transaction on the same db volume, the transaction WAL loses consensus as there are two simultaneous checkpoints. The db fails to execute any further transactions as it is unable to determine which checkpoint to consider as the correct one. This can also happen if two containers (not necessarily replicas) point to the same mount path for PG_DATA.
Eventually, the db fails to start. The container does not start as the db throws an error which closes the container.
You may reset the WAL to fix this issue. When WAL is reset, you will lose the data for transactions that are yet to be executed on the DB. However, data that is already written and transactions that are already processed are preserved.
This error means the Postgres volume is corrupted. This can happen when two containers try to connect to the same volume at the same time. See this answer for slightly more info. Not sure how modifying a file corrupted the drive. You'll need to delete and recreate the volume though. To do this you can:
$ docker stop <your_container_name> # stops a running container
$ docker image prune # removes all images that are not attached to a container
$ docker volume ls # list out active volumes
$ docker volume rm <volume_name> # Remove the volume that's corrupted
I had to run the above code to stop a container, clean images that somehow weren't attached to any containers and then finally delete the offending volume where corrupted data was held.
To resolve this error, you can try the following steps:
Stop and remove the existing PostgreSQL container:
docker stop <container_name>
docker rm <container_name>
Delete the old PostgreSQL data directory, which is usually located at /var/lib/postgresql/data. This will delete all of your database data, so make sure to back up any important data before doing this.
Create a new PostgreSQL container with a fresh data directory:
docker run --name <postgres_container_name> -d postgres

Postgresql fatal the database system is starting up - windows 10

I have installed postgresql on windows 10 on usb disk.
Every day when i start my pc in work from sleep and plug in the disk again then trying to start postgresql i get this error:
FATAL: the database system is starting up
The service starts with following command:
E:\PostgresSql\pg96\pgservice.exe "//RS//PostgreSQL 9.6 Server"
It is the default one.
logs from E:\PostgresSql\data\logs\pg96
2019-02-28 10:30:36 CET [21788]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
2019-02-28 10:31:08 CET [9796]: [1-1] user=postgres,db=postgres,app=[unknown],client=::1 FATAL: the database system is starting up
I want this start up to happen faster.
When you commit data to a Postgres database, the only thing which is immediately saved to disk is the write-ahead log. The actual table changes are only applied to the in-memory buffers, and won't be permanently saved to disk until the next checkpoint.
If the server is stopped abruptly, or if it suddenly loses access to the file system, then everything in memory is lost, and the next time you start it up, it needs to resort to replaying the log in order to get the tables back to the correct state (which can take quite a while, depending on how much has happened since the last checkpoint). And until it's finished, any attempt to use the server will result in FATAL: the database system is starting up.
If you make sure you shut the server down cleanly before unplugging the disk - giving it a chance to set a checkpoint and flush all of its buffers - then it should be able to start up again more or less immediately.

How can I tell if barman is receiving the WAL stream during the day?

I followed the directions in this and this. I've also successfully backed up from one server and restored it to another server. My barman is on a dedicated machine. Looking good. But how can I tell if it's receiving the WAL stream during the day?
I can see the base backups in [barman-server]:/var/lib/barman
barman check mydb is reporting good things
[root#barman barman]# barman check mydb
Server mydb:
PostgreSQL: OK
is_superuser: OK
PostgreSQL streaming: OK
wal_level: OK
replication slot: OK
directories: OK
retention policy settings: OK
backup maximum age: OK (interval provided: 7 days, latest backup age: 24 minutes)
compression settings: OK
failed backups: OK (there are 0 failed backups)
minimum redundancy requirements: OK (have 3 backups, expected at least 0)
pg_basebackup: OK
pg_basebackup compatible: OK
pg_basebackup supports tablespaces mapping: OK
pg_receivexlog: OK
pg_receivexlog compatible: OK
receive-wal running: OK
archiver errors: OK
I have made a cron entry to run the barman backup mydb command (I think it makes more base backups)
[root#barman ~]# cat /etc/cron.d/do_backups
30 23 * * * /usr/bin/barman backup mydb
I share this guy's opinion that this doesn't belong in a separate cron job -- it belongs in the /etc/barman.d/.conf files as some kind of setting that says "Take a Base-Backup every X days" or some such, but that's not my problem in this question.
How do I tell if this is receiving the WAL stream intra-day?
What do I look for to see some progress?
Is there a way to see the IP address or a database connection for this so I know for sure?
(I think I need a little education on WAL streams as well) Are WAL streams something that the PG server "sends" to barman? or is it "pulled" from a process on the barman?
barman uses cron command to make sure WAL streaming actually works as expected
you can see related document here
this command runs every minute and added to your system cron if you have installed barman via debian/fedora packages
you can check it on debian here : /etc/cron.d/barman
for getting a sense about barman cron job , set log_level to DEBUG in /etc/barman.conf
and watch for barman log via tailf /var/log/barman/barman.log
every minute , this command takes care of new WAL files and archive them

Exception when performing restart from replica set to standalone

I am currently experimenting with MongoDB replica set mechanism.
I already have a working standalone Mongo server with a main database of about 20GB of data.
I decided to convert this mongo server to a primary replica set server, then added a 2nd machine with a similar configuration (but a newer mongo version), as a secondary replica set server.
This works fine, all data is replicated to the secondary as expected.
But I would like to perform some alteration operations on the data (because somehow, my data model has changed and I need to, for example rename some properties, or convert references to a simple ObjectId, some things like that). By the same time I would like to update the first server which has an old version (2.4) to the last version available (2.6).
So I decided to follow the instructions on the MongoDB website to perform maintenance on replica set members.
shut down the secondary server. (ok)
restart server as standalone on another port (both servers usually run on 27017)
mongod --dbpath /my/database/path --port 37017
And then, the server never restarts correctly and I get this:
2014-10-03T08:20:58.716+0200 [initandlisten] opening db: myawesomedb
2014-10-03T08:20:58.735+0200 [initandlisten] myawesomedb Assertion failure _name == nsToDatabaseSubstring( ns ) src/mongo/db/catalog/database.cpp 472
2014-10-03T08:20:58.740+0200 [initandlisten] myawesomedb 0x11e6111 0x1187e49 0x116c15e 0x8c2208 0x765f0e 0x76ab3f 0x76c62f 0x76cedb 0x76d475 0x76d699 0x7fd958c3eec5 0x764329
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111]
/usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1187e49]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0x17e) [0x116c15e]
/usr/bin/mongod(_ZN5mongo8Database13getCollectionERKNS_10StringDataE+0x288) [0x8c2208]
/usr/bin/mongod(_ZN5mongo17checkForIdIndexesEPNS_8DatabaseE+0x19e) [0x765f0e]
/usr/bin/mongod() [0x76ab3f]
/usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x5df) [0x76c62f]
/usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1b) [0x76cedb]
/usr/bin/mongod() [0x76d475]
/usr/bin/mongod(main+0x9) [0x76d699]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fd958c3eec5]
/usr/bin/mongod() [0x764329]
2014-10-03T08:20:58.756+0200 [initandlisten] exception in initAndListen: 0 assertion src/mongo/db/catalog/database.cpp:472, terminating
2014-10-03T08:20:58.757+0200 [initandlisten] dbexit:
What am I doing wrong ?
Note that at this time, the first server is still running as primary member.
Thanks in advance!
I believe you are hitting a bug in VMWare here (can you confirm you are using VMWare VMs? confirmed) - I have seen it confirmed on Ubuntu and Fedora so far. The bug causes pieces of previous data to not be zero'ed out when creating the MongoDB namespace files (not always, but sometimes). That previous data essentially manifests as corruption in the namespace files and leads to the assertion you saw.
To work around the issue, there will be a fix released in MongoDB versions 2.4.12 and 2.6.5+ as part of SERVER-15369. The OS/Kernel level fix will eventually percolate down from the kernel bug and the Ubuntu patch, but that may take some time to actually be available as an official update (hence the need for the workaround change in MongoDB itself in the interim).
The issue will only become apparent when you upgrade to 2.6 because of additional checking added to that version that was not present in 2.4, however the corruption is still present, just not reported on version 2.4
If you still have your primary running, and it does not have the corruption, I would recommend syncing a secondary that is not on a VMWare VM and/or taking a backup of your files as soon as possible for safety - there is no automatic way to fix this corruption right now.
You can also look at using version 2.6.5 once it is released (2.6.5 rc4 is available as of writing this which includes the fix). You will still need to resync with that version off your good source to create a working secondary, but at least there will then be no corruption of the ns files.
Updates:
Version 2.6.5 which includes the fix mentioned was released on October 9th
Version 2.4.12 which includes the fix was released on October 16th
Official MongoDB Advisory: https://groups.google.com/forum/#!topic/mongodb-announce/gPjazaAePoo

Heroku: update database plan, then delete the first one

I updated my DB plan on heroku quite some time ago, following this clear tutorial: https://devcenter.heroku.com/articles/upgrade-heroku-postgres-with-pgbackups
So now I have 2 DB running:
$ heroku pg:info
=== HEROKU_POSTGRESQL_NAVY_URL (DATABASE_URL)
Plan: Crane
Status: Available
Data Size: 26.1 MB
Tables: 52
PG Version: 9.2.6
Connections: 8
Fork/Follow: Available
Rollback: Unsupported
Created: 2013-11-04 09:42 UTC
Region: eu-west-1
Maintenance: not required
=== HEROKU_POSTGRESQL_ORANGE_URL
Plan: Dev
Status: available
Connections: 0
PG Version: 9.2.7
Created: 2013-08-13 20:05 UTC
Data Size: 11.8 MB
Tables: 49
Rows: 7725/10000 (In compliance, close to row limit) - refreshing
Fork/Follow: Unsupported
Rollback: Unsupported
Region: Europe
I keep receiving mails saying that I'm close to rate limit on HEROKU_POSTGRESQL_ORANGE_URL. I'd rather delete it, but I'd like to make sure I'm not going to loose any data. Heroku is not clear about it:
The original database will continue to run (and incur charges) even after the upgrade. If desired, remove it after the upgrade is successful.
But can I be 100% sure that all the data in the HEROKU_POSTGRESQL_ORANGE_URL is duplicated in the HEROKU_POSTGRESQL_NAVY_URL? Because if HEROKU_POSTGRESQL_ORANGE_URL were a follower of HEROKU_POSTGRESQL_NAVY_URL, its data should be as big as the first one.
So I just need a confirmation.
Thanks
It sounds to me like the upgrade dumped and reloaded the DB. So the new DB is a copy of the old one. If that's the case, it will contain all data from the old one at the time it was copied - but if you kept on adding new data to the old database, that data wouldn't appear in the new one.
I strongly recommend that before dropping the DB you:
Disable access to it except for pg_dump
Dump it with pg_dump (or use Heroku's tools to do that)
... and only then delete it.
That way, if you discover you've made a mistake, you have a dump to restore.