postgres index contains "unexpected zero page at block" exceptions - postgresql

I have following errors in pg_log file several thousand times. How to resolve them.
index "meeting_pkey" contains unexpected zero page at block 410.Please REINDEX it.
index "faultevent_props_pkey" contains unexpected zero page at block 37290.
index "faultevent_pkey" contains unexpected zero page at block 1704

The cause of the issue is due to bad index pages and its unable to read it.
Reindex the problematic index to overcome the issue.
Reindex index <schema_name>.<index_name>;
Here you have some hits.

Your database is corrupt.
Try to run pg_dumpall to get a logical dump of the database.
If that fails, buy support from somebody who can salvage data from corrupted databases.
If it succeeds:
Check the hardware, particularly storage and RAM.
Once you are certain the hardware is ok, install the latest patch update for your PostgreSQL version.
Create a new database cluster with initdb.
Restore the dump into the new cluster.
Did you have crashes recently?
Did you test if your storage handles fsync requests properly?
Do you have any dangerous settings like fsync = off?

I ran into this issue and after a lot of reading I decided to do a complete DB reindex:
reindex DATABASE <DATABASE NAME>
and it solved the issue for me. Hope this helps you.

Related

PostgreSQL PANIC: WAL contains references to invalid pages

I have problem with PostgreSQL database running as replica of the master database server. Database on master runs without any problems. But replica database runs only for few hours (it is random time) and after that crashing down by this reason:
WARNING: page 3318889 of relation base/16389/19632 is uninitialized
...
PANIC: WAL contains references to invalid pages
Have you any idea what is wrong please? I'm not able to solve this problem for many days! Thanks.
There was more Postgres bugs with these symptoms. Lot of was fixed already. Please, check if your Postgres is latest minor release. And if it is, then report this issue to mailing list https://www.postgresql.org/list/pgsql-hackers/.

MongoDB WiredTiger error: WiredTiger.turtle: handle-open: open: operation not permitted

MongoDB was working beautifully for me for several months until I had an unexpected shutdown a week or two ago. Since then, I've been getting the error in the title that snowballs into an invalid argument, then a library panic, then some fatal assertions which cause MongoDB to crash.
Now, I've done my research: the normal answers are to run the repair function and to make sure SELinux isn't screwing up the process. Neither of those have worked. The error gets thrown during WiredTiger's checkpoint process, so reads/writes to the database aren't the issue, and because it's during the checkpoint process, it guarantees that MongoDB won't stay up for more than a day.
To be clear: all the files in the database are owned by mongod:mongod, have permissions set to 600 (default, and I tried setting them to 755 to see if that fixed it, and it didn't). I'm running mongodb as a service on a CentOS 7 box, and the service file specifies that it should run as user mongod. The mongod.conf file specifies a mounted filesystem as the database, and it was happy with that until the unexpected shutdown. I'm running MongoDB version 4.0.1, so WiredTiger really doesn't like it if I disable Journaling either (disregarding the fact that I shouldn't disable it in the first place).
I feel like I've exhausted all my options, and that the only thing I can do is backup my data and reinstall MongoDB. Are there any that I've missed?
After creating a backup of my data via mongodump, shutting down mongo, removing the entire database with rm -rf 'path-to-database', rebooting mongo (without the replication config), and restoring the data with mongorestore, mongodb still crashes. This time, however, it's with an Invariant failure after the open: operation not permitted. The only conclusion I can think of is that the data itself has become corrupted in some way. Thankfully, this isn't "mission critical" data, so to speak, and I can easily obtain new data.
Unfortunately, this doesn't answer my original question of "what other options do I have?". However, I'm still posting this in case others run into this same kind of issue.
EDIT: invariant issue was caused by me forgetting to re-initialize my replication set. After fixing that, it's clean. Because of this, I no longer believe it was a data corruption issue, but a checkpoint corruption issue.
EDIT 2: So the issue arose again after about a week, and after another week of trying various debugging methods, I tried simply moving the mongo process to another server. So far, that's been working. The previous server was acting up (I couldn't even run top at one point - another process had a lock on a necessary library file to run it), so here's to hoping that the current server doesn't follow suite.

missing chunk number 0 for toast value 37946637 in pg_toast_2619

Main Issue:
Getting "ERROR: missing chunk number 0 for toast value 37946637 in pg_toast_2619" while selecting from tables.
Steps that led to the issue:
- Used pg_basebackup from a Primary db and tried to restore it onto a Dev host.
- Did a pg_resetxlog -f /${datadir} and started up the Dev db.
- After starting up the Dev db, when I query a varchar column, I keep getting:
psql> select text_col_name from big_table;
ERROR: missing chunk number 0 for toast value 37946637 in pg_toast_2619
This seems to be happening for most varchar columns in the restored db.
Has anyone else seen it?
Does anyone have ideas of why it happens and how to fix it?
pg_resetxlog is a bit of a last resort utility which you should prefer not to use. Easiest way to make a fully working backup is to use pg_basebackup with the -X s option. That is an uppercase X. What this does is that basebackup opens two connections. One to copy all the data files and one to receive all of the wal that is written during the duration of the backup. This way you cannot run into the problem that parts of the wal you need are already deleted.
I tried a few things since by original question. I can confirm that the source of my error "ERROR: missing chunk number 0 for toast value 37946637 in pg_toast_2619" was doing a pg_resetxlog during the restore process.
I re-did the restore today but this time, applied the pg_xlog files from Primary using recovery.conf. The restored db started up fine now and all queries are running as expected.

Mongo DB Invariant failure

Our DB of +- 400Gb is stopping on our one server.
From the logs:
2015-07-07T09:09:51.072+0200 I STORAGE [conn10] _getOpenFile() invalid file index requested 8388701
2015-07-07T09:09:51.072+0200 I - [conn10] Invariant failure false src/mongo/db/storage/mmap_v1/mmap_v1_extent_manager.cpp 201
2015-07-07T09:09:51.082+0200 I CONTROL [conn10]
Any idea in what are I should start looking? Storage issue?
I am just answering this question in case some people make the same non-technical mistake again:
I tried to scp all the files in the /data/db directory to the server. As the files are many (dbname.1 to dbname.55, about 100GB), it was interrupted in the middle (last successful file dbname.22), and I restarted and uploaded dbname.23 to dbname.55. And when I run queries in mongo client, it worked for some cases, and failed for some others showing the error message the same as in the question. I thought it might be some file broken in the file transferring, but the md5 check was all right. Only after I spent a long time finishing all the md5 check I found the reason.
It turned out to be that scp uploads dbname.21 to dbname.29 after it uploads dbname.2, so dbname.3 to dbname.9 was never uploaded to the server. I am going to upload them, and this should solve the problem.
I ran into a variant of this today as well. Mysteriously one of my data files disappeared (or didn't make it in a migration from another server). None of the repair/recovery procedures would work, failing on the same error you reference. Luckily I have a separate mongod that has a collection with the same name, so as a cheap hack I copied the (admittedly wrong) data file to the other server, and while I knew I wouldn't get any data back, the repair tools (such as mongod --repair) were then able to work their magic, but as expected, they recovered some data from the bad file I copied in, so I had to weed out some docs. Luckily it was the "mycollection.1" file, which is only 128MB.
I don't think this applies in your case since index of the missing data file your log is talking about is ridiculously high. Your log is essentially saying it can't find /data/dbname/mycollection.8388701. You said your data-set is only 400GB, so an index that high just doesn't make sense. You should have only roughly 200 data files since most of them are 2GB each by default. What is the result of db.stats() (specifically the fileSize attribute)?
This mongolab blog entry helped me understand the data file structure.
My advice for where you should start looking:
run the db.stats() command to get an idea of how big your data on
disk actually is.
Does it make sense for your server to be looking for a data file with a crazy high index? If not, the issue isn't really with storage, but with the extents and the metadata of your collection/database.
Do your repair tools work? If you have at least enough free disk space as the size of your data set (on disk), try the mongod --repair, or db.repairDatabase() tools to start a repair. I'm assuming it won't work since my repair attempts crashed with the same invalid file index requested error.
Try copying a "bad" file like I did that roughly matches what the missing file would look like (keeping in mind how the file sizes of the data files aren't all the same, do your best to match it up and try a repair). If this works, your data files will be cleaned up (but it does take a lot of disk space).
Hope that helps point you in the right direction.
In my case this happened in a development setting with MongoDB 3.6.20 on macOS 10.14.6. Another program restarted the mac and close any open terminals, including the terminal that ran the mongod process. After the OS restart, I could not restart the mongod because the Invariant failure. The error also mentioned a bad lockfile.
I was able to solve the issue with the following steps, yet I am not exactly sure which did the job:
remove corrupted lock file: rm -rf data/db/mongod.lock
direct outcome: mongod still failed due to Invariant failure but at least no mention about the lockfile anymore.
run mongod --repair
direct outcome: repair still failed due to Invariant failure. Error output mentions SocketException: Address already in use.
restart the machine again to free the socket.
direct outcome: mongod starts and runs without problems. Yay.
The first successful mongod run after the issue gave the following output:
[ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost.
Thus, it runs smoothly again. Maybe I was fortunate. I hope the same approach helps some of you.

Error in mongodb: "getFile(): bad file number value (corrupt db?): run repair"

After my last Meteor upgrade my database became corrupted. First it started with this error message when I tried to create a new user (we're using meteor-accounts):
getFile(): bad file number value (corrupt db?): run repair
Then I saw in another question that I should run db.repairDatabase() but, although mongo shell said that the database was now ok, it didn't really work. The error message above was still showing up.
So I read something about corrupted indexes and dropped the indexes in the users collections and this obviously just made everything worse. Now I have two users with the same email address and Meteor doesn't start anymore:
MongoError: E11000 duplicate key error index: meteor.users.$emails.address_1 dup key: { : "thiago#gdeahj.com" }
When I try to remove one of these users, the original error shows up again:
meteor:PRIMARY> db.users.remove({ _id: "cAtu2XsEXTbqL2Wvx"})
getFile(): bad file number value (corrupt db?): run repair`
Fortunately we're still on the development phase and we can just drop the whole database and start over, but this has made me really insecure about running Meteor on production environment. Is there any way to fix a database in this state?
You can run db.repairDatabase to try to repair the data files - but read the linked page first for details and warnings. Make sure you run with journaling on if you didn't have it on before and, at least for production, run a replica set. Normally, in this situation it'd be preferable to resync from another replica set member or restore a backup rather than repair. You can find more information about data recovery in this article from the MongoDB Manual.