pg_dump is failing with the error message:
"pg_dump FATAL: segment too big"
What does that mean?
PostgreSQL 10.4 on Ubuntu 16.04.
It appears that pg_dump passes the error messages it receives from the queries it is running into the logs.
The following line in the logs (maybe buried deeper if you have busy logs), shows the query that failed.
In this case, we had a corrupted sequence. Any query on the sequence, whether it was interactive, via a column default, or via pgdump, returned the "segment too big" error, and killed the querying process.
I figured out the new start value for the sequence, dropped the dependencies, and created a new sequence starting where the old one left off and then put the dependencies back.
pg_dump worked fine after that.
It is not clear why or how a sequence could get so corrupted that you would have a session killing error when it was accessed. We did have a recent database hard-crash though, so it may be related. (Although that sequence is accessed very rarely and it is unlikely we went down in the middle of incrementing it.)
Related
I have a Postgres installed in a Centos and another application is using Postgres the save data.
For sometime, and I can't find the reason, all the database tables become empty on the weekends.
I have been searching a lot to try to find some clues of the reason of that behaviour, but logs are not giving me that info.
I am pretty sure the application is not executing anything to clean the records, my thoughts are pointing to some process for some reason in the Postgres side.
The pg_log only shows this warning the day it happens:
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
Apart from that I have no other clues.
Performing a VACUUM ANALYZE VERBOSE it says there is no dead data so it has nothing to delete.
Can you tell me what should I look to get the reason? Should it be any Postgres process to do it?
LOG: checkpoints are occurring too frequently (11 seconds apart)
This log message should also include all the information log_line_prefix tells it to include. So you should set log_line_prefix to include more information, like application name (voluntarily supplied by the client), database username, and host name/IP from which the connection came.
But perhaps more directly at issue, if things are connecting to your database and doing things you don't understand or approve of, it is time to change your passwords.
MongoDB was working beautifully for me for several months until I had an unexpected shutdown a week or two ago. Since then, I've been getting the error in the title that snowballs into an invalid argument, then a library panic, then some fatal assertions which cause MongoDB to crash.
Now, I've done my research: the normal answers are to run the repair function and to make sure SELinux isn't screwing up the process. Neither of those have worked. The error gets thrown during WiredTiger's checkpoint process, so reads/writes to the database aren't the issue, and because it's during the checkpoint process, it guarantees that MongoDB won't stay up for more than a day.
To be clear: all the files in the database are owned by mongod:mongod, have permissions set to 600 (default, and I tried setting them to 755 to see if that fixed it, and it didn't). I'm running mongodb as a service on a CentOS 7 box, and the service file specifies that it should run as user mongod. The mongod.conf file specifies a mounted filesystem as the database, and it was happy with that until the unexpected shutdown. I'm running MongoDB version 4.0.1, so WiredTiger really doesn't like it if I disable Journaling either (disregarding the fact that I shouldn't disable it in the first place).
I feel like I've exhausted all my options, and that the only thing I can do is backup my data and reinstall MongoDB. Are there any that I've missed?
After creating a backup of my data via mongodump, shutting down mongo, removing the entire database with rm -rf 'path-to-database', rebooting mongo (without the replication config), and restoring the data with mongorestore, mongodb still crashes. This time, however, it's with an Invariant failure after the open: operation not permitted. The only conclusion I can think of is that the data itself has become corrupted in some way. Thankfully, this isn't "mission critical" data, so to speak, and I can easily obtain new data.
Unfortunately, this doesn't answer my original question of "what other options do I have?". However, I'm still posting this in case others run into this same kind of issue.
EDIT: invariant issue was caused by me forgetting to re-initialize my replication set. After fixing that, it's clean. Because of this, I no longer believe it was a data corruption issue, but a checkpoint corruption issue.
EDIT 2: So the issue arose again after about a week, and after another week of trying various debugging methods, I tried simply moving the mongo process to another server. So far, that's been working. The previous server was acting up (I couldn't even run top at one point - another process had a lock on a necessary library file to run it), so here's to hoping that the current server doesn't follow suite.
I was debugging a PostgreSQL 9.2 database corruption issue (on Solaris, but I doubt it matters) recently, and I found that we could reproduce it reliably if the client died in the middle of a transaction and then I shut down PostgreSQL by doing pkill postgres (which basically sends SIGTERM to every running postgres process). If instead we did pkill -QUIT postgres to send SIGQUIT, the database would shut down cleanly and no corruption would occur.
Based on the PostgreSQL 9.2 docs, I think that SIGTERM should be 100% expected by the database server, so why is it not safe to shut down like this? Is it a bug in PostgreSQL, or could I be doing something (configuration, etc.) that would allow the corruption to occur?
I don't think sigterm is what is causing your problem. Again, recommend you ask on dba.stackexchange instead.
If the client dies in the middle of a transacction, then the problem is that the network connection hangs? And then when you kill it you get corruption during WAL replay?
This is a complicated area to troubleshoot but here are some places to begin:
What is going on conncurrently when this happens? What sort of transaction commit load?
How often do WAL logs normally get rotated?
It is possible you could be running into a rare, obscure bug with PostgreSQL (possibly somewhere between the db, kernel, and filesystem), but if so please start by upgrading to latest 9.2 and try to reproduce again. Term and even kill signals are supposed to be 100% safe on PostgreSQL so if you are seeing database corruption, that is not expected.
After my last Meteor upgrade my database became corrupted. First it started with this error message when I tried to create a new user (we're using meteor-accounts):
getFile(): bad file number value (corrupt db?): run repair
Then I saw in another question that I should run db.repairDatabase() but, although mongo shell said that the database was now ok, it didn't really work. The error message above was still showing up.
So I read something about corrupted indexes and dropped the indexes in the users collections and this obviously just made everything worse. Now I have two users with the same email address and Meteor doesn't start anymore:
MongoError: E11000 duplicate key error index: meteor.users.$emails.address_1 dup key: { : "thiago#gdeahj.com" }
When I try to remove one of these users, the original error shows up again:
meteor:PRIMARY> db.users.remove({ _id: "cAtu2XsEXTbqL2Wvx"})
getFile(): bad file number value (corrupt db?): run repair`
Fortunately we're still on the development phase and we can just drop the whole database and start over, but this has made me really insecure about running Meteor on production environment. Is there any way to fix a database in this state?
You can run db.repairDatabase to try to repair the data files - but read the linked page first for details and warnings. Make sure you run with journaling on if you didn't have it on before and, at least for production, run a replica set. Normally, in this situation it'd be preferable to resync from another replica set member or restore a backup rather than repair. You can find more information about data recovery in this article from the MongoDB Manual.
My application is using libpq to write data to Postgres using the COPY API. After over 900000 successful COPY+commit (each containing a single row, don't ask) actions, one errored out with the following:
ERROR: canceling statement due to user request
CONTEXT: COPY [...]
My code never calls PQcancel or related friends, which I think is precluded anyway by the fact that libpq is being used synchronously and my app is not multi-threaded.
libpq v8.3.0
Postgres v9.2.4
Is there any reasonable explanation for what might have caused the COPY to be cancelled? Will upgrading libpq (as I have done in more recent versions of my application) be expected to improve the situation?
The customer reports that the Postgres server may have been shut down when this error was reported, but I'm not convinced since the error text is pretty specific.
That error will be emitted when you:
send a PQcancel
use pg_cancel_backend
Hit control-C in psql (which invokes PQcancel)
Send SIGINT to a backend, e.g. kill -INT or kill -2.
My initial answer was incorrect, claiming that the following also produced the same error. They don't; these:
pg_terminate_backend
pg_ctl shutdown -m fast
will emit a different error FATAL: terminating connection due to administrator command.