Error "MultiXactId **** has not been created yet -- apparent wraparound" on PG11 - postgresql

I face an annoying problem on a Postgres 11.13 database when trying to get data from a big table.
The first 6 millions rows can be fetched, then I get a "MultiXactId **** has not been created yet -- apparent wraparound" message.
I've already tested, on this table :
various "select ..." queries (even in functions with exception management to ignore possible errors)
pg_dump
REINDEX TABLE
VACUUM FULL, with and without the zero_damaged_pages enabled
VACUUM FREEZE, with and without the zero_damaged_pages enabled
Nothing to do: I get every time that "MultiXactId **** has not been created yet -- apparent wraparound" error.
Is there a solution to fix that kind of problem, or is that "broken/corrupted" table definitively lost ?
Thanks in advance for any advice

Related

Postgresql is having off behavior after disk was run out

Background of the issue (could be irrelevent, but only relation to these issues makes sense for me)
In our production environment, disk space had run out. (We do have monitoring and notifications for this, but no-one read them - the classical)
Anyway, after fixing the issue Postgresql PostgreSQL 9.4.17 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit has shown couple of weird behaviors.
1. Unique indexes
I have couple of (multi column) unique indexes specified for the database, but they do not appear to be functioning. However I can find duplicate rows from the database.
2. Sorting based on date
We got one table which is basically just logging some json data. We got three columns: id, json and insertedAt DEFAULT NOW(). If I do simple query where I try to sort based on the insertedAt column, the sorting doesn't work around the area of disk overflow. All of the data is valid and readable, but order is invalid
3. Db dumps / backups are having some corruption.
Again, when I was browsing this logging data and tried to recover a backup to my local machine for better observation it gave an error around some random row. When I examined the sql-file with text editor, I encountered that data was otherwise valid expect that it was missing some semicolons on some rows. I think I'll give a try shortly for the never backup if it's still having the same error or if it was random issue with the one backup I tried playing with.
I've tried the basic ones: restarting the machine and PG process.

Deal with Postgresql Error -canceling statement due to conflict with recovery- in psycopg2

I'm creating a reporting engine that makes a couple of long queries over a standby server and process the result with pandas. Everything works fine but sometimes I have some issues with the execution of those queries using a psycopg2 cursor: the query is cancelled with the following message:
ERROR: cancelling statement due to conflict with recovery
Detail: User query might have needed to see row versions that must be removed
I was investigating this issue
PostgreSQL ERROR: canceling statement due to conflict with recovery
https://www.postgresql.org/docs/9.0/static/hot-standby.html#HOT-STANDBY-CONFLICT
but all solutions suggest fixing the issue making modifications to the server's configuration. I can't make those modifications (We won the last football game against IT guys :) ) so I want to know how can I deal with this situation from the perspective of a developer. Can I resolve this issue using python code? My temporary solution is simple: catch the exception and retry all the failed queries. Maybe could be done better (I hope so).
Thanks in advance
There is nothing you can do to avoid that error without changing the PostgreSQL configuration (from PostgreSQL 9.1 on, you could e.g. set hot_standby_feedback to on).
You are dealing with the error in the correct fashion – simply retry the failed transaction.
The table data on the hot standby slave server is modified while a long running query is running. A solution (PostgreSQL 9.1+) to make sure the table data is not modified is to suspend the replication on the slave and resume after the query.
select pg_xlog_replay_pause(); -- suspend
select * from foo; -- your query
select pg_xlog_replay_resume(); --resume
I recently encountered a similar error and was also in the position of not being a dba/devops person with access to the underlying database settings.
My solution was to reduce the time of the query where ever possible. Obviously this requires deep knowledge of your tables and data, but I was able to solve my problem with a combination of a more efficient WHERE filter, a GROUPBY aggregation, and more extensive use of indexes.
By reducing the amount of server side execute time and data, you reduce the chance of a rollback error occurring.
However, a rollback can still occur during your shortened window, so a comprehensive solution would also make use of some retry logic for when a rollback error occurs.
Update: A colleague implemented said retry logic as well as batching the query to make the data volumes smaller. These three solutions have made the problem go away entirely.
I got the same error. What you CAN do (if the query is simple enough), is deviding the data into smaller chunks as a workaround.
I did this within a python loop to call the query multiple times with the LIMIT and OFFSET parameter like:
query_chunk = f"""
SELECT *
FROM {database}.{datatable}
LIMIT {chunk_size} OFFSET {i_chunk * chunk_size}
"""
where database and datatable are the names of your sources..
The chunk_size is individually and to set this to a not too high value is crucial for the query to finish.

Redshift Drop Table Stuck

I have a cronjob that is kicked off every night that involves building up a temporary table, dropping the current table on Redshift, and swapping in the temporary table for the old one. More than half of the time, this specific job gets stuck when dropping the existing table and behaving as if there is some pending transaction that is stopping the drop from going through.
This is just one of dozens of jobs that uses the exact same script to run overnight, none of which have ever had this issue; however, there are a few minor differences:
The box that this particular job is running on is a different box from all of the other production jobs, as this one is currently in a testing state.
The S3 key used on this box is different from the other box.
In addition to the fact that I have never seen this on any other job, this issue has been extremely difficult to troubleshoot for the following reasons:
I have not been able to replicate this issue by running the script manually on the same box it is currently being run on; the script executes as expected, with the table drop occurring in mere seconds. The only difference I can think of here is that I'm executing the script as ubuntu whereas the cronjob is executed from root.
I have not had any success identifying or terminating the sessions that are causing the drop to stall; I've looked high and low on Stack Overflow (this is the most applicable question with answers - redshift drop or truncate table very very slow), the Redshift docs, and otherwise, but nothing I've found has been the answer. When I see that the job is stalled, I've checked the following tables on Redshift and usually find that things are in the following state:
The temporary table has been created, but the old version of the destination table is still there.
The stv_locks table shows that that there are three processes running, with the lock_status of "Holding write lock," "Holding delete lock," and "Holding insert lock" respectively. The process ID associated with these is NOT the ID related to the current job.
The stv_tr_conflict table shows nothing.
The stv_recents table shows the drop with a status of Running.
The query that's supposedly creating the lock described above shows up in the svl_qlog as finished, so that seems to contradict the stv_locks table.
Using pg_terminate_backend to stop the associated process does not actually remove the session when querying stv_sessions, but DOES free up something that allows the job to finish.
Any help in figuring out what exactly is going on here would be greatly appreciated!
I faced the same problem, I just rebot RS then it works again normally.

File loading issues in DB2 using Load utility

I have a .csv file, comma-delimited (located at C:/). I am using the DB2 LOAD utility to load data present in the CSV file in a DB2 table.
LOAD CLIENT FROM C:\Users\somepath\FileName.csv of del
MODIFIED BY NOCHARDEL COLDEL, insert into SchemaName.TABLE_NAME;
CSV file has 25 rows. After the utility completed I got an error message for NOCHARDEL. My table has all 25 rows properly loaded. Now when I try to execute an insert/update/delete statement on any of the tables present in that schema I am getting following error.
Lookup Error - DB2 Database Error: ERROR [55039] [IBM][DB2/AIX64] SQL0290N Table space access is not allowed.
Could you please help me whether I am making any mistake or missing a parameter that is causing lock on the table.
Earlier while loading the file similar situation occurred, where DBA confirmed that Table space in question is in “load in progress” state
Changes generated by the DB2 LOAD utility are not logged (one of the side-effects of its high performance). If the database crashes immediately after the load it will be impossible to recover the table that was loaded by replaying log records, because there are no such records. For this reason the tablespace containing the loaded table is automatically placed in the BACKUP PENDING mode, forcing you to take a backup of that tablespace or the entire database to ensure it is fully recoverable.
There are options that you can specify for the LOAD command that can help you avoid this situation in the future:
NONRECOVERABLE -- this option does not place the tablespace into the BACKUP PENDING mode, but, as its name implies, the table you're loading to becomes non-recoverable in case of a crash, and your only option in that situation will be to drop and re-create the table.
COPY YES -- this option creates a copy of the table prior to loading, which can be used to recover the table to its pre-LOAD state in case of a crash.
If you are only loading 25 records, I suggest you use the IMPORT utility instead -- it does not have these restrictions because it is fully logged (at the price of lower performance, which for 25 records won't matter).
Thanks #mustaccio. I had 60 Million rows to insert. I was using 25 as sample to check the outcome.
To add another point, we later came to know that this is a known DB2 bug that keeps the load in progress state (DB2 is unable to acknowledge that the load has completed and the session remains open indefinitely) and place the table space in backup pending state.
Recovery is the only option to release the table space once it is in pending state.
This issue is fixed in fix pack 10 as per the DB2 team (we are yet to deploy and test). Mean while NONRECOVERABLE key word is working fine for us
The reason why your table is stuck in the LOAD IN PROGRESS state is the NOCHARDEL error happening at the end of the LOAD.
Have you tried restarting the database? This should reinitialize all table spaces and remove any rogue states.
http://www-01.ibm.com/support/docview.wss?uid=swg1IC65395
http://www-01.ibm.com/support/docview.wss?uid=swg21427102

Repair Corrupt database postgresql

I have multiple errors with my postgresql db, which resulted after a power surge:
I cannot access most tables from my database. When I try for example select * from ac_cash_collection, I get the foolowing error:
ERROR: missing chunk number 0 for toast value 118486855 in pg_toast_2619
when I try pg_dump I get the following error:
Error message from server: ERROR: relation "public.st_stock_item_newlist" does not exist
pg_dump: The command was: LOCK TABLE public.st_stock_item_newlist IN ACCESS SHARE MODE
I went ahead and tried to run reindex of the whole database, I actually I left it runnng, went to sleep, and I found it had not done anything in the morning, so I had to cancel it.
I need some help to fix this as soon as possible, Please help.
Before you do anything else, http://wiki.postgresql.org/wiki/Corruption and act on the instructions. Failure to do so risks making the problem worse.
There are two configuration parameters listed in the Fine Manual that might be of use: ignore_system_indexes and zero_damaged_pages. I have never used them, but I would if I were desparate ...
I don't know if they help against toast-tables. In any case, if setting them causes your database(s) to become usable again, I would {backup + drop + restore} to get all tables and catalogs into newborn shape again. Success!
If you have backups, just restore from them.
If not - you've just learned why you need regular backups. There's nothing PostgreSQL can do if hardware misbehaves.
In addition, if you ever find yourself in this situation again, first stop PostgreSQL and take a complete file-level backup of everything - all tablespaces, WAL etc. That way you have a known starting point.
So - if you still want to recover some data.
Try dumping individual tables. Get what you can this way.
Drop indexes if they cause problems
Dump sections of tables (id=0..9999, 1000..19999 etc) - that way you can identify where some rows may be corrupted and dump ever-smaller sections to recover what's still good.
Try dumping just certain columns - large text values are stored out-of-line (in toast tables) so avoiding them might get the rest of your data out.
If you've got corrupted system tables then you're getting into a lot of work.
That's a lot of work, and then you'll need to go through and audit what you've recovered and try to figure out what's missing/incorrect.
There are more things you can do (creating empty blocks in some cases can let you dump partial data) but they're all more complicated and fiddly and unless the data is particularly valuable not worth the effort.
Key message to take away from this - make sure you take regular backups, and make sure they work.
Before you do ANYTHING ELSE, take a complete file-system-level copy of the damaged database.
http://wiki.postgresql.org/wiki/Corruption
Failure to do so destroys evidence about what caused the corruption, and means that if your repair efforts go badly and make things worse you can't undo them.
Copy it now!
If few/specific files are corrupted, following tricks might help.
Restore Older dump in different node or a second installation.
Copy required files from RESTORED/second installation to FAILED node.
Stop & Start PSQL.
From Today's experience!
Error message from server: ERROR: could not read block 226448 in file "base/12345/12345.1": Input/output error
try (probably it will fail)
cp base/12345/12345.1 /root/backup/12345.1-orig
try mv base/12345/12345.1 /root/backup/12345.1-orig #expecting this to finish. Else do rm -rf base/12345/12345.1 /root/backup/12345.1-orig
Finally,
Magic of tar. (if below tar completes, you have luck!)
tar -zcvf my_backup.tar.gz /var/lib/postgresql/xx/main/xx
Extract the corrupted file from TAR.
Replace it in original locationbase/12345/12345.1.
Stop & Start PGSQL
IMPORTANT: Please try googling and do try vaccum, reindex and disk checks like fsck etc before getting to this stage.
Also, always take a Filesystem backup before doing any TRIAL and ERROR method :)