crashed postgres database in artifactory - postgresql

we have an artifactory (5.6.2) running on docker with a postgres (9.6.6-alpine) also on docker.
We realized that artifactory performed very badly so i looked at the containers and saw that postgres was taking all cpu it could get.
So tried to restart the postgres but it did fail:
23.2.2018 11:28:12PANIC: could not locate a valid checkpoint record
23.2.2018 11:28:12LOG: startup process (PID 20) was terminated by signal 6
23.2.2018 11:28:12LOG: aborting startup due to startup process failure
23.2.2018 11:28:12LOG: database system is shut down
I then restored the whole DB Folder from Backup and tried to restart the DB again. The Postgres DB came up but as i started artifactory its waiting at the point:
23.2.2018 15:03:392018-02-23 15:03:39,537 [localhost-startStop-1] [JFrog-Access] [INFO ] (o.j.a.s.AccessServerBootstrapImpl:91) - [ACCESS BOOTSTRAP] Starting JFrog Access bootstrap...
23.2.2018 15:03:392018-02-23 15:03:39,576 [localhost-startStop-1] [JFrog-Access] [INFO ] (o.j.a.s.AccessServerBootstrapImpl:164) - [ACCESS BOOTSTRAP] Updating server ....
So Artifactory communicates with the DB and the DB again eats up all CPU.
Is this normal ? This situation is running for an hour or so, can somebody tell me if this ever stops successfully.
Do i have any other options besides waiting or has someone a tip for me how i can get my artifactory up and running again.
Every help is welcome.
thanks

Related

Unable to start keycloak 19 in production mode

we have keycloak 14 connected to an Amazon RDS Aurora PostgreSQL and now trying to update to Keycloak 19 with the same DB but is failing with the below error
2022-09-20 09:54:56,959 ERROR [org.keycloak.quarkus.runtime.cli.ExecutionExceptionHandler] (main) ERROR: Failed to start server in (production) mode
2022-09-20 09:54:56,960 ERROR [org.keycloak.quarkus.runtime.cli.ExecutionExceptionHandler] (main) ERROR: ISPN000324: Cache 'realms' is in 'STOPPING' state and this is an invocation not belonging to an on-going transaction, so it does not accept new invocations. Either restart it or recreate the cache container.
Any help would be appreciated.

Postgres in recovery mode after failed delete queries from partitioned table (PG 12)

I have a code that used to work on a simple table and stopped working when the same table was partitioned to many sub-partitioned.
In a distributed application (Spark) we have code that performs batch delete queries in parallel from different computers in the same time (deleting different records).
Most of the queries work but then one of them fails on what seems to be a socket connection
timeout:
java.sql.BatchUpdateException: Batch entry 0 DELETE FROM my_table WHERE vessel_id='xxxxxx' AND day='2020-09-15 00:00:00+00'::timestamp was aborted: An I/O error occurred while sending to the backend. Call getNextException to see other errors in the batch.
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
When the code retries to run the task the connection fails on
:FATAL: the database system is in recovery mode
In the database log I see:
2020-09-21 16:44:27 UTC::#:[26848]:DETAIL: Failed process was running: DELETE FROM my_table WHERE vessel_id=$1 AND day=$2
2020-09-21 16:44:27 UTC::#:[26848]:LOG: terminating any other active server processes
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC::#:[22480]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC::#:[22480]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC::#:[22480]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC:127.0.0.1(31826):rdsadmin#rdsadmin:[27967]:FATAL: the database system is in recovery mode
Any ideas why the database fails when the table is partitioned?
Why all the other connections on the other computers are closed and the database goes into recovery mode?
After looking at the logs I found that the problem was out-of-memory.
This database instance is the main instance, it does the writing, replicating and deleting and it didn't have enough memory to handle all these tasks at the same time.
The fix was simply to add more memory.
Nothing fancy.

iRODS configuration - Could not start iRODS server during

I've installed postgres as database and then iRODS in Ubuntu 14.04. Then I start its configuration
sudo /var/lib/irods/packaging/setup_irods.sh
After the configuration phase, when iRODS starts the updtating, the first 4 steps go well
Stopping iRODS server...
-----------------------------
Running irods_setup.pl...
Step 1 of 4: Configuring database user...
Updating user's .pgpass...
Skipped. File already uptodate.
Step 2 of 4: Creating database and tables...
Checking whether iCAT database exists...
[mydb] on [localhost] found.
Updating user's .odbc.ini...
Creating iCAT tables...
Skipped. Tables already created.
Testing database communications...
Step 3 of 4: Configuring iRODS server...
Updating /etc/irods/server_config.json...
Updating /etc/irods/database_config.json...
Step 4 of 4: Configuring iRODS user and starting server...
Updating iRODS user's ~/.irods/irods_environment.json...
Starting iRODS server...
but at the end I get this error
Could not start iRODS server.
Starting iRODS server...
Traceback (most recent call last):
File "/var/lib/irods/iRODS/scripts/python/get_db_schema_version.py", line 77, in <module>
current_schema_version = get_current_schema_version(cfg)
File "/var/lib/irods/iRODS/scripts/python/get_db_schema_version.py", line 61, in get_current_schema_version
'get_current_schema_version: failed to find result line for schema_version\n\n{}'.format(format_cmd_result(result)))
RuntimeError: get_current_schema_version: failed to find result line for schema_version
return code: [0]
stdout:
stderr:
ERROR: relation "r_grid_configuration" does not exist
LINE 1: ...option_value from R_GRID_CON...
^
Confirming catalog_schema_version... Success
Validating [/var/lib/irods/.irods/irods_environment.json]... Success
Validating [/etc/irods/server_config.json]... Success
Validating [/etc/irods/hosts_config.json]... Success
Validating [/etc/irods/host_access_control_config.json]... Success
Validating [/etc/irods/database_config.json]... Success
(1) Waiting for process bound to port 5432 ... [-]
(2) Waiting for process bound to port 5432 ... [-]
(4) Waiting for process bound to port 5432 ... [-]
Port 5432 In Use ... Not Starting iRODS Server
Install problem:
Cannot start iRODS server.
Found 0 processes:
There are no iRODS servers running.
Abort.
Have you any ideas on what went wrong?
Because I don't have enough reputation to comment:
Which version of iRODS are you using?
This portion of the output:
Creating iCAT tables...
Skipped. Tables already created.
combined with this portion:
ERROR: relation "r_grid_configuration" does not exist
suggests that the setup ran before, but only partially completed, leaving the system in a broken state. I would recommend reinstallating from scratch, which includes:
Uninstalling the iRODS icat and db plugin packages:
sudo dpkg -P irods-icat irods-database-plugin-postgres
note: make sure to use the -P, so that the configuration files are removed from dpkg's database.
Dropping and remaking the database
Deleting the following directories:
sudo rm -rf /tmp/irods /etc/irods /var/lib/irods
Reinstalling the packages and running sudo /var/lib/irods/packaging/setup_irods.sh
This portion of the output:
(1) Waiting for process bound to port 5432 ... [-]
(2) Waiting for process bound to port 5432 ... [-]
(4) Waiting for process bound to port 5432 ... [-]
Port 5432 In Use ... Not Starting iRODS Server
suggests that you are using port 5432 as your iRODS server port. This will conflict with the default Postgres port. I recommend using the default iRODS server port of 1247. This value was queried during setup as:
iRODS server's port [1247]:
and is recorded in /etc/irods/server_config.json under the zone_port entry.
iRODS-Chat:
It may be easier to continue this on the iRODS-Chat google group. Repairing installs can require back-and-forth communication, which may not be inline with standard stackoverflow usage.

PostgreSQL 9.1 streaming replication restore_command: special meaning of exit code 255?

I have a PostgreSQL 9.1.3 streaming replication setup on Ubuntu 10.04.2 LTS (primary and standby). Replication is initialized with a streamed base backup (pg_basebackup). The restore_command script tries to fetch the required WAL archives from a remote archive location with rsync.
Everything works like described in the documentation when the restore_command script fails with an exit code <> 255:
At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.
But when the restore_command script fails with exit code 255 (because the exit code from a failed rsync call is returned by the script) the server process dies with the following error:
2012-05-09 23:21:30 CEST - # LOG: database system was interrupted; last known up at 2012-05-09 23:21:25 CEST
2012-05-09 23:21:30 CEST - # LOG: entering standby mode
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7]
2012-05-09 23:21:30 CEST - # FATAL: could not restore file "00000001000000000000003D" from archive: return code 65280
2012-05-09 23:21:30 CEST - # LOG: startup process (PID 8184) exited with exit code 1
2012-05-09 23:21:30 CEST - # LOG: aborting startup due to startup process failure
So my question is now: Is this a bug or is there a special meaning of exit code 255 which is missing in the otherwise excellent documentation or am I missing something else here?
On the primary server, you have WAL files sitting in the pg_xlog/ directory. While WAL files are there, PostgreSQL is able to deliver them to the standby should they be requested.
Typically, you also have local archived WAL location, when files are moved there by PostgreSQL, they no longer can be delivered to the standby on-line and standby is expecting them to come from the archived WAL location via restore_command.
If you have different locations for archived WALs setup on primary and on standby servers, then there's no way for a while to reach standby and you have a gap.
In your case this might mean, that:
00000001000000000000003D had been archived by the primary PostgreSQL;
standby's restore_command doesn't see it from the configured source location.
You might consider manually copying missing WAL files from primary to the standby using scp or rsync. It is also might be necessary to review your WAL locations and make sure both servers look in the same direction.
EDIT:
grep-ing for restore_command in sources, only access/transam/xlog.c references it. In function RestoreArchivedFile almost at the end (round line 3115 for 9.1.3 sources), there's a check whether restore_command had exited normally or had it received a signal.
In first case, message is classified as DEBUG2. In case restore_command received a signal other then SIGTERM (and wasn't able to handle it properly I guess), a FATAL error will be reported. This is true for all codes greater then 125.
I will not be able to tell you why though.
I recommend asking on the hackers list.
This looks like an rsync problem I encountered temporarily using NFS (with rpcbind/rstatd on port 837):
$ rsync -avz /var/backup/* backup#storage:/data/backups
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
This fixed it for me:
service rpcbind stop
I had the same issue creating a hot standby (postgres 9.5). Streaming was working (I seeded the standby via pg_basebackup using the same credentials as would later be used in the standby's recovery.conf).
After taking the basebackup, I setup the following recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=ip.of.master port=5432 user=pgstandby password=password'
recovery_target_timeline = 'latest'
restore_command = 'sftp -q user#ip.of.wal.archive.host:data/master_wal_archive/%f "%p"'
trigger_file = '/srv/pgsql/9.5/data/trigger'
Starting the server would yield:
2016-03-08 12:34:58.981 UTC (/)LOG: database system was interrupted; last known up at 2016-03-08 12:26:10 UTC
Couldn't read packet: Connection reset by peer
2016-03-08 12:34:59.525 UTC (/)FATAL: could not restore file "00000002.history" from archive: child process exited with exit code 255
2016-03-08 12:34:59.526 UTC (/)LOG: startup process (PID 26636) exited with exit code 1
2016-03-08 12:34:59.526 UTC (/)LOG: aborting startup due to startup process failure
If I removed the restore_command line from recovey.conf, the standby started up fine and began streaming WALs from the master.
I eventually traced the problem down to not having added the standby postgres user's public key to the authorized_hosts file of the WAL archive host. I'd also forgotten to add the WAL archive host's server fingerprint to the known_hosts file of the standby postgres user.
These two mistakes were (I assume) causing the sftp restore_command to exit with code 255. As tscho says, the Postgres docs suggest that if the restore_command exits with ANY non-zero value, Postgres will simply move on to trying to stream from the master rather than refusing to start. In reality this doesn't seem to be the case if the exit code is higher than a certain number (maybe 125, as vyegorov's source code grepping suggests?).
Once I fixed the two SSH issues, the standby started fine with the restore_command present in recovery.conf.
Here is the comment describing why this behavior for high exit status from the command process was chosen, and the current code to implement it.
/*
* Remember, we rollforward UNTIL the restore fails so failure here is
* just part of the process... that makes it difficult to determine
* whether the restore failed because there isn't an archive to restore,
* or because the administrator has specified the restore program
* incorrectly. We have to assume the former.
*
* However, if the failure was due to any sort of signal, it's best to
* punt and abort recovery. (If we "return false" here, upper levels will
* assume that recovery is complete and start up the database!) It's
* essential to abort on child SIGINT and SIGQUIT, because per spec
* system() ignores SIGINT and SIGQUIT while waiting; if we see one of
* those it's a good bet we should have gotten it too.
*
* On SIGTERM, assume we have received a fast shutdown request, and exit
* cleanly. It's pure chance whether we receive the SIGTERM first, or the
* child process. If we receive it first, the signal handler will call
* proc_exit, otherwise we do it here. If we or the child process received
* SIGTERM for any other reason than a fast shutdown request, postmaster
* will perform an immediate shutdown when it sees us exiting
* unexpectedly.
*
* Per the Single Unix Spec, shells report exit status > 128 when a called
* command died on a signal. Also, 126 and 127 are used to report
* problems such as an unfindable command; treat those as fatal errors
* too.
*/
if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)
proc_exit(1);
signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
ereport(signaled ? FATAL : DEBUG2,
(errmsg("could not restore file \"%s\" from archive: %s",
xlogfname, wait_result_to_str(rc))));

MongoDB RHEL Failure to startup after cold reboot

Mongodb failing to start on RHEL.
Here is the output from service restart: any idea?
2011-06-17 18:44:06,387
[INFO][Dummy-3] initialize() #
connection.py:48 - Attempting Database
connection with seeds = localhost
2011-06-17 18:44:06,389
[CRITICAL][Dummy-3] initialize() #
connection.py:55 - Database
initialization failed
It is best to look at the mongod server logs to find the issue. It is most likely because the lock file was not cleaned up because you had an unclean shutdown.