Postgres (OperationalError('SSL SYSCALL error: EOF detected\n') - postgresql

Problem description:
Resource and computational intensive ETL pipeline job running every minute are failing. We receive a lot of same error message as shown below. (Seems like its failing almost every minute now). We did observe that the new parameter template is based on a generic template and can be optimized. For example work_mem is 4MB and we increased it. Any advise would be appreciated Thanks in advance.
Steps we have taken to resolve the issue:
vacuum analyze
increased work mem
Error Message 1:
(OperationalError('SSL SYSCALL error: EOF detected\n')
ETL SQL (upsert, update, create temp table, indexing, sorting, etc)
....
Exception: connection already closed
Error Message 2:
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Related

Large number of Sleep connections to database in TYPO3

Our TYPO3 application is experiencing downtime issues, with the logs displaying the error:
Core: Exception handler (WEB): Uncaught TYPO3 Exception: An exception occured in driver: Too many connections
If I connect to the MySQL database and run SHOW PROCESSLIST, all I see are lots of connections with the command “Sleep”. This seems like a red flag to me but this is not my area of expertise; is there a good reason for this and if not what might the fix be?
the PHP mysqli Driver allows for Persistent connections.
which basically means a connection is kept open and put to "sleep" till needed again.
this is a tradeoff: an already open connection does not need to be re-establisched so you can query faster. but at the cost that both systems spend some resources (Memory, CPU) on keeping the connection alive.
see the PHP documentation for more details:
https://www.php.net/manual/de/mysqli.persistconns.php
and configuration options (php.ini)
https://www.php.net/manual/de/mysqli.configuration.php

Timeout - Clone and Base table

Procedure of my mainframe job has a step which performs an exchange between clone and base table. This step fails every time the job runs with resource unavailable error. The resource is a package for another program which reads the base table used in my job.
Since the job is failing with timeout error, I usually restart this. But to fix this permanently, is it possible to increase the timeout limit for this EXCHANGE process. In IBM manual, I could see "SET CURRENT LOCK TIMEOUT 30" for this. But is this valid. My EXCHANGE statement between clone and base table is coded in a control card. Is there any possibility I can increase the timeout so that the job does not go into error.
If any further details is required, please let me know
Any help on this is appreciated.

Multiple could not receive data from client: Connection reset by peer Postgresql and Resque

I have a server that runs Postgresql. in the logs I am seeing this message for my resque based 'worker' box, multiple times a minute. Some minutes there isn't a message, others could be 10 times.
2016-01-12 13:40:36 EST:1.1.8.2(33899):[16141]: LOG: could not receive data from client: Connection reset by peer
Now when i go into the 1.1.8.2 box to look at netstat -ntp i don't see a port 33899, and most of them are at least in the 40xxx range by now. That may be conjecture but I'm at a loss to find out why a Redis/Resque/Puma Rails stack would be printing out these messages, let alone what that means even if i get to the bottom of it.
Will I gain memory back if they are closed 'normally'?
Is this a thing to be wary of?
How does one debug OLD ports that are open when the db box and the worker box both don't display the ports any more?
This message is probably due to the resque worker task not closing the database connection before it exits. It's not a huge problem, but presumably Postgres is doing a little extra work to clean it up, and it makes a mess of your log file...
One solution is to add a hook to your resque worker's task file (the same file that contains the self.perform definition):
def self.after_perform(*args)
ActiveRecord::Base.connection.disconnect!
end

Orientdb network connection lost during commit

I am using the blueprints graph api for orient against a 2 node cluster running orient 1.7.10. When ingesting simple parent child data I intermittently get the following error on commit -
Warning: caught I/O errors from not connected (local socket=?), trying to reconnect (error: java.io.IOException: Channel is closed)
The connection is then reestablished:
Connection re-acquired transparently after 31ms and 1 retries: no errors will be thrown at application level.
This occurs mid way through the commit (100 vertices and edges) with the result that the server thinks it has sent the response but the client hangs forever.
Is there a way to catch this at the application level and e.g. rollback?
I would be very grateful for any help?
As far as i know a very similar issue was fixed some time ago: https://github.com/orientechnologies/orientdb/issues/2930
one thing to be aware is that autostart transaction of the graph, if is enabled (and it is by default) you don't need to do begin, but just commit, if you do begin the transaction will be committed at shutdown and in that case can create that problem.
another suggestion is migrate to 2.0-* releases that have important improvement also in that side, especially if you are in development phase, the 2.0 final is going to be released very soon and will be the one with major focus in the next months.
bye

SQLBulkCopy connection errors when working with SQL Azure

We are currently trying out the SQLBulkCopy API on the new SQL Azure CTP.
While we have been able to consistently migrate tables with about a million rows, we are facing connection errors when working with larger tables. We keep getting (after random row transfers) the following error:
A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
I understand that SQL Azure connection policies (mentioned here) state that the connection can be terminated for a number of reasons and it also mentions some error codes that are returned.But I am not able to understand which of these might be causing the error or capture the error code.
Is there a way we can get past this error and continue with the migration of table rows?
The SQLBulkCopy options used are:
BatchSize=1000
BulkCopyTimeout = 5000
Knowledge Base article 977291 gives this error message as a symptom of a Windows 2003 TCP/IP issue.