Can't obtain database connection within 5 seconds - Sidekiq workers, Unicorn, Redis ToGo on Heroku - postgresql

I've a bunch background tasks (Sidekiq workers) that update database, and I keep getting this failing thread exception.
Heroku Log: WARN: could not obtain a database connection within 5.000 seconds (waited 5.001 seconds)
Heroku
Heroku Postgres :: Olive - 20 connections limit.
Redis ToGo - 10 connections limit.
Sidekiq - 2 connections.
Each client request create ~50 threads - finally ~20 threads trying to update db.
Now I know this is too much threads trying to make connection (updating Active:Record..).
I don't mind them to wait in try again until success.
-config/unicorn.rb
worker_processes Integer(ENV["WEB_CONCURRENCY"] || 3)
timeout 30
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
if defined?(ActiveRecord::Base)
ActiveRecord::Base.connection.disconnect!
end
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to send QUIT'
end
if defined?(ActiveRecord::Base)
config = ActiveRecord::Base.configurations[Rails.env] ||
Rails.application.config.database_configuration[Rails.env]
config['pool'] = ENV['DB_POOL'] || 2
config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
ActiveRecord::Base.establish_connection(config)
end
end
-config/initializers/sidekiq.rb
require 'sidekiq'
Sidekiq.configure_server do |config|
if(database_url = ENV['DATABASE_URL'])
p pool_size = Sidekiq.options[:concurrency] + 2
p ENV['DATABASE_URL'] = "#{database_url}?pool=#{pool_size}"
ActiveRecord::Base.establish_connection
end
end
--condig/sidekiq.yml
:concurrency: 2
Thanks a lot for all the help,
Eldar

Related

Julia Sockets TCP Connection Refused when running script, not REPL

I am following the simple TCP example found here: https://docs.julialang.org/en/v1/manual/networking-and-streams/#A-simple-TCP-example-1. The code can be seen below (I modified that from the link slightly):
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
#async begin
server = listen(2000)
x :: Int = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
for i = 1:10
connect(2000)
end
When I execute the commands within the REPL it works correctly producing the following output:
julia> #async begin
server = listen(2000)
x :: Int = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
Task (runnable) #0x00000001169d06d0
julia> for i = 1:10
connect(2000)
end
Connection number 1
Connection number 2
Connection number 3
Connection number 4
Connection number 5
Connection number 6
Connection number 7
Connection number 9
Connection number 10
However, when I try to place these commands in a file:
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
#async begin
server = listen(2000)
x :: Int = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
for i = 1:10
connect(2000)
end
and run using julia <scriptName> or (from within the REPL) include("<scriptName>") I get the following error message:
ERROR: LoadError: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] wait_connected(::TCPSocket) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Sockets/src/Sockets.jl:520
[2] connect at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Sockets/src/Sockets.jl:555 [inlined]
[3] connect at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Sockets/src/Sockets.jl:541 [inlined]
[4] connect at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Sockets/src/Sockets.jl:537 [inlined]
[5] top-level scope at <pathToFile>:18
[6] include at ./boot.jl:328 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1105
[8] include(::Module, ::String) at ./Base.jl:31
[9] include(::String) at ./client.jl:424
[10] top-level scope at REPL[1]:1
in expression starting at <pathToFile>:17
How would I run this program from a script? I am fairly new to Julia and to sockets so apologies if this is an easy question!!
You're getting an error because connect will fail if listen has not yet had time to run. As you discovered, introducing a small pause between your server and client sides fixes the issue; let's try and explain this in more details.
In the REPL-based, interactive version, as soon as the server task is created using #async, control goes back to the REPL, which performs blocking operations (such as waiting for new commands to be entered in the command line). This gives the scheduler an opportunity to yield control to the newly created task. The call to listen appears early in it, ensuring that whenever control reaches the client side again - and calls to connect are executed - the server is listening.
In the script version, the server task gets created and scheduled, but the scheduler does not have any opportunity to actually run it, since the main task does not perform any blocking call before running connect. Therefore no-one is listening to the socket when the connection is performed, and the call fails. However, any blocking call put in this place will give an opportunity for the scheduler to run the server task. This includes sleep (of arbitrarily small amounts of time, as you noted), but also any function performing I/O: a simple println("hello") would have worked too.
I would think a cleaner way to fix things would be to ensure that the call to listen is performed first, by running it synchronously beforehand:
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
server = listen(2000)
#async begin
x = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
for i = 1:10
connect(2000)
end
Now you're left with another problem: when the client loop terminates, all calls to connect have been issued, but not necessarily all corresponding calls to accept have had time to run. This means that you're likely to get truncated output such as:
Connection number 1
Connection number 2
Connection number 3
Connection number
You'd need further coordination between your tasks to correct this, but I suspect this second problem might be only related to the MWE posted here, and might not appear in your real use case.
For example, presumably the server is meant to send something to the client. The read-write operations performed on sockets in this case naturally synchronizes both tasks:
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
server = listen(2000)
#async begin
x = 1
while true
sock = accept(server)
msg = "connection number $x\n"
print("Server sent: $msg")
write(sock, msg)
x += 1
end
end
for i = 1:10
sock = connect(2000)
msg = readline(sock)
println("Client received: $msg")
end
The above example correctly yields a complete (untruncated) output:
Server sent: Connection number 1
Client received: Connection number 1
Server sent: Connection number 2
Client received: Connection number 2
...
Server sent: Connection number 10
Client received: Connection number 10
So I've figured this out I believe. The server and client either need to be separated between two files, OR, a pause command (seems to be okay with an arbitrarily small pause command) needs to be in between the server and client side:
#!/Applications/Julia-1.3.app/Contents/Resources/julia/bin/julia
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
port = 2000
#async begin
server = listen(IPv6(0),port)
x::Int = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
sleep(1E-10)
for i = 1:10
connect(port)
end
This is now working as expected!

Airflow sql alchemy pool size ignored

I am running airlfow local executor with a postgres database and I am getting a: (psycopg2.OperationalError) FATAL: remaining connection slots are reserved
My configs:
sql_alchemy_pool_size = 5
sql_alchemy_pool_recycle = 1800
sql_alchemy_reconnect_timeout = 300
Still I can see that airflow always keeps more connections open than the pool size.
How can we limit Airflow to actually use the pool limit?
airflow_version = 1.10.2; postgres_version = 9.6
You have forks of the main process as workers, each of which manage their own thread pool.
Check the implementation of LocalExecutor; as it is using multiprocessing under the hood. SqlAlchemy will close any open connections upon forking the LocalWorker; but the pool size will be equivalent to the parent, so at max, theoretically you'd have k * (n + 1) connections, where n is your parallelism constant and k is your sql_alchemy_pool_size.
People having the same problem with celery executor can try reducing scheduler.max_threads (AIRFLOW__SCHEDULER__MAX_THREADS) .
Theoretically Max no connections from scheduler = (sql_alchemy_pool_size + sql_alchemy_max_overflow) x max_threads + 1[Dag processor Manager] + 1[Main scheduler process]
Check out https://stackoverflow.com/a/57411598/5860608 for further info.

orientdb 2.2.26 cluster setup - queries hang

Orientdb version - 2.2.26
CLuster - 3 node setup, readQuorum = 2, writeQuorum = "majority", ridBag.embeddedToSbtreeBonsaiThreshold = 2147483647
Nodes - CentOS 7.0, 24 cores and 96 GB RAM
Gremlin-scala/tinkerpop APIs are used for querying and inserting.
This code works fine on single node setup.
Code checks for existing vertex in graph. If vertex does not exist, then the insert operation are batched and send to the db within a transaction.
I see following warnings in orientdb log on all three nodes -
2017-09-15 16:37:31:025 WARNI [dev2] Timeout (852567ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.354 task=record_read(#65:22)) [ODistributedDatabaseImpl]
2017-09-15 16:52:18:239 WARNI [dev2] Timeout (1049042ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.568 task=record_read(#63:24)) [ODistributedDatabaseImpl]
2017-09-15 17:25:22:477 WARNI [dev2] Timeout (1984236ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.889 task=record_read(#63:24)) [ODistributedDatabaseImpl]
There is no problem on network. Firewall is disabled on all three nodes.
Are these logs related to the problem ?
What else I should check to fix the problem ?

Backup restore fails on multi-user mode error

I have a script to automate restoring a database from a backup. My script first stops all appserver instances, stops all databases, then restores from a backup. Below is the pseudo-code:
foreach appserver:
asbman -name (appserver) -stop
foreach database:
dbman -name (database) -stop
proutil database.db -C enablelargefiles
echo y | prorest database.db backup.bak -verbose
Once my script reaches the prorest command, it outputs the following error:
** The database D:\Directory\Wrk\db\database is in use in multi-user mode. (276)
After waiting ~60 seconds, running the prorest command again executes as expected, and the database is restored correctly. My guess is that there are processes tied to the database that are still running after the database is stopped. I would like to find a solution to this problem without having to use methods such as a sleep-retry to determine when the database is capable of being restored. Is there a solution to this problem, or are there better methods for restoring a database in this way?
There are some timeouts that can come into play:
When an unconditional batch shutdown runs (PROSHUT -by), the following sequence of events takes place:
If there are any running processes left after:
30 Seconds - wake up clients waiting for locks.
60 Seconds - wake up clients waiting for locks.
90 Seconds - wake up clients waiting on screen input.
5 Minutes - Resend the shutdown signal to remaining clients.
10 Minutes - Send a terminate (SIGTERM) signal to remaining clients.
More info here:
http://knowledgebase.progress.com/articles/Article/P3222
You can tail the database.lg file and look for the messages telling you that the database is shut down:
[2017/02/06#20:20:56.353+0100] P-14292 T-13420 I SHUT 5: (542) Server shutdown started by Jens on CON:.
[2017/02/06#20:20:56.499+0100] P-10276 T-11404 I BROKER 0: (15193) The normal shutdown of the database will continue for 10 Min 0 Sec if required.
[2017/02/06#20:20:56.499+0100] P-10276 T-11404 I BROKER 0: (2248) Begin normal shutdown
[2017/02/06#20:20:57.499+0100] P-10276 T-11404 I BROKER 0: (2263) Resending shutdown request to 0 user(s).
[2017/02/06#20:21:01.692+0100] P-10276 T-11404 I BROKER 0: (15109) At Database close the number of live transactions is 0.
[2017/02/06#20:21:01.692+0100] P-10276 T-11404 I BROKER 0: (15743) Before Image Log Completion at Block 1 Offset 5300.
[2017/02/06#20:21:01.693+0100] P-10276 T-11404 I BROKER 0: (453) Logout by Jens on CON:.
[2017/02/06#20:21:01.694+0100] P-10276 T-11404 I BROKER : (16869) Removed shared memory with segment_id: 50528256
[2017/02/06#20:21:01.694+0100] P-10276 T-11404 I BROKER : (334) Multi-user session end.
[2017/02/06#20:21:02.356+0100] P-14292 T-13420 I SHUT 5: (453) Logout by Jens on CON:.
The (334) message is basically telling you that the database is shut down.
Another option could be to check for the database lock file (database.lk). It's only there if the database is running:
...
2017-02-06 20:21 2 228 224 mySportsDb.b1
2017-02-06 20:21 1 703 936 mySportsDb.d1
2017-02-06 20:21 32 768 mySportsDb.db
2017-02-06 20:21 89 643 mySportsDb.lg
2017-02-06 18:00 920 mySportsDb.lic
2017-02-06 20:26 265 mySportsDb.lk
...
There are also a couple of scripts you can run to check the status of the database. See more here:
http://knowledgebase.progress.com/articles/Article/P136887

(Unknown queue MSG=requested queue not found)

I am trying to run wien2k jobs with torque job scheduler (maui scheduler) in ubuntu 14.04 . the app is installed correctly without any error messages but when add the bash script , I encounter with this error "qsub: submit error (Unknown queue MSG=requested queue not found)". I read about this error in this website.
can any one help me please?
it's my PBS Queue Manager after using qmgr -c "p s" command:
Create queues and set their attributes.
Create and define queue batch
create queue batch
set queue batch queue_type = Execution
set queue batch max_running = 20
set queue batch resources_max.ncpus = 20
set queue batch resources_max.nodes = 20
set queue batch resources_default.ncpus = 1
set queue batch resources_default.nodect = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 76790:53:51
set queue batch enabled = True
set queue batch started = True
set server attributes.
set server scheduling = True
set server acl_hosts = seconduser
set server acl_hosts += firstuser
set server managers = root#localhost
set server managers += secfir#localhost
set server operators = root#localhost
set server operators += secfir#localhost
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server mom_job_sync = True
set server keep_completed = 0
set server next_job_number = 47
set server moab_array_compatible = True
set server nppcu = 1
Try running:
qmgr -c "p s"
Look through the config for queue names. If there is a routing queue try submitting your job with qsub -q queuename jobfile . If a routing queue doesn't exist you can try execution queues. Otherwise, it may be best to ask the cluster admin since they have the power to kill your jobs if you don't do it properly.
Just had the same problem and fixed it. The fullest way to do a qsub is (described here)
qsub -q queuename scriptname
Your queuename can be find using the command described in the previous answer, as you did
qmgr -c "p s"
The result for yours (and mine as well) is that the queue name is "batch." Multiple lines determine this, such as "Create and define queue batch," and "create queue batch," but most assuredly:
set server default_queue = batch
So, submit your job script with
qsub -q batch scriptname