Configuring the number of config server retries on mongos start? - mongodb

I'm attempting to start mongos and fail fast if the config server is unavailable. Right now, with an unavailable config server, I'm seeing:
Tue Feb 12 11:09:13 [mongosMain] can't resolve DNS for [compute-1-3] sleeping and trying 10 more times
How do I configure the 10?

The 10 retries is hard coded, it is not configurable. You can see it here:
https://github.com/mongodb/mongo/blob/master/src/mongo/s/config.cpp#L742
Just in case line numbers change, here's the relevant counter/loop:
for ( int x=10; x>0; x-- ) {
if ( ! hostbyname( host.c_str() ).empty() ) {
ok = true;
break;
}
log() << "can't resolve DNS for [" << host << "] sleeping and trying " << x << " more times" << endl;
sleepsecs( 10 );
Therefore you could, in theory, alter the code and re-build yourself, but then you would have to maintain that for new versions. I would recommend instead that you keep the config server available instead, or at least have it up within ~100 seconds of the mongos starting.

Related

Julia Sockets TCP Connection Refused when running script, not REPL

I am following the simple TCP example found here: https://docs.julialang.org/en/v1/manual/networking-and-streams/#A-simple-TCP-example-1. The code can be seen below (I modified that from the link slightly):
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
#async begin
server = listen(2000)
x :: Int = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
for i = 1:10
connect(2000)
end
When I execute the commands within the REPL it works correctly producing the following output:
julia> #async begin
server = listen(2000)
x :: Int = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
Task (runnable) #0x00000001169d06d0
julia> for i = 1:10
connect(2000)
end
Connection number 1
Connection number 2
Connection number 3
Connection number 4
Connection number 5
Connection number 6
Connection number 7
Connection number 9
Connection number 10
However, when I try to place these commands in a file:
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
#async begin
server = listen(2000)
x :: Int = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
for i = 1:10
connect(2000)
end
and run using julia <scriptName> or (from within the REPL) include("<scriptName>") I get the following error message:
ERROR: LoadError: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] wait_connected(::TCPSocket) at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Sockets/src/Sockets.jl:520
[2] connect at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Sockets/src/Sockets.jl:555 [inlined]
[3] connect at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Sockets/src/Sockets.jl:541 [inlined]
[4] connect at /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.3/Sockets/src/Sockets.jl:537 [inlined]
[5] top-level scope at <pathToFile>:18
[6] include at ./boot.jl:328 [inlined]
[7] include_relative(::Module, ::String) at ./loading.jl:1105
[8] include(::Module, ::String) at ./Base.jl:31
[9] include(::String) at ./client.jl:424
[10] top-level scope at REPL[1]:1
in expression starting at <pathToFile>:17
How would I run this program from a script? I am fairly new to Julia and to sockets so apologies if this is an easy question!!
You're getting an error because connect will fail if listen has not yet had time to run. As you discovered, introducing a small pause between your server and client sides fixes the issue; let's try and explain this in more details.
In the REPL-based, interactive version, as soon as the server task is created using #async, control goes back to the REPL, which performs blocking operations (such as waiting for new commands to be entered in the command line). This gives the scheduler an opportunity to yield control to the newly created task. The call to listen appears early in it, ensuring that whenever control reaches the client side again - and calls to connect are executed - the server is listening.
In the script version, the server task gets created and scheduled, but the scheduler does not have any opportunity to actually run it, since the main task does not perform any blocking call before running connect. Therefore no-one is listening to the socket when the connection is performed, and the call fails. However, any blocking call put in this place will give an opportunity for the scheduler to run the server task. This includes sleep (of arbitrarily small amounts of time, as you noted), but also any function performing I/O: a simple println("hello") would have worked too.
I would think a cleaner way to fix things would be to ensure that the call to listen is performed first, by running it synchronously beforehand:
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
server = listen(2000)
#async begin
x = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
for i = 1:10
connect(2000)
end
Now you're left with another problem: when the client loop terminates, all calls to connect have been issued, but not necessarily all corresponding calls to accept have had time to run. This means that you're likely to get truncated output such as:
Connection number 1
Connection number 2
Connection number 3
Connection number
You'd need further coordination between your tasks to correct this, but I suspect this second problem might be only related to the MWE posted here, and might not appear in your real use case.
For example, presumably the server is meant to send something to the client. The read-write operations performed on sockets in this case naturally synchronizes both tasks:
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
server = listen(2000)
#async begin
x = 1
while true
sock = accept(server)
msg = "connection number $x\n"
print("Server sent: $msg")
write(sock, msg)
x += 1
end
end
for i = 1:10
sock = connect(2000)
msg = readline(sock)
println("Client received: $msg")
end
The above example correctly yields a complete (untruncated) output:
Server sent: Connection number 1
Client received: Connection number 1
Server sent: Connection number 2
Client received: Connection number 2
...
Server sent: Connection number 10
Client received: Connection number 10
So I've figured this out I believe. The server and client either need to be separated between two files, OR, a pause command (seems to be okay with an arbitrarily small pause command) needs to be in between the server and client side:
#!/Applications/Julia-1.3.app/Contents/Resources/julia/bin/julia
using Sockets
using Printf
# listen will create a server waiting for incoming connections on the specified
# port (in this case it will be localhost::2000)
port = 2000
#async begin
server = listen(IPv6(0),port)
x::Int = 1
while true
sock = accept(server)
#printf "Connection number %d\n" x
x += 1
end
end
sleep(1E-10)
for i = 1:10
connect(port)
end
This is now working as expected!

Hystrix & Ribbon Timeout Warnings

Environment
Spring Boot 1.5.13.RELEASE
Spring Cloud Edgware.SR3
Compiled with Java version "1.8.0_172-ea",Java(TM) SE Runtime Environment (build 1.8.0_172-ea-b03) and source level 1.8
Runtime JRE: in docker with openjdk:10.0.1-jre-slim
Question
I have a ribbon client called serviceA and associated
serviceA.ribbon.ConnectTimeout=5000
serviceA.ribbon.ReadTimeout=15000
hystrix.command.serviceA.execution.isolation.thread.timeoutInMilliseconds = 20000
I have not (knowingly) got spring-retry on the classpath. I execute ./mvnw dependency:list | grep -i retry and get no results.
At runtime I get these warnings:
The Hystrix timeout of 20000ms for the command serviceA is set lower than the combination of the Ribbon read and connect timeout, 40000ms.
I'm not sure where these numbers come from given that I thought I'd set them to 15 and 5 seconds respectively. Why is this figure double?
Actually, ribbon timeout includes all same server retry and next server retry.
ribbonTimeout = (ribbon.ConnectTimeout + ribbon.ReadTimeout) * (ribbon.MaxAutoRetries + 1) * (ribbon.MaxAutoRetriesNextServer + 1);
// ...
if(hystrixTimeout < ribbonTimeout) {
LOGGER.warn("The Hystrix timeout of " + hystrixTimeout + "ms for the command " + commandKey +
" is set lower than the combination of the Ribbon read and connect timeout, " + ribbonTimeout + "ms.");
}
In your configuration:
ribbon.connectionTimeout is 5000
ribbon.readTimeout is 15000
ribbon.maxAutoRetries is 0 (default)
ribbon.maxAutoRetriesNextServer is 1 (default)
So the hystrixTimeout should be:
(5000 + 15000) * (1 + 0) * (1 + 1) // -> 40000 ms
If you choose to not configure Hystrix timeout, the default Hystrix timeout will be 40000ms.
19.13 Zuul Timeouts in Spring Cloud Document

Mongocxx connection pool (VS Release version) has issue connecting to remote server

I've installed (and have been using it for a while now) mongocxx driver via vcpkg and everything installed correctly and runs perfectly in Debug version (I'm using Visual Studio 2017 and my application is a Windows Form c++ (CLR) application). In my application, I get a connection pool and acquire a client everytime I upload some data on the server. Typical interval of my automatic data upload is 10 minutes.
My settings are
// Create pool (once)
mongocxx::uri uri_remote{ "mongodb://user:pwd#remote-host:PORT/database-name?minPoolSize=2&maxPoolSize=5" };
mongocxx::pool pool_remote{ uri_remote };
// The code below runs as a scheduled process after every 10 minutes
auto client_remote = pool_remote.acquire();
// The client is returned to the pool when it goes out of scope.
auto collection_remote = (*client_remote)["database-name"]["collection-1-name"];
auto collection_st_remote = (*client_remote)["database-name"]["collection-2-name"];
bsoncxx::document::value doc1= document
<< std::string(keys[0]) << entries[0] // A short string (device identifier)
<< std::string(keys[1]) << entries[1] // A short string location
<< std::string(keys[2]) << bsoncxx::types::b_date(std::chrono::system_clock::now()) // Current insert time
<< std::string(keys[3]) << entries[2] // String: updated entry name
<< std::string(keys[4]) << entries[3] // String: Updated entry description
<< std::string(keys[5]) << <float number>
<< std::string(keys[6]) << <integer>
<< finalize;
// Below are the statuses I'm recording. A binary array (length = 7)
bsoncxx::document::value doc2= document
<< std::string(status_keys[0]) << statuses[0]
<< std::string(status_keys[1]) << statuses[1]
<< std::string(status_keys[2]) << statuses[2]
<< std::string(status_keys[3]) << statuses[3]
<< std::string(status_keys[4]) << bsoncxx::types::b_date(std::chrono::system_clock::now())
<< std::string(status_keys[5]) << statuses[4] // Device identifier
<< std::string(status_keys[6]) << statuses[5]
<< finalize;
// And finally insert
try {
// Insert remote. lines of code for doc1 and doc2 are skipped
collection_remote.insert_one(doc1.view());
collection_st_remote.insert_one(doc2.view());
// I'm skipping the rest of the code section here (just a catch statement after this). . .
The problem, the database documents get uploaded every 10 minutes without a problem in Debug version, but with the Release version (when I loaded the Release version of my application and started using that), the mongo insert doesn't work every time 10 minutes. It just misses/skips some entries (mostly one after a successful attempt according to what I observed).
With the release version loaded in a remote computer I'm unable to do any debugging even though I ran debug version which works perfectly with shorter intervals too (like 1 minute each).

Can't obtain database connection within 5 seconds - Sidekiq workers, Unicorn, Redis ToGo on Heroku

I've a bunch background tasks (Sidekiq workers) that update database, and I keep getting this failing thread exception.
Heroku Log: WARN: could not obtain a database connection within 5.000 seconds (waited 5.001 seconds)
Heroku
Heroku Postgres :: Olive - 20 connections limit.
Redis ToGo - 10 connections limit.
Sidekiq - 2 connections.
Each client request create ~50 threads - finally ~20 threads trying to update db.
Now I know this is too much threads trying to make connection (updating Active:Record..).
I don't mind them to wait in try again until success.
-config/unicorn.rb
worker_processes Integer(ENV["WEB_CONCURRENCY"] || 3)
timeout 30
preload_app true
before_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn master intercepting TERM and sending myself QUIT instead'
Process.kill 'QUIT', Process.pid
end
if defined?(ActiveRecord::Base)
ActiveRecord::Base.connection.disconnect!
end
end
after_fork do |server, worker|
Signal.trap 'TERM' do
puts 'Unicorn worker intercepting TERM and doing nothing. Wait for master to send QUIT'
end
if defined?(ActiveRecord::Base)
config = ActiveRecord::Base.configurations[Rails.env] ||
Rails.application.config.database_configuration[Rails.env]
config['pool'] = ENV['DB_POOL'] || 2
config['reaping_frequency'] = ENV['DB_REAP_FREQ'] || 10 # seconds
ActiveRecord::Base.establish_connection(config)
end
end
-config/initializers/sidekiq.rb
require 'sidekiq'
Sidekiq.configure_server do |config|
if(database_url = ENV['DATABASE_URL'])
p pool_size = Sidekiq.options[:concurrency] + 2
p ENV['DATABASE_URL'] = "#{database_url}?pool=#{pool_size}"
ActiveRecord::Base.establish_connection
end
end
--condig/sidekiq.yml
:concurrency: 2
Thanks a lot for all the help,
Eldar

MongoDB ReplicaSet on windows azure - Sockect exception error

I have 3 node replica set deployed in windows azure. While doing performance testing, the test code halts after sometime. In the server I can see the following error log -
Fri Aug 30 23:14:59.982 [conn2454] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [ip:port]
For the performance test I am using multithreaded code to only read data from the replicaset.
So far I have tried http://docs.mongodb.org/manual/faq/diagnostics/#does-tcp-keepalive-time-affect-sharded-clusters-and-replica-sets. But it did not help so far.
Any thoughts/suggestions will be welcomed.
Thanks
This is old, but just in case someone else stumbles across this.
You need to set the TCP/IP keep alive times different from the stock Linux configuration if you are running under Azure:
echo 45 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 30 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 20 > /proc/sys/net/ipv4/tcp_keepalive_probes