how do I change a socket connection to just timeout vs. being completely down? - sockets

I have a bug where a long running process works fine for the first few days, but then a query to redis reaches the 45 second timeout I have set. That is, if redis was totally down my program would just crash, but it doesn't. It waits and waits (45 seconds) timesout and tries again for another 45 seconds over and over.
If I stop the process and re-start it, everything is fine again for another few days.
This is running on ec2 with Elastic Load Balancing with my process on a different box than redis.
I need to re-create this situation on my local development environment. How can I not kill my local redis, but rather put it into a state where reads will timeout?

Maybe turn off the port? This might be interpreted as connections refused/down.
Perhaps put another non-redis app on said port and just have it not respond. In other words, accept incoming connections but don't respond. You could probably write a simple app that accepts TCP connections and then does nothing in the language of your choice, and have it start on the Redis port in order to test this situation.

Related

Why does my mongoDB account have 292 connections?

I only write data into my mongoDB database once a day and I am not currently writing any data into it but there have been a consistent 292 connections into my database for the past three hours. No reads or writes, just connections and a consistent 29 commands per second since this started.
Concerned by this, I adjusted settings to only allow access from one specific IP, and changed all my passwords but the number hasn't changed, still 292 connections and 29 commands per second. Any idea what is causing this or perhaps how I can dig in further?
The number of connections depends on the cluster setup. A connection can be external (e.g. your app or monitoring tools) or internal (e.g. to replicate your data to secondary nodes or a backup process).
You can use db.currentOp() to list the active connections.
Consider that you app instance(s) may not open just 1 connection, but several, depending on the driver that connects to the DB and how it handles connection pooling. The connection pool size can be thought of as the max number of concurrent requests that your driver can service. For example, the default connection pool size for the Node.js MongoDB driver is 5. If you have set a high pool size, either with the driver or connection string, your app may open many connections to concurrently process the write commands.
You can start by process of elimination:
Completely cut your app off from the DB. There is a keep-alive time, so connections won‘t close immediately unless the driver closes them formally. You may have to wait some time, depending on the keep-alive setting. You can also restart your cluster and see how many connections there are initially.
Connect you app to the DB and check how the connection number changes with each request. Check whether your app properly closes connections to the DB at some point after opening them.

MongoDB connection fails on multiple app servers

We have mongodb with mgo driver for golang. There are two app servers connecting to mongodb running besides apps (golang binaries). Mongodb runs as a replica set and each server connects two primary or secondary depending on replica's current state.
We have experienced the SocketException handling request, closing client connection: 9001 socket exception on one of the mongo servers( which resulted in the connection to mongodb from our apps to die. After that, replica set continued to be functional but our second server (on which the error didn't happen) the connection died as well.
In the golang logs it was manifested as:
read tcp 10.10.0.5:37698-\u003e10.10.0.7:27017: i/o timeout
Why did this happen? How can this be prevented?
As I understand, mgo connects to the whole replica by the url (it detects whole topology by the single instance's url) but why did dy·ing of the connection on one of the servers killed it on second one?
Edit:
Full package path that is used "gopkg.in/mgo.v2"
Unfortunately can't share mongo files here. But besides the socketexecption mongo logs don't contain anything useful. There is indication of some degree of lock contention where lock acquired time is quite high some times but nothing beyond that
MongoDB does some heavy indexing some times but the wasn't any unusual spikes recently so it's nothing beyond normal
First, the mgo driver you are using: gopkg.in/mgo.v2 developed by Gustavo Niemeyer (hosted at https://github.com/go-mgo/mgo) is not maintained anymore.
Instead use the community supported fork github.com/globalsign/mgo. This one continues to get patched and evolve.
Its changelog includes: "Improved connection handling" which seems to be directly relating to your issue.
Its details can be read here https://github.com/globalsign/mgo/pull/5 which points to the original pull request https://github.com/go-mgo/mgo/pull/437:
If mongoServer fail to dial server, it will close all sockets that are alive, whether they're currently use or not.
There are two cons:
Inflight requests will be interrupt rudely.
All sockets closed at the same time, and likely to dial server at the same time. Any occasional fail in the massive dial requests (high concurrency scenario) will make all sockets closed again, and repeat...(It happened in our production environment)
So I think sockets currently in use should closed after idle.
Note that the github.com/globalsign/mgo has backward compatible API, it basically just added a few new things / features (besides the fixes and patches), which means you should be able to just change the import paths and all should be working without further changes.

Fabric Network - what happens when a downed peer connects back to the network?

I recently deployed the fabric network using Docker-compose, I was trying to simulate a downed peer. Essentially this is what happens:
4 peers are brought online using docker-compose running a fabric network
1 peer i.e the 4th peer goes down (done via docker stop command)
Invoke transactions are sent to the root peer which is verified by querying the peers after sometime (excluding the downed peer).
The downed peer is brought back up with docker start. Query transaction run fine on the always on peers but fail on the newly woken up peer.
Why isn't the 4th peer synchronizing the blockchain, once its up.Is there a step to be taken to ensure it does? Or is it discarded as a rogue peer.
This might be due to the expected behavior of PBFT (assuming you are using it). As explained on issue 933,
I think what you're seeing is normal PBFT behavior: 2f+1 replicas are
making progress, and f replicas are lagging slightly behind, and catch
up occasionally.
If you shut down another peer, you should observe
that the one you originally shut off and restarted will now
participate fully, and the network will continue to make progress. As
long as the network is making progress, and the participating nodes
share a correct prefix, you're all good. The reason for f replicas
lagging behind is that those f may be acting byzantine and progress
deliberately slowly. You cannot tell a difference between a slower
correct replica, and a deliberately slower byzantine replica.
Therefore we cannot wait for the last f stragglers. They will be left
behind and sync up occasionally. If it turns out that some other
replica is crashed, the network will stop making progress until one
correct straggler catches up, and then the network will progress
normally.
Hyperledger Fabric v0.6 does not support add peers dynamically. I am not sure for HF v1.0.

Google Cloud SQL: Periodic Read Spikes Associated With Loss of Connectivity

I have noticed that my Google Cloud SQL instance is losing connectivity periodically and it seems to be associated with some read spikes on the Cloud SQL instance. See the attached screenshot for examples.
The instance is mostly idle, but my application recycles connections in the connection pool every 60 seconds so this is not a wait_timeout issue. I have verified that the connection are recycled Also, it occurred twice in 30 minutes and the wait_timeout is 8 hours.
I would suspect a backup process but you can see from the screenshot that no backups have run.
The first instance lasted 17 seconds from the time the connection loss was detected until it was reestablished. The second was only 5 seconds, but given that my connections are idle for 60 seconds the actual downtime could be up to 1:17 and 1:05 respectively. They occurred at 2014-06-05 15:29:08 UTC and 2014-06-05 16:05:32 UTC respectively. The read spikes are not initiated by me. My app continued to be idle during the issue so this is some sort of internal GCS process.
This is not a big deal for my idle app, but it will become a big deal when the app is not idle.
Has anyone else run into this issue? Is this a known issue with Google Cloud SQL? Is there a known fix?
Any help would be appreciated.
****Update****
The root cause of the symptoms above has been identified as a restart of the MySQL instance. I did not restart the instance and the operations section of the web console does not list any events at that time, so now the question becomes, what would cause the instance to restart twice in 30 minutes? Why would a production database instance restart period?
That was caused by one of our regular release. Because of the way the update takes place an instance might be restarted more than once during the push to production.
Was your instance restarted ? During the restart the spinning down/up of an instance will trigger read/write.
That may be one reason why you are seeing the activity for read/write.

Cassandra giving TTransportException after some inserts/updates

Around 15 processes were inserting/updating unique entries in Cassandra. Everything was working fine but after sometime I get this error.
(When I restart the process everything is fine again till sometime)
An attempt was made to connect to each of the servers twice, but none
of the attempts succeeded. The last failure was TTransportException:
Could not connect to 10.204.81.77:9160
I did CPU/Memory Analysis of all the Cassandra machines. CPU usage sometimes goes around 110% and Memory Usage was between 60% - 77%. Not sure if this might be the cause, as it was working fine with such memory and cpu usage most of the time.
p.s.: How to ensure Cassandra update/insertion works error free?
Cassandra will throw an exception if anything goes wrong with your inserts; otherwise, you can assume it was error free.
Connection failures are a network problem, not a Cassandra problem. Some places to start: is the Cassandra process still alive? Does netstat show it still listening on 9160? Can you connect to non-Cassandra services on that machine? Is your server or router configured to firewall off frequent connection attempts?